Showing posts with label genetics. Show all posts
Showing posts with label genetics. Show all posts

Saturday, December 13, 2025

Mutations That Make Us Human

The ongoing quest to make biologic sense of genomic regions that differentiate us from other apes.

Some people are still, at this late date, taken aback by the fact that we are animals, biologically hardly more than cousins to fellow apes like the chimpanzee, and descendants through billions of years of other life forms far more humble. It has taken a lot of suffering and drama to get to where we are today. But what are those specific genetic endowments that make us different from the other apes? That, like much of genetics and genetic variation, is a tough question to answer.

At the DNA level, we are roughly one percent different from chimpanzees. A recent sequencing of great apes provided a gross overview of these differences. There are inversions, and larger changes in junk DNA that can look like bigger differences, but these have little biological importance, and are not counted in the sequence difference. A difference of one percent is really quite large. For a three gigabyte genome, that works out to 30 million differences. That is plenty of room for big things to happen.

Gross alignment of one chromosome between the great apes. [HSA- human, PTR- chimpanzee, PPA- bonobo, GGO- gorilla, PPY- orangutan (Borneo), PAB- orangutan (Sumatra)]. Fully aligned regions (not showing smaller single nucleotide differences) are shown in blue. Large inversions of DNA order are shown in yellow. Other junk DNA gains and losses are shown in red, pink, purple. One large-scale jump of a DNA segment is show in green. One can see that there has been significant rearrangement of genomes along the way, even as most of this chromosome (and others as well) are easly alignable and traceable through the evolutionary tree.


But most of those differences are totally unimportant. Mutations happen all the time, and most have no effect, since most positions (particularly the most variable ones) in our DNA are junk, like transposons, heterochromatin, telomeres, centromeres, introns, intergenic space, etc. Even in protein-coding genes, a third of the positions are "synonymous", with no effect on the coded amino acid, and even when an amino acid is changed, that protein's function is frequently unaffected. The next biggest group of mutations have bad effects, and are selected against. These make up the tragic pool of genetic syndromes and diseases, from mild to severe. Only a tiny proportion of mutations will have been beneficial at any point in this story. But those mutations have tremendous power. They can drag along their local DNA regions as they are positively selected, and gain "fixation" in the genome, which is to say, they are sufficiently beneficial to their hosts that they outcompete all others, with the ultimate result that mutation becomes universal in the population- the new standard. This process happens in parallel, across all positions of the genome, all at the same time. So a process that seems painfully slow can actually add up to quite a bit of change over evolutionary time, as we see.

So the hunt was on to find "human accelerated regions" (HAR), which are parts of our genome that were conserved in other apes, but suddenly changed on the way to humans. There roughly three thousand such regions, but figuring out what they might be doing is quite difficult, and there is a long tail from strong to weak effects. There are two general rationales for their occurrence. First, selection was lost over a genomic region, if that function became unimportant. That would allow faster mutation and divergence from the progenitors. Or second, some novel beneficial mutation happened there, bringing it under positive selection and to fixation. Some recent work found, interestingly, that clusters of mutations in HAR segments often have countervailing effects, with one major mutation causing one change, and a few other mutations (vs the ancestral sequence) causing opposite changes, in a process hypothesized to amount to evolutionary fine tuning. 

A second property of HARs is that they are overwhelmingly not in coding regions of the genome, but in regulatory areas. They constitute fine tuning adjustments of timing and amount of gene regulation, not so much changes in the proteins produced. That is, our evolution was more about subtle changes in management of processes than of the processes themselves. A recent paper delved in detail into HAR5, one of the strongest such regions, (that is, strongest prior conservation, compared with changes in human sequence), which lies in the regulatory regions upstream of Frizzled8 (FZD8). FZD8 is a cell surface receptor, which receives signals from a class of signaling molecules called WNT (wingless and int). These molecules were originally discovered in flies, where they signal body development programs, allowing cells to know where they are and when they are in the developmental program, in relation to cells next door, and then to grow or migrate as needed. They have central roles in embryonic development, in organ development, and also in cancer, where their function is misused.

For our story, the WNT/FZD8 circuit is important in fetal brain development. Our brains undergo massive cell division and migration during fetal development, and clearly this is one of the most momentous and interesting differences between ourselves and all other animals. The current authors made mutations in mice that reproduce some of the HAR5 sequences, and investigated their effects. 

Two mouse brains at three months of age, one with the human version of the HAR5 region. Hard to see here, but the latter brain is ~7% bigger.

The authors claim that these brains, one with native mouse sequence, and the other with the human sequences from HAR5, have about a seven percent difference in mass. Thus the HAR5 region, all by itself, explains about one fourteenth of the gross difference in brain size between us and chimpanzees. 

HAR5 is a 619 base-pair region with only four sequence differences between ourselves and chimpanzees. It lies 300,000 bases upstream of FZD8, in a vast region of over a million base pairs with no genes. While this region contains many regulatory elements, (generally called enhancers or enhancer modules, only some of which are mapped), it is at the same time an example of junk DNA, where most of the individual positions in this vast sea of DNA are likely of little significance. The multifarious regulation by all these modules is of course important because this receptor participates in so many different developmental programs, and has doubtless been fine-tuned over the millennia not just for brain development, but for every location and time point where it is needed.

Location of the FZD8 gene, in the standard view of the genome at NIH. I have added an arrow that points to the tiny (in relative terms) FZD8 coding region (green), and a star at the location of HAR5, far upstream among a multitude of enhancer sequences. One can see that this upstream region is a vast area (of roughly 1.5 million bases) with no other genes in sight, providing space for extremely complicated and detailed regulation, little of which is as yet characterized.

Diving into the HAR5 functions in more detail, the authors show that it directly increases FZD8 gene expression, (about 2 fold, in very rough terms), while deleting the region from mice strongly decreases expression in mice. Of the four individual base changes in the HAR5 region, two have strong (additive) effects increasing FZD8 expression, while the other two have weaker, but still activating, effects. Thus, no compensatory regulation here.. it is full speed ahead at HAR5 for bigger brain size. Additionally, a variant in human populations that is responsible for autism spectrum disorders also resides in this region, and the authors show that this change decreases FZD8 expression about 20%. Small numbers, sure, but for a process that directs cell division over many cycles in early brain development, this kind of difference can have profound effects.


The HAR5 region causes increased transcription of FZD8, in mice, compared to the native version and a deletion.

The HAR5 region causes increased cell proliferation in embryonic day 14.5 brain areas, stained for neural markers.

"This reveals Hs-HARE5 modifies radial glial progenitor behavior, with increased self-renewal at early developmental stages followed by expanded neurogenic potential. ... Using these orthogonal strategies we show four human-specific variants in HARE5 drive increased enhancer activity which promotes progenitor proliferation. These findings illustrate how small changes in regulatory DNA can directly impact critical signaling pathways and brain development."

So there you have it. The nuts and bolts of evolution, from the molecular to the cellular, the organ, and then the organismal, levels. Humans do not just have bigger brains, but better brains, and countless other subtle differences all over the body. Each of these is directed by genetic differences, as the combined inheritance of the last six million years since our divergence versus chimpanzees. Only with the modern molecular tools can we see Darwin's vision come into concrete focus, as particular, even quantum, changes in the code, and thus biology, of humanity. There is a great deal left to decipher, but the answers are all in there, waiting.


Saturday, November 22, 2025

Ground Truth for Genetic Mutations

Saturation mutagenasis shows that our estimates of the functional effect of uncharacterized mutations are not so great.

Human genomes can now be sequenced for less than $1,000. This technological revolution has enabled a large expansion of genetic testing, used for cancer tissue diagnosis and tracking, and for genetic syndrome analysis both of embryos before birth and affected people after birth. But just because a base among the 3 billion of the genome is different from the "reference" genome, that does not mean it is bad. Judging whether a variant (the modern, more neutral term for mutation) is bad takes a lot of educated guesswork.

A recent paper described a deep dive into one gene, where the authors created and characterized the functional consequence of every possible coding variant. Then they evaluated how well our current rules of thumb and prediction programs for variant analysis compare with what they found. It was a mediocre performance. The gene is CDKN2A, one of our more curious oddities. This is an important tumor suppressor gene that inhibits cell cycle progression and promotes DNA repair- it is often mutated in cancers. But it encodes not one, but two entirely different proteins, by virtue of a complex mRNA splicing pattern that uses distinct exons in some coding portions, and parts of one sequence in two different frames, to encode these two proteins, called p16 and p14. 

One gene, two proteins. CDKN2A has a splicing pattern (mRNA exons shown as boxes at top, with pink segments leading to the p14 product, and the blue segments leading the p16 product) that generates two entirely different proteins from one gene. Each product has tumor suppressing effects, though via distinct mechanisms.

Regardless of the complex splicing and protein coding characteristics, the authors generated all possible variants in every possible coded amino acid (156 amino acids in all, as both produced proteins are relatively short). Since the primary roles of these proteins are in cell cycle and proliferation control, it was possible to assay function by their effect when expressed in cultured pancreatic cells. A deleterious effect on the protein was revealed as, paradoxically, increased growth of these cells. They found that about 600 of the 3,000 different variants in their catalog had such an effect, or 20%.

This is an expected rate of effect, on the whole. Most positions in proteins are not that important, and can be substituted by several similar amino acids. For a typical enzyme, for instance, the active site may be made up of a few amino acids in a particular orientation, and the rest of the protein is there to fold into the required shape to form that active site. Similar folding can be facilitated by numerous amino acids at most positions, as has been richly documented in evolutionary studies of closely-related proteins. These p16 and p14 proteins interact with a few partners, so they need to maintain those key interfacial surfaces to be fully functional. Additionally, the assay these researchers ran, of a few generations of growth, is far less sensitive than a long-term true evolutionary setting, which can sift out very small effects on a protein, so they were setting a relatively high bar for seeing a deleterious effect. They did a selective replication of their own study, and found a reproducibility rate of about 80%, which is not great, frankly.

"Of variants identified in patients with cancer and previously reported to be functionally deleterious in published literature and/or reported in ClinVar as pathogenic or likely pathogenic (benchmark pathogenic variants), 27 of 32 (84.4%) were functionally deleterious in our assay"

"Of 156 synonymous variants and six missense variants previously reported to be functionally neutral in published literature and/or reported in ClinVar as benign or likely benign (benchmark benign variants), all were characterized as functionally neutral in our assay "

"Of 31 VUSs previously reported to be functionally deleterious, 28 (90.3%) were functionally deleterious and 3 (9.7%) were of indeterminate function in our assay."

"Similarly, of 18 VUSs previously reported to be functionally neutral, 16 (88.9%) were functionally neutral and 2 (11.1%) were of indeterminate function in our assay"

Here we get to the key issues. Variants are generally classified as benign, pathogenic/deleterious, or "variant of unknown/uncertain significance". The latter are particularly vexing to clinical geneticists. The whole point of sequencing a patient's tumor or genomic DNA is to find causal variants that can illuminate their condition, and possibly direct treatment. Seeing lots of "VUS" in the report leaves everyone in the dark. The authors pulled in all the common prediction programs that are officially sanctioned by the ACMG- Americal College of Medical Genetics, which is the foremost guide to clinical genetics, including the functional prediction of otherwise uncharacterized sequence variants. There are seven such programs, including one driven by AI, AlphaMissense that is related to the Nobel prize-winning AlphaFold. 

These programs strain to classify uncharacterized mutations as "likely pathogenic", "likely benign", or, if unable to make a conclusion, VUS/indeterminate. They rely on many kinds of data, like amino acid similarity, protein structure, evolutionary conservation, and known effects in proteins of related structure. They can be extensively validated against known mutations, and against new experimental work as it comes out, so we have a pretty good idea of how they perform. Thus they are trusted to some extent to provide clinical judgements, in the absence of better data. 

Each of seven programs (on bottom) gives estimations of variant effect over the same pool of mutations generated in this paper. This was a weird way to present simple data, but each bar contains the functional results the authors developed in their own data (numbers at the bottom, in parentheses, vertical). The bars were then colored with the rate of deleterious (black) vs benign (white) prediction from the program. The ideal case would be total black for the first bar in each set of three (deleterious) and total white in the third bar in each set (benign). The overall lineup/accuracy of all program predictions vs the author data was then overlaid by a red bar (right axis). The PrimateAI program was specially derived from comparison of homologous genes from primates only, yielding a high-quality dataset about the importance of each coded amino acid. However, it only gave estimates for 906 out of the whole set of 2964 variants. On the other hand, cruder programs like PolyPhen-2 gave less than 40% accuracy, which is quite disappointing for clinical use.

As shown above, the algorithms gave highly variable results, from under 40% accurate to over 80%. It is pretty clear that some of the lesser programs should be phased out. Of programs that fielded all the variants, the best were AlphaMissense and VEST, which each achieved about 70% accuracy. This is still not great. The issue is that, if a whole genome sequence is run for a patient with an obscure disease or syndrome, and variants vs the reference sequence are seen in several hundred genes, then a gene like CDKN2A could easily be pulled into the list of pathogenic (and possibly causal) variants, or be left out, on very shaky evidence. That is why even small increments in accuracy are critically important in this field. Genetic testing is a classic needle-in-a-haystack problem- a quest to find the one mutation (out of millions) that is driving a patient's cancer, or a child's inherited syndrome.

Still outstanding is the issue of non-coding variants. Genes are not just affected by mutations in their protein coding regions (indeed many important genes do not code for proteins at all), but by regulatory regions nearby and far. This is a huge area of mutation effects that are not really algorithmically accessible yet. As a prediction problem, it is far more difficult than predicting effects on a coded protein. It will requiring modeling of the entire gene expression apparatus, much of which remains shrouded in mystery.


Saturday, July 5, 2025

Water Sensing by WNKs

WNK kinases sense osmotic condition as well as chloride concentration to keep us hydrated.

"Water, water, everywhere, nor any drop to drink." This line from Coleridge evokes the horror of thirst on the ghost ship, as its crew can not drink salt water. Other species can, but ocean water is too strong for us, roughly four times as salty as our blood. Nevertheless, our bodies have exquisite mechanisms to manage salt concentrations, with each cell managing its own traffic, and the kidneys managing most electrolytes in the blood. It is a very difficult task that has led to clever evolutionary solutions like counter-current exchange across the nephron loops, and stark differences in those nephron cell membranes, over water or salt permeability, to maximize use of passive ion gradients. But at the heart of the system, one has to know what is going on- one has to monitor all of the electrolyte levels and overall osmotic stress.

One such monitoring thermostat for chemical balances turns out to be the WNK kinases- a family of four proteins in humans that control (by phosphorylating them) a secondary set of regulators, which in turn control many salt transporters, such as SLC12A2 and SLC12A4. These latter are passive, though regulated, co-transporters that allow chloride across the membrane when combined with a matching cation like sodium or potassium. The cations drive the process, because they are normally kept (pumped) to strong gradients across cell membranes, with high sodium outside, and high potassium inside. Thus when these co-transporters are turned on (or off), they use the cation gradients to control the chloride level in the cell, in either direction, depending on the particular transporter involved. Since the sodium and potassium levels are held at relatively static, pumped levels, it is the chloride level that helps control the overall osmotic pressure in a finely tuned way. 

A few of the ionic transactions done in the kidney.


The WNK kinases were discovered genetically, in families that showed hypertension and raised levels of chloride and potassium in the blood. These syndromes mirrored complementary syndromes caused by mutations in SLC12A2, the Na/Cl co-transporter, indicating the WNK kinases inhibit SLC12A2. It turns out that WNK, which are named for an unusual catalytic site (with no lysine [K]) are sensors for both chloride, which inhibit them, and for osmotic pressure, which activates them. They are expressed in different locations and have slightly different activities, (and control many more transporters and processes than discussed here), but I will treat them interchangeably here. The logic of all this is that, if osmotic pressure is low, that means that internal salt levels are low, and chloride needs to be let into the cell, by activating the cation/chloride co-transporters. Likewise, if chloride levels inside the cell are high, the WNK kinase needs to be inhibited, reducing chloride influx. 

A recent paper (and prior work from the same lab) discussed structures of the WNK regulators that explain some of this behavior. WNK kinases are dimers at rest, and in that state mutually inhibit their auto-phosphorylation. It is separation and auto-phosphorylation that turns them on, after which they can then phosphorylate their target proteins, such as the secondary kinases STK39 and OSR1. The authors had previously found a chloride binding site right at the active site of the enzyme that promotes dimerization. In the current paper, they reveal a couple of clusters of water molecules which similarly affect the dimerization, and thus activity, of the enzyme.

Location of the inhibitory chloride (green) binding site in WNK1. This is right in the heart of the protein, near the active kinase site and dimerization interface with the other WNK1 partner.

While X-ray crystal structures rarely show or care much about water molecules, (they are extremely small and hard to track), here, those waters were hypothesized to be important, since WNK kinases are responsive to osmotic pressure. One way to test this is to add PEG400 to the reaction. This is a polymer (400 molecular weight) that is water-like and inert, but large in a way that crowds out water molecules from the solution. At 15% or 25% of the volume, PEG400 displaces a lot of water, lowers the water activity of a solution, and thus increases the osmotic pressure- that is its tendency to draw water in from outside. Plants use osmotic pressure as turgor pressure, and our cells, not having cells walls, need to always be at an osmotic pressure similar to the outside, lest they swell up, or conversely shrink away. Anyhow, WKN kinases can be switched from an inactive to active state just by adding PEG400- a sure sign that they are sensors for osmotic pressure.


Water network (blue dots) within the WNK1 kinase protein. Most of the protein is colored teal, while the active site kinase area is red, and a tiny amount of the dimer partner is colored green. When this crystal is osmotically challenged, the water network collapses from 14 waters to 5, changing the structure and promoting dissociation of the dimer. In B is show a sequence alignment over a wide evolutionary range where the amino acids that coordinate the water network (yellow) are clearly very well conserved, thus quite important.

Above is shown a closeup of the WNK1 protein, showing in teal the main backbone, including the catalytic loop. In red is the activation loop of the kinase, and in green is a little bit from the other WNK1 protein in the dimer pair. The chloride, if bound, would be located right at top center, at K375. Shown in blue are a series of fourteen water molecules that make up one so-called water network. Another smaller one was found at the interface between the two WNK1 proteins. The key finding was that, if crystalized with PEG400, this water network collapsed to only five water molecules, thereby changing the structure of the protein significantly and accounting for the dissolution of the dimer. 

Superposition of WNK1 with PEG400 (purple) and activated vs WNK1 without, in an inactive state (teal). Most of the blue waters would be gone in the purple state as well. This shows the significant structural transition, particularly in the helixes above the active site, which induce (in the purple state) dissociation of the dimer, auto-phosphorylation, and activation.

Thus there is a delicate network of water molecules tentatively held together within this protein that is highly sensitive to the ambient water activity (aka osmotic pressure). This dynamic network provides the mechanism by which the WNK proteins sense and transmit the signal that the cell requires a change in ionic flows. Generally the point is to restore homeostatic balance, but in the kidney these kinases are also used to control flows for the benefit of the organism as a whole, by regulating different transporters in different parts of the same cell- either on the blood side, or the urine side.


Saturday, June 28, 2025

Millions of Years Go by in a Day

In vitro evolution has interesting things to say about protein structure, evolution, and even AI.

The advent of DNA sequences has been revolutionary in many ways. It has been technologically transformative, is changing medical practice, has radically validated Darwin's theories of evolution, and has allowed much more accurate phylogenies to be drawn out of the history of life. As Dobzhansky said, nothing in biology makes sense except in the light of evolution. The quest to change those DNA sequences has been another technological frontier, now exemplified by the CRISPR genome editing methods. Geneticists have been inducing mutations forever, (well, for over a century), using insults like mustard gas and X-rays. This long-standing tradition is called a "screen", where, after mutagenesis, one looks for particular effects on the resulting organisms, like changes in color, malformations, defects in development. This is a sort of artificial selection, very highly directed by the experimenter, sometimes resulting in some very weird, if informative, organisms. More recently, biotechnologists have been using directed evolution systems to help develop, through a mix of random and semi-directed mutations, more capable enzymes and other proteins.

But there are many broader questions to ask about the mutational and evolutionary processes. A recent paper demonstrated an interesting mutagenic system hosted in brewer's yeast cells, which can model rapid evolution under a variety of selective constraints. The core of the system is a plasmid, replicated separately from the main genome, by an independent enzyme. This plasmid was found in a distantly related yeast, Kluyveromyces lactis, and encodes its own DNA polymerase that operates independently from the genomic replication system. This opened the way to use the plasmid replication system to host genes of interest and subject them to wildly different (which is to say faster) mutagenic rates than the rest of the organism.

This group has been laboring on this system for several years, and this paper is the culmination, developing a series of plasmid DNA polymerases that have extremely high error rates, while also having high replication activity, and also having a balanced spectrum of error types (that is, G>A as well as G>T, etc.). Indeed, they demonstrate that the error rate (of about 2 errors for every 10,000 bases replicated) is at the threshold of mutational breakdown- the level that is so high that the plasmid's other functions (which are maintained implicitly by purifying selection on activities such as expression of an antibiotic resistance gene/protein and the polymerase itself) are so rapidly impaired that the engineered system can not survive. The error rate of the host cell, in contrast, is about 1 error for every ten billion bases replicated.

What is the point of all this? While, as pointed out above, directed evolution systems and mutation/selection systems have been around for a long time, this is something quite different. This plasmid system creates high rates of mutation all the time, over a very confined target (the plasmid). The experimenters can then decide what kinds of selection pressure to put on their target gene, if any. They can place a positive selection regime on it, to drive the development of, say, a new substrate specificity for an enzyme. They can put it under negative (purifying) selection to maintain its current activity. Or they can let it spin with no selection at all, letting it degrade into a pseudogene unable to code for anything. All of these scenarios are common in nature and of interest to evolutionary biologists.

In this paper, the authors focus on one enzyme, tryptophan synthase, from a thermophilic bacterium. The aim was to see how this enzyme responded to both positive and negative selective forces in the face of high mutation rates. As it converts one nutrient, indole, into another, tryptophan, this is an enzyme whose activity is easy to assay for and to select for. In the main experiment, using many replicate cultures, they started with no selection for fifty generations, then ramped gradually to positive selection over the next hundred generations, and finished with 300 generations of purifying selection. 

Diversity, for one thing, had increased tremendously by the end of this process. At the end, an average of 21 amino acid changes had accumulated, with the most divergent proteins differing by over 60 amino acids, in a protein that started with 398 amino acids total. Secondly, there was a marked migration to net negative charge, which they speculate was due to accommodation of this thermophilic bacterial enzyme to a more temperate environment where it is a bit more difficult to evade agglomeration with other proteins. Third, changes happened more on the outside of the enzyme structure than the interior (image below). This is a very well-known and understood phenomenon, where selective constraints are much higher on interior packing of a protein and on active/catalytic site portions. Several key amino acids that contact the substrate chemicals are colored gray, meaning that they hardly varied at all in this experiment.

Structure of the TrpB enzyme, color coded for change during the evolution experiment. Note how particularly high rates of change happen in one external region (bottom) that interactions with a partner TrpA, which was not present here. Also, gray areas with very low change tend to be in the interior and near the catalytic active site (substrate and cofactor [pyridoxal phosphate] shown in black).


Overall, the rates of mutagenesis created here over a few months in one protein approximate the kind of divergence seen between proteins of humans and mice, which have diverged for about sixty million years. The same studies one can do on such naturally diverged proteins, such as locating selectively important amino acid residues, or comparing activities of highly divergent enzymes, or studying structural constraints, one can do here on artificially evolved enzymes. And this is a general system that could be (with appropriate assays and technology) extended to many other proteins and RNAs of interest. 

One thing it can't do, however, is validate machine learning models. The researchers tried to get machine learning models that had been trained on this TrpB enzyme to classify their derived mutants. But this was almost completely unsuccessful, since machine learning (AI) systems only regurgitate what they are trained on, and can not creatively judge novel conditions.

"Although sequences that were predicted to have low fitness did exhibit little or no function in our enrichment assay, we found essentially no correlation between the predicted scores and the real enrichment scores of high-function TrpBs. For example, the highest predicted score was assigned to the nearly nonfunctional TmTriple variant."

It is important to appreciate the significance of this new mutation system, which is far more comprehensive, and a closer model of actual evolution, than are the genetic screens of yore. There, one was hunting for  the "hopeful monster" resulting from one shot of X-rays, that might generate an informative phenotype- maybe by killing a gene needed for red eye color, or amplifying expression of a gene for drug resistance. Here, the levels of negative and positive selection can be subtly adjusted in a background of continuous high mutation pressure simulating millions of years of evolution, and resulting in extensively transformed target molecules. 


  • Total lies come naturally to RFK Jr., as to so many in this administration.
  • With the help of crypo, our banks are not-so-unwitting conduits for crime.

Saturday, May 31, 2025

An Arms Race at the Tiniest Scale

Defense and anti-defense against genetic attack by plasmids.

Bacteria have a pretty active sex life. And like for us, this involves defense and offense, in a complicated tango of genetic exchange. Only, for bacteria, conjugation mechanisms are overwhelmingly used for attack, even though they are also the conduit of great innovations like horizontal gene transfer from different species, and antibiotic resistance. The top priority of most bacteria most of the time is to defend against alien DNA, which most of the time are selfish genetic elements and viruses, and they have several mechanisms to do so.

Plasmids are a very common feature of bacteria, and fundamental to genetic engineering, as the primary form of manipulated DNA. Plasmids can be amplified to high copy number in bacteria, can be cut, altered, and ligated back together- the very essence of engineering. This dates me, but I remember preparing plasmids from bacterial cultures by cesium chloride isopycnic (high speed) centrifugation, after which the plasmid (marked by the poison ethidium) would end up swimming in the middle of the gradient, and have to be sucked out with a syringe, before further steps to clean up the purified DNA. It was a rather messy, expensive, uncertain, slow, and unsafe process. 

Summary of the current paper, outlining plasmid transfer, plasmid genome structure, and some components (genes and promoters) of the leading strand as it is transferred to target cells.

Anyhow, plasmids in the wild are typically aggressive genetic elements that carry (encode) some or all of the components needed to form an injection attack complex (i.e. type IV secretion system) that conjugates with other bacteria. Plasmids can also carry many other things, like antibiotic resistance, or stray genes from prior bacterial hosts, or transposing genetic elements. So most of the time, bacteria want to defend themselves against this kind of invasion, even though some of the time the gifts they bring can end up being highly beneficial.

One of their defenses is the restriction system. Another foundation of genetic engineering was and remains the restriction enzyme. These are enzymes that cut DNA at a particular sequence. One can imagine how useful it can be to have such specific scissors for these infinitesimal molecules, and most labs would have a freezer filled with a large library of such enzymes that could be used for breaking and reconstructing new (plasmid) molecules, and also for analyzing them by their pattern of "restriction" sites. In the wild, these enzymes are paired with DNA methylation enzymes that make the host genome invisible to the restriction enzyme, leaving only newly arrived alien DNA susceptible to cleavage, and thus destruction, by these enzymes. 

Another defense is the now-famous CRISPR system. Bacteria capture small bits of invading genomes, and, assuming they survive, knit them into special genetic modules in their own chromosomes. Then they express these small modules as RNAs that latch onto an enzyme called Cas9, (or related enzyme plus RNA systems), which are nucleases that are guided by these RNAs to cut and inactivate invading DNA that matches the RNA sequence. This system is noted as a sort of adaptive immune system that learns from experience, and passes its knowledge down genetically to future generations.

More detailed maps of three example plasmid genomes, showing the distribution of anti-defense and other genes at the leading edge of the genome (left end). Red and yellow colored genes are anti-defense genes of various kinds. The tiny arrows are promoters, each of the early-start kind that can operate on single-stranded DNA.

A recent paper discussed specialized anti-defenses that plasmids have against these and other bacterial defenses, including toxin/suicide systems and inducible stress responses. For it is naturally an arms race with innovation on both sides. Plasmids have an origin of replication where new DNA strands start, and this new strand is what is injected into the target cell. So there is a linear order of DNA and thus genes going into the target cell, head to tail. These researchers find that the anti-defense genes of plasmids tend to be bunched up at the head of the genome, where they get into the target cell first. These regions also have a special way to fold up their single-stranded DNA (into a cruciform shape) that forms immediate promoters that allow these "early" genes to be transcribed by the host RNA polymerase, before replication in the host cell has regenerated normal double-stranded DNA. 

The authors do a very wide survey of species and plasmids, and find that there is a wide variety of anti-defense genes, many of which have unknown functions. Among the known ones are: anti-restriction inhibitors, which directly bind and inactivate restriction enzymes, methyltransferases (MTase) that methylate restriction sites to make them look like host DNA, single strand binding proteins (SSB), which coat the single stranded DNA and protect it from detection as single stranded DNA, and may assist in repair after Cas9 or restriction cleavage, and antitoxins that inhibit the toxin systems or the SOS system. There are also anti-CRISPR proteins, which degrade the Cas9 enzyme, or inhibit its binding to target DNA.

Experiment to demonstrate the importance of being first... into the target cell. Targeting means that the bactierial cell has a CRISPR system targeting the incoming plasmid. acrIIA4 is an anti-CRISPR protein that effectively blocks cleavage of plasmid DNA by the CRISPR/Cas9 enzyme. The petri plate exhibits bacteria that can grow only if plasmid transfer was successful. See text for further details.

They finish with an elegant experiment that asks how important it is for the anti-defense gene to be at the front of the plasmid. The gene they chose was acrIIA4, which is an anti-CRISPR that very efficiently inhibits Cas9 cleavage of targeted DNA. The petri plate at top shows the growth of infected bacteria after infection by the experimental plasmid, selected for plasmid presence on antibiotic medium. The grey bar, in both diagrams is the control, which are cells whose CRISPR system does not target this plasmid. The plasmid transfers fine, and the cells grow fine. In contrast, if the cell's CRISPR system does target the plasmid, lack of acrIIA is fatal, (top), decreasing plasmid transfer by about three logs, or a thousand fold. Putting the acrIIA gene at the tail end of the plasmid (middle experiments) helps a little, and transfer is knocked down only a hundred fold. Putting the defense gene at the front of the plasmid, (leading), though, corrects plasmid defense almost fully, and transfer is down a few fold only. Lastly, if acrIIIA is placed in opposite orientation, (inverted), such that plasmid replication is needed before this gene can be expressed (it can't use the cruciform single-stranded promoter), it is virtually useless. So indeed, being first off the block when invading a target cell is critically important, since the host cell makes its defenses all the time- they are ready and waiting.

While most of these transfers are unwanted, sometimes plasmids integrate into bacterial genomes, and then when they start transferring themselves into other cells, they can bring along huge amounts of their host genomes. That starts to look like serious genetic exchange that begins to approximate sex in eukaryotes. So, some balance of defense, offense, and beneficial exchange is the lifeblood of ecology and evolution at this most ancient scale of life.


Sunday, April 13, 2025

The Genome Remains Murky

A brilliant case study identifying the molecular cause of certain neuro-developmental disorders shows how difficult genome-based diagnoses remain.

Molecular medicine is increasingly effective in assessing both hereditary syndromes and cancers. The sequencing approach generally comes in two flavors- whole genome sequencing, or exome sequencing, where only the most important (protein-coding) parts are sampled. In each case, the hunt is for mutations (more blandly called variants) that cause the syndrome being investigated, from among the large number of variants we all carry. This approach is becoming standard-of-care in oncology, due to tremendous influence and clinical significance of cancer-driving mutations, many of which now match directly to tailored treatments that address them (thus the "precision" in precision medicine).

But another arm of precision medicine is the hunt for causes of congenital problems. There are innumerable genetic disorders whose causal analysis can lead not only to an informative diagnosis, and sometimes to useful treatments, but also to fundamental understanding of human biology. Sufferers of these syndromes may spend a lifetime searching for a diagnosis, being shuffled from one doctor or center to another and subject to various forms of hypothetical medicine, before some deep sequencing pinpoints the cause of their disease and founds a new diagnostic category that provides, if not relief, at least understanding and a medical home. 

A recent paper from Britain provided a classic of this form, investigating the causes of neurodevelopmental (NDD) disorders, which encompass a huge range of problems from mild to severe. They comment that even after the most modern analysis and intensive sequencing, 60% of NDD cases still can not be assigned causes. A large part of the problem is that, despite knowing the full sequence of the human genome, its function is less well-understood. The protein-coding genes (20,000 of those, roughly) are delineated and studied pretty closely. But that only accounts for 1 to 2% of the genome. The rest ranges from genes for a blizzard of non-coding RNAs, some of which are critical, to large regulatory regions with smatterings of important sites, to junk of various kinds- pseudogenes, relic retroviruses, repetitive elements, etc. The importance of any of these elements (and individual DNA base positions within them) varies tremendously. This means specifically that exome sequencing is not going to cut it. Exome sequencing focuses on a very small part of the genome, which is fine if your syndrome (such as a common cancer) is well characterized and known to arise from the usual suspects. But for orphan syndromes, it does not cast a wide enough net. Secondly, even with full genome sequencing, so little is known about the remoter regions of the genome that assigning a function to variations found there is difficult to impossible. It takes statistical analysis of incidence of the variation vs the incidence of the syndrome.

These authors used a trove of data- the Genomics England 100,000 genomes project, focusing on the ~9,000 genomes in this collection from people with NDD syndromes. (Plus additional genomes collected elsewhere.) (We can note in passing that Britain's nationalized health system remains at the forefront of innovative research and care.) What they found was an unusually high incidence of a particular mutation in a non-protein-coding gene called RNU4-2. The product of this gene is an RNA called U4, which is an important part of the spliceosome, where it pairs RNA-to-RNA with another RNA, U6, in a key step of selecting the first (5-prime) side of an intron that is to be spliced out of mRNA messages. This gene would never have come up in exome analysis, being non-protein-coding. Yet it is critically important, as splicing happens to the vast majority of human genes. Additionally, differential splicing- the selection of alternative exons and splice sites in a regulated way- happens frequently in developmental programs and neurological cell types. There is a class of syndromes called spliceosomopathies that are caused by defects in mRNA splicing, and tend to appear as syndromes in these processes.

As shown in the images (all based on a large corpus of other work on spliceosomes), RNU4-2/U4 pairs intimately with the U6 spliceosomal RNA, and the mutation found by the current group (which is a single nucleotide insertion) causes a bulge in this pairing, as marked. Meanwhile, the U6 RNA pairs at the same time with the exon-intron junction of the target mRNA (bottom image), at a site that is very close to the U4 pairing region (top image). The upshot is that this single base insertion into U4 causes some portion of the target mRNAs to be mis-spliced, using non-natural 5 prime splice sites and thus altering their encoded proteins. This may cause minor problems in the protein, but more often will cause a shift in translation frame, a premature stop codon, and total loss of the functional protein. So this tiny mutation can have severe effects and is indeed genetically dominant- that is, one copy overrides a second wild-type copy to generate the NDD diseases that were studied.

The U4 RNA (teal) paired with the U6 RNA (gray), within an early spliceosome complex. The mutation studied here is pointed out in black (n.64_65insT - i.e. insertion of a T). Note how it would cause a bulge in the pairing. Importantly, the location in the U6 RNA that pairs with the mRNA (see below) is right next door, at the ACAGAGA (light gray). The authors use this structural work from others to suggest how the mutation they found can alter selected splicing sites and thus lead to disease. Other single nucleotide insertions that cause similar syndromes are marked with black arrows, while single nucleotide substitutions that cause less severe syndromes are marked with orange RNA segments.

The U6 RNA (pink) paired with its mRNA target to be spliced. It binds right at the intron (gray) exon (black) boundary, where the cut will eventually be made the remove the intron. The bump from the mis-paired mutant U4 RNA (see above) distorts this binding, sending U6 to select wrong locations for spicing.


The researchers went on to survey this and other spliceosomal RNA genes for similar mutations, and found few to none outside the region marked in the diagram above. For example, there is a highly similar gene called RNU4-1. But this gene is expressed about 100-fold less in brain and other tissues, making RNU4-2 the principal source of U4 RNA, and much more significant as a causal factor for NDD. It appears that other locations in RNU4-2 (and other spliceosomal RNA genes) are even more important than the one mutated location found here, thus are never found, being lethal and heavily selected against, in this highly conserved gene. 

They also noted that, while this RNU4-2 mutation is severe, and thus must happen spontaneously (i.e. not inherited from parents), it only occurrs on the maternal alleles, not paternal alleles in the affected children. They speculate that this may be due to effects this gene may have in male gametogenesis, killing affected sperm preferentially, but not affected oocytes. Lastly, this set of mutations (in the small region shown in the first figure above) appears to account for, in their estimation, about 0.4 % of all NDD seen in Britain. This is a remarkably high rate for such a particular mutation that is not heritable. They speculate that some mutation hotspot kind of process may be causing these events, above the general mutation rate. What this all says about so-called "intelligent design", one may be reluctant to explore too deeply. On the other hand, this still leaves plenty of room to hunt for additional variations that cause these syndromes.

In this research, we see that clinically critical variations can pop up in many places, not just among the "usual suspects", genetically and genomically speaking. While much of the human genome is junk, most of it is also expressed (as RNA) and all of it is fair game for clinically important (if tragic) effects. The NDD syndromes caused by the mutation studied here are very severe- for more so than the ADD or mild autism diagnoses that make up most of the NDD spectrum. Understanding the causal nexus between the genome and human biology and its pathologies, remains an ongoing and complicated scientific adventure.


  • Playing the heel. Being the heel
  • It sure is great to be the victim.
  • Oh, right.. now we really know what is going on.
  • More spiritual warfare.
  • Another grift.

Saturday, February 15, 2025

Cloudy, With a Chance of RNA

Long RNAs play structural and functional roles in regulation of chromosome replication and expression.

One of the wonderful properties of the fruit fly as a model system of genetics and molecular biology has been its polytene chromosomes. These are hugely expanded bundles of chromosomes, replicated thousands of times, which have been observed microscopically since the late 1800's. They exist in the larval salivary gland, where huge amounts of gene expression are needed, thus the curious evolutionary solution of expanding the number of templates, not only of the gene needed, but of the entire genome. 

These chromosomes where closely mapped and investigated, almost like runic keys to the biology of the fly, especially in the day before molecular biology. Genetic translocations, loops, and other structural variations could be directly observed. The banding patterns of light, dark, expanded, and compressed regions were mapped in excruciating detail, and mapped to genetic correlates and later to gene expression patterns. These chromosomes provided some of the first suggestions of heterochromatin- areas of the genome whose expression is shut down (repressed). They may have genes that are shut off, but they may also be structural components, such as centromeres and telomeres. These latter areas tend to have very repetitive DNA sequences, inherited from old transposons and other junk. 

A diagram of polytene chromosomes, bunched up by binding at the centromeres. The banding pattern is reproducible and represents differences in proteins bound to various areas of the genome, and gene activity.

It has become apparent that RNA plays a big role in managing these areas of our chromosomes. The classic case is the XIST RNA, which is a long (17,000 bases) non-coding RNA that forms a scaffold by binding to lots of "heterogeneous" RNA-binding proteins, and most importantly, stays bound near the site of its creation, on the X chromosome. Through a regulatory cascade that is only partly understood, the XIST RNA is turned off on one of the X chromosomes, and turned on the other one (in females), leading the XIST molecule to glue itself to its chromosome of origin, and then progressively coat the rest of that chromosome and turn it off. That is, one entire X is turned into heterochromatin by a process that requires XIST scaffolding all along its length. That results in "dosage compensation" in females, where one X is turned off in all their cells, allowing dosage (that is, the gene expression) of its expressed genes to approximate those of males, despite the presence of the extra X chromosome. Dosage is very important, as shown by Down Syndrome, which originates from a duplication of one of the smallest human chromosomes, creating imbalanced gene dosage.

A recent paper described work on "ASAR" RNAs, which similarly arise from highly repetitive areas of human chromosomes, are extremely long (180,000 bases), and control expression and chromosome replication in an allele-specific way on (at least) several non-X chromosomes. These RNAs, again, like XIST, specifically bind a bunch of heternuclear binding proteins, which is presumably central to their function. Indeed, these researchers dissected out the 7,000 base segment of ASAR6 that is densest in protein binding sites, and find that, when transplanted into a new location, this segment has dramatic effects on chromosome condensation and replication, as shown below.

The intact 7,000 base core of ASAR6 was transplanted into chromosome 5, and mitotic chromosomes were spread and stained. The blue is a general DNA stain. The green is a stain for newly synthesized DNA, and the red is a specific probe for the ASAR6 sequence. One can see on the left that this chromosome 5 is replicating more than any other chromosome, and shows delayed condensation. In contrast, the right frame shows a control experiment where an anti-sense version of the ASAR6 7,000 base core was transplanted to chromosome 5. The antisense sequence not only does not have the wild-type function, but also inhibits any molecule that does by tightly binding to it. Here, the chromosome it resides on (arrows) is splendidly condensed, and hardly replicating at all (no green color).


Why RNA? It has become clear over the last two decades that our cells, and particularly our nuclei, are swimming with RNAs. Most of the genome is transcribed in some way or other, despite a tiny proportion of it coding for anything. 95% of the RNAs that are transcribed never get out of the nucleus. There has been a growing zoo of different kinds of non-coding RNAs functioning in translational control, ribosomal maturation, enhancer function, and here, in chromosome management. While proteins tend to be compact bundles, RNAs can be (as these ASARs are) huge, especially in one dimension, and thus capable of physically scaffolding the kinds of structures that can control large regions of chromosomes.

Chromosomes are sort of cloudy regions in our cells, long a focus of observation and clearly also a focus of countless proteins and now RNAs that bind, wind, disentangle, transcribe, replicate, and congregate around them. What all these RNAs and especially the various heteronuclear proteins actually do remains pretty unclear. But they form a sort of organelle that, while it protects and manages our DNA, remarkably also allows access to it for sequence-specific binding proteins and the many processes that plow through it.

"In addition, recent studies have proposed that abundant nuclear proteins such as HNRNPU nonspecifically interact with ‘RNA debris’ that creates a dynamic nuclear mesh that regulates interphase chromatin structure."


Saturday, February 1, 2025

Proving Evolution the Hard Way

Using genomes and codon ratios to estimate selective pressures was so easy... why is it not working?

The fruits of evolution surround us with abundance, from the tallest tree to the tiniest bacterium, and the viruses of that bacterium. But the process behind it is not immediately evident. It was relatively late in the enlightenment before Darwin came up with the stroke of insight that explained it all. Yet that mechanism of natural selection remains an abstract concept requiring an analytical mind and due respect for very inhuman scales of the time and space in play. Many people remain dumbfounded, and in denial, while evolutionary biology has forged ahead, powered by new discoveries in geology and molecular biology.

A recent paper (with review) offered a fascinating perspective, both critical and productive, on the study of evolutionary biology. It deals with the opsin protein that hosts the visual pigment 11-cis-retinal, by which we see. The retinal molecule is the same across all opsins, but different opsin proteins can "tune" the light wavelength of greatest sensitivity, creating the various retinal-opsin combinations for all visual needs, across the cone cells and rod cells. This paper considered the rhodopsin version of opsin, which we use in rod cells to perceive dim light. They observed that in fish species, the sensitivity of rhodopsin has been repeatedly adjusted to accommodate light at different depths of the water column. At shallow levels, sunlight is similar to what we see, and rhodopsin is tuned to about 500 nm, while deeper down, when the light is more blue-ish, rhodopsin is tuned towards about 480 nm maximum sensitivity. There are also special super-deep fish who see by their own red-tinged bioluminescence, and their rhodopsins are tuned to 526 nm. 

This "spectrum" of sensitivities of rhodopsin has a variety of useful scientific properties. First, the evolutionary logic is clear enough, matching the fish's vision to its environment. Second, the molecular structure of these opsins is well-understood, the genes are sequenced, and the history can be reconstructed. Third, the opsin properties can be objectively measured, unlike many sequence variations which affect more qualitative, difficult-to-observe, or impossible-to-observe biological properties. The authors used all this to carefully reconstruct exactly which amino acids in these rhodopsins were the important ones that changed between major fish lineages, going back about 500 million years.

The authors' phylogenetic tree of fish and other species they analyzed rhodopsin molecules from. Note how mammals occupy the bottom small branch, indicating how deeply the rest of the tree reaches. The numbers in the nodes indicate the wavelength sensitivity of each (current or imputed) rhodopsin. Many branches carry the author's inference, from a reconstructed and measured protein molecule, of what precise changes happened, via positive selection, to get that lineage.

An alternative approach to evolutionary inference is a second target of these authors. That is a codon-based method, that evaluates the rate of change of DNA sites under selection versus sites not under selection. In protein coding genes (such as rhodopsin), every amino acid is encoded by a triplet of DNA nucleotides, per the genetic code. With 64 codons for ~20 amino acids, it is a redundant code where many DNA changes do not change the protein sequence. These changes are called "synonymous". If one studies the rate of change of synonymous sites in the DNA, (which form sort of a control in the experiment), compared with the rate of change of non-synonymous sites, one can get a sense of evolution at work. Changing the protein sequence is something that is "seen" by natural selection, and especially at important positions in the protein, some of which are "conserved" over billions of years. Such sites are subject to "negative" selection, which to say rapid elimination due to the deleterious effect of that DNA and protein change.

Mutations in protein coding sequence can be synonymous, (bottom), with no effect, or non-synonymous (middle two cases), changing the resulting protein sequence and having some effect that may be biologically significant, thus visible to natural selection.


This analysis has been developed into a high art, also being harnessed to reveal "positive" selection. In this scenario, if the rate of change of the non-synonymous DNA sites is higher than that of the synonymous sites, or even just higher than one would expect by random chance, one can conclude that these non-synonymous sites were not just not being selected against, but were being selected for, an instance of evolution establishing change for the sake of improvement, instead of avoiding change, as usual.

Now back to the rhodopsin study. These authors found that a very small number of amino acids in this protein, only 15, were the ones that influenced changes to the spectral sensitivity of these protein complexes over evolutionary time. Typically only two or three changes occurred over a shift in sensitivity in a particular lineage, and would have been the ones subject to natural selection, with all the other changes seen in the sequence being unrelated, either neutral or selected for other purposes. It is a tour de force of structural analysis, biochemical measurement, and historical reconstruction to come up with this fully explanatory model of the history of piscene rhodopsins. 

But then they went on to compare what they found with what the codon-based methods had said about the matter. And they found that there was no overlap whatsover. The amino acids identified by the "positive selection" codon based methods were completely different than the ones they had found by spectral analysis and phylogenetic reconstruction over the history of fish rhodopsins. The accompanying review is particularly harsh about the pseudoscientific nature of this codon analysis, rubbishing the entire field. There have been other, less drastic, critiques as well.

But there is method to all this madness. The codon based methods were originally conceived in the analysis of closely related lineages. Specifically, various Drosophia (fly) species that might have diverged over a few million years. On this time scale, positive selection has two effects. One is that a desirable amino acid (or other) variation is selected for, and thus swept to fixation in the population. The other, and corresponding effect, is that all the other variations surrounding this desirable variation (that is, which are nearby on the same chromosome) are likewise swept to fixation (as part of what is called a haplotype). That dramatically reduces the neutral variation in this region of the genome. Indeed, the effect on neutral alleles (over millions of nearby base pairs) is going to vastly overwhelm the effect from the newly established single variant that was the object of positive selection, and this imbalance will be stronger the stronger the positive selection. In the limit case, the entire genomes of those without the new positive trait/allele will be eliminated, leaving no variation at all.

Yet, on the longer time scale, over hundreds of millions of years, as was the scope of visual variation in fish, all these effects on the neutral variation level wash out, as mutation and variation processes resume, after the positively selected allele is fixed in the population. So my view of this tempest in an evolutionary teapot is that these recent authors (and whatever other authors were deploying codon analysis against this rhodopsin problem) are barking up the wrong tree, mistaking the proper scope of these analyses. Which, after all, focus on the ratio between synonymous and non-synonymous change in the genome, and thus intrinsically on recent change, not deep change in genomes.


  • That all-American mix of religion, grift, and greed.
  • Christians are now in charge.
  • Mechanisms of control by the IMF and the old economic order.
  • A new pain med, thanks to people who know what they are doing.

Saturday, December 21, 2024

Inside the Process of Speciation

Adaptive radiations are messy, so no wonder we have a hard time reconstructing them.

Darwin drew a legendary diagram in his great book, of lineage trees tracing speciation from ancestors to descendants. It was just a sketch, and naturally had clear fork points where one species turns into two. But in real life, speciation is messier, with range overlaps, inter-breeding, and difficulties telling species apart. Ornithologists are still lumping and splitting species to this day, as more data come in about ranges, genetics, sub-populations, breeding behavior, etc. And if defining existing species is difficult, defining exactly where they split in the distant past is even harder.

Darwin's notebook sketch of speciation, from ancestors ... to descendants.

The advent of molecular data from genomes gave a tremendous boost to the amount of information on which to base phylogenetic inferences. It gave us a whole new domain of life, for one thing. And it has helped sharpen countless phylogenies that not been fully specified by fossil and morphological data. But still, difficulties remain. The deepest and most momentous divergences, like the origin of life itself, and the origin of eukaryotes, remain shrouded in hazy and inconclusive trees, as do many other lineages, such as the origin of birds. It seems to be a rule that when a group of organisms undergoes rapid evolution / speciation, the tree they are on (as reconstructed by us from contemporary data) becomes correspondingly unclear and unresolved, difficult to trace through that tumultuous time. In part this is simply a matter of timing. If dramatic events happened within a few million years a billion years ago, our ability to resolve the sequence of those events is going to be weak in any case, compared to the same events spread out over a hundred million years.

A recent paper documented some of this about phylogeny in general, by correlating times of morphological change with times of phylogenetic haziness, which they term "gene-tree conflict". That is to say, if one samples genes across genomes to draw phylogenetic trees, different genes will give different trees. And this phenomenon increases right when there are other signs of rapid evolutionary change, i.e. changing morphology.

"One insight gleaned from phylogenomics is that gene-tree conflict, frequently caused by population-level processes, is often rampant during the origin of major lineages."

They identify three mechanisms behind this observation: incomplete lineage sorting (ILS), hybridization, and rapid evolution. Obviously, these need to be unpacked a bit. ILS is a natural consequence of the fact that species arise not from single organisms, but from populations. Gene mutations that differentiate the originating and future species happen all over the respective genomes, and enter the future lineage at different times. Some may happen well after the putative speciation event, and become fixed (that is, prevalent) later in that species. Others may have happened well before the speciation event, and die off in most of the descending lineages. The fact is that not every gene is going to march in lock step with the speciation event, in terms of its variants. So phylogenetic inference is best done using lots of genes plus statistical methods to arrive at the most likely explanation of the diverse individual gene trees.

Graphs drawn from different sources relating gene conflicts in lineage estimation, (top), versus rate of morphological change from the fossil record, (bottom), in birds, and over time on the X axis. There are dramatic upticks in all metrics going back towards the end-Cretaceous extinction event.


Similarly, hybridization means that proto-species are still occasionally interbreeding with their ancestors or other relatives, (think of Neanderthals), thereby mixing up the gene trees relative to the overall speciation tree. This can even happen by gene transfer mediated by viruses. "Rapid evolution" is not defined by these authors, and comes dangerously close to using the conclusion (of high morphological change during periods of "gene-tree conflict") to describe their premise. But generally, this would mean that some genes are evolving rapidly, due to novel selective pressures, thus deviating from the general march of neutral evolution that affects most loci more evenly. This rate change can mess up phylogenetic inferences, lengthening some (gene) tree branches versus others, and making a unitary tree (that is, for the species or lineage as a whole) hard to draw.

But these are all rather abstract ideas. How does this process look on the ground? A wonderful paper on the tomato gives us some insight. This group traced the evolutionary history of a genus of tomato (Solanum sect. Lycopersicon) in the South American Andes (plus Galapagos islands just off-shore, interestingly enough). These form a tight group of about thirteen species that evolved from a single ancestor over the last two million years, before jumping onto our lunch plates via intensive breeding by native South Americans. This has been a rapid process of evolution, and phylogenies have been difficult to draw, for all the reasons given above. The tomatoes are mostly reproductively isolated, but not fully, and have various specializations for their microhabitats. So are they real species? And how can they evolve and specialize if they do not fully isolate from each other?

Gene-based phylogenetic tree of Andean tomato species. The consensus tree is in black at the right, while alternate trees (cloud) are drawn from 2,745 windows of 100 kb across the tomato genomes, clearly giving diverse views of the lineage tree. Lycopersicon are the species under study, while Lycopericoides is an "outgroup" genus used as a control / comparison. 

In the graph above, there is, as they say, rampant discord among genomic segments, versus the overall consensus tree that they arrived at:

"However, these summary support measures conceal rampant phylogenetic complexity that is evident when examining the evolutionary history of more defined genomic partitions."

For one thing, much of the sequence diversity in the ancestor survives in the descendent lineages. The founders were not single plants, by any means. Second, there has been a lot of "introgression", which is to say, breeding / hybridization between lineages after their putative separation. 

Lastly, they find a high rate of novel mutations, often subject to clear positive selection. Ten enyzmes in the carotenoid biosynthesis pathway, which affects fruit color in a group that has evolved red fruits, carry novel mutations. A UV light damage repair gene shows strong signs of positive selection, in high-altitude species. Others show novel mutations in a temperature stress response gene, and selection on genes defending plants against heavy metals in the soil. 

Their conclusion (as that of the previous paper) is that adaptive radiations are characterized by several components that scramble normal phylogenetic analysis, including variably preserved diversity from the originating species, post-divergence gene flow (i.e. mating), and rapid adaptation to new conditions along with strong environmental selection over the pre-existing diversity. All of these mechanisms are happening at the same time, and each position in the genome is being affected at the same time, so this is a massively parallel process that, while slow in human time, can be very rapid in geologic time. They note how tomato speciation compares with some other well-known cases:

"Nonetheless, based on our crude estimates within each analysis, we infer that relatively small yet substantial fractions of the euchromatic genome are implicated in each source of genetic variation. We find little evidence that one of these processes predominates in its contribution, although our estimates suggest that de novo mutation might be relatively more influential and cross-species introgression relatively less so. This latter observation is in interesting contrast with several recent studies of animal adaptive radiations, including in Darwin’s Finches [18], Equids [14], and fish [13], where evidence suggests that hybridization and introgression might be much more pervasive and influential than previously suspected, and more abundant than we detect in Solanum."

Naturally, neither of these studies go back in time to nail down exactly what happened during these evolutionary radiations, nor what caused them. They only give hints about causation. Why the stasis of some species, and the rapid niche-finding and filling by others? Was the motive force natural selection, or god? The latter paper gives some clear hints about possible selective pressures and rationales that were at work in the Andes and Galapagos on the genus of Solanum. But it is always frustratingly a matter of abstract reasoning, in the manner of Darwin, that paints the forces at work, however detailed the genetic and biogeographic analyses and however convincing the analogous laboratory experiments on model, usually microbial, organisms. We have to think carefully, and within the discipline of known forces and mechanisms, to arrive at intellectually honest answers to these questions, insofar as they can be answered at all.