Showing posts with label genetics. Show all posts
Showing posts with label genetics. Show all posts

Saturday, March 28, 2026

Death and Resurrection ... Of a Gene

The SLAMF9 gene became non-functional in the human lineage, and then later was re-activated. Why?

Biology is amazingly intricate, but it is often also needlessly complex- evidence for the haphazard, if eventually pointed, mechanisms of the evolutionary process. We will take up the discussion of "junk" DNA again next week, but molecular biology is full of redundant and excessive processes, which should certainly be mystifying from a "design" perspective. At the frontier of natural selection are neutral and near-neutral genetic elements, which change over time due to chance, lacking selection pressure towards conservation. Pseudogenes (of which we have about 20,000- almost as many as functional genes) are one form of neutral element. They are typically remnants of functional genes that have been duplicated and inactivated by mutation. They are a lively area of genome annotation because it is hard to be sure that they are really dead. Despite what looks like an inactivating mutation, they typically still produce RNA transcripts, and may produce partial or alternative proteins as well. The literature is full of experiments finding products and activities from genes annotated elsewhere as pseudogenes. And what looks like a pseudogene from one sample might just be an allele, the same gene being whole and active in other people.

So, it is hard to know what any particular genetic region is doing without a lot of evolutionary, functional, and even population analysis. A recent paper looked deeply at one gene- a gene that seems to have flipped back and forth between functional and non-functional states in the human lineage. It is a rare example of a gene coming back from what is usually a one-way trip into mutational oblivion, once its function- and thus selective pressure for conservation- have disappeared.

SLAMF9 is one of a family (signaling lymphocyte activation molecule family) of surface receptors that occur in many cells of the immune system, help activate responses in these cells, and also recognize some viruses and bacteria. They bind to each other and to other components of the immune system, creating complex signaling networks. Genes involved in our immune systems are commonly subject to rapid evolution, the arms race against our many pathogens being relentless. Sometimes that takes the form of gene inactivation, if a particular receptor, for instance, has been turned against us by a pathogen that uses it for binding and cell entry. 

This week's authors were facing a conundrum. They were studying SLAMF9, and found the mouse version easy to clone and express in the lab. But the human version ... that was another story, frustratingly impossible to express in usable amounts. When they looked at the protein sequence, they were in for a big surprise:

At the front end of SLAMF9, there is very strong conservation across mammals... except when it comes to humans! The signal peptide is what directs this protein to be inserted into the plasma membrane, and is cleaved off the mature protein. In red is highlighted the region starkly different in humans, which naturally affects (not in a good way) the signal cleavage process. "a" and "b" point to important domains of the cytoplasmic side of the final protein, which are just barely preserved/conserved in the human form.

This alignment among various mammalian versions (orthologs) of SLAMF9 shows that they are all pretty much the same... except for the human version. All the way from mouse to chimpanzee nothing has changed at the front end of this protein. That is amazing in itself, showing very strong conservation. But then after our lineage split from chimpanzees, something weird. A small segment at the front of this protein is totally different. This area is important because it carries the cleavage site of the signal sequence. The signal sequence directs the protein to be sent to the membrane (as this is a trans-membrane receptor), and this cleavage site is bad, explaining why the author's attempt to express this protein went so poorly. It might be enough for modest expression in the natural setting, but not enough for their investigations.

At the DNA level, it is clear that what happened to the protein was a double frame shift in translation, out of frame at the front, then recovered frame at the second mutation. The mutations must have been independent events, but the order of their occurrence is not known. The first intron trails off to the left, while the coding sequence tails off to the right.

When they looked at the DNA sequence, the reason for this change in the protein sequence became clearer. There was a frame shift, with only small changes in the DNA sequence that led to the bigger change in the protein sequence. On the left, there is a shift in the splice site at the end of the first intron (splice acceptor). This shifts the mRNA product by four bases (vs the start site of translation), creating a frame shift in translation, as portrayed in the amino acid codes given. On the right, there is a one nucleotide deletion, causing another frame shift that brings the translation back into the normal frame. 

They sampled all the available archeological samples from the human lineage- Neanderthals and Denisovans, and each were the same as the current human sequence. So, whatever happened did so between the split from chimpanzees and the advent of these available homo species. And what happened were two distinct events- the second frame shift and the first frame shift are independent genetic mutations. 

Which happened first? That is uncertain, but the authors show that the right-most frame shift (called g.621delT) did not influence the change in the splice site. The splice site change was caused by a series of about six mutations within the first intron, (not shown), which shifted the pattern of mRNA self-hybridization that helps direct splice site selection. So it is likely that the splice site change happened first, essentially killing the gene. And then the downstream frameshift happened later on to rescue it in a partial, not very well-expressed way. However, either mutation could have happened first to functionally kill off this gene, and then further mutation(s) to recover its function. In any case, both events happened within this roughly six-million-year time span that generated our immediate lineage, becoming firmly fixed as the only version of this gene now in our collective genome.

What might cause these events? It all goes back to the function of SLAMF9. As shown above, it is very highly conserved. But, being part of the immune system and the interface we show to pathogens, it is also on the front line of the bio-warfare arms race. As humans started ranging far beyond their original habitats, they doubtless encountered many new pathogens. It seems likely that killing off this gene might have resolved one such fight, at least for a little while, perhaps by removing a pathogen entry point. But later on, it became beneficial to recover it, which is to say that new mutations that restored its function even a little bit were evidently selected for, and spread in the population. There was a race at this point between the accumulation of more (now neutral) mutations that would have permanently inactivated this gene, and the advent of that one special mutation that could save it. The overall conservation of SLAMF9 argues that saving it must have conferred significant benefits.


Saturday, December 13, 2025

Mutations That Make Us Human

The ongoing quest to make biologic sense of genomic regions that differentiate us from other apes.

Some people are still, at this late date, taken aback by the fact that we are animals, biologically hardly more than cousins to fellow apes like the chimpanzee, and descendants through billions of years of other life forms far more humble. It has taken a lot of suffering and drama to get to where we are today. But what are those specific genetic endowments that make us different from the other apes? That, like much of genetics and genetic variation, is a tough question to answer.

At the DNA level, we are roughly one percent different from chimpanzees. A recent sequencing of great apes provided a gross overview of these differences. There are inversions, and larger changes in junk DNA that can look like bigger differences, but these have little biological importance, and are not counted in the sequence difference. A difference of one percent is really quite large. For a three gigabyte genome, that works out to 30 million differences. That is plenty of room for big things to happen.

Gross alignment of one chromosome between the great apes. [HSA- human, PTR- chimpanzee, PPA- bonobo, GGO- gorilla, PPY- orangutan (Borneo), PAB- orangutan (Sumatra)]. Fully aligned regions (not showing smaller single nucleotide differences) are shown in blue. Large inversions of DNA order are shown in yellow. Other junk DNA gains and losses are shown in red, pink, purple. One large-scale jump of a DNA segment is show in green. One can see that there has been significant rearrangement of genomes along the way, even as most of this chromosome (and others as well) are easly alignable and traceable through the evolutionary tree.


But most of those differences are totally unimportant. Mutations happen all the time, and most have no effect, since most positions (particularly the most variable ones) in our DNA are junk, like transposons, heterochromatin, telomeres, centromeres, introns, intergenic space, etc. Even in protein-coding genes, a third of the positions are "synonymous", with no effect on the coded amino acid, and even when an amino acid is changed, that protein's function is frequently unaffected. The next biggest group of mutations have bad effects, and are selected against. These make up the tragic pool of genetic syndromes and diseases, from mild to severe. Only a tiny proportion of mutations will have been beneficial at any point in this story. But those mutations have tremendous power. They can drag along their local DNA regions as they are positively selected, and gain "fixation" in the genome, which is to say, they are sufficiently beneficial to their hosts that they outcompete all others, with the ultimate result that mutation becomes universal in the population- the new standard. This process happens in parallel, across all positions of the genome, all at the same time. So a process that seems painfully slow can actually add up to quite a bit of change over evolutionary time, as we see.

So the hunt was on to find "human accelerated regions" (HAR), which are parts of our genome that were conserved in other apes, but suddenly changed on the way to humans. There roughly three thousand such regions, but figuring out what they might be doing is quite difficult, and there is a long tail from strong to weak effects. There are two general rationales for their occurrence. First, selection was lost over a genomic region, if that function became unimportant. That would allow faster mutation and divergence from the progenitors. Or second, some novel beneficial mutation happened there, bringing it under positive selection and to fixation. Some recent work found, interestingly, that clusters of mutations in HAR segments often have countervailing effects, with one major mutation causing one change, and a few other mutations (vs the ancestral sequence) causing opposite changes, in a process hypothesized to amount to evolutionary fine tuning. 

A second property of HARs is that they are overwhelmingly not in coding regions of the genome, but in regulatory areas. They constitute fine tuning adjustments of timing and amount of gene regulation, not so much changes in the proteins produced. That is, our evolution was more about subtle changes in management of processes than of the processes themselves. A recent paper delved in detail into HAR5, one of the strongest such regions, (that is, strongest prior conservation, compared with changes in human sequence), which lies in the regulatory regions upstream of Frizzled8 (FZD8). FZD8 is a cell surface receptor, which receives signals from a class of signaling molecules called WNT (wingless and int). These molecules were originally discovered in flies, where they signal body development programs, allowing cells to know where they are and when they are in the developmental program, in relation to cells next door, and then to grow or migrate as needed. They have central roles in embryonic development, in organ development, and also in cancer, where their function is misused.

For our story, the WNT/FZD8 circuit is important in fetal brain development. Our brains undergo massive cell division and migration during fetal development, and clearly this is one of the most momentous and interesting differences between ourselves and all other animals. The current authors made mutations in mice that reproduce some of the HAR5 sequences, and investigated their effects. 

Two mouse brains at three months of age, one with the human version of the HAR5 region. Hard to see here, but the latter brain is ~7% bigger.

The authors claim that these brains, one with native mouse sequence, and the other with the human sequences from HAR5, have about a seven percent difference in mass. Thus the HAR5 region, all by itself, explains about one fourteenth of the gross difference in brain size between us and chimpanzees. 

HAR5 is a 619 base-pair region with only four sequence differences between ourselves and chimpanzees. It lies 300,000 bases upstream of FZD8, in a vast region of over a million base pairs with no genes. While this region contains many regulatory elements, (generally called enhancers or enhancer modules, only some of which are mapped), it is at the same time an example of junk DNA, where most of the individual positions in this vast sea of DNA are likely of little significance. The multifarious regulation by all these modules is of course important because this receptor participates in so many different developmental programs, and has doubtless been fine-tuned over the millennia not just for brain development, but for every location and time point where it is needed.

Location of the FZD8 gene, in the standard view of the genome at NIH. I have added an arrow that points to the tiny (in relative terms) FZD8 coding region (green), and a star at the location of HAR5, far upstream among a multitude of enhancer sequences. One can see that this upstream region is a vast area (of roughly 1.5 million bases) with no other genes in sight, providing space for extremely complicated and detailed regulation, little of which is as yet characterized.

Diving into the HAR5 functions in more detail, the authors show that it directly increases FZD8 gene expression, (about 2 fold, in very rough terms), while deleting the region from mice strongly decreases expression in mice. Of the four individual base changes in the HAR5 region, two have strong (additive) effects increasing FZD8 expression, while the other two have weaker, but still activating, effects. Thus, no compensatory regulation here.. it is full speed ahead at HAR5 for bigger brain size. Additionally, a variant in human populations that is responsible for autism spectrum disorders also resides in this region, and the authors show that this change decreases FZD8 expression about 20%. Small numbers, sure, but for a process that directs cell division over many cycles in early brain development, this kind of difference can have profound effects.


The HAR5 region causes increased transcription of FZD8, in mice, compared to the native version and a deletion.

The HAR5 region causes increased cell proliferation in embryonic day 14.5 brain areas, stained for neural markers.

"This reveals Hs-HARE5 modifies radial glial progenitor behavior, with increased self-renewal at early developmental stages followed by expanded neurogenic potential. ... Using these orthogonal strategies we show four human-specific variants in HARE5 drive increased enhancer activity which promotes progenitor proliferation. These findings illustrate how small changes in regulatory DNA can directly impact critical signaling pathways and brain development."

So there you have it. The nuts and bolts of evolution, from the molecular to the cellular, the organ, and then the organismal, levels. Humans do not just have bigger brains, but better brains, and countless other subtle differences all over the body. Each of these is directed by genetic differences, as the combined inheritance of the last six million years since our divergence versus chimpanzees. Only with the modern molecular tools can we see Darwin's vision come into concrete focus, as particular, even quantum, changes in the code, and thus biology, of humanity. There is a great deal left to decipher, but the answers are all in there, waiting.


Saturday, November 22, 2025

Ground Truth for Genetic Mutations

Saturation mutagenasis shows that our estimates of the functional effect of uncharacterized mutations are not so great.

Human genomes can now be sequenced for less than $1,000. This technological revolution has enabled a large expansion of genetic testing, used for cancer tissue diagnosis and tracking, and for genetic syndrome analysis both of embryos before birth and affected people after birth. But just because a base among the 3 billion of the genome is different from the "reference" genome, that does not mean it is bad. Judging whether a variant (the modern, more neutral term for mutation) is bad takes a lot of educated guesswork.

A recent paper described a deep dive into one gene, where the authors created and characterized the functional consequence of every possible coding variant. Then they evaluated how well our current rules of thumb and prediction programs for variant analysis compare with what they found. It was a mediocre performance. The gene is CDKN2A, one of our more curious oddities. This is an important tumor suppressor gene that inhibits cell cycle progression and promotes DNA repair- it is often mutated in cancers. But it encodes not one, but two entirely different proteins, by virtue of a complex mRNA splicing pattern that uses distinct exons in some coding portions, and parts of one sequence in two different frames, to encode these two proteins, called p16 and p14. 

One gene, two proteins. CDKN2A has a splicing pattern (mRNA exons shown as boxes at top, with pink segments leading to the p14 product, and the blue segments leading the p16 product) that generates two entirely different proteins from one gene. Each product has tumor suppressing effects, though via distinct mechanisms.

Regardless of the complex splicing and protein coding characteristics, the authors generated all possible variants in every possible coded amino acid (156 amino acids in all, as both produced proteins are relatively short). Since the primary roles of these proteins are in cell cycle and proliferation control, it was possible to assay function by their effect when expressed in cultured pancreatic cells. A deleterious effect on the protein was revealed as, paradoxically, increased growth of these cells. They found that about 600 of the 3,000 different variants in their catalog had such an effect, or 20%.

This is an expected rate of effect, on the whole. Most positions in proteins are not that important, and can be substituted by several similar amino acids. For a typical enzyme, for instance, the active site may be made up of a few amino acids in a particular orientation, and the rest of the protein is there to fold into the required shape to form that active site. Similar folding can be facilitated by numerous amino acids at most positions, as has been richly documented in evolutionary studies of closely-related proteins. These p16 and p14 proteins interact with a few partners, so they need to maintain those key interfacial surfaces to be fully functional. Additionally, the assay these researchers ran, of a few generations of growth, is far less sensitive than a long-term true evolutionary setting, which can sift out very small effects on a protein, so they were setting a relatively high bar for seeing a deleterious effect. They did a selective replication of their own study, and found a reproducibility rate of about 80%, which is not great, frankly.

"Of variants identified in patients with cancer and previously reported to be functionally deleterious in published literature and/or reported in ClinVar as pathogenic or likely pathogenic (benchmark pathogenic variants), 27 of 32 (84.4%) were functionally deleterious in our assay"

"Of 156 synonymous variants and six missense variants previously reported to be functionally neutral in published literature and/or reported in ClinVar as benign or likely benign (benchmark benign variants), all were characterized as functionally neutral in our assay "

"Of 31 VUSs previously reported to be functionally deleterious, 28 (90.3%) were functionally deleterious and 3 (9.7%) were of indeterminate function in our assay."

"Similarly, of 18 VUSs previously reported to be functionally neutral, 16 (88.9%) were functionally neutral and 2 (11.1%) were of indeterminate function in our assay"

Here we get to the key issues. Variants are generally classified as benign, pathogenic/deleterious, or "variant of unknown/uncertain significance". The latter are particularly vexing to clinical geneticists. The whole point of sequencing a patient's tumor or genomic DNA is to find causal variants that can illuminate their condition, and possibly direct treatment. Seeing lots of "VUS" in the report leaves everyone in the dark. The authors pulled in all the common prediction programs that are officially sanctioned by the ACMG- Americal College of Medical Genetics, which is the foremost guide to clinical genetics, including the functional prediction of otherwise uncharacterized sequence variants. There are seven such programs, including one driven by AI, AlphaMissense that is related to the Nobel prize-winning AlphaFold. 

These programs strain to classify uncharacterized mutations as "likely pathogenic", "likely benign", or, if unable to make a conclusion, VUS/indeterminate. They rely on many kinds of data, like amino acid similarity, protein structure, evolutionary conservation, and known effects in proteins of related structure. They can be extensively validated against known mutations, and against new experimental work as it comes out, so we have a pretty good idea of how they perform. Thus they are trusted to some extent to provide clinical judgements, in the absence of better data. 

Each of seven programs (on bottom) gives estimations of variant effect over the same pool of mutations generated in this paper. This was a weird way to present simple data, but each bar contains the functional results the authors developed in their own data (numbers at the bottom, in parentheses, vertical). The bars were then colored with the rate of deleterious (black) vs benign (white) prediction from the program. The ideal case would be total black for the first bar in each set of three (deleterious) and total white in the third bar in each set (benign). The overall lineup/accuracy of all program predictions vs the author data was then overlaid by a red bar (right axis). The PrimateAI program was specially derived from comparison of homologous genes from primates only, yielding a high-quality dataset about the importance of each coded amino acid. However, it only gave estimates for 906 out of the whole set of 2964 variants. On the other hand, cruder programs like PolyPhen-2 gave less than 40% accuracy, which is quite disappointing for clinical use.

As shown above, the algorithms gave highly variable results, from under 40% accurate to over 80%. It is pretty clear that some of the lesser programs should be phased out. Of programs that fielded all the variants, the best were AlphaMissense and VEST, which each achieved about 70% accuracy. This is still not great. The issue is that, if a whole genome sequence is run for a patient with an obscure disease or syndrome, and variants vs the reference sequence are seen in several hundred genes, then a gene like CDKN2A could easily be pulled into the list of pathogenic (and possibly causal) variants, or be left out, on very shaky evidence. That is why even small increments in accuracy are critically important in this field. Genetic testing is a classic needle-in-a-haystack problem- a quest to find the one mutation (out of millions) that is driving a patient's cancer, or a child's inherited syndrome.

Still outstanding is the issue of non-coding variants. Genes are not just affected by mutations in their protein coding regions (indeed many important genes do not code for proteins at all), but by regulatory regions nearby and far. This is a huge area of mutation effects that are not really algorithmically accessible yet. As a prediction problem, it is far more difficult than predicting effects on a coded protein. It will requiring modeling of the entire gene expression apparatus, much of which remains shrouded in mystery.


Saturday, July 5, 2025

Water Sensing by WNKs

WNK kinases sense osmotic condition as well as chloride concentration to keep us hydrated.

"Water, water, everywhere, nor any drop to drink." This line from Coleridge evokes the horror of thirst on the ghost ship, as its crew can not drink salt water. Other species can, but ocean water is too strong for us, roughly four times as salty as our blood. Nevertheless, our bodies have exquisite mechanisms to manage salt concentrations, with each cell managing its own traffic, and the kidneys managing most electrolytes in the blood. It is a very difficult task that has led to clever evolutionary solutions like counter-current exchange across the nephron loops, and stark differences in those nephron cell membranes, over water or salt permeability, to maximize use of passive ion gradients. But at the heart of the system, one has to know what is going on- one has to monitor all of the electrolyte levels and overall osmotic stress.

One such monitoring thermostat for chemical balances turns out to be the WNK kinases- a family of four proteins in humans that control (by phosphorylating them) a secondary set of regulators, which in turn control many salt transporters, such as SLC12A2 and SLC12A4. These latter are passive, though regulated, co-transporters that allow chloride across the membrane when combined with a matching cation like sodium or potassium. The cations drive the process, because they are normally kept (pumped) to strong gradients across cell membranes, with high sodium outside, and high potassium inside. Thus when these co-transporters are turned on (or off), they use the cation gradients to control the chloride level in the cell, in either direction, depending on the particular transporter involved. Since the sodium and potassium levels are held at relatively static, pumped levels, it is the chloride level that helps control the overall osmotic pressure in a finely tuned way. 

A few of the ionic transactions done in the kidney.


The WNK kinases were discovered genetically, in families that showed hypertension and raised levels of chloride and potassium in the blood. These syndromes mirrored complementary syndromes caused by mutations in SLC12A2, the Na/Cl co-transporter, indicating the WNK kinases inhibit SLC12A2. It turns out that WNK, which are named for an unusual catalytic site (with no lysine [K]) are sensors for both chloride, which inhibit them, and for osmotic pressure, which activates them. They are expressed in different locations and have slightly different activities, (and control many more transporters and processes than discussed here), but I will treat them interchangeably here. The logic of all this is that, if osmotic pressure is low, that means that internal salt levels are low, and chloride needs to be let into the cell, by activating the cation/chloride co-transporters. Likewise, if chloride levels inside the cell are high, the WNK kinase needs to be inhibited, reducing chloride influx. 

A recent paper (and prior work from the same lab) discussed structures of the WNK regulators that explain some of this behavior. WNK kinases are dimers at rest, and in that state mutually inhibit their auto-phosphorylation. It is separation and auto-phosphorylation that turns them on, after which they can then phosphorylate their target proteins, such as the secondary kinases STK39 and OSR1. The authors had previously found a chloride binding site right at the active site of the enzyme that promotes dimerization. In the current paper, they reveal a couple of clusters of water molecules which similarly affect the dimerization, and thus activity, of the enzyme.

Location of the inhibitory chloride (green) binding site in WNK1. This is right in the heart of the protein, near the active kinase site and dimerization interface with the other WNK1 partner.

While X-ray crystal structures rarely show or care much about water molecules, (they are extremely small and hard to track), here, those waters were hypothesized to be important, since WNK kinases are responsive to osmotic pressure. One way to test this is to add PEG400 to the reaction. This is a polymer (400 molecular weight) that is water-like and inert, but large in a way that crowds out water molecules from the solution. At 15% or 25% of the volume, PEG400 displaces a lot of water, lowers the water activity of a solution, and thus increases the osmotic pressure- that is its tendency to draw water in from outside. Plants use osmotic pressure as turgor pressure, and our cells, not having cells walls, need to always be at an osmotic pressure similar to the outside, lest they swell up, or conversely shrink away. Anyhow, WKN kinases can be switched from an inactive to active state just by adding PEG400- a sure sign that they are sensors for osmotic pressure.


Water network (blue dots) within the WNK1 kinase protein. Most of the protein is colored teal, while the active site kinase area is red, and a tiny amount of the dimer partner is colored green. When this crystal is osmotically challenged, the water network collapses from 14 waters to 5, changing the structure and promoting dissociation of the dimer. In B is show a sequence alignment over a wide evolutionary range where the amino acids that coordinate the water network (yellow) are clearly very well conserved, thus quite important.

Above is shown a closeup of the WNK1 protein, showing in teal the main backbone, including the catalytic loop. In red is the activation loop of the kinase, and in green is a little bit from the other WNK1 protein in the dimer pair. The chloride, if bound, would be located right at top center, at K375. Shown in blue are a series of fourteen water molecules that make up one so-called water network. Another smaller one was found at the interface between the two WNK1 proteins. The key finding was that, if crystalized with PEG400, this water network collapsed to only five water molecules, thereby changing the structure of the protein significantly and accounting for the dissolution of the dimer. 

Superposition of WNK1 with PEG400 (purple) and activated vs WNK1 without, in an inactive state (teal). Most of the blue waters would be gone in the purple state as well. This shows the significant structural transition, particularly in the helixes above the active site, which induce (in the purple state) dissociation of the dimer, auto-phosphorylation, and activation.

Thus there is a delicate network of water molecules tentatively held together within this protein that is highly sensitive to the ambient water activity (aka osmotic pressure). This dynamic network provides the mechanism by which the WNK proteins sense and transmit the signal that the cell requires a change in ionic flows. Generally the point is to restore homeostatic balance, but in the kidney these kinases are also used to control flows for the benefit of the organism as a whole, by regulating different transporters in different parts of the same cell- either on the blood side, or the urine side.


Saturday, June 28, 2025

Millions of Years Go by in a Day

In vitro evolution has interesting things to say about protein structure, evolution, and even AI.

The advent of DNA sequences has been revolutionary in many ways. It has been technologically transformative, is changing medical practice, has radically validated Darwin's theories of evolution, and has allowed much more accurate phylogenies to be drawn out of the history of life. As Dobzhansky said, nothing in biology makes sense except in the light of evolution. The quest to change those DNA sequences has been another technological frontier, now exemplified by the CRISPR genome editing methods. Geneticists have been inducing mutations forever, (well, for over a century), using insults like mustard gas and X-rays. This long-standing tradition is called a "screen", where, after mutagenesis, one looks for particular effects on the resulting organisms, like changes in color, malformations, defects in development. This is a sort of artificial selection, very highly directed by the experimenter, sometimes resulting in some very weird, if informative, organisms. More recently, biotechnologists have been using directed evolution systems to help develop, through a mix of random and semi-directed mutations, more capable enzymes and other proteins.

But there are many broader questions to ask about the mutational and evolutionary processes. A recent paper demonstrated an interesting mutagenic system hosted in brewer's yeast cells, which can model rapid evolution under a variety of selective constraints. The core of the system is a plasmid, replicated separately from the main genome, by an independent enzyme. This plasmid was found in a distantly related yeast, Kluyveromyces lactis, and encodes its own DNA polymerase that operates independently from the genomic replication system. This opened the way to use the plasmid replication system to host genes of interest and subject them to wildly different (which is to say faster) mutagenic rates than the rest of the organism.

This group has been laboring on this system for several years, and this paper is the culmination, developing a series of plasmid DNA polymerases that have extremely high error rates, while also having high replication activity, and also having a balanced spectrum of error types (that is, G>A as well as G>T, etc.). Indeed, they demonstrate that the error rate (of about 2 errors for every 10,000 bases replicated) is at the threshold of mutational breakdown- the level that is so high that the plasmid's other functions (which are maintained implicitly by purifying selection on activities such as expression of an antibiotic resistance gene/protein and the polymerase itself) are so rapidly impaired that the engineered system can not survive. The error rate of the host cell, in contrast, is about 1 error for every ten billion bases replicated.

What is the point of all this? While, as pointed out above, directed evolution systems and mutation/selection systems have been around for a long time, this is something quite different. This plasmid system creates high rates of mutation all the time, over a very confined target (the plasmid). The experimenters can then decide what kinds of selection pressure to put on their target gene, if any. They can place a positive selection regime on it, to drive the development of, say, a new substrate specificity for an enzyme. They can put it under negative (purifying) selection to maintain its current activity. Or they can let it spin with no selection at all, letting it degrade into a pseudogene unable to code for anything. All of these scenarios are common in nature and of interest to evolutionary biologists.

In this paper, the authors focus on one enzyme, tryptophan synthase, from a thermophilic bacterium. The aim was to see how this enzyme responded to both positive and negative selective forces in the face of high mutation rates. As it converts one nutrient, indole, into another, tryptophan, this is an enzyme whose activity is easy to assay for and to select for. In the main experiment, using many replicate cultures, they started with no selection for fifty generations, then ramped gradually to positive selection over the next hundred generations, and finished with 300 generations of purifying selection. 

Diversity, for one thing, had increased tremendously by the end of this process. At the end, an average of 21 amino acid changes had accumulated, with the most divergent proteins differing by over 60 amino acids, in a protein that started with 398 amino acids total. Secondly, there was a marked migration to net negative charge, which they speculate was due to accommodation of this thermophilic bacterial enzyme to a more temperate environment where it is a bit more difficult to evade agglomeration with other proteins. Third, changes happened more on the outside of the enzyme structure than the interior (image below). This is a very well-known and understood phenomenon, where selective constraints are much higher on interior packing of a protein and on active/catalytic site portions. Several key amino acids that contact the substrate chemicals are colored gray, meaning that they hardly varied at all in this experiment.

Structure of the TrpB enzyme, color coded for change during the evolution experiment. Note how particularly high rates of change happen in one external region (bottom) that interactions with a partner TrpA, which was not present here. Also, gray areas with very low change tend to be in the interior and near the catalytic active site (substrate and cofactor [pyridoxal phosphate] shown in black).


Overall, the rates of mutagenesis created here over a few months in one protein approximate the kind of divergence seen between proteins of humans and mice, which have diverged for about sixty million years. The same studies one can do on such naturally diverged proteins, such as locating selectively important amino acid residues, or comparing activities of highly divergent enzymes, or studying structural constraints, one can do here on artificially evolved enzymes. And this is a general system that could be (with appropriate assays and technology) extended to many other proteins and RNAs of interest. 

One thing it can't do, however, is validate machine learning models. The researchers tried to get machine learning models that had been trained on this TrpB enzyme to classify their derived mutants. But this was almost completely unsuccessful, since machine learning (AI) systems only regurgitate what they are trained on, and can not creatively judge novel conditions.

"Although sequences that were predicted to have low fitness did exhibit little or no function in our enrichment assay, we found essentially no correlation between the predicted scores and the real enrichment scores of high-function TrpBs. For example, the highest predicted score was assigned to the nearly nonfunctional TmTriple variant."

It is important to appreciate the significance of this new mutation system, which is far more comprehensive, and a closer model of actual evolution, than are the genetic screens of yore. There, one was hunting for  the "hopeful monster" resulting from one shot of X-rays, that might generate an informative phenotype- maybe by killing a gene needed for red eye color, or amplifying expression of a gene for drug resistance. Here, the levels of negative and positive selection can be subtly adjusted in a background of continuous high mutation pressure simulating millions of years of evolution, and resulting in extensively transformed target molecules. 


  • Total lies come naturally to RFK Jr., as to so many in this administration.
  • With the help of crypo, our banks are not-so-unwitting conduits for crime.

Saturday, May 31, 2025

An Arms Race at the Tiniest Scale

Defense and anti-defense against genetic attack by plasmids.

Bacteria have a pretty active sex life. And like for us, this involves defense and offense, in a complicated tango of genetic exchange. Only, for bacteria, conjugation mechanisms are overwhelmingly used for attack, even though they are also the conduit of great innovations like horizontal gene transfer from different species, and antibiotic resistance. The top priority of most bacteria most of the time is to defend against alien DNA, which most of the time are selfish genetic elements and viruses, and they have several mechanisms to do so.

Plasmids are a very common feature of bacteria, and fundamental to genetic engineering, as the primary form of manipulated DNA. Plasmids can be amplified to high copy number in bacteria, can be cut, altered, and ligated back together- the very essence of engineering. This dates me, but I remember preparing plasmids from bacterial cultures by cesium chloride isopycnic (high speed) centrifugation, after which the plasmid (marked by the poison ethidium) would end up swimming in the middle of the gradient, and have to be sucked out with a syringe, before further steps to clean up the purified DNA. It was a rather messy, expensive, uncertain, slow, and unsafe process. 

Summary of the current paper, outlining plasmid transfer, plasmid genome structure, and some components (genes and promoters) of the leading strand as it is transferred to target cells.

Anyhow, plasmids in the wild are typically aggressive genetic elements that carry (encode) some or all of the components needed to form an injection attack complex (i.e. type IV secretion system) that conjugates with other bacteria. Plasmids can also carry many other things, like antibiotic resistance, or stray genes from prior bacterial hosts, or transposing genetic elements. So most of the time, bacteria want to defend themselves against this kind of invasion, even though some of the time the gifts they bring can end up being highly beneficial.

One of their defenses is the restriction system. Another foundation of genetic engineering was and remains the restriction enzyme. These are enzymes that cut DNA at a particular sequence. One can imagine how useful it can be to have such specific scissors for these infinitesimal molecules, and most labs would have a freezer filled with a large library of such enzymes that could be used for breaking and reconstructing new (plasmid) molecules, and also for analyzing them by their pattern of "restriction" sites. In the wild, these enzymes are paired with DNA methylation enzymes that make the host genome invisible to the restriction enzyme, leaving only newly arrived alien DNA susceptible to cleavage, and thus destruction, by these enzymes. 

Another defense is the now-famous CRISPR system. Bacteria capture small bits of invading genomes, and, assuming they survive, knit them into special genetic modules in their own chromosomes. Then they express these small modules as RNAs that latch onto an enzyme called Cas9, (or related enzyme plus RNA systems), which are nucleases that are guided by these RNAs to cut and inactivate invading DNA that matches the RNA sequence. This system is noted as a sort of adaptive immune system that learns from experience, and passes its knowledge down genetically to future generations.

More detailed maps of three example plasmid genomes, showing the distribution of anti-defense and other genes at the leading edge of the genome (left end). Red and yellow colored genes are anti-defense genes of various kinds. The tiny arrows are promoters, each of the early-start kind that can operate on single-stranded DNA.

A recent paper discussed specialized anti-defenses that plasmids have against these and other bacterial defenses, including toxin/suicide systems and inducible stress responses. For it is naturally an arms race with innovation on both sides. Plasmids have an origin of replication where new DNA strands start, and this new strand is what is injected into the target cell. So there is a linear order of DNA and thus genes going into the target cell, head to tail. These researchers find that the anti-defense genes of plasmids tend to be bunched up at the head of the genome, where they get into the target cell first. These regions also have a special way to fold up their single-stranded DNA (into a cruciform shape) that forms immediate promoters that allow these "early" genes to be transcribed by the host RNA polymerase, before replication in the host cell has regenerated normal double-stranded DNA. 

The authors do a very wide survey of species and plasmids, and find that there is a wide variety of anti-defense genes, many of which have unknown functions. Among the known ones are: anti-restriction inhibitors, which directly bind and inactivate restriction enzymes, methyltransferases (MTase) that methylate restriction sites to make them look like host DNA, single strand binding proteins (SSB), which coat the single stranded DNA and protect it from detection as single stranded DNA, and may assist in repair after Cas9 or restriction cleavage, and antitoxins that inhibit the toxin systems or the SOS system. There are also anti-CRISPR proteins, which degrade the Cas9 enzyme, or inhibit its binding to target DNA.

Experiment to demonstrate the importance of being first... into the target cell. Targeting means that the bactierial cell has a CRISPR system targeting the incoming plasmid. acrIIA4 is an anti-CRISPR protein that effectively blocks cleavage of plasmid DNA by the CRISPR/Cas9 enzyme. The petri plate exhibits bacteria that can grow only if plasmid transfer was successful. See text for further details.

They finish with an elegant experiment that asks how important it is for the anti-defense gene to be at the front of the plasmid. The gene they chose was acrIIA4, which is an anti-CRISPR that very efficiently inhibits Cas9 cleavage of targeted DNA. The petri plate at top shows the growth of infected bacteria after infection by the experimental plasmid, selected for plasmid presence on antibiotic medium. The grey bar, in both diagrams is the control, which are cells whose CRISPR system does not target this plasmid. The plasmid transfers fine, and the cells grow fine. In contrast, if the cell's CRISPR system does target the plasmid, lack of acrIIA is fatal, (top), decreasing plasmid transfer by about three logs, or a thousand fold. Putting the acrIIA gene at the tail end of the plasmid (middle experiments) helps a little, and transfer is knocked down only a hundred fold. Putting the defense gene at the front of the plasmid, (leading), though, corrects plasmid defense almost fully, and transfer is down a few fold only. Lastly, if acrIIIA is placed in opposite orientation, (inverted), such that plasmid replication is needed before this gene can be expressed (it can't use the cruciform single-stranded promoter), it is virtually useless. So indeed, being first off the block when invading a target cell is critically important, since the host cell makes its defenses all the time- they are ready and waiting.

While most of these transfers are unwanted, sometimes plasmids integrate into bacterial genomes, and then when they start transferring themselves into other cells, they can bring along huge amounts of their host genomes. That starts to look like serious genetic exchange that begins to approximate sex in eukaryotes. So, some balance of defense, offense, and beneficial exchange is the lifeblood of ecology and evolution at this most ancient scale of life.


Sunday, April 13, 2025

The Genome Remains Murky

A brilliant case study identifying the molecular cause of certain neuro-developmental disorders shows how difficult genome-based diagnoses remain.

Molecular medicine is increasingly effective in assessing both hereditary syndromes and cancers. The sequencing approach generally comes in two flavors- whole genome sequencing, or exome sequencing, where only the most important (protein-coding) parts are sampled. In each case, the hunt is for mutations (more blandly called variants) that cause the syndrome being investigated, from among the large number of variants we all carry. This approach is becoming standard-of-care in oncology, due to tremendous influence and clinical significance of cancer-driving mutations, many of which now match directly to tailored treatments that address them (thus the "precision" in precision medicine).

But another arm of precision medicine is the hunt for causes of congenital problems. There are innumerable genetic disorders whose causal analysis can lead not only to an informative diagnosis, and sometimes to useful treatments, but also to fundamental understanding of human biology. Sufferers of these syndromes may spend a lifetime searching for a diagnosis, being shuffled from one doctor or center to another and subject to various forms of hypothetical medicine, before some deep sequencing pinpoints the cause of their disease and founds a new diagnostic category that provides, if not relief, at least understanding and a medical home. 

A recent paper from Britain provided a classic of this form, investigating the causes of neurodevelopmental (NDD) disorders, which encompass a huge range of problems from mild to severe. They comment that even after the most modern analysis and intensive sequencing, 60% of NDD cases still can not be assigned causes. A large part of the problem is that, despite knowing the full sequence of the human genome, its function is less well-understood. The protein-coding genes (20,000 of those, roughly) are delineated and studied pretty closely. But that only accounts for 1 to 2% of the genome. The rest ranges from genes for a blizzard of non-coding RNAs, some of which are critical, to large regulatory regions with smatterings of important sites, to junk of various kinds- pseudogenes, relic retroviruses, repetitive elements, etc. The importance of any of these elements (and individual DNA base positions within them) varies tremendously. This means specifically that exome sequencing is not going to cut it. Exome sequencing focuses on a very small part of the genome, which is fine if your syndrome (such as a common cancer) is well characterized and known to arise from the usual suspects. But for orphan syndromes, it does not cast a wide enough net. Secondly, even with full genome sequencing, so little is known about the remoter regions of the genome that assigning a function to variations found there is difficult to impossible. It takes statistical analysis of incidence of the variation vs the incidence of the syndrome.

These authors used a trove of data- the Genomics England 100,000 genomes project, focusing on the ~9,000 genomes in this collection from people with NDD syndromes. (Plus additional genomes collected elsewhere.) (We can note in passing that Britain's nationalized health system remains at the forefront of innovative research and care.) What they found was an unusually high incidence of a particular mutation in a non-protein-coding gene called RNU4-2. The product of this gene is an RNA called U4, which is an important part of the spliceosome, where it pairs RNA-to-RNA with another RNA, U6, in a key step of selecting the first (5-prime) side of an intron that is to be spliced out of mRNA messages. This gene would never have come up in exome analysis, being non-protein-coding. Yet it is critically important, as splicing happens to the vast majority of human genes. Additionally, differential splicing- the selection of alternative exons and splice sites in a regulated way- happens frequently in developmental programs and neurological cell types. There is a class of syndromes called spliceosomopathies that are caused by defects in mRNA splicing, and tend to appear as syndromes in these processes.

As shown in the images (all based on a large corpus of other work on spliceosomes), RNU4-2/U4 pairs intimately with the U6 spliceosomal RNA, and the mutation found by the current group (which is a single nucleotide insertion) causes a bulge in this pairing, as marked. Meanwhile, the U6 RNA pairs at the same time with the exon-intron junction of the target mRNA (bottom image), at a site that is very close to the U4 pairing region (top image). The upshot is that this single base insertion into U4 causes some portion of the target mRNAs to be mis-spliced, using non-natural 5 prime splice sites and thus altering their encoded proteins. This may cause minor problems in the protein, but more often will cause a shift in translation frame, a premature stop codon, and total loss of the functional protein. So this tiny mutation can have severe effects and is indeed genetically dominant- that is, one copy overrides a second wild-type copy to generate the NDD diseases that were studied.

The U4 RNA (teal) paired with the U6 RNA (gray), within an early spliceosome complex. The mutation studied here is pointed out in black (n.64_65insT - i.e. insertion of a T). Note how it would cause a bulge in the pairing. Importantly, the location in the U6 RNA that pairs with the mRNA (see below) is right next door, at the ACAGAGA (light gray). The authors use this structural work from others to suggest how the mutation they found can alter selected splicing sites and thus lead to disease. Other single nucleotide insertions that cause similar syndromes are marked with black arrows, while single nucleotide substitutions that cause less severe syndromes are marked with orange RNA segments.

The U6 RNA (pink) paired with its mRNA target to be spliced. It binds right at the intron (gray) exon (black) boundary, where the cut will eventually be made the remove the intron. The bump from the mis-paired mutant U4 RNA (see above) distorts this binding, sending U6 to select wrong locations for spicing.


The researchers went on to survey this and other spliceosomal RNA genes for similar mutations, and found few to none outside the region marked in the diagram above. For example, there is a highly similar gene called RNU4-1. But this gene is expressed about 100-fold less in brain and other tissues, making RNU4-2 the principal source of U4 RNA, and much more significant as a causal factor for NDD. It appears that other locations in RNU4-2 (and other spliceosomal RNA genes) are even more important than the one mutated location found here, thus are never found, being lethal and heavily selected against, in this highly conserved gene. 

They also noted that, while this RNU4-2 mutation is severe, and thus must happen spontaneously (i.e. not inherited from parents), it only occurrs on the maternal alleles, not paternal alleles in the affected children. They speculate that this may be due to effects this gene may have in male gametogenesis, killing affected sperm preferentially, but not affected oocytes. Lastly, this set of mutations (in the small region shown in the first figure above) appears to account for, in their estimation, about 0.4 % of all NDD seen in Britain. This is a remarkably high rate for such a particular mutation that is not heritable. They speculate that some mutation hotspot kind of process may be causing these events, above the general mutation rate. What this all says about so-called "intelligent design", one may be reluctant to explore too deeply. On the other hand, this still leaves plenty of room to hunt for additional variations that cause these syndromes.

In this research, we see that clinically critical variations can pop up in many places, not just among the "usual suspects", genetically and genomically speaking. While much of the human genome is junk, most of it is also expressed (as RNA) and all of it is fair game for clinically important (if tragic) effects. The NDD syndromes caused by the mutation studied here are very severe- for more so than the ADD or mild autism diagnoses that make up most of the NDD spectrum. Understanding the causal nexus between the genome and human biology and its pathologies, remains an ongoing and complicated scientific adventure.


  • Playing the heel. Being the heel
  • It sure is great to be the victim.
  • Oh, right.. now we really know what is going on.
  • More spiritual warfare.
  • Another grift.

Saturday, February 15, 2025

Cloudy, With a Chance of RNA

Long RNAs play structural and functional roles in regulation of chromosome replication and expression.

One of the wonderful properties of the fruit fly as a model system of genetics and molecular biology has been its polytene chromosomes. These are hugely expanded bundles of chromosomes, replicated thousands of times, which have been observed microscopically since the late 1800's. They exist in the larval salivary gland, where huge amounts of gene expression are needed, thus the curious evolutionary solution of expanding the number of templates, not only of the gene needed, but of the entire genome. 

These chromosomes where closely mapped and investigated, almost like runic keys to the biology of the fly, especially in the day before molecular biology. Genetic translocations, loops, and other structural variations could be directly observed. The banding patterns of light, dark, expanded, and compressed regions were mapped in excruciating detail, and mapped to genetic correlates and later to gene expression patterns. These chromosomes provided some of the first suggestions of heterochromatin- areas of the genome whose expression is shut down (repressed). They may have genes that are shut off, but they may also be structural components, such as centromeres and telomeres. These latter areas tend to have very repetitive DNA sequences, inherited from old transposons and other junk. 

A diagram of polytene chromosomes, bunched up by binding at the centromeres. The banding pattern is reproducible and represents differences in proteins bound to various areas of the genome, and gene activity.

It has become apparent that RNA plays a big role in managing these areas of our chromosomes. The classic case is the XIST RNA, which is a long (17,000 bases) non-coding RNA that forms a scaffold by binding to lots of "heterogeneous" RNA-binding proteins, and most importantly, stays bound near the site of its creation, on the X chromosome. Through a regulatory cascade that is only partly understood, the XIST RNA is turned off on one of the X chromosomes, and turned on the other one (in females), leading the XIST molecule to glue itself to its chromosome of origin, and then progressively coat the rest of that chromosome and turn it off. That is, one entire X is turned into heterochromatin by a process that requires XIST scaffolding all along its length. That results in "dosage compensation" in females, where one X is turned off in all their cells, allowing dosage (that is, the gene expression) of its expressed genes to approximate those of males, despite the presence of the extra X chromosome. Dosage is very important, as shown by Down Syndrome, which originates from a duplication of one of the smallest human chromosomes, creating imbalanced gene dosage.

A recent paper described work on "ASAR" RNAs, which similarly arise from highly repetitive areas of human chromosomes, are extremely long (180,000 bases), and control expression and chromosome replication in an allele-specific way on (at least) several non-X chromosomes. These RNAs, again, like XIST, specifically bind a bunch of heternuclear binding proteins, which is presumably central to their function. Indeed, these researchers dissected out the 7,000 base segment of ASAR6 that is densest in protein binding sites, and find that, when transplanted into a new location, this segment has dramatic effects on chromosome condensation and replication, as shown below.

The intact 7,000 base core of ASAR6 was transplanted into chromosome 5, and mitotic chromosomes were spread and stained. The blue is a general DNA stain. The green is a stain for newly synthesized DNA, and the red is a specific probe for the ASAR6 sequence. One can see on the left that this chromosome 5 is replicating more than any other chromosome, and shows delayed condensation. In contrast, the right frame shows a control experiment where an anti-sense version of the ASAR6 7,000 base core was transplanted to chromosome 5. The antisense sequence not only does not have the wild-type function, but also inhibits any molecule that does by tightly binding to it. Here, the chromosome it resides on (arrows) is splendidly condensed, and hardly replicating at all (no green color).


Why RNA? It has become clear over the last two decades that our cells, and particularly our nuclei, are swimming with RNAs. Most of the genome is transcribed in some way or other, despite a tiny proportion of it coding for anything. 95% of the RNAs that are transcribed never get out of the nucleus. There has been a growing zoo of different kinds of non-coding RNAs functioning in translational control, ribosomal maturation, enhancer function, and here, in chromosome management. While proteins tend to be compact bundles, RNAs can be (as these ASARs are) huge, especially in one dimension, and thus capable of physically scaffolding the kinds of structures that can control large regions of chromosomes.

Chromosomes are sort of cloudy regions in our cells, long a focus of observation and clearly also a focus of countless proteins and now RNAs that bind, wind, disentangle, transcribe, replicate, and congregate around them. What all these RNAs and especially the various heteronuclear proteins actually do remains pretty unclear. But they form a sort of organelle that, while it protects and manages our DNA, remarkably also allows access to it for sequence-specific binding proteins and the many processes that plow through it.

"In addition, recent studies have proposed that abundant nuclear proteins such as HNRNPU nonspecifically interact with ‘RNA debris’ that creates a dynamic nuclear mesh that regulates interphase chromatin structure."


Saturday, February 1, 2025

Proving Evolution the Hard Way

Using genomes and codon ratios to estimate selective pressures was so easy... why is it not working?

The fruits of evolution surround us with abundance, from the tallest tree to the tiniest bacterium, and the viruses of that bacterium. But the process behind it is not immediately evident. It was relatively late in the enlightenment before Darwin came up with the stroke of insight that explained it all. Yet that mechanism of natural selection remains an abstract concept requiring an analytical mind and due respect for very inhuman scales of the time and space in play. Many people remain dumbfounded, and in denial, while evolutionary biology has forged ahead, powered by new discoveries in geology and molecular biology.

A recent paper (with review) offered a fascinating perspective, both critical and productive, on the study of evolutionary biology. It deals with the opsin protein that hosts the visual pigment 11-cis-retinal, by which we see. The retinal molecule is the same across all opsins, but different opsin proteins can "tune" the light wavelength of greatest sensitivity, creating the various retinal-opsin combinations for all visual needs, across the cone cells and rod cells. This paper considered the rhodopsin version of opsin, which we use in rod cells to perceive dim light. They observed that in fish species, the sensitivity of rhodopsin has been repeatedly adjusted to accommodate light at different depths of the water column. At shallow levels, sunlight is similar to what we see, and rhodopsin is tuned to about 500 nm, while deeper down, when the light is more blue-ish, rhodopsin is tuned towards about 480 nm maximum sensitivity. There are also special super-deep fish who see by their own red-tinged bioluminescence, and their rhodopsins are tuned to 526 nm. 

This "spectrum" of sensitivities of rhodopsin has a variety of useful scientific properties. First, the evolutionary logic is clear enough, matching the fish's vision to its environment. Second, the molecular structure of these opsins is well-understood, the genes are sequenced, and the history can be reconstructed. Third, the opsin properties can be objectively measured, unlike many sequence variations which affect more qualitative, difficult-to-observe, or impossible-to-observe biological properties. The authors used all this to carefully reconstruct exactly which amino acids in these rhodopsins were the important ones that changed between major fish lineages, going back about 500 million years.

The authors' phylogenetic tree of fish and other species they analyzed rhodopsin molecules from. Note how mammals occupy the bottom small branch, indicating how deeply the rest of the tree reaches. The numbers in the nodes indicate the wavelength sensitivity of each (current or imputed) rhodopsin. Many branches carry the author's inference, from a reconstructed and measured protein molecule, of what precise changes happened, via positive selection, to get that lineage.

An alternative approach to evolutionary inference is a second target of these authors. That is a codon-based method, that evaluates the rate of change of DNA sites under selection versus sites not under selection. In protein coding genes (such as rhodopsin), every amino acid is encoded by a triplet of DNA nucleotides, per the genetic code. With 64 codons for ~20 amino acids, it is a redundant code where many DNA changes do not change the protein sequence. These changes are called "synonymous". If one studies the rate of change of synonymous sites in the DNA, (which form sort of a control in the experiment), compared with the rate of change of non-synonymous sites, one can get a sense of evolution at work. Changing the protein sequence is something that is "seen" by natural selection, and especially at important positions in the protein, some of which are "conserved" over billions of years. Such sites are subject to "negative" selection, which to say rapid elimination due to the deleterious effect of that DNA and protein change.

Mutations in protein coding sequence can be synonymous, (bottom), with no effect, or non-synonymous (middle two cases), changing the resulting protein sequence and having some effect that may be biologically significant, thus visible to natural selection.


This analysis has been developed into a high art, also being harnessed to reveal "positive" selection. In this scenario, if the rate of change of the non-synonymous DNA sites is higher than that of the synonymous sites, or even just higher than one would expect by random chance, one can conclude that these non-synonymous sites were not just not being selected against, but were being selected for, an instance of evolution establishing change for the sake of improvement, instead of avoiding change, as usual.

Now back to the rhodopsin study. These authors found that a very small number of amino acids in this protein, only 15, were the ones that influenced changes to the spectral sensitivity of these protein complexes over evolutionary time. Typically only two or three changes occurred over a shift in sensitivity in a particular lineage, and would have been the ones subject to natural selection, with all the other changes seen in the sequence being unrelated, either neutral or selected for other purposes. It is a tour de force of structural analysis, biochemical measurement, and historical reconstruction to come up with this fully explanatory model of the history of piscene rhodopsins. 

But then they went on to compare what they found with what the codon-based methods had said about the matter. And they found that there was no overlap whatsover. The amino acids identified by the "positive selection" codon based methods were completely different than the ones they had found by spectral analysis and phylogenetic reconstruction over the history of fish rhodopsins. The accompanying review is particularly harsh about the pseudoscientific nature of this codon analysis, rubbishing the entire field. There have been other, less drastic, critiques as well.

But there is method to all this madness. The codon based methods were originally conceived in the analysis of closely related lineages. Specifically, various Drosophia (fly) species that might have diverged over a few million years. On this time scale, positive selection has two effects. One is that a desirable amino acid (or other) variation is selected for, and thus swept to fixation in the population. The other, and corresponding effect, is that all the other variations surrounding this desirable variation (that is, which are nearby on the same chromosome) are likewise swept to fixation (as part of what is called a haplotype). That dramatically reduces the neutral variation in this region of the genome. Indeed, the effect on neutral alleles (over millions of nearby base pairs) is going to vastly overwhelm the effect from the newly established single variant that was the object of positive selection, and this imbalance will be stronger the stronger the positive selection. In the limit case, the entire genomes of those without the new positive trait/allele will be eliminated, leaving no variation at all.

Yet, on the longer time scale, over hundreds of millions of years, as was the scope of visual variation in fish, all these effects on the neutral variation level wash out, as mutation and variation processes resume, after the positively selected allele is fixed in the population. So my view of this tempest in an evolutionary teapot is that these recent authors (and whatever other authors were deploying codon analysis against this rhodopsin problem) are barking up the wrong tree, mistaking the proper scope of these analyses. Which, after all, focus on the ratio between synonymous and non-synonymous change in the genome, and thus intrinsically on recent change, not deep change in genomes.


  • That all-American mix of religion, grift, and greed.
  • Christians are now in charge.
  • Mechanisms of control by the IMF and the old economic order.
  • A new pain med, thanks to people who know what they are doing.