Showing posts with label evolution. Show all posts
Showing posts with label evolution. Show all posts

Sunday, April 13, 2025

The Genome Remains Murky

A brilliant case study identifying the molecular cause of certain neuro-developmental disorders shows how difficult genome-based diagnoses remain.

Molecular medicine is increasingly effective in assessing both hereditary syndromes and cancers. The sequencing approach generally comes in two flavors- whole genome sequencing, or exome sequencing, where only the most important (protein-coding) parts are sampled. In each case, the hunt is for mutations (more blandly called variants) that cause the syndrome being investigated, from among the large number of variants we all carry. This approach is becoming standard-of-care in oncology, due to tremendous influence and clinical significance of cancer-driving mutations, many of which now match directly to tailored treatments that address them (thus the "precision" in precision medicine).

But another arm of precision medicine is the hunt for causes of congenital problems. There are innumerable genetic disorders whose causal analysis can lead not only to an informative diagnosis, and sometimes to useful treatments, but also to fundamental understanding of human biology. Sufferers of these syndromes may spend a lifetime searching for a diagnosis, being shuffled from one doctor or center to another and subject to various forms of hypothetical medicine, before some deep sequencing pinpoints the cause of their disease and founds a new diagnostic category that provides, if not relief, at least understanding and a medical home. 

A recent paper from Britain provided a classic of this form, investigating the causes of neurodevelopmental (NDD) disorders, which encompass a huge range of problems from mild to severe. They comment that even after the most modern analysis and intensive sequencing, 60% of NDD cases still can not be assigned causes. A large part of the problem is that, despite knowing the full sequence of the human genome, its function is less well-understood. The protein-coding genes (20,000 of those, roughly) are delineated and studied pretty closely. But that only accounts for 1 to 2% of the genome. The rest ranges from genes for a blizzard of non-coding RNAs, some of which are critical, to large regulatory regions with smatterings of important sites, to junk of various kinds- pseudogenes, relic retroviruses, repetitive elements, etc. The importance of any of these elements (and individual DNA base positions within them) varies tremendously. This means specifically that exome sequencing is not going to cut it. Exome sequencing focuses on a very small part of the genome, which is fine if your syndrome (such as a common cancer) is well characterized and known to arise from the usual suspects. But for orphan syndromes, it does not cast a wide enough net. Secondly, even with full genome sequencing, so little is known about the remoter regions of the genome that assigning a function to variations found there is difficult to impossible. It takes statistical analysis of incidence of the variation vs the incidence of the syndrome.

These authors used a trove of data- the Genomics England 100,000 genomes project, focusing on the ~9,000 genomes in this collection from people with NDD syndromes. (Plus additional genomes collected elsewhere.) (We can note in passing that Britain's nationalized health system remains at the forefront of innovative research and care.) What they found was an unusually high incidence of a particular mutation in a non-protein-coding gene called RNU4-2. The product of this gene is an RNA called U4, which is an important part of the spliceosome, where it pairs RNA-to-RNA with another RNA, U6, in a key step of selecting the first (5-prime) side of an intron that is to be spliced out of mRNA messages. This gene would never have come up in exome analysis, being non-protein-coding. Yet it is critically important, as splicing happens to the vast majority of human genes. Additionally, differential splicing- the selection of alternative exons and splice sites in a regulated way- happens frequently in developmental programs and neurological cell types. There is a class of syndromes called spliceosomopathies that are caused by defects in mRNA splicing, and tend to appear as syndromes in these processes.

As shown in the images (all based on a large corpus of other work on spliceosomes), RNU4-2/U4 pairs intimately with the U6 spliceosomal RNA, and the mutation found by the current group (which is a single nucleotide insertion) causes a bulge in this pairing, as marked. Meanwhile, the U6 RNA pairs at the same time with the exon-intron junction of the target mRNA (bottom image), at a site that is very close to the U4 pairing region (top image). The upshot is that this single base insertion into U4 causes some portion of the target mRNAs to be mis-spliced, using non-natural 5 prime splice sites and thus altering their encoded proteins. This may cause minor problems in the protein, but more often will cause a shift in translation frame, a premature stop codon, and total loss of the functional protein. So this tiny mutation can have severe effects and is indeed genetically dominant- that is, one copy overrides a second wild-type copy to generate the NDD diseases that were studied.

The U4 RNA (teal) paired with the U6 RNA (gray), within an early spliceosome complex. The mutation studied here is pointed out in black (n.64_65insT - i.e. insertion of a T). Note how it would cause a bulge in the pairing. Importantly, the location in the U6 RNA that pairs with the mRNA (see below) is right next door, at the ACAGAGA (light gray). The authors use this structural work from others to suggest how the mutation they found can alter selected splicing sites and thus lead to disease. Other single nucleotide insertions that cause similar syndromes are marked with black arrows, while single nucleotide substitutions that cause less severe syndromes are marked with orange RNA segments.

The U6 RNA (pink) paired with its mRNA target to be spliced. It binds right at the intron (gray) exon (black) boundary, where the cut will eventually be made the remove the intron. The bump from the mis-paired mutant U4 RNA (see above) distorts this binding, sending U6 to select wrong locations for spicing.


The researchers went on to survey this and other spliceosomal RNA genes for similar mutations, and found few to none outside the region marked in the diagram above. For example, there is a highly similar gene called RNU4-1. But this gene is expressed about 100-fold less in brain and other tissues, making RNU4-2 the principal source of U4 RNA, and much more significant as a causal factor for NDD. It appears that other locations in RNU4-2 (and other spliceosomal RNA genes) are even more important than the one mutated location found here, thus are never found, being lethal and heavily selected against, in this highly conserved gene. 

They also noted that, while this RNU4-2 mutation is severe, and thus must happen spontaneously (i.e. not inherited from parents), it only occurrs on the maternal alleles, not paternal alleles in the affected children. They speculate that this may be due to effects this gene may have in male gametogenesis, killing affected sperm preferentially, but not affected oocytes. Lastly, this set of mutations (in the small region shown in the first figure above) appears to account for, in their estimation, about 0.4 % of all NDD seen in Britain. This is a remarkably high rate for such a particular mutation that is not heritable. They speculate that some mutation hotspot kind of process may be causing these events, above the general mutation rate. What this all says about so-called "intelligent design", one may be reluctant to explore too deeply. On the other hand, this still leaves plenty of room to hunt for additional variations that cause these syndromes.

In this research, we see that clinically critical variations can pop up in many places, not just among the "usual suspects", genetically and genomically speaking. While much of the human genome is junk, most of it is also expressed (as RNA) and all of it is fair game for clinically important (if tragic) effects. The NDD syndromes caused by the mutation studied here are very severe- for more so than the ADD or mild autism diagnoses that make up most of the NDD spectrum. Understanding the causal nexus between the genome and human biology and its pathologies, remains an ongoing and complicated scientific adventure.


  • Playing the heel. Being the heel
  • It sure is great to be the victim.
  • Oh, right.. now we really know what is going on.
  • More spiritual warfare.
  • Another grift.

Saturday, February 8, 2025

Sugar is the Enemy

Diabetes, cardiovascular health, and blood glucose monitoring.

Christmas brought a book titled "Outlive: The Science and Art of Longevity". Great, I thought- something light and quick, in the mode Gweneth Paltrow or Deepak Chopra. I have never been into self-help or health fad and diet books. Much to my surprise, however, it turned out to be a rather rigorous program of preventative medicine, with a side of critical commentary on our current medical system. A system that puts various thresholds, such as blood sugar and blood pressure, at levels that represent serious disease, and cares little about what led up to them. Among the many recommendations and areas of focus, blood glucose levels stand out, both for their pervasive impact on health and aging, and also because there are new technologies and science that can bring its dangers out of the shadows.

Reading: 

Where do cardiovascular problems, the biggest source of mortality, come from? Largely from metabolic problems in the control of blood sugar. Diabetics know that uncontrolled blood sugar is lethal, on both the acute and long-terms. But the rest of us need to realize that the damage done by swings in blood sugar are more insidious and pervasive than commonly appreciated. Both microvascular (what is commonly associated with diabetes, in the form of problems with the small vessels of the kidney, legs, and eyes) and macrovascular (atherosclerosis) are due to high and variable blood sugar. The molecular biology of this was impressively unified in 2005 in the paper above, which argues that excess glucose clogs the mitochondrial respiration mechanisms. Their membrane voltage maxes out, reactive forms of oxygen accumulate, and glucose intermediates pile up in the cell. This leads to at least four different and very damaging consequences for the cell, including glucose modification (glycation) of miscellaneous proteins, a reduction of redox damage repair capacity, inflammation, and increased fatty acid export from adipocytes to endothelial (blood vessel) cells. Not good!

Continuous glucose monitored concentrations from three representative subjects, over one day. These exemplify the low, moderate, and severe variability classes, as defined by the Stanford group. Line segments are individually classed as to whether they fall into those same categories. There were 57 subject in the study, of all ages, none with an existing diagnosis of diabetes. Yet five of them had diabetes by traditional criteria, and fourteen had pre-diabetes by those criteria. By this scheme, 25 had severe variability as their "glucotype", 25 had moderate variability, and only 7 had low variability. As these were otherwise random subjects selected to not have diabetes, this is not great news about our general public health, or the health system.

Additionally, a revolution has occurred in blood glucose monitoring, where anyone can now buy a relatively simple device (called a CGM) that gives continuous blood glucose monitoring to a cell phone, and associated analytical software. This means that the fasting blood glucose level that is the traditional test is obsolete. The recent paper from Stanford (and the literature it cites) suggests, indeed, that it is variability in blood glucose that is damaging to our tissues, more so than sustained high levels.

One might ask why, if blood glucose is such a damaging and important mechanism of aging, hasn't evolution developed tighter control over it. Other ions and metabolites are kept under much tighter ranges. Sodium ranges between 135 to 145 mM, and calcium from 8.8 to 10.7 mM. Well, glucose is our food, and our need for glucose internally is highly variable. Our livers are tiny brains that try very hard to predict what we need, based on our circadian rhythms, our stress levels, our activity both current and expected. It is a difficult job, especially now that stress rarely means physical activity, and nor does travel, in our automobiles. But mainly, this is a problem of old age, so evolution cares little about it. Getting a bigger spurt of energy for a stressful event when we, in our youth, are in crisis may, in the larger scheme of things, outweigh the slow decay of the cardiovascular system in old age. Not to mention that traditional diets were not very generous at all, certainly not in sugar and refined carbohydrates.


Saturday, February 1, 2025

Proving Evolution the Hard Way

Using genomes and codon ratios to estimate selective pressures was so easy... why is it not working?

The fruits of evolution surround us with abundance, from the tallest tree to the tiniest bacterium, and the viruses of that bacterium. But the process behind it is not immediately evident. It was relatively late in the enlightenment before Darwin came up with the stroke of insight that explained it all. Yet that mechanism of natural selection remains an abstract concept requiring an analytical mind and due respect for very inhuman scales of the time and space in play. Many people remain dumbfounded, and in denial, while evolutionary biology has forged ahead, powered by new discoveries in geology and molecular biology.

A recent paper (with review) offered a fascinating perspective, both critical and productive, on the study of evolutionary biology. It deals with the opsin protein that hosts the visual pigment 11-cis-retinal, by which we see. The retinal molecule is the same across all opsins, but different opsin proteins can "tune" the light wavelength of greatest sensitivity, creating the various retinal-opsin combinations for all visual needs, across the cone cells and rod cells. This paper considered the rhodopsin version of opsin, which we use in rod cells to perceive dim light. They observed that in fish species, the sensitivity of rhodopsin has been repeatedly adjusted to accommodate light at different depths of the water column. At shallow levels, sunlight is similar to what we see, and rhodopsin is tuned to about 500 nm, while deeper down, when the light is more blue-ish, rhodopsin is tuned towards about 480 nm maximum sensitivity. There are also special super-deep fish who see by their own red-tinged bioluminescence, and their rhodopsins are tuned to 526 nm. 

This "spectrum" of sensitivities of rhodopsin has a variety of useful scientific properties. First, the evolutionary logic is clear enough, matching the fish's vision to its environment. Second, the molecular structure of these opsins is well-understood, the genes are sequenced, and the history can be reconstructed. Third, the opsin properties can be objectively measured, unlike many sequence variations which affect more qualitative, difficult-to-observe, or impossible-to-observe biological properties. The authors used all this to carefully reconstruct exactly which amino acids in these rhodopsins were the important ones that changed between major fish lineages, going back about 500 million years.

The authors' phylogenetic tree of fish and other species they analyzed rhodopsin molecules from. Note how mammals occupy the bottom small branch, indicating how deeply the rest of the tree reaches. The numbers in the nodes indicate the wavelength sensitivity of each (current or imputed) rhodopsin. Many branches carry the author's inference, from a reconstructed and measured protein molecule, of what precise changes happened, via positive selection, to get that lineage.

An alternative approach to evolutionary inference is a second target of these authors. That is a codon-based method, that evaluates the rate of change of DNA sites under selection versus sites not under selection. In protein coding genes (such as rhodopsin), every amino acid is encoded by a triplet of DNA nucleotides, per the genetic code. With 64 codons for ~20 amino acids, it is a redundant code where many DNA changes do not change the protein sequence. These changes are called "synonymous". If one studies the rate of change of synonymous sites in the DNA, (which form sort of a control in the experiment), compared with the rate of change of non-synonymous sites, one can get a sense of evolution at work. Changing the protein sequence is something that is "seen" by natural selection, and especially at important positions in the protein, some of which are "conserved" over billions of years. Such sites are subject to "negative" selection, which to say rapid elimination due to the deleterious effect of that DNA and protein change.

Mutations in protein coding sequence can be synonymous, (bottom), with no effect, or non-synonymous (middle two cases), changing the resulting protein sequence and having some effect that may be biologically significant, thus visible to natural selection.


This analysis has been developed into a high art, also being harnessed to reveal "positive" selection. In this scenario, if the rate of change of the non-synonymous DNA sites is higher than that of the synonymous sites, or even just higher than one would expect by random chance, one can conclude that these non-synonymous sites were not just not being selected against, but were being selected for, an instance of evolution establishing change for the sake of improvement, instead of avoiding change, as usual.

Now back to the rhodopsin study. These authors found that a very small number of amino acids in this protein, only 15, were the ones that influenced changes to the spectral sensitivity of these protein complexes over evolutionary time. Typically only two or three changes occurred over a shift in sensitivity in a particular lineage, and would have been the ones subject to natural selection, with all the other changes seen in the sequence being unrelated, either neutral or selected for other purposes. It is a tour de force of structural analysis, biochemical measurement, and historical reconstruction to come up with this fully explanatory model of the history of piscene rhodopsins. 

But then they went on to compare what they found with what the codon-based methods had said about the matter. And they found that there was no overlap whatsover. The amino acids identified by the "positive selection" codon based methods were completely different than the ones they had found by spectral analysis and phylogenetic reconstruction over the history of fish rhodopsins. The accompanying review is particularly harsh about the pseudoscientific nature of this codon analysis, rubbishing the entire field. There have been other, less drastic, critiques as well.

But there is method to all this madness. The codon based methods were originally conceived in the analysis of closely related lineages. Specifically, various Drosophia (fly) species that might have diverged over a few million years. On this time scale, positive selection has two effects. One is that a desirable amino acid (or other) variation is selected for, and thus swept to fixation in the population. The other, and corresponding effect, is that all the other variations surrounding this desirable variation (that is, which are nearby on the same chromosome) are likewise swept to fixation (as part of what is called a haplotype). That dramatically reduces the neutral variation in this region of the genome. Indeed, the effect on neutral alleles (over millions of nearby base pairs) is going to vastly overwhelm the effect from the newly established single variant that was the object of positive selection, and this imbalance will be stronger the stronger the positive selection. In the limit case, the entire genomes of those without the new positive trait/allele will be eliminated, leaving no variation at all.

Yet, on the longer time scale, over hundreds of millions of years, as was the scope of visual variation in fish, all these effects on the neutral variation level wash out, as mutation and variation processes resume, after the positively selected allele is fixed in the population. So my view of this tempest in an evolutionary teapot is that these recent authors (and whatever other authors were deploying codon analysis against this rhodopsin problem) are barking up the wrong tree, mistaking the proper scope of these analyses. Which, after all, focus on the ratio between synonymous and non-synonymous change in the genome, and thus intrinsically on recent change, not deep change in genomes.


  • That all-American mix of religion, grift, and greed.
  • Christians are now in charge.
  • Mechanisms of control by the IMF and the old economic order.
  • A new pain med, thanks to people who know what they are doing.

Saturday, January 25, 2025

The Climate is Changing

Fires in LA, and a puff of smoke in DC.

An ill wind has blown into Washington, a government of whim and spite, eager to send out the winged monkeys to spread fear and kidnap the unfortunate. The order of the day is anything that dismays the little people. The wicked witch will probably have melted away by the time his most grievous actions come to their inevitable fruition, of besmirching and belittling our country, and impoverishing the world. Much may pass without too much harm, but the climate catastrophe is already here, burning many out of their homes, as though they were made of straw. Immoral and spiteful contrariness on this front will reap the judgement and hatred of future generations.

But hasn't the biosphere and the climate always been in flux? Such is the awful refrain from the right, in a heartless conservatism that parrots greedy, mindless propaganda. In truth, Earth has been blessed with slowness. The tectonic plates make glaciers look like race cars, and the slow dance of Earth's geology has ruled the evolution of life over the eons, allowing precious time for incredible biological diversification that covers the globe with its lush results.

A stretch of relatively unbroken rain forest, in the Amazon.

Past crises on earth have been instructive. Two of the worst were the end-Permian extinction event, about 252 million years ago (mya), and the end-Cretaceous extinction event, about 66 mya. The latter was caused by a meteor, so was a very sudden event- a shock to the whole biosphere. Following the initial impact and global fire, it is thought to have raised sun-shielding dust and sulfur, with possible acidification, lasting for years. However, it did not have very large effects on CO2, the main climate-influencing gas.

On the other hand, the end-Permian extinction event, which was significantly more severe than the end-Cretaceous event, was a more gradual affair, caused by intense volcanic eruptions in what is now Siberia. Recent findings show that this was a huge CO2 event, turning the climate of Earth upside down. CO2 went from about 400 ppm, roughly what we are at currently, to 2500 ppm. The only habitable regions were the poles, while the tropics were all desert. But the kicker is that this happened over the surprisingly short (geologically speaking) time of about 80,000 years. CO2 then stayed high for the next roughly 400,00 years, before returning slowly to its former equilibrium. This rate of rise was roughly 2.7 ppm per 100 years, yet that change killed off 90% of all life on Earth. 

The momentous analysis of the end-Permian extinction event, in terms of CO2, species, and other geological markers, including sea surface temperature (SST). This paper was when the geological brevity of the event was first revealed.

Compare this to our current trajectory, where atmospheric CO2 has risen from about 280 ppm at the dawn of the industrial age to 420 ppm now. That is rate of maybe 100 ppm per 100 years, and rising steeply. It is a rate far too high for many species, and certainly the process of evolution itself, to keep up with, tuned as it is to geologic time. As yet, this Anthropocene extinction event is not quite at the level of either the end-Permian or end-Cretaceous events. But we are getting there, going way faster than the former, and creating a more CO2-based long-term climate mess than the latter. While we may hope to forestall nuclear war and thus a closer approximation to the end-Cretaceous event, it is not looking good for the biosphere, purely from a CO2 and warming perspective, putting aside the many other plagues we have unleashed including invasive species, pervasive pollution by fertilizers, plastics and other forever chemicals, and the commandeering of all the best land for farming, urbanization, and other unnatural uses. 

CO2 concentrations, along with emissions, over recent time.

We are truly out of Eden now, and the only question is whether we have the social, spiritual, and political capacity to face up to it. For the moment, obviously not. Something disturbed about our media landscape, and perhaps our culture generally, has sent us for succor, not to the Wizard who makes things better, but to the Wicked Witch of the East, who delights in lies, cruelty and destruction.


Saturday, January 18, 2025

Eeking Out a Living on Ammonia

Some archaeal microorganisms have developed sophisticated nano-structures to capture their food: ammonia.

The earth's nitrogen cycle is a bit unheralded, but critical to life nonetheless. Gaseous nitrogen (N2) is all around us, but inert, given its extraordinary chemical stability. It can be broken down by lightning, but little else. It must have been very early in the history of life that the nascent chemical-biological life forms tapped out the geologically available forms of nitrogen, despite being dependent on nitrogen for countless critical aspects of organic chemistry, particularly of nucleic acids, proteins, and nucleotide cofactors. The race was then on to establish a way to capture it from the abundant, if tenaciously bound, dinitrogen of the air. It was thus very early bacteria that developed a way (heavily dependent, unsurprisingly, on catalytic metals like molybdenum and iron) to fix nitrogen, meaning breaking up the triple N≡N bond, and making ammonia, NH3 (or ammonium, NH4+). From there, the geochemical cycle of nitrogen is all down-hill, with organic nitrogen being oxidized to nitric oxide (NO), nitrite (NO2-), nitrate (NO3), and finally denitrification back to N2. Microorganisms obtain energy from all of these steps, some living exclusively on either nitrite or nitrate, oxidizing them as we oxidize carbon with oxygen to make CO2. 

Nitrosopumilus, as imaged by the authors, showing its corrugated exterior, a layer entirely composed of ammonia collecting elements (can be hexameric or pentameric). Insets show an individual hexagonal complex, in face-on and transverse views. Note also the amazing resolution of other molecules, such as the ribosomes floating about.

A recent paper looked at one of these denizens beneath our feet, an archaeal species that lives on ammonia, converting it to nitrite, NO2. It is a dominant microbe in its field, in the oceans, in soils, and in sewage treatment plants. The irony is that after we spend prodigious amounts of fossil fuels fixing huge amounts of nitrogen for fertilizer, most of which is wasted, and which today exceeds the entire global budget of naturally fixed nitrogen, we are faced with excess and damaging amounts of nitrogen in our effluent, which is then processed in complex treatment plants by our friends the microbes down the chain of oxidized states, back to gaseous N2.

Calculated structure of the ammonia-attracting pore. At right are various close-up views including the negatively charged amino acids (D, E) concentrated at the grooves of the structure, and the pores where ammonium can transit to the cell surface. 

The Nitrosopumilus genus is so successful because it has a remarkable way to capture ammonia from the environment, a way that is roughly two hundred times more efficient than that of its bacterial competitors. Its surface is covered by a curious array of hexagons, which turn out to be ammonia capture sites. In effect, its skin is an (relatively) enormous chemical antenna for ammonia, which is naturally at low concentration in sea water. These authors do a structural study, using the new methods of particle electron microscopy, to show that these hexagons have intensely negatively charged grooves and pores, to which positively charged ammonium ions are attracted. Within this outer shell, but still outside the cell membrane, enzymes at the cell surface transform the captured ammonium to other species such as hydroxylamine, which enforces the ammonium concentration gradient towards the cell surface, and which are then pumped inside.

Cartoon model of the ammonium attraction and transit mechanisms of this cell wall. 

It is a clever nano-material and micro-energetic system for concentrating a specific chemical- a method that might inspire human applications for other chemicals that we might need- chemicals whose isolation demands excessive energy, or whose geologic abundance may not last forever.


Saturday, January 4, 2025

Drilling Into the Transcriptional Core

Machine learning helps to tease out the patterns of DNA at promoters that initiate transcription.

One of the holy grails of molecular biology is the study of transcriptional initiation. While there are many levels of regulation in cells, the initiation of transcription is perhaps, of all of them, the most powerful. An organism's ability to keep the transcription of most genes off, and turn on genes that are needed to build particular tissues, and regulate others in response to other urgent needs, is the very soul of how multicellular organisms operate. The decision to transcribe a gene into its RNA message (mRNA) represents a large investment, as that transcript can last hours or more and during that time be translated into a great many protein copies. Additionally, this process identifies where, in the otherwise featureless landscape of genomic DNA, genes are located, which is another significant process, one that it took molecular biologists a long time to figure out.

Control over transcription is generally divided into two conceptual and physical regions- enhancers and promoters. Enhancers are typically far from the start site of transcription, and are modules of DNA sequences that bind innumerable regulatory proteins which collectively tune, in fine and rough ways, initiation. Promoters, in contrast, are at the core and straddle the start site of transcription (TSS, for short). They feature a much more limited set of motifs in the DNA sequence. The promoter is the site where the proteins bound to the various enhancers converge and encourage the formation of a "preinitiation complex", which includes the RNA polymerase that actually carries out transcription, plus a lot of ancillary proteins. The RNA polymerase can not initiate on its own or find a promoter on its own. It requires direction by the regulatory proteins and their promoter targets before finding its proper landing place. So the study of promoter initiation and regulation has a very long history, as a critical part of the central flow of information in molecular biology, from DNA to protein.

A schematic of a promoter, where initiation of transcription of Gene A, happens, with the start site (+1) right at the boundary of the orange and green colors. At this location, the RNA polymerase will melt the DNA strands, and start synthesizing an RNA strand using the (bottom) template strand of the DNA. Regulatory proteins bound to enhancers far away in the genomic DNA bend through space to activate proteins bound at the core promoter to load the polymerase and initiate this process.

A recent paper provided a novel analysis of promoter sequences, using machine learning to derive a relatively comprehensive account of the relevant sequences. Heretofore, many promoters had been dissected in detail and several key features found. But many human promoters had none of them, showing that our knowledge was incomplete. This new approach started strictly from empirical data- the genome sequence, plus large experimental compilations of nascent RNAs, as they are expressed in various cells, and mapped to the precise base where they initiated from- that is, their respective TSS. These were all loaded into a machine learning model that was supplemented with explanatory capabilities. That is, it was not just a black box, but gave interpretable results useful to science, in the form of small sequence signatures that it found are needed to make particular promoters work. These signatures presumably bind particular proteins that are the operational engines of regulatory integration and promoter function.

The TATA motif, found about 30 base pairs upstream of the transcription start site in many promoters. This is a motif view, where the statistical prevalence of the base is reflected in the height of the letter (top, in color) and its converse is reflected below in gray. Regular patterns like this found in DNA usually mean that some protein typically binds to this site, in this case TFIID.


For example, the grand-daddy of them all is the TATA box, which dates back to bacteria / archaea and was easily dug up by this machine learning system. The composition of the TATA box is shown above in a graphical form, where the probability of occurrence (of a base in the DNA) is reflected in height of the base over the axis line. A few G/C bases surround a central motif of T/A, and the TSS is typically 30 base pairs downstream. What happens here is that one of the central proteins of the RNA polymerase positioning complex, TFIID, binds strongly to this sequence, and bends the DNA here by ninety degrees, forming a launchpad of sorts for the polymerase, which later finds and opens DNA at the transcription start site. TFIID and the TATA box are well known, so it certainly is reassuring that this algorithmic method recovered it. TATA boxes are common at regulated promoters, being highly receptive to regulation by enhancer protein complexes. This is in contrast to more uniformly expressed (housekeeping) genes which typically use other promoter DNA motifs, and incidentally tend to have much less precise TSS positions. They might have start sites that range over a hundred base pairs, more or less stochastically.

The main advance of this paper was to find more DNA sites, and new types of sites, which collectively account for the positioning and activation of all promoters in humans. Instead of the previously known three or four factors, they found nine major DNA sequences, and a smattering of weaker patterns, which they combine into a predictive model that matches empirical data. Most of these DNA sequences were previously known, but not as part of core promoters. For example, one is called YY1, because it binds the YY1 protein, which has long been appreciated to be a transcriptional repressor, from enhancer positions. But now it turns out to also be core promoter participant, identifying and turning on a class of promoters that, as for most of the new-found sequence elements, tend to operate genes that are not heavily regulated, but rather universally expressed and with delocalized start sites. 

Motifs and initiator elements found by the current work. Each motif, presumably matched by a protein that binds it, gets its own graph of relation of the motif location (at 0 on the X axis) vs the start site of transcription that it directs, which for TATA is about 30 base pairs downstream. Most of the newly discovered motifs are bi-directional, directing start sites and transcription both upstream and downstream. This wastes a lot of effort, as the upstream transcripts are typically quickly discarded. The NFY motif has an interesting pattern of 10.5 bp periodicity of its directed start sites, which suggests that the protein complex that binds this site hugs one side of the DNA quite closely, setting up start sites on that side of the helix.

Secondly, these authors find that most of the new sequences they identify have bidirectional effects. That is, they set up promoters to fire in both directions, both typically about forty base pairs downstream and also upstream from their binding site. This explains a great deal of transcription data derived from new sequencing technologies, which shows that many promoters fire in both directions, even though the "upstream" or non-gene side transcript tends to be short-lived.


Overview of the new results, summarized by type of DNA sequence pattern. The total machine learning prediction was composed of predictions for larger motifs, which were the dominant pattern, plus a small contribution from "initiators", which comprise a few patterns right at the start site, plus a large but diffuse contribution from tiny trinucleotide patterns, such as the CG pattern known to mark active genes and carry activating DNA methylation marks.


A third finding was the set of trinucleotide motifs that serve as the sort of fudge factor for their machine learning model, filling in details to make the match to empirical data come out better. The length was set more or less arbitrarily, but they play a big part in the model fit. They note that one common example is the CG pattern, which is one of the stronger trinucleotide motifs. This pattern is known as CpG, and is the target of chemical methylation of DNA by regulatory enzymes, which helps to mark and regulate genes. The current work suggests that there may be more systems of this kind yet to be discovered, which play a modulating role in gene/promoter selection and activation.

The accuracy of this new learning and modeling system exemplifies some of the strengths of AI, of which machine learning is a sub-discipline. When there is a lot of data available, and a problem that is well defined and on the verge of solution (like the protein folding problem), then AI, or these machine learning methods, can push the field over the edge to a solution. AI / ML are powerful ways to explore a defined solution space for optimal results. They are not "intelligent" in the normal sense of the word, (at least not yet), which would imply having generalized world models that would allow them to range over large areas of knowledge, solve undefined problems, and exercise common sense.


Saturday, December 21, 2024

Inside the Process of Speciation

Adaptive radiations are messy, so no wonder we have a hard time reconstructing them.

Darwin drew a legendary diagram in his great book, of lineage trees tracing speciation from ancestors to descendants. It was just a sketch, and naturally had clear fork points where one species turns into two. But in real life, speciation is messier, with range overlaps, inter-breeding, and difficulties telling species apart. Ornithologists are still lumping and splitting species to this day, as more data come in about ranges, genetics, sub-populations, breeding behavior, etc. And if defining existing species is difficult, defining exactly where they split in the distant past is even harder.

Darwin's notebook sketch of speciation, from ancestors ... to descendants.

The advent of molecular data from genomes gave a tremendous boost to the amount of information on which to base phylogenetic inferences. It gave us a whole new domain of life, for one thing. And it has helped sharpen countless phylogenies that not been fully specified by fossil and morphological data. But still, difficulties remain. The deepest and most momentous divergences, like the origin of life itself, and the origin of eukaryotes, remain shrouded in hazy and inconclusive trees, as do many other lineages, such as the origin of birds. It seems to be a rule that when a group of organisms undergoes rapid evolution / speciation, the tree they are on (as reconstructed by us from contemporary data) becomes correspondingly unclear and unresolved, difficult to trace through that tumultuous time. In part this is simply a matter of timing. If dramatic events happened within a few million years a billion years ago, our ability to resolve the sequence of those events is going to be weak in any case, compared to the same events spread out over a hundred million years.

A recent paper documented some of this about phylogeny in general, by correlating times of morphological change with times of phylogenetic haziness, which they term "gene-tree conflict". That is to say, if one samples genes across genomes to draw phylogenetic trees, different genes will give different trees. And this phenomenon increases right when there are other signs of rapid evolutionary change, i.e. changing morphology.

"One insight gleaned from phylogenomics is that gene-tree conflict, frequently caused by population-level processes, is often rampant during the origin of major lineages."

They identify three mechanisms behind this observation: incomplete lineage sorting (ILS), hybridization, and rapid evolution. Obviously, these need to be unpacked a bit. ILS is a natural consequence of the fact that species arise not from single organisms, but from populations. Gene mutations that differentiate the originating and future species happen all over the respective genomes, and enter the future lineage at different times. Some may happen well after the putative speciation event, and become fixed (that is, prevalent) later in that species. Others may have happened well before the speciation event, and die off in most of the descending lineages. The fact is that not every gene is going to march in lock step with the speciation event, in terms of its variants. So phylogenetic inference is best done using lots of genes plus statistical methods to arrive at the most likely explanation of the diverse individual gene trees.

Graphs drawn from different sources relating gene conflicts in lineage estimation, (top), versus rate of morphological change from the fossil record, (bottom), in birds, and over time on the X axis. There are dramatic upticks in all metrics going back towards the end-Cretaceous extinction event.


Similarly, hybridization means that proto-species are still occasionally interbreeding with their ancestors or other relatives, (think of Neanderthals), thereby mixing up the gene trees relative to the overall speciation tree. This can even happen by gene transfer mediated by viruses. "Rapid evolution" is not defined by these authors, and comes dangerously close to using the conclusion (of high morphological change during periods of "gene-tree conflict") to describe their premise. But generally, this would mean that some genes are evolving rapidly, due to novel selective pressures, thus deviating from the general march of neutral evolution that affects most loci more evenly. This rate change can mess up phylogenetic inferences, lengthening some (gene) tree branches versus others, and making a unitary tree (that is, for the species or lineage as a whole) hard to draw.

But these are all rather abstract ideas. How does this process look on the ground? A wonderful paper on the tomato gives us some insight. This group traced the evolutionary history of a genus of tomato (Solanum sect. Lycopersicon) in the South American Andes (plus Galapagos islands just off-shore, interestingly enough). These form a tight group of about thirteen species that evolved from a single ancestor over the last two million years, before jumping onto our lunch plates via intensive breeding by native South Americans. This has been a rapid process of evolution, and phylogenies have been difficult to draw, for all the reasons given above. The tomatoes are mostly reproductively isolated, but not fully, and have various specializations for their microhabitats. So are they real species? And how can they evolve and specialize if they do not fully isolate from each other?

Gene-based phylogenetic tree of Andean tomato species. The consensus tree is in black at the right, while alternate trees (cloud) are drawn from 2,745 windows of 100 kb across the tomato genomes, clearly giving diverse views of the lineage tree. Lycopersicon are the species under study, while Lycopericoides is an "outgroup" genus used as a control / comparison. 

In the graph above, there is, as they say, rampant discord among genomic segments, versus the overall consensus tree that they arrived at:

"However, these summary support measures conceal rampant phylogenetic complexity that is evident when examining the evolutionary history of more defined genomic partitions."

For one thing, much of the sequence diversity in the ancestor survives in the descendent lineages. The founders were not single plants, by any means. Second, there has been a lot of "introgression", which is to say, breeding / hybridization between lineages after their putative separation. 

Lastly, they find a high rate of novel mutations, often subject to clear positive selection. Ten enyzmes in the carotenoid biosynthesis pathway, which affects fruit color in a group that has evolved red fruits, carry novel mutations. A UV light damage repair gene shows strong signs of positive selection, in high-altitude species. Others show novel mutations in a temperature stress response gene, and selection on genes defending plants against heavy metals in the soil. 

Their conclusion (as that of the previous paper) is that adaptive radiations are characterized by several components that scramble normal phylogenetic analysis, including variably preserved diversity from the originating species, post-divergence gene flow (i.e. mating), and rapid adaptation to new conditions along with strong environmental selection over the pre-existing diversity. All of these mechanisms are happening at the same time, and each position in the genome is being affected at the same time, so this is a massively parallel process that, while slow in human time, can be very rapid in geologic time. They note how tomato speciation compares with some other well-known cases:

"Nonetheless, based on our crude estimates within each analysis, we infer that relatively small yet substantial fractions of the euchromatic genome are implicated in each source of genetic variation. We find little evidence that one of these processes predominates in its contribution, although our estimates suggest that de novo mutation might be relatively more influential and cross-species introgression relatively less so. This latter observation is in interesting contrast with several recent studies of animal adaptive radiations, including in Darwin’s Finches [18], Equids [14], and fish [13], where evidence suggests that hybridization and introgression might be much more pervasive and influential than previously suspected, and more abundant than we detect in Solanum."

Naturally, neither of these studies go back in time to nail down exactly what happened during these evolutionary radiations, nor what caused them. They only give hints about causation. Why the stasis of some species, and the rapid niche-finding and filling by others? Was the motive force natural selection, or god? The latter paper gives some clear hints about possible selective pressures and rationales that were at work in the Andes and Galapagos on the genus of Solanum. But it is always frustratingly a matter of abstract reasoning, in the manner of Darwin, that paints the forces at work, however detailed the genetic and biogeographic analyses and however convincing the analogous laboratory experiments on model, usually microbial, organisms. We have to think carefully, and within the discipline of known forces and mechanisms, to arrive at intellectually honest answers to these questions, insofar as they can be answered at all.


Saturday, December 7, 2024

Cranking Up DNA, One Gyration at a Time

The mechanism of DNA gyrase, which supercoils bacterial DNA.

Imagine that you have a garden hose that is thirty miles long. How would you keep it from getting tangled? That is unlikely to be easy. Now add randomly placed heavy machinery that actively twists that hose as it travels / pulls along, causing it to wind up ahead, and unwind behind. And that machinery can be placed in either direction, often getting into head-on conflicts, not to mention going at quite different speeds. That is the problem our cells have, managing their DNA. 

They use a set of topoisomerases to manage the topology of DNA- that is, its twist-i-ness. One easy method is to nick the DNA on one of its two strands, allowing it to relax by spinning around the remaining phosphate bond, before resealing it back to a double strand and sending it on its way. But what if you encounter coils or knots that can't be resolved that way? The next level is to cut one entire DNA molecule, not just one side/strand of it, and pass the conflicting one though it. All organisms contain topoisomerases of both kinds, and they are essential.

How DNA gets twisted. While most topoisomerases relax DNA (top) to resolve the many twisty problems posed by transcription and replication, gyrase increases twist by grabbing and holding a quasi-positive twist, then cutting and resolving it, as shown at bottom.

Bacteria have an additional enzyme that we do not have, called gyrase, to crank up the supercoiling of their DNA, to make it easier to open for transcription. Gyrase works just like a type II topoisomerase that cuts a double-stranded DNA and lets another DNA through, but it does so in a special way that puts a twist on the DNA first, so instead of relaxing the DNA, it increases the stress. How exactly that works has been a bit mysterious, though gyrases and the general principles they operate under have been clear for decades. Gyrase uses ATP, and grabs onto two parts of a DNA molecule, one of which is pre-twisted into coil, after which one is cut and the other passed through to create a change (-2) in the twisting number of that DNA.

A general model of gyrase action. The G segment of DNA is firmly held by the gyrase dimer in the center.  The same DNA is forcibly twisted about, around the pinwheel structures, and bent back around to enter through the N-gate (as the T segment). Then, the N gate closes, paving the way for the G-segment to be cut and separated (step 3). ATP is the energy source behind all this structural drama. The T-segment then passes through the cut, enters the C-gate, and the cycle is complete.

A recent paper determined the structure of active gyrase complexes, and was able to trace the pre-twisted conformation. This, combined with a lot of past work on the ATPase and cleavage functions of gyrase, allows a reasonably full picture of how this enzyme works. It is a symetric dimer of a two-subunit protein, so there are four protein chains in all. There are three major regions of the full structure. The N-gate at top where one segment (the T-segment) of DNA binds, then the central DNA gate, where the other (G-segment) DNA binds and is later cut to let the T-segment through, and the C-gate, where the T segment ends up and is released at the end of the cycle. 

Focus on the pinwheel structure that dramatically pre-twists the DNA around between the G and T segments, pre-positioning the complex for strand passage and increased supercoiling.

The magic is that the T-segment and the G-segment of DNA are parts of the same DNA molecule, by being wrapped around the ears of the protein, which are also called pinwheels. That is what the newest structure solves in greatest detail. These pinwheels essentially allow the enzyme to yank an otherwise normal DNA strand into a pre-knotted (positive supercoil) form that, when cut and resolved as shown, results in a negative increment of supercoiling or twist. If they mutated the pinwheels away, the enzyme could still hold, cut, and relax DNA, but it could not increase its supercoiling. It is the ability of the pinwheel structures to set up a pre-twisted structure onto the DNA that makes this enzyme a machine to increase negative supercoiling, and thus ease other DNA transactions. 

Topoisomerase enzymes through evolution, from gyrase (left) to human topoII on the right. Note how the details of the protein structure are virtually unrecognizable, while the overall shape and DNA-binding stays the same.

Bacteria also have more normal type II topoisomerases that cut DNA merely to relax it, so one might wonder how these two enzymes get along. Well, gyrase is responsible for the overall negative supercoiling of the bacterial genome, while the other topoisomerases have more localized roles to relieve transient knots and over-twisting. Indeed, if you negatively twist DNA enough, you can separate its strands entirely, which is not usually desirable. Further research shows that too much of either topoisomerase is lethal, and that they are kept in balance by transcriptional controls over the amount of each topoisomerase. This suggests a futile cycle of DNA winding and unwinding, as the optimal condition in bacterial cells when both are present in just the right amounts. 


Saturday, November 9, 2024

Rings of Death

We make pore-forming proteins that poke holes in cells and kill them. Why?

Gasdermin proteins are parts of the immune system, and exist in bacteria as well. It was only in 2016 that their mechanism of action was discovered, as forming unusual pores. The function of these pores was originally assumed to be offensive, killing enemy cells. But it quickly became apparent that they more often kill the cells that make them, as the culmination of a process called pyroptosis, a form of (inflammatory) cell suicide. Further work has only deepened the complexity of this system, showing that gasdermin pores are more dynamic and tunable in their action than originally suspected.

The structure is quite striking. The protein starts as an auto-inhibited storage form, sitting around in the cell. When the cell comes under attack, a cascade of detection and signaling occurs that winds up expressing a family of proteases called caspases. Some of these caspases can cut the gasdermin proteins, removing their inhibitory domain and freeing them to assemble into multimers. About 26 to 32 of these activated proteins can form a ring on top of a membrane (let's say the plasma membrane), which then cooperatively jut down their tails into the membrane and make a massive hole in it.

Overall structure of assembled gasdermin protein pores.


Simulations of pore assembly, showing how the trapped membrane lipids would pop out of the center, once pore assembly is complete.


These holes, or pores, are big enough to allow small proteins through, and certainly all sorts of chemicals. So one can understand that researchers thought that these were lethal events. And gasdermins are known to directly attack bacterial cells, being responsible in part for defense against Shigella bacteria, among others. But then it was found that gasdermins are the main way that important cytokines like the highly pro-inflammatory IL-1β get out of the cell. This was certainly an unusual mode of secretion, and the gasdermin D pore seems specifically tailored, in terms of shape and charge, to conduct the mature form of IL-1β out of the cell. 

It also turned out that gasdermins don't always kill their host cells. Indeed, they are far more widely used for temporary secretion purposes than for cell killing. And this secretion can apparently be regulated, though the details of that remain unclear. In structural terms, gasdermins can apparently form partial and mini-pores that are far less lethal to their hosts, allowing, by way of their own expression levels, a sensitive titration of the level of response to whatever danger the cell is facing.

Schematic of how lower concentrations of gasdermin D (lower path, blue) allow smaller pores to form with less lethality.

Equally interesting, the bacterial forms of gasdermin have just begun to be studied. While they may have other functions, they certainly can kill their host cell in a suicide event, and researchers have shown that they can shut down phage infection of a colony or lawn of bacterial cells. That is, if a phage-infected cell can signal and activate its gasdermin proteins fast enough, it can commit suicide before the phage has time to fully replicate, beating the phage at its own race of infection and propagation. 

Bacteria committing suicide for the good of the colony or larger group? That introduces the theme of group selection, since committing suicide certainly doesn't do the individual bacterium any good. It is only in a family group, clonal colony, or similar community that suicide for the sake of the (genetically related) group makes sense. We, as multicellular organisms, are way past that point. Our cells are fully devoted to the good of the organism, not themselves. But to see this kind of heroism among bacteria is, frankly, remarkable.

Bacteria have even turned around to attack the attacker. The Shigella bacteria mentioned above, which are directly killed by gasdermins, have evolved an enzymatic activity that tags gasdermin with ubiquitin, sending it to the cellular garbage disposal and saving themselves from destruction. It is an interesting validation of the importance of gasdermins and the arms race that is afoot, within our bodies.


  • A tortured ballot.
  • Great again? Corruption and degradation is our lot.
  • We may be in a (lesser) Jacksonian age. Populism, bad taste, big hair, and mass deportation.
  • Beautiful Jupiter.
  • Bill Mitchell on our Depression job guarantee: "So for every $1 outlaid the total societal benefits were around $6 over the lifetime of the participant."
  • US sanctions are scrambling our alliances and the financial system.
  • Solar works for everyone.


Saturday, October 12, 2024

Pumping DNA

Arnold has nothing on the DNA pumps that load phages.

DNA is a very unwieldy molecule. Elegant in concept, but as organisms accumulated more features and genes, it got extremely long and twisty. So a series of management proteins arose, such as helicases and gyrases to relieve the torsional tension, and topoisomerases to cut and pass strands through each other to resolve knots. Another class is DNA pumps, which can forcefully travel along DNA to thread it into useful spaces, like the head of a phage, or a domain in our nucleus, to facilitate transcriptional isolation or organized recombination and synapsis. While other motors, acting on actin and microtubules, manage DNA segregation during mitosis, cell division, and cell movement, true DNA motors deal directly with DNA.

An iconic electron micrograph of a phage with its head blown open. The previously enclosed DNA is splayed about, suggesting the capsid's great capacity for DNA, and great pressure it was under. Inset shows an intact phage. Note the landing tentacles, which attach to the target bacterium.

There are several types of DNA pump, the lower-powered of which I have reviewed previously. The champions in terms of force, however, are the pumps that fill phage heads. Phages are viruses that infect bacteria, and they operate under a variety of limitations. Size is one- they have to be small and have small genomes, due to the small size of their targets, the brevity of their life cycle, and the mathematics of scattered propagation. Bacterial cells are under turgor pressure, of about three atmospheres, and have strong cell walls to hold everything in. So their infecting phages have several barriers to overcome. One solution is to be under even higher pressure themselves, up to about sixty atmospheres. That way, once the injection system has cut through the cell wall and inner membrane, the phage genome, which is pretty much the only thing in the phage head (or capsid), can shoot out rapidly and take over the cell. 

Schematic of late phage development, where the motor (blue) docks to the phage head and fills it with DNA, after which the tail assembly is attached.

How does the DNA loading pump work? It is closely docked into the phage head structure, has a pentagonal structure attached to the phage head, and a loosely attached, 12-sided inner rosette that they describe as a sort of bearing or ball-race. The outer pentagon has an ATPase at each vertex, and these fire sequentially during the pumping mechanism. Each ATP advances the DNA by about two base pairs. Presumably the head has a structure that guides the DNA into regular loops around its inside walls. 

Structure of the dodecameric portion of the phage DNA pump, without the ATPase pentameric portion. Obviously, the DNA threads through the center.

In the diagram below (reference), three steps are shown. First, (a, top), the "I" ATPase node (red) is linked to the "J" and "A" rosette nodes. "A" is where the rosette hooks into the DNA (red). Next, the rosette is expanded a bit, bringing "A" out of register from "I" and "C" into register with "II". At the same time, "C" links to the DNA two base pairs down from where "A" latched into it. In the third step, the rosette squashes again, the DNA ends up raised by two base pairs, and the process can start all over. It is a bit of a sleeve/ratchet mechanism. They do not speculate at this point which of these steps is the power stroke- were the ATP is hydrolyzed. Getting only two base pairs into the head per ATP doesn't seem very efficient, but it is evidently at the end of packaging, when the pressure rises to extreme levels, where this pump shines. And it can get a 19,000 bp genome into a phage head in three minutes, (~100 bp per second), so it isn't a slouch when it comes to speed, either. 

Model of how this pump works. See text above for details.


Not only is this pump an amazing and powerful bit of biotechnology, able to compress DNA to sixty atmospheres, but it is a fourth fundamental type of motor, in addition to the rotary motors as found in flagella, the linear motors found along actin and microtubules, and the DNA threading/looping motors of condensin/cohesin.


  • The 2024 Nobel prizes show the close nexus between computers and molecular biology. The original finding of miRNA complementarity could not have been made without a computerized sequence search.
  • When truth is a gaffe, and lies are routine.
  • Could crypto be any worse or more corrupting?