Sunday, May 29, 2022

Evolution Under (Even in) Our Noses

The Covid pandemic is a classic and blazingly fast demonstration of evolution.

Evolution has been "controversial" in some precincts. While tradition told the fable of genesis, evolution told a very different story of slow yet endless change and adaptation- a mechanistic story of how humans ultimately arose. The stark contrast between these stories, touching both on the family tree we are heir to, and also on the overall point and motivation behind the process, caused a lot of cognitive dissonance, and is a template of how a fact can be drawn into the left/right, blue/red, traditional/progressive cultural vortex.

This all came to a head a couple of decades ago, when in the process of strategic retreat, anti-evolution forces latched onto some rather potent formulations, like "just a theory", and "intelligent design". These were given a lot of think tank support and right wing money, as ways to keep doubt alive in a field that scientifically had been settled and endlessly ramified for decades. To scientists, it was the height of absurdity, but necessitated wading into the cultural sphere in various ways that didn't always connect effectively with their intended audience. But eventually, the tide turned, courts recognized that religion was behind it all, and kept it out of schools. Evolution has more or less successfully receded from hot-button status.

One of the many rearguard arguments of anti-evolutionists was that sure, there is short-term evolution, like that of microbes or viruses, but that doesn't imply that larger organisms are they way they are due to evolution and selection. That would be simply beyond the bounds of plausibility, so we should search for explanations elsewhere. At this point they were a little gun-shy and didn't go so far in public as to say that elsewhere might be in book like the Bible. This line of argument was a little ironic, since Darwin himself hardly knew about microbes, let alone viruses, when he wrote his book. The evidence that he adduced (in some profusion) described the easily visible signs of geology, of animals and plants around the world, (including familar domestic animals), which all led to the subtle, yet vast, implications he drew about evolution by selection. 

So it has been notable that the vistas of biology that opened up since that time, in microbiology, paleontology, genetics, molecular biology, et al., have all been guided by these original insights and have in turn supported them without fail. No fossils are found out of order in the strata, no genes or organisms parachute in without antecedents, and no chicken happens without an egg. Evolution makes sense of all of biology, including our current pandemic.

But you wouldn't know it from the news coverage. New variants arise into the headlines, and we are told to "brace" for the next surge, or the next season. Well, what has happened is that the SARS-COV2 virus has adapted to us, as we have to it, and we are getting along pretty well at this point. Our adaptation to it began as a social (or antisocial!) response that was very effective in frustrating transmission. But of late, it has been more a matter of training our immune systems, which have an internal selective principle. Between rampant infections and the amazing vaccines, we have put up significant protective barriers to severe illness, though not, notably, to transmission.

But what about the virus? It has adapted in the most classic of ways, by experiencing a wide variety of mutations that address its own problems of survival. It is important to remember that this virus originated in some other species (like a bat) and was not very well adapted to humans. Bats apparently have countless viruses of this kind that don't do them much harm. Similarly, HIV originated in chimpanzee viruses that didn't do them much harm either. Viruses are not inherently interested in killing us. No, they survive and transmit best if they keep us walking around, happily breathing on other people, with maybe an occasional sneeze. The ultimate goal of every virus is to stay under the radar, not causing its host to either isolate or die. (I can note parenthetically that viruses that do not hew to this paradigm, like smallpox, are typically less able to mutate, thus less adaptable, or have some other rationale for transmission than upper respiratory spread.)

And that is clearly what has happened with SARS-COV2. Local case rates in my area are quite high, and wastewater surveilance indicates even higher prevalence. Isolation and mask mandates are history. Yet hospitalizations remain very low, with no one in the ICU right now. Something wonderful has happened. Part of it is our very high local vaccination rate, (96% of the population), but another part is that the virus has become less virulent as it has adapted to our physiology, immune systems, media environment and social practices, on its way to becoming endemic, and increasingly innocuous. All this in a couple of years of world-wide spread, after billions of infections and transmissions.

The succession (i.e. evolution) of variants detected in my county

The trend of local wastewater virus detection, which currently shows quite high levels, despite mild health outcomes.

So what has the virus been doing? While it has many genes and interactions with our physiology, the major focus has been on the spike protein, which is most prominent on the viral surface, is the first protein to dock to specific human proteins (the ACE2 cell surface receptor), and is the target of all the mRNA and other specific subunit vaccines. (As distinct from the killed virus vaccines that are made from whole viruses.) It is the target of 40% of the antibodies we naturally make against the whole virus, if we are infected. It is also, not surprisingly, the most heavily mutated portion of the virus, over the last couple of years of evolution. One paper counts 45 mutations in the spike protein that have risen to the level of "variants of concern" at WHO. 

"We found that most of the SARS-COV-2 genes are undergoing negative purifying selection, while the spike protein gene (S-gene) is undergoing rapid positive selection."


Structure of the spike protein, in its normal virus surface conformation, (B, C), and in its post-triggering extended conformation that reaches down into the target cell's membrane, and later pulls the two together. Top (in B, C) is where it binds to the ACE2 target on respiratory cells, and bottom is its anchor in the viral membrane coat (D shows it upside-down). At top (A) is the overall domain structure of the protein, in its linear form as synthesized, especially the RBD (receptor binding domain) and the two protease cleavage sites that prepare it for eventual triggering.


The spike protein is a machine, not just a blob. As shown in this video, it starts as a pyramidal blob flexibly tethered to the viral surface. Binding the ACE2 proteins in our respiratory tracts triggers a dramatic re-organization whereby this blob turns into a thin rope, which drops into the target cell. Meanwhile, the portion stuck to the virus unfolds as well and turns into threads that wind back around the newly formed rope, thereby pulling the virus and the target cell membrane together and ultimately fusing them. This is, mechanistically, how the virus gets inside our cells.

The triggering of the spike protein is a sensitive and adjustable process. In related viruses, the triggering is more difficult, and waits till the virus is engulfed in a vesicle that taken into the cell, and acidified in the normal process of lysosomal destruction / ingestion of outside materials. The acidification triggers these viral spike proteins to fire and release the virus into the cell. Triggering also requires cleavage of the spike protein with proteases that cut it at two locations. Other related viruses sometime wait for a target host protease to do the honors, but SARS-COV2 spike protein apparently is mostly cleaved during production by its originating host. This raises the stakes, since it can then more readily trigger, by accident, or once it finds proper ACE2 receptors on a target host. One theme of recent SARS-COV2 evolution is that triggering has become slightly easier, allowing the virus to infect higher up in the respiratory system. The original strains set up infections deep in the lung, but recent variants infect higher up, which lessens the systemic risks of infection to the host, promotes transmissibility, and speeds the infection and transmission process. 

The mutations G339D, N440K, L452R, S477N, T478K, and E484K in the spike region that binds to ACE2 (RBD, or receptor binding domain) promotes this interaction, raising transmissibility. (The nomenclature is that the number gives the position of the amino acid in the linear protein sequence, and the letters give the original version of the amino acid in one letter code (start) and in the mutated version (end)). Overall, mutations of the spike protein have increased the net charge on the spike protein significantly in the positive direction, which encourages binding to the negatively charged ACE2 protein. D614G is not in this region, but is nearby and seems to have similar effects, stabilizing the protein. The P681 mutation in one of the cleaved regions promotes proteolysis by the enzyme furin, thus making the virus more trigger-able. 

What are some other constraints on the spike protein? It needs to evade our vaccines and natural immunity, but has seemingly adapted to a here-and-gone infection style, though with periodic re-infection, like other colds. So any change is good for the purpose of camouflage, as long as its essential functions remain intact. The N-terminal, or front, domain of the spike protein, which is not involved directly in ACE2 binding, has experienced a series of mutations of this kind. An additional function it seems to have is to mimic a receptor for the cytokine interleukin 8, which attracts neutrophils and encourages activation of macrophages. Such mimicry may reduce this immune reaction, locally. 

In comparison to all these transmissibility-enhancing mutations, it is not clear yet where the mutations that decrease virulence are located. It is likely that they are more widely distributed, not in the gene encoding the spike protein. SARS-COV2 has a remarkable number of genes with various interactions with our immune systems, so the scope for tuning is prodigious. If all this can be accomplished in a couple of years, image what a million, or a billion, years can do for other organisms that, while they have slower reproduction cycles and more complicated networks of internal and external relations, still obey that great directive to adapt to their circumstances.


  • Late link, on receptor binding vs immune evasion tradeoffs.
  • Yes, chimpanzees can talk.
  • The rich are getting serious about destroying democracy.
  • Forced arbitration is, generally, unconscionable and should be illegal.
  • We could get by with fewer nuclear weapons.
  • Originalism would never allow automatic or semiautomatic weapons.

Saturday, May 21, 2022

What Binds to DNA?

Large scale studies of what binds to DNA over whole genomes.

Biology is full of codes. There is the genetic code, but there are many others. There are protein localization codes- short sequences on many proteins that tell them where they should transported to, such as to the mitochondrion, lysozome, the exterior, or the nucleus. There are kinase codes- the positions on many proteins where modification by phosphate changes their behavior. There is a histone code, which is the set of acetylations and methylations on histone tails which have wide-ranging influence on transcription of DNA to RNA. There is a sugar code- the many glycosyl modifications of proteins displayed externally on cells, which affect how they are seen and work in that space. Lastly, there is a code of short sites on DNA where specialized proteins bind, and by which transcription and other processes are regulated. Humans have roughly 1600 loci, out of their 20,000 genes, which appear to encode such proteins, and each binds somewhere and does something in our biology. 

The important cancer-suppressing gene p53 (green), binding to DNA (orange).  To the right is a closeup, showing a few of the detailed base contacts, with dashed black lines. Proteins that bind to DNA in sequence-specific ways feel their way around by making many such shape and charge-guided physical contacts.

The study of how and where these proteins bind has a long history, with many such proteins now exceedingly well characterized, to the atomic level. But at the genomic scale, it is still something of a crapshoot to guess where and whether some DNA site binds a regulatory protein. Such proteins have rather flexible requirements, which researchers express in "motifs". These motifs are short and typically variable, or "degenerate". That is, each position in such a motif can be one of the four bases, and frequently more than one base is allowed. In the motif shown below, for the protein ZBTB33, only one G is absolutely required. The other positions are variable to some degree. Outer areas of a binding site tend to be less selective, naturally, as they are less strongly bound by the protein. Some proteins can bind to two different motifs, and some can be accessorized by partners of various kinds to bind yet other sites. Evolution is the great tinker, and in this system, interactions are frequently kept rather loose and fluid, enabling precision where needed, (partly by complexing numerous regulatory factors & binding sites with each other in large casettes), but also flexibility and adaptability elsewhere.

A representation of what DNA sequences the zinc finger protein ZBTB33 binds to. Each position along the DNA site is shown as the collection of possible bases seen in functional sites, with each shown in proportion to its frequency of occurrence. The central G is the only absolutely required base, though several others are nearly invariant.


So the question of what binds where is not an easy one to answer, just going from the sequence of the genome. Naturally, this has been the subject of recent advances in large-scale biology, enabling researchers to, for instance, identify all the binding sites of a given protein across the genome (in a given cell type and culture condition). Or alternately to identify all the "accessible" sites across a genome (and also in a given cell type and culture condition), which would be locations where chromatin is "opened" up due to the binding of whatever regulatory proteins. This latter style of experiment naturally leads to the question- what is doing all that binding?

A recent paper comes from that field, deploying the latest machine learning and convoluted neural nets to find the answer, at least on a statistical basis. They combine a series of bulk open-chromatin experiments with a database of known transcription regulator motifs to match genomic sites with plausible proteins that bind there. In usual machine learning fashion, they reserve some of the training data for testing and validation, enabling the production of ROC statistics for accuracy and for comparison with other methods, of which there are many. But what they do not do is actually test the accuracy of their data in the lab, with actual cells and proteins. That would hard for a bioinformatics lab! So their talk of "accuracy" is rather untethered from reality, though fine enough for the journal they published in, which is Public Library of Science, Computational Biology

All that said, this is a code that is going to be very difficult to crack, since regulatory proteins are not just highly diverse and their sites degenerate, but they are themselves regulated in many ways, by phosphorylation, sumoylation, ubiquinylation, methylation, complexing with partners, the generation of variable isoforms through transcription, and cleavage, among others. The same protein that activates transcription here may repress it there. So the "motif" is a bit of a chimera, as is its effect on gene expression. The great tinkerer has gone so far down the rabbit hole that even "Deep Mind" is going to have a hard time following it down, without further empirical advances ... such as a massive upgrades in methods to identify specific protein binding sites across the genome.


  • Can steel be green?
  • If Russia leaves Ukraine, the war can end very quickly. If not, then it won't.
  • What happened to Finland?
  • Ride hailing meets economics.
  • Why doesn't CPAC go all the way to Moscow?
  • Boiling point.
  • Another Ukraine end game.

Saturday, May 14, 2022

Tangling With the Network

Molecular biology needs better modeling.

Molecular biologists think in cartoons. It takes a great deal of work to establish the simplest points, like that two identifiable proteins interact with each other, or that one phosphorylates the other, which has some sort of activating effect. So biologists have been satsified to achieve such critical identifications, and move on to other parts of the network. With 20,000 genes in humans, expressed in hundreds of cell types, regulated states and disease settings, work at this level has plenty of scope to fill years of research.

But the last few decades have brought larger scale experimentation, such as chips that can determine the levels of all proteins or mRNAs in a tissue, or the sequences of all the mRNAs expressed in a cell. And more importantly, the recognition has grown that any scientific field that claims to understand its topic needs to be able to model it, in comprehensive detail. We are not at that point in molecular biology, at all. Our experiments, even those done at large scale and with the latest technology, are in essence qualitative, not quantitative. They are also crudely interventionistic, maybe knocking out a gene entirely to see what happens in response. For a system as densely networked as the eukaryotic cell, it will take a lot more to understand and model it.

One might imagine that this is a highly detailed model of cellular responses to outside stimuli. But it is not. Some of the connections are much less important than others. Some may take hours to have the indicated effect, while others happen within seconds or less. Some labels hide vast sub-systems with their own dynamics. Important items may still be missing, or assumed into the background. Some connections may be contingent on (or even reversed by) other conditions that are not shown. This kind of cartoon is merely a suggestive gloss and far from a usable computational (or true) model of how a biological regulatory system works.


The field of biological modeling has grown communities interested in detailed modeling of metabolic networks, up to whole cells. But these remain niche activities, mostly because of a lack of data. Experiments remain steadfastly qualitative, given the difficulty of performing them at all, and the vagaries of the subjects being interrogated. So we end up with cartoons, which lack not only quantitative detail on the relative levels of each molecule, but also critical dynamics of how each relationship develops in time, whether in a time scale of seconds or milliseconds, as might be possible for phosphorylation cascades (which enable our vision, for example), or a time scale of minutes, hours, or days- the scale of changes in gene expression and longer-term developmental changes in cell fate.

These time and abundance variables are naturally critical to developing dynamic and accurate models of cellular activities. But how to get them? One approach is to work with simple systems- perhaps a bacterial cell rather than a human cell, or a stripped down minimal bacterial cell rather than the E. coli standard, or a modular metabolic sub-network. Many groups have labored for years to nail down all the parameters of such systems, work which remains only partially successful at the organismal scale.

Another approach is to assume that co-expressed genes are yoked together in expression modules, or regulated by the same upstream circuitry. This is one of the earliest forms of analysis for large scale experiments, but it ignores all the complexity of the network being observed, indeed hardly counts as modeling at all. All the activated genes are lumped together into one side, and all the down-regulated genes on the other side, perhaps filtered by biggest effect. The resulting collections are clustered by some annotation of those gene's functions, thereby helping the user infer what general cell function was being regulated in her experiment / perturbation. This could be regarded perhaps as the first step on a long road from correlation analysis of gene activities to a true modeling analysis that operates with awareness of how individual genes and their products interact throughout a network.

Another approach is to resort to a lot of fudge factors, while attempting to make a detailed model of the cell /components. Assume a stable network, and fill in all the values that could get you there, given the initial cartoon version of molecule interactions. Simple models thus become heuristic tools to hunt for missing factors that affect the system, which are then progressively filled in, hopefully by doing new experiments. Such factors could be new components, or could be unsuspected dynamics or unknown parameters of those already known. This is, incidentally, of intense interest to drug makers, whose drugs are intended to tweek just the right part of the system in order to send it to a new state- say, from cancerous back to normal, well-behaved quiescence.

A recent paper offered a version of this approach, modular response analysis (MRA). The authors use perturbation data from other labs, such as the inhibition of 1000 different genes in separately assayed cells, combined with a tentative model of the components of the network, and then deploy mathematical techniques to infer / model the dynamics of how that cellular system works in the normal case. What is observed in either case- the perturbed version, or the wild-type version- is typically a system (cell) at steady state, especially if the perturbation is something like knocking out a gene or stably expressing an inhibitor of its mRNA message. Thus, figuring out the (hidden) dynamic in between- how one stable state gets to another one after a discrete change in one or more components- is the object of this quest. Molecular biologists and geneticists have been doing this kind of thing off-the-cuff forever (with mutations, for instance, or drugs). But now we have technologies (like siRNA silencing) to do this at large scale, altering many components at will and reading off the results.

This paper extends one of the relevant mathematical methods (modular response analysis, MRA) to this large scale, and finds that, with a bit of extra data and some simplifications, it is competitive with other methods (mutual information) in creating dynamic models of cellular activities, at the scale of a thousand components, which is apparently unprecedented. At the heart of MRA are, as its name implies, modules, which break down the problem into manageable portions and allow variable amounts of detail / resolution. For their interaction model, they use a database of protein interactions, which is a reasonably comprehensive, though simplistic, place to start.

What they find is that they can assemble an effective system that handles both real and simulated data, creating quantitative networks from their inputs of gene expression changes upon inhibition of large numbers of individual components, plus a basic database of protein relationships. And they can do so at reasonable scale, though that is dependent on the ability to modularize the interaction network, which is dangerous, as it may ignore important interactions. As a state of the art molecular biology inference system, it is hardly at the point of whole cell modeling, but is definitely a few steps ahead of the cartoons we typically work with.

The authors offer this as one result of their labors. Grey nodes are proteins, colored lines (edges) are activating or inhibiting interactions. Compared to the drawing above, it is decidedly more quantitative, with strengths of interactions shown. But timing remains a mystery, as do many other details, such as the mechanisms of the interactions


  • Fiscal contraction + interest rate increase + trade deficit = recession.
  • The lies come back to roost.
  • Status of carbon removal.
  • A few notes on stuttering.
  • A pious person, on shades of abortion.
  • Discussion on the rise of China.