Showing posts with label naturalism. Show all posts
Showing posts with label naturalism. Show all posts

Sunday, April 13, 2025

The Genome Remains Murky

A brilliant case study identifying the molecular cause of certain neuro-developmental disorders shows how difficult genome-based diagnoses remain.

Molecular medicine is increasingly effective in assessing both hereditary syndromes and cancers. The sequencing approach generally comes in two flavors- whole genome sequencing, or exome sequencing, where only the most important (protein-coding) parts are sampled. In each case, the hunt is for mutations (more blandly called variants) that cause the syndrome being investigated, from among the large number of variants we all carry. This approach is becoming standard-of-care in oncology, due to tremendous influence and clinical significance of cancer-driving mutations, many of which now match directly to tailored treatments that address them (thus the "precision" in precision medicine).

But another arm of precision medicine is the hunt for causes of congenital problems. There are innumerable genetic disorders whose causal analysis can lead not only to an informative diagnosis, and sometimes to useful treatments, but also to fundamental understanding of human biology. Sufferers of these syndromes may spend a lifetime searching for a diagnosis, being shuffled from one doctor or center to another and subject to various forms of hypothetical medicine, before some deep sequencing pinpoints the cause of their disease and founds a new diagnostic category that provides, if not relief, at least understanding and a medical home. 

A recent paper from Britain provided a classic of this form, investigating the causes of neurodevelopmental (NDD) disorders, which encompass a huge range of problems from mild to severe. They comment that even after the most modern analysis and intensive sequencing, 60% of NDD cases still can not be assigned causes. A large part of the problem is that, despite knowing the full sequence of the human genome, its function is less well-understood. The protein-coding genes (20,000 of those, roughly) are delineated and studied pretty closely. But that only accounts for 1 to 2% of the genome. The rest ranges from genes for a blizzard of non-coding RNAs, some of which are critical, to large regulatory regions with smatterings of important sites, to junk of various kinds- pseudogenes, relic retroviruses, repetitive elements, etc. The importance of any of these elements (and individual DNA base positions within them) varies tremendously. This means specifically that exome sequencing is not going to cut it. Exome sequencing focuses on a very small part of the genome, which is fine if your syndrome (such as a common cancer) is well characterized and known to arise from the usual suspects. But for orphan syndromes, it does not cast a wide enough net. Secondly, even with full genome sequencing, so little is known about the remoter regions of the genome that assigning a function to variations found there is difficult to impossible. It takes statistical analysis of incidence of the variation vs the incidence of the syndrome.

These authors used a trove of data- the Genomics England 100,000 genomes project, focusing on the ~9,000 genomes in this collection from people with NDD syndromes. (Plus additional genomes collected elsewhere.) (We can note in passing that Britain's nationalized health system remains at the forefront of innovative research and care.) What they found was an unusually high incidence of a particular mutation in a non-protein-coding gene called RNU4-2. The product of this gene is an RNA called U4, which is an important part of the spliceosome, where it pairs RNA-to-RNA with another RNA, U6, in a key step of selecting the first (5-prime) side of an intron that is to be spliced out of mRNA messages. This gene would never have come up in exome analysis, being non-protein-coding. Yet it is critically important, as splicing happens to the vast majority of human genes. Additionally, differential splicing- the selection of alternative exons and splice sites in a regulated way- happens frequently in developmental programs and neurological cell types. There is a class of syndromes called spliceosomopathies that are caused by defects in mRNA splicing, and tend to appear as syndromes in these processes.

As shown in the images (all based on a large corpus of other work on spliceosomes), RNU4-2/U4 pairs intimately with the U6 spliceosomal RNA, and the mutation found by the current group (which is a single nucleotide insertion) causes a bulge in this pairing, as marked. Meanwhile, the U6 RNA pairs at the same time with the exon-intron junction of the target mRNA (bottom image), at a site that is very close to the U4 pairing region (top image). The upshot is that this single base insertion into U4 causes some portion of the target mRNAs to be mis-spliced, using non-natural 5 prime splice sites and thus altering their encoded proteins. This may cause minor problems in the protein, but more often will cause a shift in translation frame, a premature stop codon, and total loss of the functional protein. So this tiny mutation can have severe effects and is indeed genetically dominant- that is, one copy overrides a second wild-type copy to generate the NDD diseases that were studied.

The U4 RNA (teal) paired with the U6 RNA (gray), within an early spliceosome complex. The mutation studied here is pointed out in black (n.64_65insT - i.e. insertion of a T). Note how it would cause a bulge in the pairing. Importantly, the location in the U6 RNA that pairs with the mRNA (see below) is right next door, at the ACAGAGA (light gray). The authors use this structural work from others to suggest how the mutation they found can alter selected splicing sites and thus lead to disease. Other single nucleotide insertions that cause similar syndromes are marked with black arrows, while single nucleotide substitutions that cause less severe syndromes are marked with orange RNA segments.

The U6 RNA (pink) paired with its mRNA target to be spliced. It binds right at the intron (gray) exon (black) boundary, where the cut will eventually be made the remove the intron. The bump from the mis-paired mutant U4 RNA (see above) distorts this binding, sending U6 to select wrong locations for spicing.


The researchers went on to survey this and other spliceosomal RNA genes for similar mutations, and found few to none outside the region marked in the diagram above. For example, there is a highly similar gene called RNU4-1. But this gene is expressed about 100-fold less in brain and other tissues, making RNU4-2 the principal source of U4 RNA, and much more significant as a causal factor for NDD. It appears that other locations in RNU4-2 (and other spliceosomal RNA genes) are even more important than the one mutated location found here, thus are never found, being lethal and heavily selected against, in this highly conserved gene. 

They also noted that, while this RNU4-2 mutation is severe, and thus must happen spontaneously (i.e. not inherited from parents), it only occurrs on the maternal alleles, not paternal alleles in the affected children. They speculate that this may be due to effects this gene may have in male gametogenesis, killing affected sperm preferentially, but not affected oocytes. Lastly, this set of mutations (in the small region shown in the first figure above) appears to account for, in their estimation, about 0.4 % of all NDD seen in Britain. This is a remarkably high rate for such a particular mutation that is not heritable. They speculate that some mutation hotspot kind of process may be causing these events, above the general mutation rate. What this all says about so-called "intelligent design", one may be reluctant to explore too deeply. On the other hand, this still leaves plenty of room to hunt for additional variations that cause these syndromes.

In this research, we see that clinically critical variations can pop up in many places, not just among the "usual suspects", genetically and genomically speaking. While much of the human genome is junk, most of it is also expressed (as RNA) and all of it is fair game for clinically important (if tragic) effects. The NDD syndromes caused by the mutation studied here are very severe- for more so than the ADD or mild autism diagnoses that make up most of the NDD spectrum. Understanding the causal nexus between the genome and human biology and its pathologies, remains an ongoing and complicated scientific adventure.


  • Playing the heel. Being the heel
  • It sure is great to be the victim.
  • Oh, right.. now we really know what is going on.
  • More spiritual warfare.
  • Another grift.

Saturday, April 5, 2025

Psilocybin and the Normie Network

Psychedelic mushrooms desynchronize parts of the brain, especially the default mode network, disrupting our normal sense of reality.

Default mode network- sounds rather dull, doesn't it? But somewhere within that pattern of brain activity lies at least some of our consciousness. It is the hum of the brain while we are resting without focus. When we are intensely focused on something outside, in contrast, it is turned down, as we "lose ourselves" in the flow of other activities. This network remains active during light sleep and has altered patterns during REM sleep and dreaming, as one might expect. A recent paper tracked its behavior during exposure to the psychedelic drug psilocybin.

These researchers measured the level of active connectivity between brain regions (by functional MRI) on human subjects given a high dose of psilocybin, or Ritalin- which stands in as another psychoactive stimulant- or nothing. Compared with the control of no treatment, Ritalin caused a slight loss of connectivity (or desynchronization), (below, b), while psilocybin (a) caused huge loss of connectivity, clearly correlated with the subjective intensity of the trip. They also found that if, while on psilocybin, they gave their subjects some task to focus on, their connectivity increased again, again tracking directly with the subjective experiences of their subjects. 

The researchers show a metric of connectivity between distinct brain regions, under three conditions. FC stands for functional connectivity, where high numbers (and brighter colors) stand for more distance, i.e. less connectivity/synchrony. Methylphenidate is Ritalin. Synchrony is heavily degraded under psilocybin.

Of all the networks they analyzed, the default mode network (DNM) was most affected. This network runs between the prefrontal cortex, the posterior cingulate cortex, the hippocampus, the angular gyrus, and temporoparietal junction, among others. These are key areas for awareness, memory, time, social relations, and much else that is highly relevant to conscious awareness and personhood. There is a long thread of work in the same vein that shows that psychedelic drugs have these highly correlated effects on subjective experience and brain patterns. A recent review suggested that while subnetworks like the DNM are weakened, the brain as a whole experiences higher synchrony as well as higher receptivity to outside influence in an edgy process that increases its level of (to put it in terms of complexity theory) chaos or criticality. 

So that is what is happening! But that is not all. The effects of pychedelics, even from one dose, can be long-lasting, even life-changing. The current researchers note that the DMN desynchronization they see persists, at weaker levels, for weeks. This correlates with the subjective experience of changed senses that can result from a significant drug trip. And that trip, as noted above regarding receptivity and chaos, is a delicate thing, highly subject to the environment and mood the subject is experiencing at the time. 

But when a task is being done, the subjects come back down towards normalcy.

These researchers note that brain plasticity comes into play, evidently as a homeostatic response to the wildly novel patterns of brain activation that took the subject out of their rut of DMN self-hood. Synapses change, genes are activated, and the brain reshapes usually in a way that increases receptivity to novel experiences and lifts mood. For example, depression correlates with strong DMN coherence, while successful treatment correlates with less connectivity.

"We propose that psychedelics induce a mode of brain function that is more dynamically flexible, diverse, integrated, and tuned for information sharing, consistent with greater criticality."

So, consider the mind blown. Psychedelics appear to disrupt the usual day-to-day, which is apparently a strongly positive thing to do, both subjectively and clinically. That raises the question of why. Why do we settle into such durable and often negative images of the self and ways of thinking? And why does shaking things up have such positive effects? Jung had a deep conviction that our brains are healing machines, with deep wisdom and powers that are exposed during unusual events, like intense dreams. While there are bad trips, and people who go over the edge from excessive psychedelic use, with a modicum of care, it appears that positive trips, taken in a mood of physical and social safety, let the mind reset in important ways, back to its natural open-ness.


Saturday, March 29, 2025

What Causes Cancer? What is Cancer?

There is some frustration in the literature.

Fifty years into the war on cancer, what have we learned and gained? We do not have a general cure, though we have a few cures and a lot of treatments. We have a lot of understanding, but no comprehensive theory or guide to practice. While some treatments are pin-point specific to certain proteins and even certain mutated forms of those proteins, most treatments remain empirical, even crude, and few provide more than a temporary respite. Cancer remains an enormous challenge, clinically and intellectually.

Recently, a prominent journal ran a provocative commentary about the origins of cancer, trashing the reigning model of "Somatic Mutation Theory", or SMT. Which is the proposition that cancer is caused by mutations that "drive" cell proliferation, and thus tumor growth. I was surprised at the cavalier insinuations being thrown around by these authors, their level of trash talk, and the lack of either compelling evidence or coherent alternative model. Some of their critiques have a fair basis, as discussed below, but to say, as the title does, that this is "The End of the Genetic Paradigm of Cancer" is simply wrong.

"It is said that the wise only believe in what they can see, and the fools only see what they can believe in. The latter attitude cements paradigms, and paradigms are amplified by any new-looking glass that puts one’s way of seeing the world on steroids. In cancer research, such a self-fulfilling prophecy has been fueled by next-generation DNA sequencing."

"However, in the quest for predictive biomarkers and molecular targets, the cancer research community has abandoned deep thinking for deep sequencing, interpreting data through the lens of clinical translation detached from fundamental biology."

Whew!

The main critique, once the gratuitous insults and obligatory references to Kuhn and Feynman are cleared away, is that cancer does not resemble other truly clonal disease / population processes, like viral or bacterial infections. In such processes, (which have become widely familiar after the COVID and HIV pandemics), a founder genotype can be identified, and its descendants clearly derive from that founder, while accumulating additional mutations that may respond to the Darwinian pressures, such as the immune system and other host defenses. While many cancers are clearly driven by some founding mutation, when treatments against that particular "driver" protein are given, resistance emerges, indicating that the cancer is a more diverse population with a very active mutation and adaptation process. 

Additionally, tumors are not just clones fo the driving cell, but have complex structure and genetic variety. Part of this is due to the mutator phenotypes that arise during carcinogenesis, that blow up the genome and cause large numbers of additional mutations- many deleterious, but some carrying advantages. More significantly, tumors arise from and continue to exist in the context of organs and tissues. They can not just grow wildly as though they were on a petri plate, but must generate, for example, vascular structures and a "microenvironment" including other cells that facilitate their life. Similarly, metastasis is highly context-dependent and selective- only very few of the cells released by a tumor land in a place they find conducive to new growth. This indicates, again, that the organ setting of cancer cells is critically important, and accounts in large part for this overall difference between cancers and more straightforward clonal processes. 

Schematic of cancer development, from a much more conventional and thorough review of the field.

Cancer cells need to work with the developmental paradigms of the organism. For instance, the notorious "EMT", or epithelial-mesenchymal transition is a hallmark of de-differentiation of many cancer cells. They frequently regress in developmental terms to recover some of the proliferative and self-repair potential of stem cells. What developmental program is available or allowed in a particular tissue will vary tremendously. Thus cancer is not caused by each and every oncogenic mutation, and each organ has particular and distinct mutations that tend to cause cancers within it. Indeed, some organs hardly foster any cancers at all, while other organs with more active (and perhaps evolutionarily recent) patterns of proliferation (such as breast tissue, or prostate tissue) show high rates of cancer. Given the organ setting, cancer "driver" mutations need not only unleash the cell's own proliferation, but re-engineer its relations with other cells to remove their inhibition of its over-growth, and pursuade them to provide the environment it needs- nutritionally, by direct contact, by growth factors, vascular formation, immune interactions, etc., in a sort of para-organ formation process. It is a complicated job, and one mutation is, empirically, rarely enough.

"Instead, cancer can be broadly understood as “development gone awry”. Within this perspective, the tissue organization field theory is based on two principles that unite phylogenesis and ontogenesis."

"The organicist perspective is based on the interdependency of the organism and its organs. It recognizes a circular causal regimen by closure of constraints that makes parts interdependent, wherein these constraints are not only molecules, but also biophysical force."

As an argument or alternative theory, this leaves quite a bit to be desired, and does not obviate the role of  initiating mutations in the process.

It remains, however, that oncogenic mutations cause cancer, and treatments that address those root causes have time and again shown themselves to be effective cancer treatments, if tragically incomplete. The rise of shockingly effective immunotherapies for cancer have shown, however, that the immune system takes a more holistic approach to attacking disease than such "precision" single-target therapies, and can make up for the vagaries of the tissue environment and the inflammatory, developmental, and mutational derangements that happen later in cancer development. 

In one egregious citation, the authors hail an observation that certain cancers need both a mutation and a chemical treatment to get started, and that the order of these events is not set in stone. Traditionally, the mutation is induced first, and then the chemical treatment, which causes inflammation, comes second. They state: 

"The qualitative dichotomy between a mutagenic initiator that creates ’cancer cells’ and the non-genetic, tissue-perturbing promoter that expands them may not be as clear-cut. Indeed, the reverse experiment (first treatment with the promoter followed by the initiator) equally produces tumors. This result refutes the classical model that requires that the mutagenic (alleged) initiator must act first."

The citation is to a paper entitled "The reverse experiment in two-stage skin carcinogenesis. It cannot be genuinely performed, but when approximated, it is not innocuous". This paper dates from 1993, long before sequencing was capable of evaluating the mutation profiles of cancer cells. Additionally, the authors of this paper themselves point out (in the quote below) a significant assymetry in the treatments. Their results are not "equal":

"The two substances showed a reciprocal enhancing effect, which was sometimes weak, sometimes additive, and sometimes even synergistic, and was statistically most significant when the results were assessed from the time of DMBA application. Although the reverse experiment was not in any way innocuous it always resulted in a lower tumor crop than the classical sequence of DMBA followed by a course of TPA treatment. 

However, the lower tumor crop in the reverse experiment cannot be used to prove a qualitative difference between initiators and promoters."

(DMBA is the mutagen, while TPA is the inflammatory accelerant.)

So chemical treatment can prepare the ground for subsequent mutant generation in forming cancers in this system, while being much less efficient than the traditional order of events. This is not a surprise, given that this chemical (TPA) treatment causes relatively long-term inflammation and cell proliferation on its own.

"An epistemic shift towards a biological theory of cancer may still be an uphill battle in the current climate of thought created by the ease of data collection and a culture of research that discourages ’disruptive science’. Here, we have made an argument for dropping the SMT and its epicycles. We presented new and old but sidelined theoretical alternatives to the SMT that embrace theory and organismal biology and can guide experiments and data interpretation. We expect that the diminishing returns from the ceaselessly growing databases of somatic mutations, the equivalent to Darwin’s gravel pit, may soon reach a pivot point."

One rarely reads such grandiloquent summaries (or mixed metaphors) in scientific papers! But here they are truly beating up on straw men. In the end, it is true that cancer is quite unlike clonal infectious diseases, and for this, as for many other reasons, has had scientists scratching their heads for decades, if not centuries. But rest assured that this chest-thumping condescension is quite unnecessary, since those in the field are quite aware of these difficulties. The various nebulous alternatives these authors offer, whether the "epigenetic landscape", the "tissue organization field theory", or the "biological theory of cancer" all have kernels of logic, but the SMT remains the foundation-stone of cancer study and treatment, while being, for all the reasons enumerated above and by these authors, only part of the edifice, not the whole truth.


Saturday, March 15, 2025

Eccentricity, Obliquity, Precession, and Glaciation

The glacial cycles of the last few million years were highly determined by earth's orbital mechanics.

Naturalism as a philosophy came into its own when Newton explained the heavens as a machine, not a pantheon. It was stunning to realize that age-old mysteries were thoroughly explicable and that, if we kept at it with a bit of diligence and intellectual openness, we could attain ever-widening vistas of understanding, which now reach to the farthest reaches of the universe. 

In our current day, the mechanics of Earth's climate have become another example of this expansion of understanding, and, sadly, another example of resistance to naturalism, to scientific understanding, and ultimately to the stewardship of our environment. It has dawned on the scientific community (and anyone else willing to look) over the last few decades that our industrial production of CO2 is heating the climate, and that it needs to stop if the biosphere is to be saved from an ever-more degrading crisis. But countervailing excuses and interests abound, and we are now ruled by an adminstration in the US whose values run toward lies and greed, and which naturally can not abide moral responsibility.

The Cenozoic, our present age after the demise of the dinosaurs, has been characterized by falling levels of CO2 in the atmosphere. This has led to a progression from very warm climates 50 mya (million years ago) to ice ages beginning roughly 3 mya. The reasons for this are not completely clear. There has been a marked lack of vocanism, which is one of main ways CO2 gets back into the atmosphere. This contrasts strongly with ages of extreme volcanism like the Permian-Triassic boundary and extinction events, about 250 mya. It makes one think that the earth may be storing up a mega-volcanic event for the future. Yeet plate tectonics has kept plugging along, and has sent continents to the poles, where they previously hung out in more equatorial locations. That makes ice ages possible, giving glaciers something to glaciate, rather than letting ocean circulation keep the poles temperate. Additionally, the uplift of the Himalayas has dramatically increased rock exposure and weathering, which is the main driver of CO2 burial, by carbonate formation. And on top of all that has been the continued evolution of plant life, particularly the grasses, which have extra mechanisms to extract CO2 out of the atmosphere.

CO2 in the atmosphere has been falling through most of the Cenozoic.

All this has led to the very low levels of CO2 in the atmosphere, which have been stable at about 300 ppm over the last million years, very gradually declining prior to that time. Now we are pushing 420 ppm and beyond, which the biosphere has not seen for ten million years or more, and doing so at speeds that no amount of evolution can accommodate. The problem is clear enough, once the facts are laid out.

But what about those glaciations, which have been such a dramatic and influential feature of Earth's climate over the last few million years? They have followed a curious periodicity, advancing and retreating repeatedly over this time. Does that have anything to do with CO2? It turns out that it does not, and we have to turn our eyes to the heavens again for an explanation. It was Milankovitch, a century ago, who first solidified the theory that the changing orbital parameters of Earth, and particularly the intensity of the sun in the Northern hemisphere, where most of the land surface of Earth lies, that causes this repetitive climatic behavior.  

Cycles of orbital parameters and glaciation, over a million years.

It was in 1976 that a more refined analysis put a mathematical model and better data behind the Milankovitch cycles, showing that one major element of our orbit around the sun- the variation of eccentricity- had the greatest overall effect on the 100,000 year periodicity of recent glacial cycles. Eccentricity is how skewed our orbit is from round-ness, which varies slightly over time, due to interactions with other planets. Secondly, the position of the Earth's tilt at various points of this eliptical orbit, whether closer to the sun in northern summer, or father away, has critical effects on net solar input and on glaciation. The combined measure is called the precessional index, expressing the earth-sun distance in June. The eccentricity itself has a period of about 93,000 years, and the precessional index has a periodicity of 21,000 years. As glacial cycles over the last 800,000 years have had a strong 100,000 year periodicity, it is clearly the eccentricity alone that has the strongest single effect.

Lastly, there is also the tilt of the Earth, called obliquity, which varies slightly with a 40,000 year cycle. A recent paper made a major claim that they had finally solved the whole glaciation cycle in more detail than previously, by integrating all these cycles into a master algorithm for when glaciations start/end. They were curious about exactly what drives the deglaciation phase, within the large eccentricity-driven energetic cycle. The rule they came up with, again using better data and more complicated algorithms, is that it reaches its maximum rate when, after a minimum of eccentricity, the precession parameter (the purple line, below) has reached a peak, and the obliquity parameter (the green line, below) is rising. That is, when the Earth's degree of tilt and closeness to the sun in Norther summer are mutually reinforcing. There are also lags built into this, since it takes one or two thousand years for these orbital effects to build heat up in the climate system, a bit like spring happening annually well after the equinox.

"We find that the set of precession peaks (minima) responsible for terminations since 0.9 million years ago is a subset of those peaks that begin (i.e., the precession parameter starts decreasing) while obliquity is increasing. Specifically, termination occurs with the first of these candidate peaks to occur after each eccentricity minimum."

 

 

Summary diagram from Barker, et al. At the very top is a synopsis of the orbital variables. At bottom are the glacial cycles, marked with yellow dots (maximum slope of deglaciation), red dots (maximum extent of deglaciation) and blue dots (maximum slope of reglaciation, also called inception). Above this graph is an analysis of the time spans between the yellow and red dots, showing the strength of each deglaciation (gray double arrows). They claim that this strength is proportion to an orbita parameter illustrated above with the T-designation of each glacial cycle. This parameter is precession lagged by obliquity. Finally in the upper graph, the orbital cycles are shown directly, especially including eccentricity in gray, and the time points of the yellow nodes are matched here with purple nodes, lagged with the preceeding (by ~2,000 years) rising obliquity as an orange node. The green verticle bars were applied by me to ease the clear correlation of eccentricity maxima vs deglaciation maxima.

I have to say that the communication of this paper is not crystal clear, and the data a bit iffy. The T5 deglaciation, for instance, which is relatively huge, comes after a tiny minimum of eccentricity and at a tiny peak of precession, making the scale of the effect hard to understand from the scale of the inputs. T3 shows the opposite, with large inputs yielding a modest, if extended, deglacial cycle. And the obliquity values that are supposed to drive the deglaciation events are quite scattered over their respective cycle. But I take their point that ultimately, it is slight variations in the solar inputs that drive these cycles, and we just need to tease out / model the details to figure out how it works.

There is another question in the field, which is that, prior to 800,000 years ago, glacial cycles were much less dramatic, and had a faster cadence of about 40,000 years. This is clearly more lined up with the obliquity parameter as a driver. So while obliquity is part of the equation in the recent period, involved in triggering deglaciation, it was the primary driver a million years ago, when CO2 levels were perhaps slightly higher and the system didn't need the extra push from eccentricity to cycle milder glaciations. Lastly, why are the recent glacial cycles so pronounced, when the orbital forcing effects are so small and take thousands of years to build up? Glaciation is self-reinforcing, in that higher reflectivity from snow / ice drives down warming. Conversely, retreat of glaciers can release large amounts of built-up methane and other forms of carbon from permafrost, continental shelves, the deep ocean, etc. So there may be some additional cycle, such as a smaller CO2 or methane cycle, that halts glaciation at its farthest extent- that aspect remains a bit unclear.

Overall, the earlier paper of Hays et al. found that summer insolation varies by at most 10% over Earth's various orbital cycles. That is not much, yet it drives glaciation of ice sheets thousands of feet thick, and reversals back to deglaciation that uncovers bare rock all over the far north. It shows that Earth's climate is extremely sensitive to small effects. The last time CO2 was as high as it is now, (~16 mya), Greenland was free of ice. We are heading in that direction very rapidly now, in geological terms. Earth has experienced plenty of catastrophes in the past, even some caused biologically, such as the oxygenation of the atmosphere. But this, what we are doing to the biosphere now, is something quite new.


  • That new world order we were working on...
  • Degradation and corruption at FAA.. what could go wrong?
  • Better air.
  • Congress has the power, should it choose to use it.
  • Ongoing destruction, degradation.
  • Oh, Canada!

Saturday, March 1, 2025

The Train Tracks of Synapsis

Structures that align and tether the chromosomes in meiosis are now understood in some molecular detail.

It has been one of the wonders of biology- the synaptonemal complex that aligns homologous chromosomes during meiosis. While chromosomes regularly line up in the middle of the cell during mitosis, so that they can be evenly divided between the daughter cells, in this process they only have to join at their centromeres, which get dragged to the midline of the cell, and then pulled back apart at cell division. In meiosis, on the other hand, not only do the sister chromosomes that have just replicated stick together at their centromeres, but the homologous chromosomes, which have never bothered about each other since sperm fused with egg, suddenly seek each other out and pair up in an elaborate dance of DNA breakage, alignment, cross-over, and repair. Then in the first division, these cross-over-joined homologs line up at the midline and get pulled apart as their crossovers are repaired. The second division follows, much more like mitosis, where the duplicated sister chromosomes line up at the midline based on their centromere attachments, and then separate into haploid gametes.

Comparison of mitosis vs meiosis, which goes through an extra division and alternate chrosomosome pairing and separation processes in the firsts division.

The two divisions are fundamentally different, with the first involving novel chromosome pairings and attachments. The opening act of all this, which I won't go into further, is a sprinkling of ~400 DNA strand breaks induced specifically all over the genome, which sets up a repair process at each site, where the chromosomes (using Rad51) seek out good copies of the damaged DNA- that is, another, matching, DNA molecule. There are specific processes that appear to prevent use of the recently replicated "sister", which would be the most closely identical copy that could be used. Instead, there is a bias to use the "homologous" copy from the other parent. But these homologous chromosomes have just been replicated as well. How to line all this up so that the chromosomes all line up neatly and separate neatly during the first meiotic division? The answer is the synaptonemal complex.

Schematic of the synaptonemal complex joining two homologous chromosomes. The lateral elements are on each side, and the central element lines up the center. Crossing the gap is the transverse elements, now known to be composed of the SYCP1 protein. At bottom is a diagram from its atomic structure of how SYCP1 coils together, and how its ends join to zip up the synaptonemal gap. 

This is a train track of connecting proteins between the homologous chromosomes. It is evident that the DNA breaks come first, followed by the search for matching homologs, followed by the radiating and progressive assembly of the synaptonemal complex out from the break repair sites. The components of its major structures have been mostly characterized- the lateral element where the DNA loops line up; the transverse element that spans the gap between the homologous chromosomes, and the central element, proteins at the midline that help the transverse elements assemble. A paper from 2023 characterized the transverse element protein, SYCP1, which is a long coil of a protein that dimerizes to make a strong coil, and then dimerizes again head-to-head to create the symmetric bridge over the whole width of the synaptonemal complex. Which is about 100 nanometers in width. 

These authors then focus on a series of experiments using key mutations at the dimer-dimer head-to-head interaction area, to demonstrate how this head-to-head zippering works in detail. Mutating just two amino acids in this contact region eliminates the head-to-head interaction, making synapsis impossible. In these cases, the homologous chromosomes (from mice) remain in proximity, especially at crossover sites, but are no longer zippered up and closely aligned.

Spreads of mouse meiotic chromosomes, labeled as shown with antibodies against two synaptonemal proteins. From the top, wild-type SYCP1, then single individual mutations in the end-joining region, and at bottom SYCP1 with two point mutations that eliminate its function entirely. The chromosomes at the bottom are aligned only by virtue of their crossover points, but not by a zippered up synaptonemal complex. Needless to say, mice like this are not fertile.


Thus what was once a hazy mystery in the highest power microscopes has been defined in molecular terms, highlighting once again the power of curiosity, and the essentially moral aim of truth-seeking- to reveal what is true, rather than dictate it. But who cares about all that? Truth, knowledge, science... these values are now not only in question, but under active attack. Who is making America great, and who is diminishing it? Those in our institutions of power who have a voice will hopefully see the consequences and act on them, before our history and values are entirely corrupted.


  • Sociopaths at work.
  • Evidently the model is that we become a version of China/Russia, and make a tripolar world. Not a little Orwellian. And who knows, perhaps we will offer Russia a deal to partition Canada. That is, after we get done partitioning Ukraine.
  • A black day.
  • Oh, wait, the next day was even worse.
  • Shades of Stalin, with a sad sartorial hat-tip to Steve Jobs.
  • Unlawful and vindictive destruction at the NIH, and of biological research in general.
  • And all for love.

Saturday, February 15, 2025

Cloudy, With a Chance of RNA

Long RNAs play structural and functional roles in regulation of chromosome replication and expression.

One of the wonderful properties of the fruit fly as a model system of genetics and molecular biology has been its polytene chromosomes. These are hugely expanded bundles of chromosomes, replicated thousands of times, which have been observed microscopically since the late 1800's. They exist in the larval salivary gland, where huge amounts of gene expression are needed, thus the curious evolutionary solution of expanding the number of templates, not only of the gene needed, but of the entire genome. 

These chromosomes where closely mapped and investigated, almost like runic keys to the biology of the fly, especially in the day before molecular biology. Genetic translocations, loops, and other structural variations could be directly observed. The banding patterns of light, dark, expanded, and compressed regions were mapped in excruciating detail, and mapped to genetic correlates and later to gene expression patterns. These chromosomes provided some of the first suggestions of heterochromatin- areas of the genome whose expression is shut down (repressed). They may have genes that are shut off, but they may also be structural components, such as centromeres and telomeres. These latter areas tend to have very repetitive DNA sequences, inherited from old transposons and other junk. 

A diagram of polytene chromosomes, bunched up by binding at the centromeres. The banding pattern is reproducible and represents differences in proteins bound to various areas of the genome, and gene activity.

It has become apparent that RNA plays a big role in managing these areas of our chromosomes. The classic case is the XIST RNA, which is a long (17,000 bases) non-coding RNA that forms a scaffold by binding to lots of "heterogeneous" RNA-binding proteins, and most importantly, stays bound near the site of its creation, on the X chromosome. Through a regulatory cascade that is only partly understood, the XIST RNA is turned off on one of the X chromosomes, and turned on the other one (in females), leading the XIST molecule to glue itself to its chromosome of origin, and then progressively coat the rest of that chromosome and turn it off. That is, one entire X is turned into heterochromatin by a process that requires XIST scaffolding all along its length. That results in "dosage compensation" in females, where one X is turned off in all their cells, allowing dosage (that is, the gene expression) of its expressed genes to approximate those of males, despite the presence of the extra X chromosome. Dosage is very important, as shown by Down Syndrome, which originates from a duplication of one of the smallest human chromosomes, creating imbalanced gene dosage.

A recent paper described work on "ASAR" RNAs, which similarly arise from highly repetitive areas of human chromosomes, are extremely long (180,000 bases), and control expression and chromosome replication in an allele-specific way on (at least) several non-X chromosomes. These RNAs, again, like XIST, specifically bind a bunch of heternuclear binding proteins, which is presumably central to their function. Indeed, these researchers dissected out the 7,000 base segment of ASAR6 that is densest in protein binding sites, and find that, when transplanted into a new location, this segment has dramatic effects on chromosome condensation and replication, as shown below.

The intact 7,000 base core of ASAR6 was transplanted into chromosome 5, and mitotic chromosomes were spread and stained. The blue is a general DNA stain. The green is a stain for newly synthesized DNA, and the red is a specific probe for the ASAR6 sequence. One can see on the left that this chromosome 5 is replicating more than any other chromosome, and shows delayed condensation. In contrast, the right frame shows a control experiment where an anti-sense version of the ASAR6 7,000 base core was transplanted to chromosome 5. The antisense sequence not only does not have the wild-type function, but also inhibits any molecule that does by tightly binding to it. Here, the chromosome it resides on (arrows) is splendidly condensed, and hardly replicating at all (no green color).


Why RNA? It has become clear over the last two decades that our cells, and particularly our nuclei, are swimming with RNAs. Most of the genome is transcribed in some way or other, despite a tiny proportion of it coding for anything. 95% of the RNAs that are transcribed never get out of the nucleus. There has been a growing zoo of different kinds of non-coding RNAs functioning in translational control, ribosomal maturation, enhancer function, and here, in chromosome management. While proteins tend to be compact bundles, RNAs can be (as these ASARs are) huge, especially in one dimension, and thus capable of physically scaffolding the kinds of structures that can control large regions of chromosomes.

Chromosomes are sort of cloudy regions in our cells, long a focus of observation and clearly also a focus of countless proteins and now RNAs that bind, wind, disentangle, transcribe, replicate, and congregate around them. What all these RNAs and especially the various heteronuclear proteins actually do remains pretty unclear. But they form a sort of organelle that, while it protects and manages our DNA, remarkably also allows access to it for sequence-specific binding proteins and the many processes that plow through it.

"In addition, recent studies have proposed that abundant nuclear proteins such as HNRNPU nonspecifically interact with ‘RNA debris’ that creates a dynamic nuclear mesh that regulates interphase chromatin structure."


Saturday, February 8, 2025

Sugar is the Enemy

Diabetes, cardiovascular health, and blood glucose monitoring.

Christmas brought a book titled "Outlive: The Science and Art of Longevity". Great, I thought- something light and quick, in the mode Gweneth Paltrow or Deepak Chopra. I have never been into self-help or health fad and diet books. Much to my surprise, however, it turned out to be a rather rigorous program of preventative medicine, with a side of critical commentary on our current medical system. A system that puts various thresholds, such as blood sugar and blood pressure, at levels that represent serious disease, and cares little about what led up to them. Among the many recommendations and areas of focus, blood glucose levels stand out, both for their pervasive impact on health and aging, and also because there are new technologies and science that can bring its dangers out of the shadows.

Reading: 

Where do cardiovascular problems, the biggest source of mortality, come from? Largely from metabolic problems in the control of blood sugar. Diabetics know that uncontrolled blood sugar is lethal, on both the acute and long-terms. But the rest of us need to realize that the damage done by swings in blood sugar are more insidious and pervasive than commonly appreciated. Both microvascular (what is commonly associated with diabetes, in the form of problems with the small vessels of the kidney, legs, and eyes) and macrovascular (atherosclerosis) are due to high and variable blood sugar. The molecular biology of this was impressively unified in 2005 in the paper above, which argues that excess glucose clogs the mitochondrial respiration mechanisms. Their membrane voltage maxes out, reactive forms of oxygen accumulate, and glucose intermediates pile up in the cell. This leads to at least four different and very damaging consequences for the cell, including glucose modification (glycation) of miscellaneous proteins, a reduction of redox damage repair capacity, inflammation, and increased fatty acid export from adipocytes to endothelial (blood vessel) cells. Not good!

Continuous glucose monitored concentrations from three representative subjects, over one day. These exemplify the low, moderate, and severe variability classes, as defined by the Stanford group. Line segments are individually classed as to whether they fall into those same categories. There were 57 subject in the study, of all ages, none with an existing diagnosis of diabetes. Yet five of them had diabetes by traditional criteria, and fourteen had pre-diabetes by those criteria. By this scheme, 25 had severe variability as their "glucotype", 25 had moderate variability, and only 7 had low variability. As these were otherwise random subjects selected to not have diabetes, this is not great news about our general public health, or the health system.

Additionally, a revolution has occurred in blood glucose monitoring, where anyone can now buy a relatively simple device (called a CGM) that gives continuous blood glucose monitoring to a cell phone, and associated analytical software. This means that the fasting blood glucose level that is the traditional test is obsolete. The recent paper from Stanford (and the literature it cites) suggests, indeed, that it is variability in blood glucose that is damaging to our tissues, more so than sustained high levels.

One might ask why, if blood glucose is such a damaging and important mechanism of aging, hasn't evolution developed tighter control over it. Other ions and metabolites are kept under much tighter ranges. Sodium ranges between 135 to 145 mM, and calcium from 8.8 to 10.7 mM. Well, glucose is our food, and our need for glucose internally is highly variable. Our livers are tiny brains that try very hard to predict what we need, based on our circadian rhythms, our stress levels, our activity both current and expected. It is a difficult job, especially now that stress rarely means physical activity, and nor does travel, in our automobiles. But mainly, this is a problem of old age, so evolution cares little about it. Getting a bigger spurt of energy for a stressful event when we, in our youth, are in crisis may, in the larger scheme of things, outweigh the slow decay of the cardiovascular system in old age. Not to mention that traditional diets were not very generous at all, certainly not in sugar and refined carbohydrates.


Saturday, February 1, 2025

Proving Evolution the Hard Way

Using genomes and codon ratios to estimate selective pressures was so easy... why is it not working?

The fruits of evolution surround us with abundance, from the tallest tree to the tiniest bacterium, and the viruses of that bacterium. But the process behind it is not immediately evident. It was relatively late in the enlightenment before Darwin came up with the stroke of insight that explained it all. Yet that mechanism of natural selection remains an abstract concept requiring an analytical mind and due respect for very inhuman scales of the time and space in play. Many people remain dumbfounded, and in denial, while evolutionary biology has forged ahead, powered by new discoveries in geology and molecular biology.

A recent paper (with review) offered a fascinating perspective, both critical and productive, on the study of evolutionary biology. It deals with the opsin protein that hosts the visual pigment 11-cis-retinal, by which we see. The retinal molecule is the same across all opsins, but different opsin proteins can "tune" the light wavelength of greatest sensitivity, creating the various retinal-opsin combinations for all visual needs, across the cone cells and rod cells. This paper considered the rhodopsin version of opsin, which we use in rod cells to perceive dim light. They observed that in fish species, the sensitivity of rhodopsin has been repeatedly adjusted to accommodate light at different depths of the water column. At shallow levels, sunlight is similar to what we see, and rhodopsin is tuned to about 500 nm, while deeper down, when the light is more blue-ish, rhodopsin is tuned towards about 480 nm maximum sensitivity. There are also special super-deep fish who see by their own red-tinged bioluminescence, and their rhodopsins are tuned to 526 nm. 

This "spectrum" of sensitivities of rhodopsin has a variety of useful scientific properties. First, the evolutionary logic is clear enough, matching the fish's vision to its environment. Second, the molecular structure of these opsins is well-understood, the genes are sequenced, and the history can be reconstructed. Third, the opsin properties can be objectively measured, unlike many sequence variations which affect more qualitative, difficult-to-observe, or impossible-to-observe biological properties. The authors used all this to carefully reconstruct exactly which amino acids in these rhodopsins were the important ones that changed between major fish lineages, going back about 500 million years.

The authors' phylogenetic tree of fish and other species they analyzed rhodopsin molecules from. Note how mammals occupy the bottom small branch, indicating how deeply the rest of the tree reaches. The numbers in the nodes indicate the wavelength sensitivity of each (current or imputed) rhodopsin. Many branches carry the author's inference, from a reconstructed and measured protein molecule, of what precise changes happened, via positive selection, to get that lineage.

An alternative approach to evolutionary inference is a second target of these authors. That is a codon-based method, that evaluates the rate of change of DNA sites under selection versus sites not under selection. In protein coding genes (such as rhodopsin), every amino acid is encoded by a triplet of DNA nucleotides, per the genetic code. With 64 codons for ~20 amino acids, it is a redundant code where many DNA changes do not change the protein sequence. These changes are called "synonymous". If one studies the rate of change of synonymous sites in the DNA, (which form sort of a control in the experiment), compared with the rate of change of non-synonymous sites, one can get a sense of evolution at work. Changing the protein sequence is something that is "seen" by natural selection, and especially at important positions in the protein, some of which are "conserved" over billions of years. Such sites are subject to "negative" selection, which to say rapid elimination due to the deleterious effect of that DNA and protein change.

Mutations in protein coding sequence can be synonymous, (bottom), with no effect, or non-synonymous (middle two cases), changing the resulting protein sequence and having some effect that may be biologically significant, thus visible to natural selection.


This analysis has been developed into a high art, also being harnessed to reveal "positive" selection. In this scenario, if the rate of change of the non-synonymous DNA sites is higher than that of the synonymous sites, or even just higher than one would expect by random chance, one can conclude that these non-synonymous sites were not just not being selected against, but were being selected for, an instance of evolution establishing change for the sake of improvement, instead of avoiding change, as usual.

Now back to the rhodopsin study. These authors found that a very small number of amino acids in this protein, only 15, were the ones that influenced changes to the spectral sensitivity of these protein complexes over evolutionary time. Typically only two or three changes occurred over a shift in sensitivity in a particular lineage, and would have been the ones subject to natural selection, with all the other changes seen in the sequence being unrelated, either neutral or selected for other purposes. It is a tour de force of structural analysis, biochemical measurement, and historical reconstruction to come up with this fully explanatory model of the history of piscene rhodopsins. 

But then they went on to compare what they found with what the codon-based methods had said about the matter. And they found that there was no overlap whatsover. The amino acids identified by the "positive selection" codon based methods were completely different than the ones they had found by spectral analysis and phylogenetic reconstruction over the history of fish rhodopsins. The accompanying review is particularly harsh about the pseudoscientific nature of this codon analysis, rubbishing the entire field. There have been other, less drastic, critiques as well.

But there is method to all this madness. The codon based methods were originally conceived in the analysis of closely related lineages. Specifically, various Drosophia (fly) species that might have diverged over a few million years. On this time scale, positive selection has two effects. One is that a desirable amino acid (or other) variation is selected for, and thus swept to fixation in the population. The other, and corresponding effect, is that all the other variations surrounding this desirable variation (that is, which are nearby on the same chromosome) are likewise swept to fixation (as part of what is called a haplotype). That dramatically reduces the neutral variation in this region of the genome. Indeed, the effect on neutral alleles (over millions of nearby base pairs) is going to vastly overwhelm the effect from the newly established single variant that was the object of positive selection, and this imbalance will be stronger the stronger the positive selection. In the limit case, the entire genomes of those without the new positive trait/allele will be eliminated, leaving no variation at all.

Yet, on the longer time scale, over hundreds of millions of years, as was the scope of visual variation in fish, all these effects on the neutral variation level wash out, as mutation and variation processes resume, after the positively selected allele is fixed in the population. So my view of this tempest in an evolutionary teapot is that these recent authors (and whatever other authors were deploying codon analysis against this rhodopsin problem) are barking up the wrong tree, mistaking the proper scope of these analyses. Which, after all, focus on the ratio between synonymous and non-synonymous change in the genome, and thus intrinsically on recent change, not deep change in genomes.


  • That all-American mix of religion, grift, and greed.
  • Christians are now in charge.
  • Mechanisms of control by the IMF and the old economic order.
  • A new pain med, thanks to people who know what they are doing.

Saturday, January 18, 2025

Eeking Out a Living on Ammonia

Some archaeal microorganisms have developed sophisticated nano-structures to capture their food: ammonia.

The earth's nitrogen cycle is a bit unheralded, but critical to life nonetheless. Gaseous nitrogen (N2) is all around us, but inert, given its extraordinary chemical stability. It can be broken down by lightning, but little else. It must have been very early in the history of life that the nascent chemical-biological life forms tapped out the geologically available forms of nitrogen, despite being dependent on nitrogen for countless critical aspects of organic chemistry, particularly of nucleic acids, proteins, and nucleotide cofactors. The race was then on to establish a way to capture it from the abundant, if tenaciously bound, dinitrogen of the air. It was thus very early bacteria that developed a way (heavily dependent, unsurprisingly, on catalytic metals like molybdenum and iron) to fix nitrogen, meaning breaking up the triple N≡N bond, and making ammonia, NH3 (or ammonium, NH4+). From there, the geochemical cycle of nitrogen is all down-hill, with organic nitrogen being oxidized to nitric oxide (NO), nitrite (NO2-), nitrate (NO3), and finally denitrification back to N2. Microorganisms obtain energy from all of these steps, some living exclusively on either nitrite or nitrate, oxidizing them as we oxidize carbon with oxygen to make CO2. 

Nitrosopumilus, as imaged by the authors, showing its corrugated exterior, a layer entirely composed of ammonia collecting elements (can be hexameric or pentameric). Insets show an individual hexagonal complex, in face-on and transverse views. Note also the amazing resolution of other molecules, such as the ribosomes floating about.

A recent paper looked at one of these denizens beneath our feet, an archaeal species that lives on ammonia, converting it to nitrite, NO2. It is a dominant microbe in its field, in the oceans, in soils, and in sewage treatment plants. The irony is that after we spend prodigious amounts of fossil fuels fixing huge amounts of nitrogen for fertilizer, most of which is wasted, and which today exceeds the entire global budget of naturally fixed nitrogen, we are faced with excess and damaging amounts of nitrogen in our effluent, which is then processed in complex treatment plants by our friends the microbes down the chain of oxidized states, back to gaseous N2.

Calculated structure of the ammonia-attracting pore. At right are various close-up views including the negatively charged amino acids (D, E) concentrated at the grooves of the structure, and the pores where ammonium can transit to the cell surface. 

The Nitrosopumilus genus is so successful because it has a remarkable way to capture ammonia from the environment, a way that is roughly two hundred times more efficient than that of its bacterial competitors. Its surface is covered by a curious array of hexagons, which turn out to be ammonia capture sites. In effect, its skin is an (relatively) enormous chemical antenna for ammonia, which is naturally at low concentration in sea water. These authors do a structural study, using the new methods of particle electron microscopy, to show that these hexagons have intensely negatively charged grooves and pores, to which positively charged ammonium ions are attracted. Within this outer shell, but still outside the cell membrane, enzymes at the cell surface transform the captured ammonium to other species such as hydroxylamine, which enforces the ammonium concentration gradient towards the cell surface, and which are then pumped inside.

Cartoon model of the ammonium attraction and transit mechanisms of this cell wall. 

It is a clever nano-material and micro-energetic system for concentrating a specific chemical- a method that might inspire human applications for other chemicals that we might need- chemicals whose isolation demands excessive energy, or whose geologic abundance may not last forever.


Saturday, January 4, 2025

Drilling Into the Transcriptional Core

Machine learning helps to tease out the patterns of DNA at promoters that initiate transcription.

One of the holy grails of molecular biology is the study of transcriptional initiation. While there are many levels of regulation in cells, the initiation of transcription is perhaps, of all of them, the most powerful. An organism's ability to keep the transcription of most genes off, and turn on genes that are needed to build particular tissues, and regulate others in response to other urgent needs, is the very soul of how multicellular organisms operate. The decision to transcribe a gene into its RNA message (mRNA) represents a large investment, as that transcript can last hours or more and during that time be translated into a great many protein copies. Additionally, this process identifies where, in the otherwise featureless landscape of genomic DNA, genes are located, which is another significant process, one that it took molecular biologists a long time to figure out.

Control over transcription is generally divided into two conceptual and physical regions- enhancers and promoters. Enhancers are typically far from the start site of transcription, and are modules of DNA sequences that bind innumerable regulatory proteins which collectively tune, in fine and rough ways, initiation. Promoters, in contrast, are at the core and straddle the start site of transcription (TSS, for short). They feature a much more limited set of motifs in the DNA sequence. The promoter is the site where the proteins bound to the various enhancers converge and encourage the formation of a "preinitiation complex", which includes the RNA polymerase that actually carries out transcription, plus a lot of ancillary proteins. The RNA polymerase can not initiate on its own or find a promoter on its own. It requires direction by the regulatory proteins and their promoter targets before finding its proper landing place. So the study of promoter initiation and regulation has a very long history, as a critical part of the central flow of information in molecular biology, from DNA to protein.

A schematic of a promoter, where initiation of transcription of Gene A, happens, with the start site (+1) right at the boundary of the orange and green colors. At this location, the RNA polymerase will melt the DNA strands, and start synthesizing an RNA strand using the (bottom) template strand of the DNA. Regulatory proteins bound to enhancers far away in the genomic DNA bend through space to activate proteins bound at the core promoter to load the polymerase and initiate this process.

A recent paper provided a novel analysis of promoter sequences, using machine learning to derive a relatively comprehensive account of the relevant sequences. Heretofore, many promoters had been dissected in detail and several key features found. But many human promoters had none of them, showing that our knowledge was incomplete. This new approach started strictly from empirical data- the genome sequence, plus large experimental compilations of nascent RNAs, as they are expressed in various cells, and mapped to the precise base where they initiated from- that is, their respective TSS. These were all loaded into a machine learning model that was supplemented with explanatory capabilities. That is, it was not just a black box, but gave interpretable results useful to science, in the form of small sequence signatures that it found are needed to make particular promoters work. These signatures presumably bind particular proteins that are the operational engines of regulatory integration and promoter function.

The TATA motif, found about 30 base pairs upstream of the transcription start site in many promoters. This is a motif view, where the statistical prevalence of the base is reflected in the height of the letter (top, in color) and its converse is reflected below in gray. Regular patterns like this found in DNA usually mean that some protein typically binds to this site, in this case TFIID.


For example, the grand-daddy of them all is the TATA box, which dates back to bacteria / archaea and was easily dug up by this machine learning system. The composition of the TATA box is shown above in a graphical form, where the probability of occurrence (of a base in the DNA) is reflected in height of the base over the axis line. A few G/C bases surround a central motif of T/A, and the TSS is typically 30 base pairs downstream. What happens here is that one of the central proteins of the RNA polymerase positioning complex, TFIID, binds strongly to this sequence, and bends the DNA here by ninety degrees, forming a launchpad of sorts for the polymerase, which later finds and opens DNA at the transcription start site. TFIID and the TATA box are well known, so it certainly is reassuring that this algorithmic method recovered it. TATA boxes are common at regulated promoters, being highly receptive to regulation by enhancer protein complexes. This is in contrast to more uniformly expressed (housekeeping) genes which typically use other promoter DNA motifs, and incidentally tend to have much less precise TSS positions. They might have start sites that range over a hundred base pairs, more or less stochastically.

The main advance of this paper was to find more DNA sites, and new types of sites, which collectively account for the positioning and activation of all promoters in humans. Instead of the previously known three or four factors, they found nine major DNA sequences, and a smattering of weaker patterns, which they combine into a predictive model that matches empirical data. Most of these DNA sequences were previously known, but not as part of core promoters. For example, one is called YY1, because it binds the YY1 protein, which has long been appreciated to be a transcriptional repressor, from enhancer positions. But now it turns out to also be core promoter participant, identifying and turning on a class of promoters that, as for most of the new-found sequence elements, tend to operate genes that are not heavily regulated, but rather universally expressed and with delocalized start sites. 

Motifs and initiator elements found by the current work. Each motif, presumably matched by a protein that binds it, gets its own graph of relation of the motif location (at 0 on the X axis) vs the start site of transcription that it directs, which for TATA is about 30 base pairs downstream. Most of the newly discovered motifs are bi-directional, directing start sites and transcription both upstream and downstream. This wastes a lot of effort, as the upstream transcripts are typically quickly discarded. The NFY motif has an interesting pattern of 10.5 bp periodicity of its directed start sites, which suggests that the protein complex that binds this site hugs one side of the DNA quite closely, setting up start sites on that side of the helix.

Secondly, these authors find that most of the new sequences they identify have bidirectional effects. That is, they set up promoters to fire in both directions, both typically about forty base pairs downstream and also upstream from their binding site. This explains a great deal of transcription data derived from new sequencing technologies, which shows that many promoters fire in both directions, even though the "upstream" or non-gene side transcript tends to be short-lived.


Overview of the new results, summarized by type of DNA sequence pattern. The total machine learning prediction was composed of predictions for larger motifs, which were the dominant pattern, plus a small contribution from "initiators", which comprise a few patterns right at the start site, plus a large but diffuse contribution from tiny trinucleotide patterns, such as the CG pattern known to mark active genes and carry activating DNA methylation marks.


A third finding was the set of trinucleotide motifs that serve as the sort of fudge factor for their machine learning model, filling in details to make the match to empirical data come out better. The length was set more or less arbitrarily, but they play a big part in the model fit. They note that one common example is the CG pattern, which is one of the stronger trinucleotide motifs. This pattern is known as CpG, and is the target of chemical methylation of DNA by regulatory enzymes, which helps to mark and regulate genes. The current work suggests that there may be more systems of this kind yet to be discovered, which play a modulating role in gene/promoter selection and activation.

The accuracy of this new learning and modeling system exemplifies some of the strengths of AI, of which machine learning is a sub-discipline. When there is a lot of data available, and a problem that is well defined and on the verge of solution (like the protein folding problem), then AI, or these machine learning methods, can push the field over the edge to a solution. AI / ML are powerful ways to explore a defined solution space for optimal results. They are not "intelligent" in the normal sense of the word, (at least not yet), which would imply having generalized world models that would allow them to range over large areas of knowledge, solve undefined problems, and exercise common sense.