Showing posts with label developmental biology. Show all posts
Showing posts with label developmental biology. Show all posts

Saturday, January 4, 2025

Drilling Into the Transcriptional Core

Machine learning helps to tease out the patterns of DNA at promoters that initiate transcription.

One of the holy grails of molecular biology is the study of transcriptional initiation. While there are many levels of regulation in cells, the initiation of transcription is perhaps, of all of them, the most powerful. An organism's ability to keep the transcription of most genes off, and turn on genes that are needed to build particular tissues, and regulate others in response to other urgent needs, is the very soul of how multicellular organisms operate. The decision to transcribe a gene into its RNA message (mRNA) represents a large investment, as that transcript can last hours or more and during that time be translated into a great many protein copies. Additionally, this process identifies where, in the otherwise featureless landscape of genomic DNA, genes are located, which is another significant process, one that it took molecular biologists a long time to figure out.

Control over transcription is generally divided into two conceptual and physical regions- enhancers and promoters. Enhancers are typically far from the start site of transcription, and are modules of DNA sequences that bind innumerable regulatory proteins which collectively tune, in fine and rough ways, initiation. Promoters, in contrast, are at the core and straddle the start site of transcription (TSS, for short). They feature a much more limited set of motifs in the DNA sequence. The promoter is the site where the proteins bound to the various enhancers converge and encourage the formation of a "preinitiation complex", which includes the RNA polymerase that actually carries out transcription, plus a lot of ancillary proteins. The RNA polymerase can not initiate on its own or find a promoter on its own. It requires direction by the regulatory proteins and their promoter targets before finding its proper landing place. So the study of promoter initiation and regulation has a very long history, as a critical part of the central flow of information in molecular biology, from DNA to protein.

A schematic of a promoter, where initiation of transcription of Gene A, happens, with the start site (+1) right at the boundary of the orange and green colors. At this location, the RNA polymerase will melt the DNA strands, and start synthesizing an RNA strand using the (bottom) template strand of the DNA. Regulatory proteins bound to enhancers far away in the genomic DNA bend through space to activate proteins bound at the core promoter to load the polymerase and initiate this process.

A recent paper provided a novel analysis of promoter sequences, using machine learning to derive a relatively comprehensive account of the relevant sequences. Heretofore, many promoters had been dissected in detail and several key features found. But many human promoters had none of them, showing that our knowledge was incomplete. This new approach started strictly from empirical data- the genome sequence, plus large experimental compilations of nascent RNAs, as they are expressed in various cells, and mapped to the precise base where they initiated from- that is, their respective TSS. These were all loaded into a machine learning model that was supplemented with explanatory capabilities. That is, it was not just a black box, but gave interpretable results useful to science, in the form of small sequence signatures that it found are needed to make particular promoters work. These signatures presumably bind particular proteins that are the operational engines of regulatory integration and promoter function.

The TATA motif, found about 30 base pairs upstream of the transcription start site in many promoters. This is a motif view, where the statistical prevalence of the base is reflected in the height of the letter (top, in color) and its converse is reflected below in gray. Regular patterns like this found in DNA usually mean that some protein typically binds to this site, in this case TFIID.


For example, the grand-daddy of them all is the TATA box, which dates back to bacteria / archaea and was easily dug up by this machine learning system. The composition of the TATA box is shown above in a graphical form, where the probability of occurrence (of a base in the DNA) is reflected in height of the base over the axis line. A few G/C bases surround a central motif of T/A, and the TSS is typically 30 base pairs downstream. What happens here is that one of the central proteins of the RNA polymerase positioning complex, TFIID, binds strongly to this sequence, and bends the DNA here by ninety degrees, forming a launchpad of sorts for the polymerase, which later finds and opens DNA at the transcription start site. TFIID and the TATA box are well known, so it certainly is reassuring that this algorithmic method recovered it. TATA boxes are common at regulated promoters, being highly receptive to regulation by enhancer protein complexes. This is in contrast to more uniformly expressed (housekeeping) genes which typically use other promoter DNA motifs, and incidentally tend to have much less precise TSS positions. They might have start sites that range over a hundred base pairs, more or less stochastically.

The main advance of this paper was to find more DNA sites, and new types of sites, which collectively account for the positioning and activation of all promoters in humans. Instead of the previously known three or four factors, they found nine major DNA sequences, and a smattering of weaker patterns, which they combine into a predictive model that matches empirical data. Most of these DNA sequences were previously known, but not as part of core promoters. For example, one is called YY1, because it binds the YY1 protein, which has long been appreciated to be a transcriptional repressor, from enhancer positions. But now it turns out to also be core promoter participant, identifying and turning on a class of promoters that, as for most of the new-found sequence elements, tend to operate genes that are not heavily regulated, but rather universally expressed and with delocalized start sites. 

Motifs and initiator elements found by the current work. Each motif, presumably matched by a protein that binds it, gets its own graph of relation of the motif location (at 0 on the X axis) vs the start site of transcription that it directs, which for TATA is about 30 base pairs downstream. Most of the newly discovered motifs are bi-directional, directing start sites and transcription both upstream and downstream. This wastes a lot of effort, as the upstream transcripts are typically quickly discarded. The NFY motif has an interesting pattern of 10.5 bp periodicity of its directed start sites, which suggests that the protein complex that binds this site hugs one side of the DNA quite closely, setting up start sites on that side of the helix.

Secondly, these authors find that most of the new sequences they identify have bidirectional effects. That is, they set up promoters to fire in both directions, both typically about forty base pairs downstream and also upstream from their binding site. This explains a great deal of transcription data derived from new sequencing technologies, which shows that many promoters fire in both directions, even though the "upstream" or non-gene side transcript tends to be short-lived.


Overview of the new results, summarized by type of DNA sequence pattern. The total machine learning prediction was composed of predictions for larger motifs, which were the dominant pattern, plus a small contribution from "initiators", which comprise a few patterns right at the start site, plus a large but diffuse contribution from tiny trinucleotide patterns, such as the CG pattern known to mark active genes and carry activating DNA methylation marks.


A third finding was the set of trinucleotide motifs that serve as the sort of fudge factor for their machine learning model, filling in details to make the match to empirical data come out better. The length was set more or less arbitrarily, but they play a big part in the model fit. They note that one common example is the CG pattern, which is one of the stronger trinucleotide motifs. This pattern is known as CpG, and is the target of chemical methylation of DNA by regulatory enzymes, which helps to mark and regulate genes. The current work suggests that there may be more systems of this kind yet to be discovered, which play a modulating role in gene/promoter selection and activation.

The accuracy of this new learning and modeling system exemplifies some of the strengths of AI, of which machine learning is a sub-discipline. When there is a lot of data available, and a problem that is well defined and on the verge of solution (like the protein folding problem), then AI, or these machine learning methods, can push the field over the edge to a solution. AI / ML are powerful ways to explore a defined solution space for optimal results. They are not "intelligent" in the normal sense of the word, (at least not yet), which would imply having generalized world models that would allow them to range over large areas of knowledge, solve undefined problems, and exercise common sense.


Saturday, September 28, 2024

Dangerous Memories

Some memory formation involves extracellular structures, DNA damage, and immune component activation / inflammation.

The physical nature of memories in the brain is under intensive scrutiny. The leading general theory is that of positive reinforcement, where neurons that are co-activated strengthen their connections, enhancing their ability to co-fire and thus to express the same pattern again in the future. The nature of these connections has been somewhat nebulous, assumed to just be the size and stability of their synaptic touch-points. But it turns out that there is a great deal more going on.

A recent paper started with a fishing expedition, looking at changes in gene expression in neurons at various time points after the mice were subjected to a fear learning regimen. They took this out to much longer time points (up to a month) than had been contemplated previously. At short times, a bunch of well-known signals and growth-oriented gene expression happened. At the longest time points, organization of a structure called the perineural net (PNN) was read out of the gene expression signals. This is a extracellular matrix sheath that appears to stabilize neuronal connections and play a role in long-term memory and learning. 

But the real shocker came at the intermediate time point of about four days. Here, there was overexpression of TLR9, which is an immune system detector of broken / bacterial DNA, and inducer in turn of inflammatory responses. This led the authors down a long rabbit hole of investigating what kind of DNA fragmentation is activating this signal, how common this is, how influential it is for learning, and what the downstream pathways are. Apparently, neuronal excitation, particularly over-excitation that might be experienced under intense fear conditions, isn't just stressful in a semiotic sense, but is highly stressful to the participating neurons. There are signs of mitochondrial over-activity and oxidative stress, which lead to DNA breakage in the nucleus, and even nuclear perforation. It is a shocking situation for cells that need to survive for the lifetime of the animal. Granted, these are not germ cells that prioritize genomic stability above all else, but getting your DNA broken just for the purpose of signaling a stress response that feeds into memory formation? That is weird.

Some neuronal cell bodies after fear learning. The red dye is against a marker of DNA repair proteins, which form tight dots around broken DNA. The blue is a general DNA stain, and the green is against a component of the nuclear envelope, showing here that nuclear envelopes have broken in many of these cells.

The researchers found that there are classic signs of DNA breakage, which are what is turning on the TLR9 protein, such as seeing concentrated double-strand DNA repair complexes. All this stress also turned on proteases called caspases, though not the cell suicide program that these caspases typically initiate. Many of the DNA break and repair complexes were, thanks to nuclear perforation, located diffusely at the centrosome, not in the nucleus. TLR9 turns on an inflammatory response via NFKB / RELA. This is clearly a huge event for these cells, not sending them into suicide, but all the alarms short of that are going off.

The interesting part was when the researchers asked whether, by deleting the TLR9 or related genes in the pathway, they could affect learning. Yes, indeed- the fear memory was dependent on the expression of this gene in neurons, and on this cell stress pathway, which appears to be the precondition of setting up the perineural net structures and overall stabilization. Additionally, the DNA damage still happened, but was not properly recognized and repaired in the absence of TLR9, creating an even more dangerous situation for the affected neurons- of genomic instability amidst unrepaired DNA.

When TRL9 is knocked out, DNA repair is cancelled. At bottom are wild-type cells, and at top are mouse neurons after fear learning that have had the gene TLR9 deleted. The red dye is against DNA repair proteins, as is the blue dye in the right-most frames. The top row is devoid of these repair activities.

This paper and its antecedent literature are making the case that memory formation (at least under these somewhat traumatic conditions- whether this is true for all kinds of memory formation remains to be seen) has commandeered ancient, diverse, and quite dangerous forms of cell stress response. It is no picnic in the park with madeleines. It is an all-hands-on-deck disaster scene that puts the cell into a permanently altered trajectory, and carries a variety of long-term risks, such as cancer formation from all the DNA breakage and end-joining repair, which is not very accurate. They mention in passing that some drugs have been recently developed against TLR9, which are being used to dampen inflammatory activities in the brain. But this new work indicates that such drugs are likely double-edged swords, that could impair both learning and the long-term health of treated neurons and brains.

Saturday, July 27, 2024

Putting Body Parts in Their Places

How HOX genes run development, on butterfly wings.

I have written about the HOX complex of genes several times, because they constitute a grail of developmental genetics- genes that specify the identity of body parts. They occupy the middle of a body plan cascade of gene regulation, downstream from broader specifiers for anterior/posterior orientation, regional and segment specification, and in turn upstream of many more genes that specify the details of organ and tissue construction. Each of the HOX genes encodes a transcriptional regulator, and the name of one says it all- antennapedia. In fruit flies, where all this was first discovered, loss of antennapedia converts some legs into antennae, and extra expression of antennapedia converts antennae on the head into legs.

The HOX complex (named for the homeobox DNA binding motif of the proteins they encode) is linear, arranged from head-affecting genes (labial, proboscipedia) to abdomen-affecting genes (abdominal A, abdominal B; evidently the geneticist's flair for naming ran out by this point). This arrangement is almost universally conserved, and turns out to reflect molecular mechanisms operating on the complex. That is, it "opens" in a progressive manner during development, on the chromosome. Repression of chromatin is a very common and sturdy way to turn genes off, and tends to affect nearby genes, in a spreading effect. So it turns out to be easy, in some sense, to set up the HOX complex to have this chromatin repression lifted in a segmental fashion, by upstream regulators, whereby only the head sections are allowed to be expressed in head tissues, but all the genes are allowed to be expressed in the final abdominal segment. That is why the unexpected expression of antennapedia, which is the fifth of eight HOX genes, in the head, leads to a thoracic tissue (legs) forming on the head.

A recent paper delved a little more deeply into this story, using butterflies, which have a normal linearly conserved HOX cluster and are easy to diagnose for certain body part transformations (called homeotic) on their beautiful wings. The main thing these researchers were interested in is the genetic elements that separate one part of the HOX cluster from other parts. These are boundary or "insulator" elements that separate topologically associated domains (called TADs). Each HOX gene is surrounded by various regulatory enhancer and inhibitor sites in the DNA that are bound by regulatory proteins. And it is imperative that these sites be directed only to the intended gene, not neighboring genes. That is why such TADs exist, to isolate the regulation of genes from others nearby. There are now a variety of methods to map such TADs, by looking where chromatin (histones) are open or closed, or where DNA can be cut by enzymes in the native chromatin, or where crosslinks can be formed between DNA molecules, and others.

The question posed here was whether a boundary element, if deleted, would cause a homeotic transformation in the butterflies they were studying. They found, unfortunately, that it was impossible to generate whole animals with the deletions and other mutations they were engineering, so they settled for injecting the CRISPER mutational molecules into larval tissues and watching how they affected the adults in mosaic form, with some mutant tissues, some wild-type. The boundary they focused on was between antennapedia (Antp) and ultrabithorax (Ubx), and the tissues the forewings, where Ubx is normally off, and hindwings, where Ubx is normally on. Using methods to look at the open state of chromatin, they found that the Ubx gene is dramatically opened in hindwings, relative to forewings. Nevertheless, the boundary remains in place throughout, showing that there is a pretty strong isolation from Antp to Ubx, though they are next door and a couple hundred thousand basepairs apart. Which in genomic terms is not terribly far, while it leaves plenty of space for enhancers, promotes, introns, boundary elements, and other regulatory paraphernalia.

Analysis of the site-to-site chromosomal closeness and accessibility across the HOX locus of the butterfly Junonia coenia. The genetic loci are noted at the bottom, and the site-to-site hit rates are noted in the top panels, with blue for low rates of contact, and orange/red for high rates of contact. At top is the forewing, and at bottom is the hindwing, where Ubx is expressed, thus the high open-ness and intra-site contact within its topological domain (TAD). Yet the boundary between Ubx and Anp to its left (dotted lines at bottom) remains very strong in both tissues. In green is a measure of transcription from this DNA, in differential terms hindwing minus forewing, showing the strong repression of Ubx in the forewing, top panel.

The researchers naturally wanted to mutate the boundary element, (Antp-Ubx_BE), which they deduced lay at a set of binding sites (featuring CCCTC) for the protein CTCF, a well-known insulating boundary regulator. Note, interestingly, that in the image above, the last exon (blue) of Ubx (transcription goes right to left) lies across the boundary element, and in the topological domain of the Antp gene. This means that while all the regulatory apparatus of Ubx is located in its own domain, on the right side, it is OK for transcription to leak across- that has no regulatory implications. 

Effects of removing the boundary element between Ubx and Antp. Detailed description is in the text below. 

Removal of this boundary element, using CRISPER technology in portions of the larval tissues, had the expected partial effects on the larval, and later adult, wings of this butterfly. First, note that in panel D insets, the wild type larval forewing shows no expression of Ubx, (green), while the wild type hind wing shows wide-spread expression. This is the core role of the HOX locus and the Ubx gene- locate its expression in the correct body parts to then induce the correct tissues to develop. The larval wing tissue of the mosaic mutant, also in D, shows, in the forewing, extensive patchy expression of Ubx. This is then reflected in the adult (different animals) in the upper panels, in the mangled eyespot of the fully formed wing (center panel, compared to wild-type forewing and hindwing to each side). It is a small effect, but then these are small mutations, done in only a fraction of the larval cells, as well.

So here we are, getting into the nuts and bolts of how body parts are positioned and encoded. There are large regions around these genes devoted to regulatory affairs, including the management of chromatin repression, the insulation of one region from another, the enhancer and repressor sites that integrate myriad upstream signals (i.e. other DNA binding proteins) to come up with the detailed pattern of expression of these HOX genes. Which in turn control hundreds of other genes to execute the genetic program. This program can hardly be thought of as a blueprint, nor a "design" in anyone's eye, divine or otherwise. It resembles much more a vast pile of computer code that has accreted over time with occasional additions of subroutines, hacks, duplicated bits, and accidental losses, adding up to a method for making a body that is robust in some respects to the slings and arrows of fortune, but naturally not to mutations in its own code.