Sunday, July 23, 2023

Many Ways There are to Read a Genome

New methods to unravel the code of transcriptional regulators.

When we deciphered the human genome, we came up with three billion letters of its linear code- nice and tidy. But that is not how it is read inside our cells. Sure, it is replicated linearly, but the DNA polymerases don't care about the sequence- they are not "reading" the book, they are merely copying machines trying to get it to the next generation with as few errors as possible. The book is read in an entirely different way, by a herd of proteins that recognize specific sequences of the DNA- the transcription regulators (also commonly called transcription factors [TF], in the classic scientific sense of some "factor" that one is looking for). These regulators- and there are, by one recent estimate, 1,639 of them encoded in the human genome- constitute an enormously complex network of proteins and RNAs that regulate each other, and regulate "downstream" genes that encode everything else in the cell. They are made in various proportions to specify each cell type, to conduct every step in development, and to respond to every eventuality that evolution has met and mastered over the eons.

Loops occur in the DNA between site of regulator binding, in order to turn genes on (enhancer, E, and transcription regulator/factor, TF).

Once sufficient transcription regulators bind to a given gene, it assembles a transcription complex at its start site, including the RNA polymerase that then generates an RNA copy that can float off to be made into a protein, (such as a transcription regulator), or perhaps function in its RNA form as part of zoo of rRNA, tRNA, miRNA, piRNA, and many more that also help run the cell. Some regulators can repress transcription, and many cooperate with each other. There are also diverse regions of control for any given target gene in its nearby non-coding DNA- cassettes (called enhancers) that can be bound by different regulators and thus activated at different stages for different reasons. 

These binding sites in the DNA that transcription regulators bind to are typically quite small. A classic regulator SP1 (itself 785 amino acids long and bearing three consecutive DNA binding motifs coordinated by a zinc ions) binds to a sequence resembling (G/T)GGGCGG(G/A)(G/A)(C/T). So only ten bases are specified at all, and four of those positions are degenerate. By chance, a genome of three billion bases will have such a sequence about 45,769 times. So this kind of binding is not very strictly specified, and such sites tend to appear and disappear frequently in evolution. That is one of the big secrets of evolution- while some changes are hard, others are easy, and it there is constant variation and selection going on in the regulatory regions of genes, refining and defining where / when they are expressed.

Anyhow, researchers naturally have the question- what is the regulatory landscape of a given gene under some conditions of interest, or of an entire genome? What regulators bind, and which ones are most important? Can we understand, given our technical means, what is going on in a cell from our knowledge of transcription regulators? Can we read the genome like the cell itself does? Well the answer to that is, obviously no and not yet. But there are some remarkable technical capabilities. For example, for any given regulator, scientists can determine where it binds all over the genome in any given cell, by chemical crosslinking methods. The prediction of binding sites for all known regulators has been a long-standing hobby as well, though given the sparseness of this code and the lability of the proteins/sites, one that gives only statistical, which is to say approximate, results. Also, scientists can determine across whole genomes where genes are "open" and active, vs where they are closed. Chromatin (DNA bound with histones in the nucleus) tends to be closed up on repressed and inactive genes, while transcription regulators start their work by opening chromatin to make it accessible to other regulators, on active genes.

This last method offers the prospect of truly global analysis, and was the focus of a recent paper. The idea was to merge a detailed library predicted binding sites for all known regulators all over the genome, with experimental mapping of open chromatin regions in a particular cell or tissue of interest. And then combine all that with existing knowledge about what each of the target genes near the predicted binding sites do. The researchers clustered the putative regulators binding across all open regions by this functional gene annotation to come up with statistically over-represented transcription regulators and functions. This is part of a movement across bioinformatics to fold in more sources of data to improve predictions when individual methods each produce sketchy, unsatisfying results.

In this case, mapping open chromatin by itself is not very helpful, but becomes much more helpful when combined with assessments of which genes these open regions are close to, and what those genes do. This kind of analysis can quickly determine whether you are looking at an immune cell or a neuron, as the open chromatin is a snapshot of all the active genes at a particular moment. In this recent work, the analysis was extended to say that if some regulators are consistently bound near genes participating in some key cellular function, then we can surmise that that regulator may be causal for that cell type, or at least part of the program specific to that cell. The point for these researchers is that this multi-source analysis performs better in finding cell-type specific, and function-specific, regulators than is the more common approach of just adding up the prevalence of regulators occupying open chromatin all over a given genome, regardless of the local gene functions. That kind of approach tends to yield common regulators, rather than cell-type specific ones. 

To validate, they do rather half-hearted comparisons with other pre-existing techniques, without blinding, and with validation of only their own results. So it is hardly a fair comparison. They look at the condition systemic lupus (SLE), and find different predictions coming from their current technique (called WhichTF) vs one prior method (MEME-ChIP).  MEME-ChIP just finds predicted regulator binding sites for genomic regions (i.e. open chromatin regions) given by the experimenter, and will do a statistical analysis for prevalence, regardless of the functions of either the regulator or the genes it binds to. So you get absolute prevalence of each regulator in open (active) regions vs the genome as a whole. 

Different regulators are identified from the same data by different statistical methods. But both sets are relevant.


What to make of these results? The MEME-ChIP method finds regulators like SP1, SP2, SP4, and ZFX/Y. SP1 et al. are very common regulators, but that doesn't mean they are unimportant, or not involved in disease processes. SP1 has been observed as likely to be involved in autoimmune encephalitis in mice, a model of multiple sclerosis, and naturally not so far from lupus in pathology. ZFX is also a prominent regulator in the progenitor cells of the immune system. So while these authors think little of the competing methods, those methods seem to do a very good job of identifying significant regulators, as do their own methods. 

There is another problem with the author's WhatTF method, which is that gene annotation is in its infancy. Users are unlikely to find new functions using existing annotations. Many genes have no known function yet, and new functions are being found all the time for those already assigned functions. So if one's goal is classification of a cell or of transcription regulators according to existing schemes, this method is fine. But if one has a research goal to find new cell types, or new processes, this method will channel you into existing ones instead.

This kind of statistical refinement is unlikely to give us what we seek in any case- a strong predictive model of how the human genome is read and activatated by the herd of gene regulators. For that, we will need new methods for specific interaction detection, with a better appreciation for complexes between different regulators, (which will be afforded by the new AI-driven structural techniques), and more appreciation for the many other operators on chromatin, like the various histone modifying enzymes that generate another whole code of locks and keys that do the detailed regulation of chromatin accessibility. Reading the genome is likely to be a somewhat stochastic process, but we have not yet arrived at the right level of detail, or the right statistics, to do it justice.


  • Unconscious messaging and control. How the dark side operates.
  • Solzhenitsyn on evil.
  • Come watch a little Russian TV.
  • "Ruthless beekeeping practices"
  • The medical literature is a disaster.

No comments: