Billions of years have created some weird tricks in DNA search.
Search is all around us, as we increasingly rely on search engines to find everything we need on the internet, want to watch, or want to buy. Search looks into databases, which hold the sought-after information. All our accounts, all the domain names, all the products... everything is held in databases of one kind or another, and those databases are indexed in clever ways to provide virtually instant pointers from the question we ask to the answer held online. AI merely puts a linguistic gloss on this, and most people are still encountering AI first as a feature of search, such as the top of current Google search results.
Well, our genomes are databases as well- rich and ancient storehouses of jewels that encode the body and its doings. How does search work there, and what is search even for? At any moment, each cell of the body has certain needs, stresses, and goals, as expressed in its DNA programming. The tools available are proteins and RNAs, which carry out the cell's functions. The needs may arise from signals coming from previously expressed receptors, say, for insulin, which may trigger and tell the cell to take up glucose from the blood. The receptor turns on a kinase, which may turn on another kinase, which turns on a transcription regulator, which goes into the nucleus and ... does a search. This regulator is searching for places (specific sequences) in the genomic DNA where it can bind, after which it helps to turn on (or off) the nearby gene, executing the desired function / tool.
| General introduction to transcription regulators (or "factors") and their role in gene activation and the whole process of gene expression. |
Obviously a very different kind of search than what Google does on our behalf across documents, but there are similarities. Internet search depends on patterns, matching the user's input with the vast corpus of the internet also held as text symbols. Transcription regulators match patterns, in this case patterns of DNA that they like to bind, which may occur only once in the genome, or occur tens of thousands of times. The pattern here is a complementary physical/electrochemical shape, rather than an abstract same-symbol match. The genome is, to a protein, truly vast. Our three billion-base genome is forty million times larger than an average regulatory protein of, say, fifty kilodaltons (kDa). Search is also, here, a difficult problem, which researchers have been wondering about for decades. Several recent papers discuss different aspects of the problem and shed some modern light on it.
We have roughly 1600 transcription regulators in our genomes, so there is something going on all the time. DNA is always being queried. And what it replies with is RNA- a transcript issued from a gene, which either goes off to instruct creation of a protein, or is itself functional in some way. So, how do proteins bind to DNA, executing their search? It was transformative when the first atomic structures of such proteins were solved. They were clearly complementary with their DNA targets, with nicely positioned positive charges to mate with the backbone of the DNA and amino acid fingers reaching into the helix to feel the shapes of the nucleotides they wanted to bind. All very neat, and paradigmatic for bacteria whose genomes are quite small. But there is more to the story. Binding sites in human genes tend to be quite short- five to seven bases. That really isn't enough to be very specific, across a vast genome. Eukaryotes have developed several weird tricks, as it were, to encourage efficient search over much larger genomes and at the same time increase precision while maintaining evolvability and flexibility.
Eukaryotes have nucleosomes, chromatin, and packaging. The DNA is not just splayed out randomly, but wound up on protein spools. One would think that this would impair search by regulators. But paradoxically, there is a fine balance between hunting around on a given piece of DNA for a preferred site (one-dimensional search, 1D), and jumping off, letting go, and trying somewhere else (by diffusion; three-dimensional search, 3D). The compaction of genomic DNA into nucleosomes that wind up most of the DNA while leaving linking DNA in between free appears to provide a nice balance of landing spots that allow searching regulators to jump very long distances (in linear DNA terms) while not going very far in absolute terms. Regulators vary in how aggressively they can plow through nucleosomes to try out their internal DNA sites, but many (called pioneer factors) can do so.
Secondly, transcription regulators cooperate with other proteins to create longer, more complex DNA sites for precise gene identification and higher binding affinity. As biologists have characterized the enhancers and promoters of important developmental genes, they have found that DNA binding sites occur in bunches, and have much weaker effects when broken down and separated. Sometimes there is direct side-to-side cooperation between two regulators that bind the DNA. At other times, they combine with other non-DNA-binding proteins to create complexes at such sites. The DNA recognition sequences of these combinatorial sites can be changed significantly, even beyond (our) recognition, by the addition of cooperative proteins. This is something that makes prediction of where a given regulator binds particularly perilous.
Thirdly, many regulators contain not only DNA binding domains, but also extra disordered domains that facilitate DNA search and binding. This has been a recent realization that accounts for some of the speed and flexibility of regulator search and DNA interaction in eukaryotes. The stable crystal structures of paradigmatic bacterial regulators are not the whole story, and indeed are insufficient to explain what is happening in the much larger setting of our own cells. The authors note that eighty percent of human gene regulators have large disordered domains, (called IDRs, for intrinsically disordered region), upwards of 500 amino acids long. These never showed up in crystal structures, naturally. Being disordered, they are also poorly conserved. So, they have been difficult to study.
| Comparison of binding by one regulator, MSN2, which has a large IDR, to its genomic sites. At top is its native binding pattern, across a whole genome. At bottom are mapped its core motif occurrences on that DNA. Second from top is the MSN2 protein mutated to contain only its core motif-binding domain, and third from top is the MSN2 protein mutated to remove that domain and retain everything else. Note how different the patterns of binding by each of these proteins are, though how each approximates to some degree the wild-type pattern. |
In related work, researchers have divided up such proteins into the core binding site part and the IDR part. They find that both parts work partially, directing binding to some of the native sites around the genome. In fact, the IDR part does a more statistically accurate job than the core DNA binding motif. This is fascinating, showing that in eukaryotes, a new search mechanism arose, supplementing discrete and precise binding with a floppy / fuzzy code in the IDR and its binding sites. It turns out that regions of hundreds of bases around core target sites (which in one case amount to only the motif AGGGG) are preferentially bound by the respective IDR protein domain, with multiple weak interactions that remain structurally uncharacterized.
| Relationship between IDR binding site size, and the ratio of 1D vs 3D search time, by simulation. The bottom axis is size of the IDR binding region, the Y axis is time taken for search. Time spent in total (yellow) goes down to minimum at an optimum between 1D search that is slowed by longer IDR-binding regions, while 3D search is strongly accelerated by longer IDR-binding regions. |
The combination of core binding and loosely unstructured binding in one regulatory / search protein provides powerful benefits. In dimensionality terms, if the effective landing site is expanded from five to five hundred bases, then the time required for 3D search through the space of the nucleus is dramatically shortened. Secondly, loose binding by the IDR then promotes an "octopus"-like 1D search along the local DNA, resulting in efficient settling on the core binding site to get ultimately precise positioning. The ultimate affinity of the regulator with the local DNA is also enhanced compared to what it could manage over a five base pair site. The researchers conclude that with these domains, the search problem is, in net terms, reduced by one dimension, from 3D to 2D. The surrounding areas of DNA that have marginal affinity for the IDR domain are called "antenna" regions, and the author's simulations show how they alter search behavior.
| Schematic explanation of the current work, describing how IDR domains help to speed up the transition from 3D search through space, to 1D search across the DNA. And then also to facilitate 1D search by preventing full detachment from the DNA while the core binding motif continues to search by diffusion for its binding site (yellow). |
For computers and databases, search is a huge problem that has led to technical innovation, as well as large drains on resources. Every search engine combs the internet, gobbling up all available information, creating indexes, and updating them constantly in order to give us the instant access we want. This infrastructure has been raised to a new level by AI, which transforms search into a new form, combining it with language translation and prediction methods that allow a search for corkscrew to bring back results for wine. Whether it understands anything is unlikely, but the desire to upgrade search from a simply determinative process to one that is more fuzzy and richly interpretive, and thus more useful, is not a new phenomenon.
No comments:
Post a Comment
Thank you for commenting!