Large scale studies of what binds to DNA over whole genomes.
Biology is full of codes. There is the genetic code, but there are many others. There are protein localization codes- short sequences on many proteins that tell them where they should transported to, such as to the mitochondrion, lysozome, the exterior, or the nucleus. There are kinase codes- the positions on many proteins where modification by phosphate changes their behavior. There is a histone code, which is the set of acetylations and methylations on histone tails which have wide-ranging influence on transcription of DNA to RNA. There is a sugar code- the many glycosyl modifications of proteins displayed externally on cells, which affect how they are seen and work in that space. Lastly, there is a code of short sites on DNA where specialized proteins bind, and by which transcription and other processes are regulated. Humans have roughly 1600 loci, out of their 20,000 genes, which appear to encode such proteins, and each binds somewhere and does something in our biology.
The study of how and where these proteins bind has a long history, with many such proteins now exceedingly well characterized, to the atomic level. But at the genomic scale, it is still something of a crapshoot to guess where and whether some DNA site binds a regulatory protein. Such proteins have rather flexible requirements, which researchers express in "motifs". These motifs are short and typically variable, or "degenerate". That is, each position in such a motif can be one of the four bases, and frequently more than one base is allowed. In the motif shown below, for the protein ZBTB33, only one G is absolutely required. The other positions are variable to some degree. Outer areas of a binding site tend to be less selective, naturally, as they are less strongly bound by the protein. Some proteins can bind to two different motifs, and some can be accessorized by partners of various kinds to bind yet other sites. Evolution is the great tinker, and in this system, interactions are frequently kept rather loose and fluid, enabling precision where needed, (partly by complexing numerous regulatory factors & binding sites with each other in large casettes), but also flexibility and adaptability elsewhere.
A representation of what DNA sequences the zinc finger protein ZBTB33 binds to. Each position along the DNA site is shown as the collection of possible bases seen in functional sites, with each shown in proportion to its frequency of occurrence. The central G is the only absolutely required base, though several others are nearly invariant. |
So the question of what binds where is not an easy one to answer, just going from the sequence of the genome. Naturally, this has been the subject of recent advances in large-scale biology, enabling researchers to, for instance, identify all the binding sites of a given protein across the genome (in a given cell type and culture condition). Or alternately to identify all the "accessible" sites across a genome (and also in a given cell type and culture condition), which would be locations where chromatin is "opened" up due to the binding of whatever regulatory proteins. This latter style of experiment naturally leads to the question- what is doing all that binding?
A recent paper comes from that field, deploying the latest machine learning and convoluted neural nets to find the answer, at least on a statistical basis. They combine a series of bulk open-chromatin experiments with a database of known transcription regulator motifs to match genomic sites with plausible proteins that bind there. In usual machine learning fashion, they reserve some of the training data for testing and validation, enabling the production of ROC statistics for accuracy and for comparison with other methods, of which there are many. But what they do not do is actually test the accuracy of their data in the lab, with actual cells and proteins. That would hard for a bioinformatics lab! So their talk of "accuracy" is rather untethered from reality, though fine enough for the journal they published in, which is Public Library of Science, Computational Biology.
All that said, this is a code that is going to be very difficult to crack, since regulatory proteins are not just highly diverse and their sites degenerate, but they are themselves regulated in many ways, by phosphorylation, sumoylation, ubiquinylation, methylation, complexing with partners, the generation of variable isoforms through transcription, and cleavage, among others. The same protein that activates transcription here may repress it there. So the "motif" is a bit of a chimera, as is its effect on gene expression. The great tinkerer has gone so far down the rabbit hole that even "Deep Mind" is going to have a hard time following it down, without further empirical advances ... such as a massive upgrades in methods to identify specific protein binding sites across the genome.