Saturday, August 2, 2014

91.8% of the Human Genome is Junk

In other words, 8.2% is selectively constrained. What is in there?

Even after we know all the letters, the human genome remains a very mysterious place, encoding in runic digital form unimaginably complex developmental processes and subtle behavioral traits, on top of the bread & butter cells, organs, metabolism, etc. Much has been learned, but much remains unknown. One perennial question about the genome is ... how much of it has any function at all? We know there is a great deal of free-riding junk: old transposons, repetitive recombination errors, dead genes, and other unidentifiable regions.

We also know there are only about 22,333 protein-encoding genes, which to experts seems an unimaginably small numer. I mean, yeast cells have 6,275 genes, and do we seem a mere 3.5 times as complicated? This coding portion takes up about 1.5% of the genome. We have another maybe 2,000 non-protein-coding genes like RNA regulators, which come in numerous types. Then there is a lot of other material, like regulatory regions that control whether nearby genes get turned on, functional parts of introns, centromere and telomere structures, etc. There are also a few mysterious ultra-conserved elements with no known function. One might imagine that "function" is a graded concept, with some vague regulatory tweeks having little function, in constrast to ribosomal components that have immutable must-have function. So any number we put on this is going to be fuzzy.

Perhaps the best way to define what is functional and what isn't is to go back to Darwin ... whatever is conserved in evolution is important, and whatever changes rapidly is not (with the exception of the occasional positively selected mutations, but these are, in the bulk of the DNA, extremely rare). A recent paper has pursued this analysis over many mammalian lineages to come up with the estimate of functional human DNA in the title above. Their technical approach is to compare the occurrence of alterations in sequence, especially insertions / deletions (indel), in actual genomes versus a calculated control genome with no conserved elements.

Over time, in a pair of imaginary (control) genomes with no important elements at all, mutations accumulate all over, and the average distance between zones of similarity declines to progressively smaller values. Actual genomes have far more regions of DNA where such indels are not allowed, a sign of purifying selection that eliminates mutation and preserves statistically unusual information. This leads to their statistic of constraint, meaning the portion of the actual genomes constrained from random change by purifying selection which removes organisms with mutations in those important nucleotides.

Portion of the human genome under various constraints. The X axis refers to divergence of sequences in the authors' neutral model, where 1.0 would be total divergence. The axis can be thought of as time going backwards to the right, though the relationship might not be linear. The Y axis is their measure of selective constraint which is the fraction of nucleotide sites (here displayed as the actual megabases of the 3.23 gigabase human genome) that are under purifying selection. Coding regions are blue and hardly vary with time or with the divergence of less constrained areas, while in red are shown non-coding regions that have some constraint. As divergence goes on, less of the latter are detectable.

Further breakdown of constrained regions by type. The Y-axis here is in terms of turnover during evolution, which is another measure of constraint. Protein coding regions turn over the least, then sites that are DNase sensitive (promoters, protein binding sites on DNA, etc.), then the average genome, then Promoters overall, then UnTranslated Regions, which are transcribed near genes on both sides, then Transcription Factor Binding Sites, then regulatory enhancers, which can be much farther away from genes than promoters, and lastly long noncoding RNA species, which have modest function at best. "TE" refers to transposable elements, which have no constraint to lose or turn over.

As shown above, different types of known genetic elements obviously have different levels of conservation, and in the second diagram, probability of loss over time, from conserved status to unconserved. Indeed the authors make the interesting point that of all this more-or-less conserved material, we only share a minority of it, about 23%, with mice. There is no surprise that coding regions are far more constrained than others. What is interesting is that these authors were willing and able to put a number on the global level of constraint, saying that, while the 91.8% of the genome with little evolutionary constraint may not be entirely without function, it can fairly be called "junk".

"If we make the assumption that the exponential decay model of functional sequence applies outside of the range of divergences we examined, then by extrapolating back to zero divergence we can estimate the total proportion of human genomes that is under present-day purifying selection with respect to indels. ... We therefore estimate that 8.2% of the human genome (253 Mb; 95% CI 7.1%–9.2%, 220–286 Mb) is presently under purifying selection with respect to indels."

This conclusion is controversial, because many labs have done functional analyses of what is actually transcribed from the genome, and find that practically all of it gets transcribed at some level. So most of the genome is "active" in some sense. The problem is that just because a stretch of DNA is transcribed to RNA doesn't make it a gene, or particularly important. The RNA may not get translated or have any further effect. All processes in the cell are messy, and it is quite likely that transcription is the same way, creating lots of noise and waste that signifies relatively little. That little may have activities to some modest degree that escapes strong selective constraint and bears investigation, but on the whole, I agree with these authors that evolutionary conservation is the best measure we have of importance vs junk.


  • GOP knows power, even if it know-nothings all else.
  • Pakistan fires artillery into Afghanistan ... sounds a little like the Russia-Ukraine situation.
  • Federal reserve president (Texas) keeps being wrong about inflation. This is simple class warfare.
  • "This report provides little evidence of any pick-up in wage growth. ... While a tightening labor market should eventually allow workers to see some gains in real wages, the economy does not appear to be at this point yet."
  • Towards a maximum wage.
  • Skills shortage? No way, Jose. It looks more like a slave shortage.
  • Microsoft apparently not so desperate for skilled workers as its H1B propaganda would have you believe; fires 18,000.
  • Autopsy of NAFTA ... not so great for either side, but not so bad either.
  • And generally, globalization vs democracy.
  • The fables of Reagan, and his children.
  • In Ukraine, the Russian operatives had a hard time gaining local support: "The people who supported this were marginal people, communists, lumpen, some of the Orthodox priests."
  • This week in the WSJ.. GOP candidate for California governor wonders where the jobs are.
"I walked for hours and hours in search of a job, giving me a lot of time to think. Five days into my search, hungry, tired and hot, I asked myself: What would solve my problems? Food stamps? Welfare? An increased minimum wage? 
No. I needed a job. Period. Like others, I have often said the best social program in the world is a good job. Even though my homeless trek was only for a week, with a defined endpoint, that statement became much more real for me. A job was the one thing that could have solved my food, housing and transportation problems."
[He might ask his GOP brethren in Washington, who slept at the fiscal wheel during the recession and do everything in their power to make America worse off, especially the worst off. And he might also ask why having a job leaves many workers still on food stamps and other public assistance.]

No comments: