Saturday, November 17, 2012

Humans find some scraps in the genome junkyard

The human genome has a great deal of junk in it, but some junk may be better than other junk.

Humans have been shocked to find out that we have only about 21,000 protein-coding genes, the workhorses of developing and running our tissues. These cover only 2% of the DNA of the genome, so what is going on in the rest of it? Is it just parasitic junk like old retrotransposons, repetitive stutterings, and duplications decayed into pseudogenes?

A large genome evaluation project (ENCODE) has been in the news lately, claiming that perhaps 80% of the genome is actually functional. But that is quite a stretch of interpretation. What they actually found is that this large proportion of the genome is transcribed, typically at very low levels, (detectable with high-technology), or has some other marker of activity like sites where regulatory proteins bind, or signs of activated chromatin, and other features.

It is sort of like saying that cable TV (or Roku, for that matter) has 300 channels, instead of the four or five channels we used to know about in an earlier age. Yet how many of those channels actually count for anything? Perhaps they amount to a lot of garbage, not influencing the larger media and social/political landscape. Perhaps, like FOX, they are putting out a lot of chatter and fluff, but in the end do not help build a greater future...

In biology, regulatory binding and transcription are only the opening steps on the way to gene expression. Only when all the pieces are put together, combining on a real gene whose transcribed RNA gets processed to an mRNA that encodes a protein on the ribosome, and gets transported to the right location, and ... do these regulatory events have effects in the cell and organism.

I think it is likely that, while the regulation of known genes is doubtless quite a bit more complex and distributed than previously realized, (generating the incredibly fine-grained and endless variation that we see among humans), there remains a great deal of junk in the genome, some of which gives rise to biological noise (i.e. transcription and protein binding) whose effect may be minimal.

This was all by way of introduction to a paper that uses this ENCODE data to analyze the genome for signs of recent human evolution, especially of sites which have been newly drafted into use in the human lineage, from what was junk in our ancestors (and remains junk in chimpanzees and other contemporary fellow-species). It is an intriguing story of how genes and regulatory functions may have been fashioned by evolution out of the miscellaneous scraps lying around in the DNA.

An important problem these researchers face is that humans are virtual clones. We have much less genetic variation than other species. We evidently went through narrow population bottlenecks in the recent past (think African "Eve"). The current population of Africa has roughly 1.3 times the genetic variation of all humans outside Africa, due to the extra bottleneck of small groups migrating out of Africa. But human variation is low in any case (a variant every 153 bases in humans, vs every 0.2 bases in other mammals- truly a remarkable difference).

So the researchers turned to mass sequencing of many genomes to assemble enough genetic variation (the 1000 genomes project). This allowed them to map where, across the population, humans have DNA variants. In important areas, there will tend to be fewer variants, and in junky areas, there will be more, since there is no selection for important function keeping mutations at bay (i.e. killing and reducing the reproduction of people carrying variants there).

The second piece of data they use is maps of genome conservation between many mammalian species (26 species, including humans, indicated in blue, below). This allows them to see what has been conserved for a long time. Known genes like hemoglobin will have been around a very long time, be easy to recognize, and be highly conserved, since they do critical work. Most changes in its code are going to be lethal. But elsewhere, most of the genome allows far more leeway and the question is just how much. Are these regions that the ENCODE project identified (indicated in red, below) as being transcribed and sort-of-active important to humans? Or are they still just junk?

Venn diagram of the human genome (the whole black box). The ENCODE project found that most of it (red) is somehow active, either being transcribed to RNA at some low level, or bound by proteins that regulate active genes, etc. "Mb" means thousand base pairs of DNA. The blue part is that which is conserved among mammals, marking it as functionally significant not just in humans, but over a far longer time. Half of this conserved amount was in the "inactive" portion of the ENCODE data, which is certainly odd, and leads to questions about just what the ENCODE folks were looking at.

The answer is a partial one. They found that, on average, the newly found "active" areas of the genome outside known genes and outside areas known to be conserved in other mammals still carried significantly fewer variants than "inactive" areas (non-red). So they seem to have separated quasi-junk from the honest-to-goodness junk.

(Wonk note: But it is important to note that what they regard as conserved among mammals must be a very small proportion of what is actually active and functional in all these species, since the same ENCODE proportion of activity and function would be found in all these other species as well. So I believe they are comparing incommensurate metrics in this paper, and can not really conclude that the selective constraints on ENCODE-specific areas of the human genome are really human-specific rather than long-standing among mammals, if more variable than what is readily captured by typical measures of conservation.)

Graph of human variation, categorized by type of genome element. The axes are two different measures of genome variation. The X-axis is a metric of the density of SNPs, which are single nucleotide (or base) variants in the DNA, which is the same as a single base mutation. The Y-axis is a metric of derived allele frequency (DAF), which cranks the SNP data through an additional analysis to focus on new ones versus ancestral ones. 

I have added red arrows, which point to the ENCODE included and excluded sets among the non-conserved areas of the human genome. The data of the paper essentially boils down to the difference of these two points on the graph, indicating that the "active" designation by the ENCODE project has some functional significance that is reflected in lower-than-average rates of variation (i.e. mutation) in human populations that reflect intra-species conservation, to some small degree. They term this as the "constraint" these areas of the genome are under, from natural selection. 

Other genomic features mentioned in this graph include: "Non-degenerate coding", codes for protein products, and specifically restricted to bases of the DNA that are not in the synonymous part of the triplet genetic code; "UTR", untranslated region, typically immediately leading or following a coding region; CDS, coding sequence, coding for protein or RNA; "Annotated", previously included in atlases of functional genomic elements; "Active chromatin", regions bound by few histones or special histones chemically marked as permissive for transcription; "intron", interrupting portion of genes that lie between the coding pieces and are specially spliced out of the transcribed RNA; "Exon", the coding pieces of a gene that lie between the introns and the UTRs; "Mappable", means pretty much everything- the whole enchilada, whole ball of genome wax.

Everything in this paper is done in bulk: averages drawn over huge areas and over crudely summarized features of the genome. What actually lies within the ENCODE areas that leads to these rather slight findings of selective constraint (and thus presumed biological function) is hard to say, without consulting the much more detailed work done elsewhere in the project. It could be a few important new genes, perhaps coding for non protein-coding RNAs that have become the focus of so much interest recently and which regulate other genes. Or it could be a large cloud of regulatory protein binding sites that tweek the activities of genes lying far, far away, weakening the idea of the gene as a local object on the DNA. Or some new aspect of biology waiting to be discovered. In any case, it reinforces the idea that it isn't how many genes you have, but how you use them that counts- that humans are beneficiaries of an extremely long process of gene-regulatory tinkering, both recently in our own lineage, and through the deep reaches of evolutionary time.


No comments: