Saturday, January 23, 2016

Where do Genes Come From?

How did DNA odds and ends become new genes in the human lineage?

Every new genome is like a jigsaw puzzle made from extremely modern, sometimes unrecognizable, art. Fitting the physical pieces together, out of millions of short DNA sequencing reads, is the relatively straightforward task and is done entirely by computers these days. Identifying genes encoded from that DNA, however, is a bit more difficult. This can be done partly by computers by looking for conserved sequences similar to genes or other genomic features known in other species. Yet there are always a bunch of left-over pieces- sequences, and possibly genes, that are novel in each species. A significant problem is finding such genes, which have no reliable structural signature, and may only become apparent with intensive functional studies based on what they do.

How do such genes arise from nothing, which is to say from the large amount of junk DNA lying about in the genome? Most genes arise by duplication and specialization, like the profusion of hundreds of olfactory receptor genes. But not all. A recent paper trolled through several mammalian genomes to look at the process of entirely new gene creation. The question is- how does random DNA get turned into a useful, transcribed, and translated gene?

The transcribed part is not so difficult, actually, since it has been found that most of the genome of any eukaryote is typically transcribed at a low rate. Not only genes, but all sorts of junk and useless DNA are transcribed into  messages that are quickly discarded. Apparently, the cost of low-level promiscuous RNA production is less that that of tightening down the controls to restrict transcription only to bona fide, gold-plated genes. And such noise may be an important evolutionary resource as well.

The authors gathered up lots of this transcribed RNA, sequenced it in bulk, and filtered for a minimum length (300 nt), from human, chimpanzee, and macaque, and mouse tissues. Running all this through a computer which condensed duplicate reads and compiled the distinct RNAs, they came up with about 100,000 candidate transcripts from about 35,000 candidate coding regions per species, substantially higher than the roughly 22,000 genes known to exist. They note that the noisy excess they find accounts for about 2% of transcriptional production, vs RNA production from known genes, so the extra transcription is low level, and pretty low cost. True gene regulation causes much higher rates of transcription, at least where and when expression is really needed.

The next step was to try to identify new, novel genes within this mess of noise. This involved discarding anything previously recognized as a gene, or related to genes in other species, by comparison to various sequence databases. This whittled the collection down to 634 in humans specifically, and 2,714 genes in humans, chimpanzees, or both (not shared with macaque or mouse). These are genes that seem to be regularly transcribed, at some length, but have not been previously recognized or annotated and are unique to their respective species. That is quite a lot, actually, for a few million years of evolution. What are they and where have they come from?

The researchers work quite hard to check whether these genes are expressed into proteins, and find evidence, for the human species, that only 21 seem to be translated. That does not mean that the others are not, but is a disappointing rate. They also find evidence for natural selection, in that the mutation rate these putative genes (though only the few that were validated above as being translated to protein) is lower than for junk DNA, indicating that it has adopted some kind of usefulness.

One theory for the origin of such novel genes is that they may come from existing gene regulatory regions that fire in both directions. Firing bidirectionally is quite common, but typically, only the downstream (sense) direction is conserved and useful, (i.e. the conserved gene), while the upstream sequences vary quickly through evolution and do not encode genes or anything else useful. In this study, they did not find any enrichment for close opposite-strand positioning of their de novo genes with existing genes, so concluded that conversion of such divergent transcripts to something useful was not a common mechanism of gene creation.

Occurrence of selected regulatory protein binding sites upstream of the putative genes. TSS denotes the transcription start site, and negative coordinates count the bases upstream on the X axis. Each regulator is noted at right, with a different color. While elevated, the frequencies hardly crack 1%, so again, this evidence suggests that only a small proportion of the collection of putative genes are actually regulated by these proteins.

What they did find was that certain DNA-binding protein regulator sites came up quite frequently upstream of the novel genes. These sites were for CREB, JUN, RFX, and M1/M2(TFIIB), rather common regulators. They also show that these sites are new just as the genes are, being typically absent from the same regions in the macaque genome, compared to the novel gene regions in chimpanzee and human. This leads to the theory that the random generation of such very short binding sites might have been the spark that originated these genes from unprepossessing DNA, after which they became more highly transcribed and attracted other regulator sites, some kind of function based on possible protein translation, and so forth into the Darwinian light.

Binding sites of the respective regulatory proteins, in logos format. This format indicates by the letter size how essential and selective a particular base is at that position of the protein binding site.
 
Unfortunately, none of their genes or proteins are more definitively assigned a function, and they dejectedly point out that the gene complements of organisms and lineages tend to be relatively stable, with lots of conserved genes, so that these novel (and putative) genes are rarely brought up to important/essential function, but rather keep being refreshed in a treadmill of molecular birth and death.

  • Another paper on the translation of promiscuous transcripts.
  • We need to fix the drug industry, with more public research, fewer patent protections, more negotiation.
  • Theories of low growth.
  • Median income in the US peaked back in 2000.
  • Are we responsible for Egypt?
  • Pakistan is a terrorist state.
  • What good did Hillary do in foreign policy? "In the case of Clinton there hasn’t been a major foreign policy decision in the Middle East she pushed for that didn’t end up being a disaster both at home and the countries she advocated meddling in."
  • Does realism mean that Hillary concedes all ideals in advance, or does idealism offer nothing but false hope because Republicans will never move? Which side is more astute?
  • David Bowie has a bit of fun, miming other singers.
  • Themes in evolution and economics.
  • Annals of stuttering while black ...