Saturday, March 5, 2016

Diamonds in the Junk: Medically Important Sites in Intergenic Regions

Finding human genomic variants that are relevant to disease, in regulatory regions.

The human genomes remains a complicated place, and to simplify its study, many researchers and medical screeners stay in the shallow end- the coding regions or "exome", which comprises what is transcribed from protein-coding genes. But there is much more in there, from the introns and other critical elements inside and near genes, to regulatory elements spread over millions of basepairs around their target genes, to junk DNA with no function known at all, yet.

A recent paper offered an improved analysis of these large intergenic regions, and claimed to find medically interesting mutations. The issue is that the vast regions surrounding genes are largely what could be called junk DNA, with occasional quite small and variable regulatory sites that are difficult to find. So mutations are frequent, yet most have no effect. Figuring out which might have medical significance is one of those finding-needles-in-haystacks problems. And doing it over a whole genome to an informative level is unprecedented.

Example of a mammalian genomic region (A) featuring 700,000 bases between neighboring genes. The genes are grey: Nom1, Lmbr1, Rnf32, Shh, Rbm33, Cnpy1, and En2. In color are various regulatory sites (enhancers), coded by their location and the region (B) where they activate gene expression of Shh in a mouse embryo. It is noteworthy that Shh is driven by enhancers lying throughout the intergenic region, and even within the Lmbr1 gene, 850 kbp away.

The researchers start with a few genomes from five people who have allowed their data- genomic and medical- to be used in the public domain. The other starting material is knowledge about 657 DNA-binding transcription regulator proteins, which is about half of all the regulators known in humans; especially what kind of DNA sequences they tend to bind to. Then they add in a phylogenetic conservation analysis of 33 other mammals, which allows them to find putative regulatory regions and sites in their human genomes.

The procedure is then to comb each genome, asking which probable sites in the vast non-coding areas differ from the average or reference human genome. And then of those, which sites are conserved with respect to chimpanzee and other species, indicating that the site is significant over evolutionary time, but was mutated in the person whose genome is being studied.

The innovative part of their analysis, aside from the scale being attempted in terms of numbers of sites, merging of evolutionary and regulatory binding pattern methods, and whole genome coverage, is in their statistical treatment. The object at this point was to describe the various mutated sites that were found in terms of possible medical or biological significance. This routinely depends on the annotation of the nearest gene, which would be the regulatory target and confer whatever ultimate function the site has. It has been customary to address only regulatory sites close to genes, since the demarcation from one gene's regulatory region to its neighbor's is not yet easily predictable from the gnome sequence alone.

Parenthetically, one can note that this demarcation is performed by "insulators". These nucleo-skeletal-tethered sites in the intergenic DNA are bound by specific proteins that form boundaries which keep nearby genes regulated independently. They are under study and will be the subject of a future post.

A second problem is that, if one does range widely over the intergenic spaces to look for relevant mutations and biologically significant sites, it can be statistically hazardous to equate the sum of annotations from huge regions (which tend to surround developmentally important genes, for instance) against the necessarily smaller sets that would come from regions where genes happen to be close together. A correction for the size of the sampled area is called for. These researchers have constructed tools that specifically correct for these and other issues, and sample up to two megabases of intergenic DNA. Their rule is that a gene's core region is 5,000 bases upstream and 1,000 bases downstream of the site where its transcription starts, and everything up to a megabase on either side, or to the core region of the next gene, is fair game for finding associated regulatory sites. This is a rather broad zone, but can only be improved once the insulator sites are better defined.

Since the three billion base human genome contains roughly 22,000 genes, there is an average of 136 kilobases per gene. So a two megabase bound per gene, while large and perhaps sometimes not capturing the largest possible extent of a heavily & diversely regulated gene, should easily capture most regulatory regions and cover most of the genome.

Cranking through the whole analysis, they come up with specifically enriched annotations tied to genes whose conserved regulatory regions have suffered detectable mutations, in each of the five people (acronymized as CoBELs). And the question is whether these annotations match the respective medical histories and problems. With only about 5,000 to 6,000 sites found for each person, these researchers are probably only scratching the surface of genomic variation, picking the very easiest fruit off this tree. It is thus is important that their data come up with significant matches, since if this data is not significant, more and deeper data is unlikely to be any better. And this is what they claim to do.

Outline of results from five individuals, correlating regulatory variants from this whole-genome analysis with their medical syndromes.
The total numbers of variant sites that were flagged by this analysis was:
  • Stephen Quake: 6,321
  • George Church: 5,291
  • Misha Angrist: 5,775
  • Rosalynn Gill: 5,861
  • James Lupski: 6,447
Thus they come up with statements like:
"Prominent in Stephen Quake’s medical records is a family history of arrhythmogenic right ventricular dysplasia/cardiomyopathy, including a possible case of sudden cardiac death. Strikingly, when Quake’s set of CoBELs is analyzed using GREAT, the top phenotype enrichment (using default parameter settings, optimized for inference power in the original GREAT paper) is “abnormal cardiac output” (57 CoBELs, false discovery rate Q = 1.69 x 10−4). This enrichment is suggestive of susceptibility to heart diseases responsible for reduced cardiac output. Meaningful associations between CoBELs and personal medical records are in fact observed for all five genomes" 
"For example, 33 genes in the human genome are annotated for “abnormal cardiac output”. Their GREAT assigned regulatory domains cover 0.45% of the genome. Of the 6,321 Quake CoBELs, 28 (0.45%) are expected in the regulatory domains of these 33 genes by chance, but 57 CoBELs, over twice as many, are in fact observed." 
"The top enrichment for George Church, who suffers from narcolepsy, is “preganglionic parasympathetic nervous system development” (23 CoBELs, Q = 1.18 x 10−4). The autonomic nervous system is strongly suspected to be involved in narcolepsy. Misha Angrist, whose personal reporting indicates possible keratosis pilaris, a follicular condition manifested by the appearance of rough, slightly red, bumps on the skin, has “epithelial cell morphogenesis” as his top biological process enrichment (60 CoBELs, Q = 1.38 x 10−5)."

This analysis provides something that has been a dream up till now- a way to rigorously evaluate whole personal genomes, not just the coding areas, for medically relevant mutations. There is far more to do, since the 657 DNA binding regulators and their sites are far from the only action in town. There are more functions hidden in the intergenic DNA, like epigentic marks and non-gene transcription units. But this is a very promising start that can be scaled up and added to retail-scale analysis of anyone's fully sequenced genome, not only finding syndromes and weaknesses before they appear for individuals, but also helping the medical research enterprise find disease-causing mutations and pathways.