Saturday, April 17, 2021

Zooming In On The Genome

Better sequencing methods bring the human genome to higher resolution and accuracy.

Most of the progress in DNA sequencing over the last two decades has come in what is known as "short read" sequencing. The dominant company, Illumina, produces massive machines that crank out huge amounts of data, but in the form of tiny individual reads, only about 90 bases long. That means that there is a lot of work on the back end for data analysis to piece everything together. And given the frequent occurence in our genomes of repeats and repetitive sequences in many forms and sizes, these reads are simply too short to fully make sense of it. No amount of assembly magic can make up for a lack of long-range information.

So there has been a second movement of "next generation" sequencing methods, pursuing long reads, of tens of thousands of bases. Several methods exist, but the leader is Pacific Biosciences, (PacBio), whose method tacks down a single polymerase into a special optical well and then uses fluorescence to detect each individual nucleobase incorporation as the polymerase chugs away at the given template. This is not a fool-proof process, being at the single molecule level. While the Illumina system greatly amplifies the DNA and thus raises the signal (which is also ultimately fluorescence-based) to a high and reliable level, these long-read methods tend to have lower reliability. A recent paper described a way around this, featuring a long read system which was used to analyze 34 human genomes to collect new information about large scale structure and variation.

The "weird trick" that PacBio uses is to circularize templates of about 15,000 bases, and then drive the polymerase reaction described above around those circles upwards of fifty times. This allows multiple passes around the same DNA to make up (in volume/repetition) for the inherent error rate of each individual pass. Reads of this size are big enough to surround most forms of repetition and low complexity in our genomes, or at least cover enough landmarks/variants that one repeat can be distinguished from others. Indeed, these researchers could even figure out, based on allelic variants peppered through the genome, which parent each sequence came from, assembling each of the subject's two copies of each chromosome separately. All this makes it possible to assemble whole genomes with unprecedented accuracy, (~1 error in a half-million bases), especially in terms of long-range features and variations.

And that has been a growing area of interest in human genetics- variations in structure that lead to extra copies of genes, insertions, deletions, and altered regulatory contexts.  It is a frontier that has had to wait for these new techniques, while millions of single nucleotide variants have been piling up. Cancers are notorious, for instance, for being caused by accidental fusions of two distant genes whereby some developmental or cell cycle gene function is put under novel and (usually) high activation by some other gene regulatory region. Down Syndrome is caused by a whole-chromosome duplication to trisomy. Smaller deletions and duplications have significant effects as well, naturally.

The new paper digs up twice as many structural variants as previous analyses (and does so from only 37 human genomes, compared to the >2000 genomes used by other analyses) - 107,590 insertion/ deletions over 50 bp in size; 316 inversions, 2.3 million insertions/deletions under 50 bp in size; and 15.8 million single nucleotide variants. Many of these count as normal alleles, of long-standing in the human genome, just difficult to piece together previously. In non-gene regions, these variants may have little effect. Per individual, and in comparison to the current reference human genome, they claim to see 24,653 large structural variants, 794,406 small insertions/deletions under 50 bases, and 3,895,274 single nucleotide variants. This is quite a lot in a genome of 3 billion bases, amounting to about 0.1% of all individual positions that are varying in the population, and almost a million other re-arrangements, deletions, etc.

An example of one transposon (top) that the current paper discovered has jumped several times in succession, from chromosome 3 to chromosome 1, then from that landing spot to two other locations on chromosome 1 and one spot on chromosome 17. Each jump brought along a bit of extra DNA from the originating locus.

The vast majority of these mutations arose from repair events, where the DNA broke and was then fixed either by repair using the other homolog sequence for reference (~65% of cases), or simple blunt end rejoining, with a few percent coming from errors that happen during replication, particularly of repetitive sequences. Another source of mutation is the movement of mobile genetic elements, which encode their own apparatus of transposition to new locations. These researchers found ~ 10,000 that were not present or not identified in the human reference genome (because this is what is generally called "junk"). Their detailed data, in comparison to outside references like the chimpanzee genome, allowed them to assess the ages and relationships of these mobile elements. Most are old fossils and no longer active due to mutation. But others have clear and recent lineages, and are still giving rise to mutations, even causing cancers. One can imagine that genome editing could eventually turn these off permanently, reducing one source of cancer and birth defect risk.

Close-up view of one part of chromosome 3, cytological band q29. Even in this small population sample (individual haplotypes listed down the left side, bottom), there is a flurry of structural variations, including inversions and duplications. (CNP = copy number polymorphism.) At top left is a map of genes located here in the reference sequence (hg38). The light arrow shows the direction of transcription, and the heavy vertical lines are the exons. For example. TNK2 is a protein kinase that relays signals inside cells, is active during brain development, and can be an oncogene when activated or overexpressed, as well as having a role in cell entry by some viruses.

An additional analysis was for trait loci associated with the newly found structural variants. As can be surmised from the sample genomic location diagrammed above, this kind of jumbling of the genome is likely to have functional consequences, either by breaking genes apart, joining them to foreign regulatory regions, or by duplicating or deleting them, in part or whole. The researchers found that roughly half of structural variants that map to known trait loci (called quantitative trait loci, or QTLs), were newly found in this study. So while the accuracy increase may not seem like a lot, it can have profound consequences.

The count of structural variants that differ by population. Superpopulation (region) count in light color, and population-specific in dark color.

Lastly, this new fund of variation data allows another look at human ancestry. As we know, the bulk of human variation remains in Africa, and that is reflected in structural variation as well as other forms of variation. Populations elsewhere are far less diverse, due to the small groups that founded those populations from the mother-continent, and perhaps also through the new selective pressures that swept those populations, either positively or negatively. Twenty years after the original human genome was published, it continues to be a clinical and research goldmine, but also requires ongoing work to bring to complete accuracy- something this work gets us much closer to.


No comments: