Saturday, March 26, 2022

A Brief History of DNA Sequencing

Technical revolutions that got us to modern DNA sequencing.

DNA is an incredibly elegant molecule- that much was apparent as soon as its structure came out. It is structurally tough, and its principles of information storage and replication are easy to understand. It is one instance where evolution came with, not a messy hack, but brilliant simplicity, which remains universal over all the life that we know. While its modeled structure was immediately informative, it didn't help to figure out its most important property- its sequence. Methods to sequence DNA have gone through an interesting evolution of their own. First were rather brutal chemical methods which preferentially cut DNA at certain nucleotides. Combined with the hot new methods of labeling the DNA with radioactive P32, and of separating DNA fragments by size by electically pushing them (electrophoresing) through a jello-like gel, this could give a few base pairs of information.

A set of Maxam-Gilbert reactions, with the DNA labeled with 32P and exposed to X-ray film after being separated by size by electrophoresis through a gel. Smallest are on the bottom, biggest fragments on on the top. Each of the four reactions cleaves at certain bases, as noted at the top. The intepretation of the sequence is on the right. PvuII is a bacterial enzyme that cleaves DNA, and this (palindromic) sequence noted at the bottom is the site where it does so.

Next came the revolution led by Fred Sanger, who harnessed a natural enzyme that polymerizes DNA in order to sequence it. By providing it with a mixture of natural nucleotides and defective ones that terminate the extension process, he could easily develop far bigger assortments of DNAs of various lengths (that is, reads) as well as much higher accuracy of base calling. The chemistry of the Maxam-Gilbert chemical process was quite poor in base discrimination. This polymerase method also eventually used a different isotope to trace the synthesized DNAs, S35, which is less powerful than P32 and gave sharper signals on film, which was how the DNA fragments were visualized after laid out and ordered by size, by electrophoresis.

The Sanger sequencing method. Note the much longer read length, and cleaner reactions, with fully distinct base specificity. dITP was used in place of dGTP to help clarify G/C-rich regions of sequence, which are hard to read due to polymerase pausing and odd behavior in gel electrophoresis. 

There have been many technological improvements and other revolutions since then, though none have won Nobel prizes. One was the use of fluorescent terminating nucleotides in place of radioactive ones. In addition to improving safety in the lab, this obviated the need to generate four different reactions and run them in separate lanes on the electrophoretic gel. Now, everything could be mixed into one reaction, with four different terminating fluorescent nucleotides in different colors. Plus, the mix of synthesized DNA products could now be run through a short bit of gel held in a machine, and a light meter could see them come off the end, in marcing order, all in an automated process. This was a very significant advance in capacity, automatability, and cost savings.

Fluorescent terminating nucleotides facilitate combined reactions and automation.

After that came the silicon chip revolution- the marriage between Silicon Valley and Biotech. Someone discovered that silicon chips made a good substrate to attach DNA, making possible large-scale matrix experiments. For instance, DNA corresponding to each gene from an organism could be placed at individual positions across such a chip, and then experiments run to hybridize those to bulk mRNA expressed from some organ or cell type. The readout would then be fluorescent signals indicating the level of expression of each gene- a huge technical advance in the field. For sequencing, something similar was attempted, laying down all possible 8 or 9-mers across such a chip, hybridizing the sample, thereby trying to figure out all the component sequences of the sample. The sequences were so short, however, that this never worked well. Assembling a complete sequence from such short snippets is nearly impossible.

What worked better was a variation of this method, where the magic of DNA synthesis was once again harnessed, together with the matrix layout. Millions of positions on a chip or other substrate have short DNA primers attached. The target DNA of interest, such as someone's genome, is chopped up and attached to matching primers, then hybridized to this substrate. Now a few amplification steps are done to copy this DNA a bunch of times, all still attached in place to the substrate. Finally, complementary strands are all melted off and the single DNA strands are put through a laborious step-by-step chemical synthesis process, similar to how artifical DNA is made to order, across the whole apparatus, with chemicals successively washed through. No polymerase is used. Each step ends with a fluorescent signal that says what the base that just got added was at that position, and a giant camera or scanner reads the plate after each pass, adding +1 to the sequence of each position. The best chemical systems of this kind can go to 150 or even 300 rounds (i.e. base pairs), which, over millions of different DNA fragments from the same source, is enough to then later re-assemble most DNA sequences, using a lot of computer power. This is currently the leading method of bulk DNA sequencing.

A single DNA molecule being sequenced by detecting its progressive transit through a tiny (i.e. nano) pore, with corresponding electrical readout of which base is being wedged through.

Unfortunately, our DNA has lots of repetitive and junky areas which read sizes of even 300 bases can not do justice to. We have thousands of derelict transposons and retroviruses, for instance, presenting impossible conundrums to programs trying to assemble a complete genome, say, out of ~200 bp pieces. This limitation of mass-sequencing technologies has led to a niche market for long-read DNA sequencing methods, the most interesting of which is nanopore sequencing. It is almost incredible that this works, but it is capable of reading the sequence of a single molecule of single stranded DNA at a rate of 500 bases per second, for reads going to millions of bases. This is done by threading the single strand through a biological (or artifical) pore just big enough to accommodate it, situated in an artifical membrane. With an electrical field set across the membrane, there are subtle fluctuations detectable as each base slips through, which are different for each of the four bases. Such is the sensitivity of modern electronics that this can be picked up reliably enough to read the single thread of DNA going through the pore, making possible hand-held devices that can perform such sequencing at reasonable cost.

All this is predicated on DNA being an extremely tough molecule, able to carry our inheritance over the decades, withstand rough chemical handling, and get stuffed through narrow passages, while keeping its composure. We thought we were done when we sequenced the human genome, but the uses of DNA sequencing keep ramifying, from forensics to diagnostics of every tumor and tissue biopsy, to wastewater surveillance of the pandemic, and on to liquid biopsies that promise to read our health and our future from a drop of blood.


No comments: