Saturday, February 7, 2009

How to read DNA

A review of DNA sequencing technologies, from the paleolithic to the bleeding edge.

While one of the greatest discoveries of the last century, indeed of all time, was the role and structure of DNA, it did not amount to much in practical terms until methods were devised to read its code- its sequence. There has been a fascinating evolution in technologies to read DNA, and I have experienced a good share of it. Most methods are dependent on harnessing nature's own enzymes that replicate DNA in increasingly clever ways. The resulting flood of information will serve the age-old project of "know thyself".

DNA exists in almost endless lengths (bacterial genomes are typically circular, and the average human chromosome is 1.3E8 base pairs in length). So the first step in sequencing, in typical reductive fashion, is to break this linear structure into small pieces, place them into bacterial mini-genomic circles with independent replicative ability (plasmids or their relatives), and replicate/amplify them to large amounts that can be handled, sampled, sequenced, filed, bar-coded and stored.

The paleolithic method of sequencing (I've used it a few times) is based on chemistry instead of on enzymes, and is called the Maxam-Gilbert method, after its developers. First, one cuts a large batch of DNA at a specific sequence site with what is called a "restriction" enzyme- a pair of molecular scissors. Then its ends are labeled with radioactive phosphorous (P32), and one of the two ends removed with yet another restriction cut, and the remaining DNA split up into several pools, treating each pool lightly with quite hazardous chemicals that modify the DNA at certain bases (hydrazine at T and C, dimethyl sulfate at G, and formic acid at G and A). The individual units of DNA are called nucleotides, and their key parts are called bases- the A, G, C, and T of the genetic code, after their basic pH.

These chemical reactions are only roughly base-specific, and hit other bases as well, so the whole thing is woefully inefficient. The DNA is then further chemically processed to break the backbones at the modified bases, and the mixtures are eletrophoretically separated on a gel that allows fragments differing in length by a single nucleotide base to be distinguished. The radioactive label on one end ensures that only those fragments spanning from the radioactive label to the randomly cut point appear on the X-ray film that is exposed to the gel.

All the other methods to sequence DNA use the magic of DNA replication enzymes (polymerases) to read sequence, using methods devised by Fred Sanger (who is one of only three people to have received two nobel prizes in science). They do this by getting the enzyme to incorporate occasional bases with some special property- the nucleotides either stop chain elongation at random positions, allowing fragments like the ones described above to be produced directly by the polymerase, or they have other complex modifications to be described below. The enzyme does the work of reading along the DNA, and the experimenter coaxes it to tell which nucleotide base it is seeing as it goes along.

The original Sanger method used radioactive tracers such as P32 or S35 to detect the resulting DNA fragments, but advances in fluorescence technology have revolutionized this aspect of biology, as so many others (one of the latest nobel prizes went to fluorescence labeling technologies for proteins)

How do these enzymes know where to start? The DNA is continuous, but just like in a book or a chapter or a page, you have to start somewhere. And since the text in this case is A, T, G, and C with no further punctuation, the problem of knowing where you are is quite a bit more difficult than in a book. Usually a "primer" is used to start off the DNA polymerase- a short DNA fragment that can be made by pure chemistry, perhaps 20 nucleotides long, which hybridizes to its complementary sequence in the target DNA (after it has been heated up to melting temperature). If the cloning was done in clever fashion (abutting the DNA fragment to be sequenced right up to a known part of the cloning plasmid), then the same primer can be used for an entire sequenceing project.


The original human genome project used a variation of this method, where primed DNA polymerases on templates are fed a low ratio of nucleotides that have both chain terminating capacity, (di-deoxy, as opposed to DNA's single deoxy), and also have fluorescent labels (different for each of the four bases). Then the full four-label reactions with their resulting synthesized fragments are run through an extremely tiny (capillary) electrophoretic gel, at the end of which a fluorescence detector reads off the labels from the size-sorted fragments as they travel past. This is done with expensive machines, using miniaturized reactions that attain large scales of operation, taking all this work out of the hands of regular bench scientists.

A more recent technology is the 454/Illumina system (named for the companies they are offered by), which has finally dispensed altogether with the electrophoretic separation step, which has been such a painful bottleneck.

These systems lay single molecules of template on tiny islands on a glass slide (or a bead), and do an in-place PCR amplification step to park at lot of copies at that location. Then the sequencing step is performed, with A, G, C and T successively washed over all the template islands, and a luminous flash registered wherever a single step of incorporation takes place, before the next washes and next step of polymerization is performed, etc.

The virtue of this system is its extreme miniaturization and large parallelism- many different molecules can be laid down, amplified, and sequenced in one experiment. However, the read length is paltry- only about 35 (Illumina) or 300 (454) nucleotides, compared to the 800 nucleotides regularly attainable with the gel-sorting methods above.

Read length is critically important, since the next step for all these technologies is the reverse of reductionism: the re-assembly of the sequence from all the individual sequence reads, like doing a jigsaw puzzle. The reads (for a whole genome, say) are all poured into a computer program which lines up sequences that overlap, building back up to the sequence of the entire source DNA as best it can. As with jigsaw puzzles, the bigger the pieces you start with, the easier the puzzle is to solve, to an almost exponential degree.

Last, and most amazing, a recent report in Science introduces what is sure to be the next iteration- monitoring the production of a single strand of DNA on a single polymerase from a single template strand with an extremely miniaturized apparatus. Originating in the labs of Watt Webb (of which I am an exceedingly minor alumnus), and Harold Craighead at Cornell, this technique uses an odd optical property to peek into extremely tiny volumes of solution (one zeptoliter ~1E-21 liter).

It turns out that if you shine light through holes made in a conductor whose diameter are less than half the light's wavelength, the light does not get very far. If a solution is put into those holes, you can look at the fluorescent properties of the super-tiny volume right at the floor of the hole (containing in this case a DNA polymerase with template) without being distracted by the rest of the solution which may contain a high concentration of other fluorescent compounds (nucleotides). The fluorescence system looking into the bottom of the hole essentially just "sees" the occasional one or two fluorescent molecules bouncing along the bottom, or binding to the polymerase located there.

The sequencing method is then to add a solution of four different fluorescent nucleotides which contain color labels at their outer-most phosphates, which get clipped off as they are added to the growing chain. The polymerase attached to the bottom of the view-hole can use and incorporate these nucleotides with no problem, and fluorescence from the incoming nucleotide appears transiently, as it is positioned in the enzyme's active site, but before the reaction takes place that clips off the label and incorporates the rest of the nucleotide into the growing chain.

Thus the detector sees a parade of distinct fluorescence signals, one by one, as the lone polymerase does its work synthesizing a new DNA strand along the template. The tricky part is that this process happens stochastically. One incorporation event may go fast, the next slow, as diffusion of the nucleotides and even quantum effects come into play. Several incorporations of the same nucleotide may occur in succession on the template, requiring the observers to make sure they are tracking the pauses in fluorescence that occur between each step of the elongation reaction. Much of this uncertainty can be resolved technically, and also by doing a few replicates.

One advantage of this method is that read lengths are substantially increased. The researchers (who have now duly set up shop in Silicon Valley) show an experiment using a circular template with alternating G (red, below) and C (blue) halves to run off a potentially infinitely long sequencing read. They report a rate of ~3 bases incorporated per second under their conditions, with clear alternation of C and G signals, up to 4,000 nucleotides in an hour's time. This is very promising for problems in genomic sequencing like the occurrence of repetitive regions that are very difficult to piece together from short sequencing reads, and one may hope that these lengths can be extended and the polymerization times speeded up as the technique is further optimized.


All these advances mean that it will not be long before individuals can get their entire genomes sequenced at a reasonable price. The information will allow divination of the future, in the form of improved personal medical prognoses as we slowly learn more about how the genome works. And also divination of the past, since complete genomes will allow geneological analysis of unprecedented detail and depth. Our long evolutionary inheritances reside in these ~3 billion base pairs, and bringing them into the light will generate great benefits, individually and collectively.

Incidental links:
Steven Pinker on his own genome.
Very basic TED talk on genomes by Barry Schuler.
Dire warnings about privacy issues.