Saturday, March 8, 2014

What is the oldest cell?

Some comparisons of the most ancient lineages of life- Archaea, Bacteria, and Eukaryotes.

Take your mind back.. way back.. four billion years back. Now fast-forward over chemical evolution, or whatever happened to cause the origin of life, a few hundred million years, on to the first cellular life. Now stop- what was that? A recent paper argues from a novel analysis of protein domain lineages that, of the major domains of life, the Archaea (also called archaebacteria) are in some respects closest to that original form, and that the other domains- Bacteria (also called eubacteria) and Eukaryotes- are more distant. (Apologies that biologists use the term "domain" in these two very different senses.) This is an interesting hypothesis, since up till now, it has been indeterminate which of the two bacterial lineages came first, or at least, most resembles the ur-life form, also called the progenote.

Tree of life, deep edition. Note that while eukaryotes arose from Archaea, (with plenty of later additions from Bacteria by engulfment / symbiosis, but that is another story), the root between Archaea and Bacteria as shown here is indeterminate. Which was really first, or is that even a reasonable question to ask? The current paper also disputes that Eukaryotes derived from Archaea as diagrammed, and puts Archaea at the root of the tree.

Non-biologists may not get excited about the distinction between Archaea and Bacteria, but molecular biologists regard it as the most fundamental division of life, far more consequential than vertebrates / invertebrates, plants / animals, etc. All of the latter you can see in the little brown stubs far to the right of the diagram above. In molecular and deep phylogenetic terms, they don't contribute much to the diversity of life.

The Archaea / Bacteria division was only recognized relatively recently, however, since the nature of Archaea was not appreciated until the 1970's when ribosomal RNA began to be sequenced. It provided the first primitive molecular sequence that was common to every single form of life and thus provided a metric of diversity and geneology. The great American microbiologist Carl Woese labored to gather these sequences from obscure organisms and bacteria of all sorts. He made the shocking discovery that there were "bacteria" out there that were very, very different from the usual run of laboratory bacteria- the E. coli and various other disease-causing and easily-cultured bacteria that were the staff of biology since Pasteur. When he plotted out the sequences, these "bacteria" had ribosomal RNA that was a little more like animal sequences than bacterial, but not terribly similar to either. They weren't from another planet, but they were different enough that he took the very bold step of claiming an entirely new domain of life, co-equal with the heretofore only domains of life, Bacteria and Eukaryotes.

He named them Archaebacteria, on a hunch that they had something important to say about the origin of life. This name has subsequently been shorted to Archaea, so that the traditional bacteria can just be called Bacteria. These Archaea look like Bacteria, however- they are the same tiny cells whose wonders are not apparent from their looks. They are super-diverse, living in all sorts of environments from the coldest to the hottest known. They have phenomenal metabolic diversity, creating the methane in our guts and living off rocks, sulfur, and other obscure chemicals. Some have a primitive form of photosynthesis. They are typically sensitive to oxygen, a sign of their preference for a world predating the oxygenation of the atmosphere about 2 billion years ago (and making them very difficult to culture).

They also share most of their informational machinery (transcription, translation) with Eukaryotes, indicating strongly that Eukaryotes derived from Archaea that later engulfed bacteria (which eventually turned into mitochondria and chloroplasts) that provided some of the remarkable resources, both genetic and metabolic, for the eukaryotic triumph over the macroscopic world.

But ribosomal RNA, as convenient and informative as it is, has some problems. It is only a single, if large, molecule, among the thousands of other genes an organism has, and its sequence is somewhat inaccurate as a "clock" for molecular evolution. Few other sequences, such as those encoding proteins, are as completely universal among all life forms, however. The authors of a recent paper take a broader approach to the question of sharpening the universal geneology (or, tree of life) by treating whole complements of proteins and their domain, or "fold" sub-sections as geneological markers, testing which protein domains arose when, and which were lost in various lineages.

This gets around the issue of aligning individual sequences, to some extent, taking a wider lens view of the evolutionary process. A view that is well-suited to this question of the ultimate priority of the most ancient life forms. Protein domains / folds have been generated and lost quite frequently on this time scale, though there are core domains that are universal over all life forms. Eukaryotes are particularly prolific in generating new protein domains. About 3% of protein domains are unique to primates, for instance, though this may have as much to do with sampling & investigation bias as with reality.
"In fact, recruitment of ancient domains to perform new functions is a recurrent phenomenon in metabolism."

A protein with two domains. This one binds to DNA. The domains fold independently, have structure that is distinct from other domains, and can be easily linked, making them easy to re-shuffle in evolution, hooking functions together, leggo-like.

The authors assembled a compendium of about 2400 protein domain, or fold "families" from 420 sequenced organisms of all kinds, and used well-known methods to arrange them into trees based on their occurrence in the individual organisms (though sometimes a fold might be missed even if present, if its sequence diverged from its family consensus pattern too far). The gain and loss of such folds is a particularly powerful method of lineage analysis, giving more information than the comparison of sequences can, if those sequences are distant, with all the problems of alignment, assumed modes of mutational change, etc. Thirteen of their folds were present in every single organism, and 62 more were recognizably present in 95% or more.

A Venn diagram showing the distribution of fold families among the three domains of life, whether shared or not. Note than a large core is shared by all life forms, while Eukaryotes take the prize for the development of new protein domains, despite originating after the divergence of Bacteria and Archaea.

"To determine the relative age of FF [fold family] domains in our dataset, we reconstructed trees of domains (ToDs) from the abundance and occurrence matrices used in the reconstruction of ToLs [trees of life]. The matrices were transposed, treating FFs as taxa and proteomes as characters. The reconstructed ToDs described the evolution of domains grouped into FFs and identified the most ancient and derived FFs. ... Specifically, it considers that abundance and diversity of individual FFs increases progressively in nature by gene duplication (and associated processes of subfunctionalization and neofunctionalization) and de novo gene creation, even in the presence of loss, lateral transfer or evolutionary constraints in individual lineages. Consequently, ancient domains have more time to accumulate and increase their abundance in proteomes. In comparison, domains originating recently are less popular and are specific to fewer lineages."

The next diagram shows the phylogenetic tree they deduce from all this data, with time along the horizontal axis, and species ordered up the side. The two trees were created from the same data by slightly different methods. Note how in both of these trees, the Eukaryotes (green) split from the Bacteria (blue) only a short time after the Bacteria split from the Archaea (red). The lavender arrows are mine. Both trees also show (the numbers, which are percentage of time their simulations came out the same way) that this split is relatively less clearly supported than some of the other major divergences.

Author's phylogenetic trees.

Returning to the Venn diagram, the Archaea-only group of folds is tiny, and does not seem particularly ancient, even though their trees put Archaea first. The hypothesis is that the other groups (The BE and AB (or AE) groups) generated far more protein diversity later on, whereas the Archaea did not, indeed losing quite a bit of the original complement of protein domains. In this way, Archaea end up resembling the progenote somewhat more than the Bacteria that diverged from the progenote simultaneously, but were more active in later evolution, in molecular terms.

Both the Bacteria and Archaea took the streamlining route in evolution, casting off quite a bit of machinery, focusing on small-ness of size and specialization of metabolism. The Eukaryotes, in contrast, branched off from the Archaea after the Bacteria did, and retained a good deal of the transcriptional, replicational, and translational machinery that the bacteria particularly lost or reduced (at least, by the conventional theory). And Eukaryotes in general took the opposite route with respect to streamlining, retaining molecular diversity & sloppiness, metabolic generalization, great physical size, gaining sex as a means to more effective evolution, and gaining the final upper hand with the endosymbiosis of two different Bacteria- the proto-mitochondrion, and the proto-chloroplast. These properties led eventually to multicellularity and the invasion of land. It was (depending on what one values!) a triumph of complexity and cooperation over brutal, cost-cutting competition.

The authors plot their organisms in an "economic" space. This is based on two scores- the number of protein folds occurring that are unique (economy), and the redundancy of protein folds occurring in each organism (flexibility), with the ratio between them serving as the last measure (robustness), which is, frankly, sort of an amplification of the flexibility score. Obviously, Eukaryotes will do very well in these measures.

Since in the author's scheme the AE goup of domains appeared very late, and the BE group was the first to branch off from the universal ancestors, they hypothesize that Eukaryotes branched off from Bacteria, and their informational-class resemblance to Archaea is due either to later lateral transfer, or to comprehensive loss in many Bacterial lineages (though the latter is very unlikely). To me this seems hard to swallow, as this class of functions is particularly unlikely to be transferred wholesale between organisms.
"Informational FFs were significantly over-represented in the AE taxonomic group and appeared during the late evolutionary epoch. This suggested that both Archaea and Eukarya work with a very similar apparatus for decoding their genetic information, which is different from Bacteria. However, as we explained above, all these innovations occurred in the late epoch (nd>0.55), highlighting ongoing secondary adaptations in the superkingdoms. In comparison, the BE taxonomic group was enriched in metabolic FFs (Figure 2A). This toolkit was probably acquired via HGT [horizontal gene transfer] during endosymbiosis of primordial microbes rich in diverse metabolic functions."

This idea would significantly alter / extend the well-known endosymbiotic hypothesis, in that the Eukaryotic precursor would presumably have to acquire not only the proto-mitochondrial cell, but also the proto-nuclear cell that provided these informational functions, from Archaea. It is hard to know what would characterize this original precursor at all ... why not just take the crucial Archaeal additions as the benchmark of the whole lineage? Wouldn't the large protein repertoire commonality between Bacteria and Eukaryotes be better accounted by the known endosymbiosis than by this proposed lineage derivation? The authors have very little to say about what this early Eukaryotic stem organism might be, other than that it was quite advanced and had escaped the brutal streamlining that characterizes both the Archaeal and Bacterial lineages. Thus it, whatever it was, might represent the closest thing to the progenote, in some respects, before the vast elaborations that have been added in that line since, and the massive losses that took place in the other two domains.
"Thus, the primordial stem line, which was already structurally and functionally quite complex, generated organismal biodiversity first by streamlining the structural make up in Archaea (at nd = 0.15), then by generating novelty in Bacteria (nd = 0.26), and finally by generating novelty and co-opting bacterial lineages as organelles in Eukarya (nd less than 0.55)". [nd is their measure of time, from beginning (0) to now (1).]

In the end, the progenote is heavily veiled from our view. The common repertoire of sequences common to all cells is small, (484 fold families in this analysis), and not enough to model what it may have been like, other than to say it had a membrane, functioning metabolism, and informational / genetic system likely similar to what archaebacteria have today. It may have been a good deal more complex, depending on how one interprets the intervening events- as ones primarily of loss, or ones of gain.