Saturday, May 15, 2010

Protein theology

I critique a scholarly article from the "intelligent design" camp.


One of the wonders at the heart of modern biology is our knowledge of protein function and structure. The fact that the linear genetic code produces protein molecules that spontaneously fold into complicated shapes, wander off to various corners of the cell, and then spontaneously do their complicated functions like metabolic reactions, holding things together, filtering ions, or replicating DNA. etc.- it is mind-boggling, and a little magical.

But it all had to come from somewhere, and I ran across a recent paper that takes up the issue. This is from a creationism house journal, BIO-Complexity, put out by the Discovery Institute. They have their own peers, thus this is darn well a peer-reviewed journal! Accompanying it is an honest-to-goodness experimental article, where the experimenters strive to not observe a phenomenon that they theorize doesn't happen .. and succeed!

Alright, sarcasm aside, the article, "The case against a Darwinian origin of protein folds", by Douglas Axe, is well-written and mildly interesting, though I take a highly critical attitude. It is a review- no new theories are proposed, let alone tested. He lays out (using the royal "we" throughout) the difficulty of protein domains arising from nothing, and claims to make a strong case that this was impossible within the Darwinian paradigm (presumably including early chemical evolution in the broader Darwinian theory).

Domains are the structural units of proteins- the smallest portions that fold by themselves and sometimes function by themselves in doing whatever, like catalyzing a reaction. Typically, proteins are composed of several domains, whose connection is critical in, for example, activating the catalytic activity of one domain in response to binding a regulatory chemical by another. The definition can be vague, since very small proteins (called peptides) may have important functions (like hormones) despite being unable to maintain a coherent fold/shape on their own. One might say they are protean! And there are large proteins that amount to very large domains, not easily broken down conceptually into independent folding / functioning sub-domains.

Axe spends quite a bit of time using average protein size statistics to marvel at the improbability of any protein-sized, or even domain-sized unit of protein sequence arising de novo. If we have 20 amino acids, and the average protein is 300 amino acids long, that amounts to 1E390 possible combinations. Finding the one of these that is a modern protein is like, well, it is simply impossible, there being only 1E150 particles in the entire universe, 4E20 microseconds since the big bang, etc. He likens it to finding a gemstone in a Sahara desert to the Nth power, and other elaborate comparisons.

Even if one cuts the search space down to the size of a domain, (average modern size ~ 100 amino acids), these numbers are astronomical, though Axe does not go into this correction in detail. But obviously, the origin of proteins at the dawn of life has never been hypothesized to involve the sudden appearance of 300 or even 100 amino acid-long enzymes for oxidative phosphorylation. This is a straw man from top to bottom.

Why is he and his community fixated on it? It seems to follow a model of a deity trying desperately to design our proteins from above, stabbing away over billions of years and uncounted quadrillions of organisms before getting it just right in the case of homo sapiens. If one lays out the premises explicitly, they dissolve before one's eyes. But with the ID community, all this is implicit as a religio-political "wedge" project, not a serious intellectual endeavor.

Over in the actual scientific community, the origin of protein coding capacity is commonly assumed to have extremely modest beginnings, as an extension of the RNA world, when RNA had the primary replicative and catalytic ability. This modest catalytic ability might then have been abetted by tiny peptides, painfully assembled by a set of primitive RNA enzymes, and then extended to slightly longer protein chains, which eventually and competitively, through their vastly superior chemical abilities, relegated RNA to what is now its mostly informational role. Indeed, the protein-translating ribosome remains a thoroughly RNA machine, using strings of mRNA as the template code, tRNA- mounted amino acids as the building blocks, and a catalytic core of rRNA for polymerization. This sort of gives the game away right there, if one cares to look.

The original genetic code is also likely to have had fewer than the 20 amino acids that are universal today. The messiness of this genetic code, with some amino acids encoded by only one of the 64 codons, and others encoded by six, indicates some late additions and jerry-rigging to the system. And since the code's establishment, more amino acids have come into use through chemical modifications, either before the amino acid is incorporated (selenocysteine), or afterwards (hypusine).

Axe never recognizes such realistic accounts of the primitive origin of proteins, however. He also assumes that successful proteins have to approach modern levels of efficiency, making any path from one folded form to another folded form (there are an estimated ~2000 classified folds) impossible, none in between being likely to have a well-honed function. Here again he fails to recognize the wider spectrum of hypotheses available. Many proteins have unstructured regions- floppy areas that may adopt structure only when binding some other partner, or adopt alternate structures under different conditions. Such alternate folding lies at the heart of Alzheimer's disease and prion diseases.

While one optimized folding structure is unlikely to turn into something completely different and coherent through direct evolutionary selection, there are many other resources for the emergence of novel domains, such as these floppy sections of working proteins, or random DNA segments that do not, for the moment, code for genes, or various fusions of working proteins (a fertile source of cancer), or duplicated coding genes with no selective constraints at the moment. Many accidents have happened to genomes over time.

Axe cites experiments showing that proteins can switch readily between quasi-stable folds, an important precursor to these innovations. But he dismisses such cases as not competitive in the modern Darwinian landscape. When a novel function is at issue, however, how primitive is too primitive? Some function is doubtless better than none, and that is how new functions (and structures) gain a toehold in the Darwinian paradigm. The starting point for any evolutionary optimization path is not an already-optimized functional protein, but one with any function at all, however glimmeringly small, compared to the lack of that function in competing organisms. This will often be an off-beat mutation of an existing protein conferring a novel, if weak, activity.

The idea that one sequence out of all 1E390 sequences is the one that evolution must find, and in a hurry, is fallacious in another way as well. All of phylogenetic analysis is based on the wide variation of sequences, to the point that functionally and structurally similar proteins may have no detectable similarity in their linear sequences. We have essentially no idea how big a swath of sequence space any function might require, even when it is optimized. Evolution certainly never samples all of it- that we can agree on.

But does it have to? No, it obviously does not. One of the wonders of molecular biology is that, as sequences were accumulated in databases, many of them turned out to be related to each other, elegantly recapitulating the phylogenetic tree that Charles Darwin had first sketched out so tentatively to extraordinary depth and detail. In addition to clearly tracking the divergence of species by the divergence of their homologous genes, this method also found countless deeper relationships- families of proteins that had diverged at ancient times from single ancestors through duplication, first sharing functions, but often diverging in function as well. Unfortunately, as indicated above, relationships between linear sequences go back only so far before becoming unrecognizable, despite being truly ancestral, so the full story of ancient protein domain diversification can not be revealed in this way.

Axe notes that every organism harbors, in addition to critical genes that are highly conserved, a population of others with no detectable relationships. A bold hypothesis from his perspective would be that it is these proteins that are the most important, showing that god remains at work, creating new protein structures for critical cellular functions, as has been his habit through the ages.

Unfortunately, another hypothesis is quite a bit more likely. These proteins are, in point of fact, the least important ones of the organism, prone to rapid mutation and divergence to the point of unrecognizability. These, in turn, might be exactly the kinds of proteins that generate new structures, folds, and functions, if they can outrun complete inactivation through mutation, yielding up the novel folds that the author seems so perplexed by.

Indeed, I'd suggest that the known collection of protein folds is reasonably definitive and represents the limited number of ways that small domains can fold coherently. The vast remaining unexplored sequence space is unlikely to add much, just as it is unlikely to add new secondary structures to the venerable alpha helix and beta sheet, due to basic physico-chemical constraints.

If Axe and his peers are interested in doing a real service, they would help save the huge fund of bio-diversity (including protein diversity) we are squandering by the day, rather than pursuing faux-science whose philosophical destination would indicate that god might be happy to reverse our degradation of the natural world with a wave of his magic wand. So, no worries!