Saturday, June 15, 2024

The Quest for the Perfect Message, in E. coli

Translation efficiency has some weird rules, and a tortured history.

One would think we know everything there is to know about the workhorse of bacterial molecular biology, Escherichia coli. And that would be especially true for its technological applications, like the expression of engineered genes, which is at the very heart of molecular biology and much of biotechnology. Getting genes you put into E. coli expressed at high levels is critical for making drugs, and for making enough for structural and biochemical studies. For decades, the wisdom of the field was to design introduced genes using the codon adaptation index (CAI). This is a measurement of the three-letter codes (codons of the genetic code) that are used in highly expressed genes. They tend to correspond to tRNAs that are more abundant in the cell. So, for example, the amino acid leucine is encoded by six different codons, any of which can be chosen at intended leucine positions in the intended protein. In E. coli, CTG is over ten times more frequently used than CTA, however. Thus, even though they code for the same amino acid, one is more common, perhaps because its cognate tRNA is more common and more easily used during translation. This is basically a diffusion-based argument, that translation will be easier if the tRNA that carries the next amino acid is easier to find.

A recent paper provides a remarkable review of this field. For one thing, it turns out that use of the CAI has virtually no effect on translation efficiency. Whether using rare or common codons, translation is equally efficient for introduced genes. Needless to say, this is quite surprising. It seems as though the role of common vs uncommon tRNAs/codons is more to manage the health of the cell by relieving bottlenecks to translation in a global sense and managing the free pool of ribosomes, rather than regulating the efficiency of translation of any particular mRNA message. tRNAs are highly abundant generally, so there are significant savings possible by managing their levels judiciously, and reducing investment in some versus others.

So what does affect the efficiency of translation? Some messages are better translated than others, after all. The authors point to a completely different mechanism, which is the melting stability of the first ten codons of the mRNA message. RNA can form hairpin and other secondary structures / shapes, and this can apparently strongly affect the ability of ribosomes to find initiation sites. While eukaryotic ribosomes scan in from the 5 prime cap of the mRNA, bacterial ribosomes bind directly to a sequence slightly upstream of the initiating AUG codon. And this can be inhibited by mRNAs that are not neatly ironed out, but knotted up in hairpins and loops. 

Ratio of occurrence of nucleosides in the third codon position of the first ten codons of high versus low expressing genes in E. coli. This was not run on native E. coli genes, but on a large panel of transgenes engineered from outside. The strong bias towards A at this position in high expressing genes shows a preference for initiating sequences to have weak secondary structure, allowing better ribosome access.


Use of A-rich sequences around the ribosomal initiation sites and the first ten codons, then, dramatically increases the translation efficiency, (via the initiation efficiency) of introduced genes, and provide a much more robust method to control their expression. But then the authors make another observation, which is that the bacteria themselves do not seem to use this mechanism for their own genes. In a massive analysis of data from other labs, (below), there is actually a negative correlation between the quality of the initiation region (X- axis) and the abundance of the respective protein (Y- axis). Again, quite a surprising result, which the authors can only speculate about. 

There is negative correlation between the initiation codon quality (X- axis), as shown above, and the native E. coli gene expression level (Y- axis). So these cells are not optimizing their translation at all in accordance with the findings above.

The picture that they paint is that highly expressed genes in E. coli benefit from consistent, smooth translation. This depends less on maximal initiation speed than on the holistic picture of translation. The CAI optimal codons (called translationally optimal in this paper, or TO) tend to be poor at initiation, but have good codon-anticodon pairing and thus low A content. So there are conflicting pressures at work, in basic chemical terms, where different codons are intrinsically good for initiation, and complementary ones for elongation. The obvious solution is to use the initiation-optimal codons for the first ten codons, and translationally optimal codons the rest of the way. But that is not what is found either. The authors claim that, for native proteins, lower levels of initiation are actually beneficial for smoother protein production with less noise from time to time and cell to cell. 

Additionally, lower initiation rates preserve free ribosome levels globally, another important goal for the cell, via evolutionary selection. The authors find, for instance, a correlation between low variability of initiation (low noise) and low initiation rate. This is a bit mystifying, since ribosomes should always be present in excess, and it is not immediately apparent why holdups to translation initiation would lend themselves to more even initiation. Perhaps the search process by which ribosomes find free mRNAs is inefficient, so that those with slower initiation sequences have a constant backlog of incoming, bound and poised ribosomes, while after they get past the initiation region, those ribosomes progress rapidly and rejoin the free pool. That would be one way of setting up a smooth production process, suitable for essential protein products, that is relatively insensitive to the free ribosome concentration and other variations in the cell.

Technologists trying to express some drug-associated protein in bacteria don't care about smoothness and noise, but just want to maximize production while not killing the cell (or before killing the cell). So all these subtle considerations that go into the evolution of the native gene complement of E. coli and its high or low expression levels don't apply. But for researchers trying to predict the expression level of a given natural gene, it is maddening, since it seems currently impossible to predict the expression level (via translation) of a gene from its sequence. It is one more case where modeling of what is going on inside cells is surprisingly difficult, even for a system we had thought we understood, in one of the simplest and most well-studied bacteria. As researchers never tire of saying ... more research is needed.