Saturday, January 7, 2023

A New Way of Doing Biology

Structure prediction of proteins is now so good that computers can do a lot of the work of molecular biology.

There are several royal roads to knowledge in molecular biology. First, and most traditional, is purification and reconstitution of biological molecules and the processes they carry out, in the test tube. Another is genetics, where mutational defects, observed in whole-body phenotypes or individually reconstituted molecules, can tell us about what those gene products do. Over the years, genetic mapping and genomic sequencing allowed genetic mutations to be mapped to precise locations, making them increasingly informative. Likewise, reverse genetics became possible, where mutational effects are not generated randomly by chemical or radiation treatment of organisms, but are precisely engineered to find out what a chosen mutation in a chosen molecule could reveal. Lastly, structural biology contributed the essential ground truth of biology, showing how detailed atomic interactions and conformations lead to the observations made at higher levels- such as metabolic pathways, cellular events, and diseases. The paradigmatic example is DNA, whose structure immediately illuminated its role in genetic coding and inheritance.

Now the protein structure problem has been largely solved by the newest generations of artificial intelligence, allowing protein sequences to be confidently modeled into the three dimensional structures they adopt when mature. A recent paper makes it clear that this represents not just a convenience for those interested in particular molecular structures, but a revolutionary new way to do biology, using computers to dig up the partners that participate in biological processes. The model system these authors chose to show this method is the bacterial protein export process, which was briefly discussed in a recent post. They are able to find and portray this multi-step process in astonishing detail by relying on a lot of past research including existing structures and the new AI searching and structure generation methods, all without dipping their toes into an actual lab.

The structure revolution has had two ingredients. First is a large corpus of already-solved structures of proteins of all kinds, together with oceans of sequence data of related proteins from all sorts of organisms, which provide a library of variations on each structural theme. Second is the modern neural networks from Google and other institutions that have solved so many other data-intensive problems, like language translation and image matching / searching. They are perfectly suited to this problem of "this thing is like something else, but not identical". This resulted in the AlphaFold program, which has pretty much solved the problem of determining the 3D structure of novel protein sequences.

"We validated an entirely redesigned version of our neural network-based model, AlphaFold, in the challenging 14th Critical Assessment of protein Structure Prediction (CASP14), demonstrating accuracy competitive with experimental structures in a majority of cases and greatly outperforming other methods."

The current authors realized that the determination of protein structures is not very different from the determination of complex structures- the structure of interfaces and combinations between different proteins. Many already-solved structures are complexes of several proteins, and more fundamentally, the way two proteins interact is pretty much the same as the way that a protein folds on itself- the same kinds of detailed secondary motif and atomic complementarity take place. So they used the exact AlphaFold core to create AF2Complex, which searches specifically through a corpus of protein sequences for those that interact in real life.

This turned out to be a very successful project, (though a supercomputer was required), and they now demonstrate it for the relatively simple case of bacterial protein export. The corpus they are working with is about 1500 E. coli periplasmic and membrane proteins. They proceed step by step, asking what interacts with the first protein in the sequence, then what interacts with the next one, etc., till they hit the exporter on the outer membrane. While this sequence has been heavily studied and several structures were already known, they reveal several new structures and interactions as they go along. 

Getting proteins from inside the cell to outside is quite complicated, since they have to traverse two membranes and the intermembrane space, (periplasm), all without getting fouled up or misdirected. This is done by an organized sequence of chaperone and transport proteins that hand the new proteins off to each other. Proteins are recognized by this machinery by virtue of sequence-encoded signals, typically at their front/leading ends. This "export signal" is recognized, in some instances, right as it comes out of the ribosome and captured by the SecA/B/E/Y/G machinery at the inner bacterial membrane. But most exported proteins are not recognized right away, but after they are fully synthesized.

The inner membrane (IM) is below, and the outer membrane (OM) is above, showing the steps of bacterial protein export to the outer membrane. The target protein being transported is the yellow thread, (OmpA), and the various exporting machines are shown in other colors, either in cartoon form or in ribbon structures from the auther's computer predictions. Notably, SurA is the main chaperone that carries OmpA in partially unfolded form across the periplasm to the outer membrane.

SecA is the ATP-using pump that forces the new protein through the SecY channel, which has several other accessory partners. SecB, for example, is thought to be mostly responsible for recognizing the export signal on the target protein. The authors start with a couple of accessory chaperones, PpiD and YfgM, which were strongly suspected to be part of the SecA/B/E/Y/G complex, and which their program easily identifies as interacting with each other, and gives new structures for. PpiD is an important chaperone that helps proline amino acids twist around, (a proline isomerase), which they do not naturally do, helping the exporting proteins fold correctly as they emerge. It also interacts with SecY, providing chaperone assistance (that is, helping proteins fold correctly) right as proteins pass out of SecY and into the periplasm. The second step the authors take is to ask what interacts with PpiD, and they find DsbA, with its structure. This is a disulfide isomerase, which performs another vital function of shuffling the cysteine bonds of proteins coming into the periplasmic space, (which is less reducing than the cytoplasm), and allows stable cysteine bonds to form. This is one more essential chaperone-kind of function needed for relatively complicated secreted proteins. Helping them form at the right places is the role of DsbA, which transiently docks right at the exit port from SecY. 

The author's (computers) generate structures for the interactions of the Sec complex with PpiD, YfgM, and the disulfide isomerase DbsA, illuminating their interactions and respective roles. DbsA helps refold proteins right when then come out of the transporter pore, from the cytoplasm.

Once the target protein has all been pumped through the SecY complex pore, it sticks to PpiD, which does its thing and then dissociates, allowing two other proteins to approach, the signal peptidase LepB, which cleaves off the export signal, and then SurA, which is the transporting chaperone that wraps the new protein around itself for the trip across the periplasm. Specific complex structures and contacts are revealed by the authors for all these interactions. Proteins destined for the outer membrane are characterized by a high proportion of hydrophobic amino acids, some of which seem to be specifically recognized by SurA, to distinguish them from other proteins whose destination is simply to swim around in the periplasm, such as the DsbA protein mentioned above. 

The author's (computers) spit out a ranking of predicted interactions using SurA as a query, and find itself as one protein that interacts (it forms a dimer), and also BamA, which is the central part of the outer membrane transporting pore. Nothing was said about the other high-scoring interacting proteins identified, which may not have had immediate interest.

"In the presence of SurA, the periplasmic domain [of transported target protein OmpA] maintains the same fold, but remarkably, the non-native β-barrel region completely unravels and wraps around SurA ... the SurA/OmpA models appear physical and provide a hypothetical basis for how the chaperone SurA could prevent a polypeptide chain from aggregating and present an unfolded polypeptide to BAM for its final assembly."

At the other end of the journey, at the outer membrane, there is another channel protein called BamA, where SurA docks, as was also found by the author's interaction hunting program. BamA is part of a large channel complex that evidently receives many other proteins via its other periplasmic-facing subunits, BamB, C, and D. The authors went on to do a search for proteins that interact with BamA, finding BepA, a previously unsuspected partner, which, by their model, wedges itself in between BamC and BamB. BepA, however, turns out to have a crucial function in quality control. Conduction of target proteins through the Bam complex seems to be powered only by diffusion, not by ATP or ion gradients. So things can get fouled up and stuck pretty easily. BepA is a protease, and appears, from its structure, to have a finger that gets flipped and turns the protease on when a protein transiting through the pore goes awry / sideways. 


The author's (computers) provide structures of the outer membrane Bam complex, where SurA binds with its cargo. The cargo , unstructured, is not shown here, but some of the detailed interface between SurA and BamA is shown at bottom left. The beta-barrel of BamA provides the obvious route out of the cell, or in some cases sideways into the membrane.

While filling in some new details of the outer membrane protein export system is interesting, what was really exciting about this paper was the ease with which this new way of doing biology went forth. Intimate physical interactions among proteins and other molecules are absolutely central to molecular biology, as this example illustrates. To have a new method that not only reveals such interactions in a reliable way, from sequences of novel proteins, but also presents structurally detailed views of them, is astonishing. Extending this to bigger genomes and collections of targets, vs the relatively small 1500 periplasmic-related proteins tested here remains a challenge, but doubtless one that more effort and more computers will be able to solve.