Saturday, November 2, 2019

To Model is to Know

Getting signal out of the noise of living systems, by network modeling.

Biology is complex. That is as true on the molecular level as it is on the organismal and ecological levels. So despite all the physics envy, even something as elegant as the structure of DNA rapidly gets wound up in innumerable complexities as it meets the real world and needs reading, winding, cutting, packaging, synapsing, recombining, repairing, etc. This is particularly true of networks of interactions- the pathways of (typically) protein interactions that regulate what a cell is and does.

An article from several years ago discussed an interesting and influential way to learn about these interactions. The advent of "big data" in biology allowed us to do things like tabulate all the interactions of individual proteins in a cell, or sample the abundance of every transcript or protein in cells of a tissue. But it turned out that these alone did not lead directly to the elucidation of how things work. Where the genome was a part ordering list, offering one catalog number and description for each part, these experiments provided the actual parts list- how many screws, how many manifolds, how many fans, etc., and occasionally, what plugs into what else. These were all big steps ahead, but hardly enough to figure out how complicated machinery works. We still lack the blueprint, one that ideally is animated to show how the machine runs. But that is never going to happen unless we build it ourselves. We need to build a model, and we need more information to do so.

These authors added one more dimension to the equation- time, via engineered perturbations to the system. Geneticists and biochemists have been doing this (aka experiments- such as mutations, reconstitutions, and titrations) forever, including in gene expression panels and other omics data collections. But employing a perturbation method in a systematic and informative way on the big data level in biology remains a significant advance. The problem is called network inference- figuring out how a complex system works, (which is to say, making a model), now that we are given some but not all important information of its composition and activities. And the problem is difficult because biological networks are frequently very large and can be modeled in an astronomical number of ways, given that we have scanty information about key internal aspects. For instance, even if many individual interactions are known from experimental data, not all are known, many key conditions (like tissue, cell type, phase of the cell cycle, local nutrient conditions etc.) are unknown, and quantitative data is very rare.

One way to get around some of these specifics is to poke the system somewhere and track what happens thereafter. It is a lot like epistasis analysis in genetics, where if you, say, mutate a set of genes acting in a linear process, the ones towards the end can not be cured by supplying chemical intermediates that are made upstream- the later genes are "epistatic" to those earlier in the process. Such logic needs to be expanded exponentially to address inference over realistic biological networks, and gets the authors into some abstruse statistics and mathematics. Their goal is to simplify the modeling and search problem enough to make it computationally tractable, while still exploring the most promising parts of their parameter space to come up with reasonably accurate models. They also seek to iterate- to bring in new perturbation information and use it to update the model. One step is to discretize the parameters, rather than exploring continuous space. Second is to use preliminary calculations to get near optimal values for their model parameters, and thereafter explore those approximated local spaces, rather than all possible space.

Speed of this article's method (teal) compared with a conventional method for the same task (green).

All this is done over again for experimental cases, where some perturbation has been introduced into their real system, generating new data for input to the model. Their improved speed of calculation is critical here, enabling as many iterations and alterations as needed, to update and refine the model. If the model is correct, it will respond accurately to the alteration, giving the output that is also observed in real life. Such a model then makes it possible to perform virtual perturbations, such as simulating the effect of a drug on the model, which then predicts entirely in silico what the effects will be on the biological network.
"It is also useful, as an exercise, to evaluate the overall performance of the BP algorithm on data sets engineered from completely known networks. With such toy datasets we achieve the following: (i) demonstrate that BP converges quickly and correctly; (ii) compare BP network models to a known data-generating network; and (iii) evaluate performance in biologically realistic conditions of noisy data from sparse perturbations."
...
"Each model is then a set of differential equations describing the behavior of the system in response to perturbations."

The upshot of all this is models that are roughly correct, and influential on later work. The figure below shows (A) the smattering of false positives and missing (false negatives) interactions, but (B) accounts for most of this error as shortcuts of various kinds- the inference of regulation that is globally correct, but may be missing a step here or there. So they suggest that the scoring is actually better than the roughly 50 - 70% correct rate that they report.

An example pair of interconnecting pathways inferred from experimental protein abundance data and perturbed abundance data, with the protein molecules as nodes and their interactions as arrows. Where would a drug have the most effect if it inhibited one of these proteins?

They offer one pathway as an example, with an inferred pattern of activity, (above), and a few predictions about what proteins would be good drug targets. For example, PLK1 in this diagram is a key node, and has dramatic effects if perturbed. This came up automatically from their analysis, but PLK1 happens to already be an anticancer drug target with two drugs under development. Any biologist in the field could have told them about this target, but they went ahead with proof-of-principle experiments to show that yes, indeed, treatment of their RAF-inhibitor drug-resistant melanoma cells, which were the subject of modeling, with an experimental anti-PLK1 drug results in dramatic cell killing at quite low concentrations. They had used other drugs as perturbation agents in the physical experiments to develop this model, but not this one, so at least in their terms, this is a novel finding, arrived at out of their modeling work.

Given that these authors were working from scratch, not starting with manually curated pathway models that incorporate a lot of known individual interactions, this is impressive work (and they note parenthetically that using such curated data would be a big help to their modeling). Having computationally tractable ways to generate and refine large molecule networks based on typical experimentation is a recipe for advancement in the field- I hope these tools become more popular.



  • Ruminations on PGE. Minimally, the PUC needs to be publically elected. Maximally, the state needs to take over PGE entirely and take responsibility.
  • Study on the effect of automation on labor power ... which is minor.
  • What has happened to the Supreme Court?
  • Will bribery help?