Saturday, June 4, 2022

Cracking the Kinome Code

Attempts to figure out what causes phosphorylation events in our cells.

Continuing with biological codes, this week's topic is protein phosphorylation sites. Phosphate groups are negatively charged, so they have dramatic electrical field as well as steric effects when attached to a protein. Though many other forms of protein modification are known, phosphorylation is an extremely common route of biological regulation in cells- a way to supplement binding, complex formation, and allosteric interactions between proteins for regulatory purposes. The human genome is estimated to encode 518 protein kinases- that is, proteins that phosphorylate other proteins. Because each one can have hundreds of substrates or targets, this is a lot, gives rise to complex networks of reguation, and is called the "kinome", in analogy with the genome, microbiome, proteome, etc. Kinases are roughly divided between those that target tyrosines (Y) in substrate proteins, and those that target the chemically similar serine and threonine (S/T).

These are images of all proteins from rat neurons, spread out on a 2-dimensional gel, one dimension (vertical) by weight, and the other dimension (horizontal) by isoelectric point. At top, the experiment is stained for overall protein. At bottom, it is labeled for all the phosphotyrosines that exist. That is, all proteins that have been, under these conditions, phosphorylated by the minority of kinases that are tyrosine-targeting. There is clearly a lot going on.


The typical sequence of events is that some upstream signal, such as binding a hormone at the cell membrane, will turn on a kinase, which then phosphorylates a target protein. This will cause the target protein to interact with new partners, perhaps to be degraded, perhaps to be transported to the nucleus, or perhaps to phosphorylate yet other targets in an amplifying cascade of regulatory events. While not as fast as neuronal action, this regulation is typically much faster than the typical gene expression route, where a signal activates transcription of some gene, which is transcribed to mRNA, which is spliced and processed, and eventually translated to make the target proteins. Thus regulation by phosphorylation is critical for all sorts of rapid biological responses, like metabolic tuning and hormonal actions. 

One key problem in the field is mapping exactly what each kinase does, and what kinase is responsible for each phosphorylated target site. Massively parallel methods are now able to identify all the phosphorylated proteins in a cell (after killing it, naturally). But knowing what process and individual kinase was responsible for each of those events... we are much farther away from mastering that level of knowledge.

Similarly to the transcription regulator problem a couple of weeks ago, the sites that kinases act at are characterized by motifs that can be illustrated by a diagram of colored probabilities (below). In this case, the kinase (AKT, one of the most influential in the cell), is a serine/threonine targeting enzyme, so the center of its site must have one of those two amino acids. Then there are a couple of argenines (R) at the minus 3/4/5 positions, a hydrophobic amino acid at +1, and otherwise there are few restrictions. 

A probabilistic view of the AKT kinase target sequence, where this serine/threonine kinase attaches phosphate groups on other proteins that it regulates.

Like in the transcription regulator case, the targeting code is pretty loose and degenerate. That drives researchers to probabalistic methods to wring as much mapping as they can out of current data, which was the topic of a recent paper. The title is "Accurate, high-coverage assignment of in vivo protein kinases to phosphosites from in vitro phosphoproteomic specificity data". But "accurate" is a relative word. The graph below of recall and precision, which are standard terms of art in probability, using reserved portions of the data to test data accuracy, show a maximum of ~65% and 70%, respectively. That means that about 65% of true values are successfully collected from the underlying data, and 70% of the data collected is actually true. That may be best of class, but one wouldn't want to stake one's life, or even one's drug development program, on it.

Measures of accuracy of various methods of guessing what kinase is responsible for a given phosphorylated target site. Precision and recall are developed vs reserved (non-training) test data. The current author's method is IV-KAPhE in yellow.


This researcher set up an extensive pipeline to add together numerous sources of information. First is PhosphoSite, a database of kinase target sites and other interactions gathered both by hand from the literature and from private mass-scale data sources. Then he added co-expression data, which can hint that a kinase and target are present in the same cell, and thus candidates for interaction. Then came semantic data from general gene classifications, which can hint that a kinase and target work in the same process, and thus again likely to act in concert. A few additional databases, and he could, in classic Bayesian fashion, assemble a new resource that outperforms any of the individual ones in predicting what kinase is responsible for any given phosphoprotein that one has dug up in some mass-spec experiment. All that said, the method only covers kinases about which something is known, which currently runs to 349 kinases, well short of the total number mentioned above. So both in coverage and accuracy, we have a great deal to learn.


  • Silence is golden, and healthy.
  • Please throw out your halogen lamps.
  • Watch Lucy pull the football away again.
  • Allerdings, a lesson in German.