Showing posts with label genetics. Show all posts
Showing posts with label genetics. Show all posts

Saturday, January 16, 2021

Hunting for Lost Height

Progress in sequencing technologies and genetic analysis nails down the genetic sources of variability in the trait of human height.

PBS has an excellent program about eugenics- the push by some scientists and social reformers in the early 1900's to fix social problems by fixing problematic people. Both the science and the social ethics fell into disrepute, however, and were completely done in by the Nazi's version. While the stigma and ethical futility of eugenics remains, human genetics has advanced immeasurably, putting the science on much firmer footing. One example is a recent announcement that one research group has found all the sources of genetic variation that relate to human height.

Height is obviously genetic, and twin studies show that it is 80% heritable. There has been an interesting literature on the environmental effects on height, to the extent that whole populations of malnourished immigrants find that, after they move to the US, their children grow substantially taller. So genetic influences are only apparent (as indicated by the 80% figure) in the absence of over-riding environmental constraints. 

The first attempts to find the genetic loci associated with height took off after the human genome was sequenced, in the form of GWAS studies (genome-wide association study). It was easier in this era to probe short oligonucleotide sequences against the sampled genomic DNA, rather than sequence whole genomes of many people. So GWAS typically took a large sample of about 500,000 locations through human genomes that were variant, and used them to test which of those variants a set of human populations had. A massive correlation analysis was done versus the traits of those people, say their height, or weight or health, to see which markers (i.e. variants) correlated with the trait of interest. 

Such studies only found about 5% to 25% of the heritability of height, perplexing researchers. They were sampling the entire genome, if sparsely. The 500,000 markers corresponded to about one every 6,000 base pairs, so should be near enough to most genes, if they have significant effects on the trait of interest. And since most human genome regions are inherited as relatively large blocks, (haplotypes), due to our near-clonal genetic history, the idea was that sampling a sparse set of markers was sufficient to get at any significant effect from any gene. Later work could then focus in on particular regions to find the actual genes and variations that were responsible for the trait in question.

But there was a big problem, which was that the variants selected to go into the marker pool were from a very small population of a few hundred people. Recall that sequencing whole genomes was very expensive at this time, so researchers were trying to wring as much analysis out of as little data as possible. By 2018, GWAS type studies were still only finding genetic causes for about 25% of the variability of height, clearly short of what was known from simple genetic analysis of the trait. Not only that, but the number of genes implicated was rising into the thousands, each with infinitesimal effect. The first 40 genes found in these studies only accounted for about 5% of the variation in height. 

The large effect of rare alleles. MAF (minor allele frequency) in the human population, plotted against the trait variance it accounts for. The color code (LD, or linkage disequilibrium) indicates selection against the locus (if high) and other predicted characteristics of the variation, in the color scheme. It is very rare protein-altering variants (blue) that have the strongest individual effects.

The current work (review, review) takes a new approach, by virtue of new technologies. They sequence the full genomes of over 20,000 people, finding a plethora of rare alleles that had not been included in the original marker studies- alleles that have significant effects on height. They find variations that account for 79% of height heritability, which is to say, all of it. It turns out that the whole premise of the GWAS study, that common markers are sufficient to analyze diverse populations, is incorrect. The common markers are not as widely distributed, or as well-linked to rare variants, as was originally assumed. The new technologies allow vastly more depth of analysis (full genome sequencing) and broader sampling (20,000 vs a few hundred) to find rare and influential variants. We had previously learned that using common variants confines the GWAS analysis to uninteresting variants- those that are not being selected against. This may not be an enormous issue in height trait, (though these researchers find that many of their new, rare loci are being selected against), but it was a big issue in the analysis of disease-linked genetic loci, like for diabetes or alcoholism. While these traits may be common, the most influential genetic variants that cause them are not, for good reason.

One can imagine that over time, everyone will have their genome sequenced, and that this data will lead to a fuller, if not complete, understanding of trait genetics. But what are the genes responsible for the traits? All this is still an abstract mapping of locations of variability (what used to be called mutation) correlated with variations of a trait. This newest data identifies thousands of influential variants covering one third of the genome. This means that, like most interesting traits, the genetics of human height are dispersed- a genetic fog. All sorts of defects or changes can influence this trait to infinitesimal degrees, making it a fool's errand to look for a gene for height.


  • Guns are a key element of this volatile moment.
  • Stories, data, and emotion.
  • God, guns, and lunacy ... a match made in heaven.

Saturday, November 21, 2020

Stem Cell Asymmetry Originates at the Centrosome

At least sometimes.. how replication of the centrosome creates asymmetry of cell division as a whole, and what makes that asymmetry happen.

Cell fates can hinge on very simple distinctions. The orientation of dividing cells in a tight compartment may force one out of the cozy home, and into a different environment, which induces differentiation. Stem cells, those notorious objects of awe and research, are progenitor cells that stay undifferentiated themselves, but divide to produce progeny that differentiate into one, or many, different cell types. At the root of this capacity of stem cells is some kind of asymmetric cell division, whether enforced physically by an environmental micro-niche, or internally by molecular means. And a dominant way for cells to have intrinsic asymmetry is for their spindle apparatus to lead the way. Our previous overview of the centrosome (or spindle pole body) described its physical structure and ability to organize the microtubules of the cell, particularly during cell division. A recent paper discussed how the centrosome itself divides and originates a basic asymmetry of all eukaryotic cells.

The centrosome is a complicated structure that replicates in tandem with the rest of the cell cycle. Centrosomes do not divide in the middle or by fission. Rather, the daughter develops off to the side of the mother. Centrosomes are embedded in the nuclear envelope, and the mother develops a short extension, called a bridge or half-bridge, off its side, off of which the daughter develops, also anchored in the nuclear envelope. Though there are hundreds of associated proteins, the key components in this story are NUD1, which forms part of the core of the centrosome, and SPC72, which binds to NUD1 and also binds to the microtubules (made of the protein tubulin) which it is the job of the centrosome to organize. In yeast cells, which divide into very distinct mother and daughter (bud) cells, the mother centrosome (called the spindle pole body) leads the way into division and always goes into the daughter cell, while the daughter centrosome stays in the mother cell.

The deduced structure of some members of the centrosome/spindle pole in yeast cells. Everything below the nuclear envelope is inside the nucleus, while everything above is in the cytoplasm. The proteins most significant in this study are gamma tubulin (yTC), Spc72, and Nud1. OP stands for outer plaque, CP central plaque, IP inner plaque, as these structures look like separate dense layers in electron microscopy. To the right side of the central plaque is a dark bit called the half-bridge, on the other side of which the daughter centrosome develops, during cell division.

The authors asked why this difference exists- why do mother centrosomes act first to go to the outside of the cell where the bud forms? Is it simply a matter of immaturity, that the daughter centrosome is not complete at this point, (and if so, why), or is there more specific regulation involved that enforces this behavior? They use a combined approach in yeast cells combining advanced fluorescence microscopy with genetics to find the connection between the cell cycle and the progressive development of the daughter centrosome.

Yeast cells with three mutant centrosome proteins, each engineered as fusions to fluorescent proteins of different color, were used to show the relative positions of KAR1, (green), which lies in the half-bridge between the mother and daughter centrosomes. Three successive cell cycle states are shown. Spc42, (blue), at the core of the centrosome, and gamma tubulin (red; Tub4, or alternately Spc72, which lies just inside Tup4), which is at the outside and mediates between the centrosome and the tubulin-containing microtubules. Note that the addition of gamma tubulin is a late event, after Spc42 appears in the daughter. The bottom series is oriented essentially upside down vs the top two series.

What they find, looking at cells going through all stages of cell division, is that the assembly of the daughter centrosome is  stepwise, with inner components added before outer ones. Particularly, the final structural elements of Spc72 and gamma tubulin wait till the start of anaphase, when the cells are just about to divide, to be added to the daughter centrosome. The authors then bring in key cell cycle mutants to show that the central controller of the cell cycle, cyclin-dependent kinase CDK, is what is causing the hold-up. This kinase (a protein that phosphorylates other proteins, as a means of regulation) orchestrates much of the yeast cell cycle, as it does in all eukaryotic cells, subject to a blizzard of other regulatory influences. They observed that special inducible mutations (sensitive versions of the protein that shut off at elevated temperature) of CDK would stop this spindle assembly process, suggesting that some component was being phosphorylated by CDK at the key time of the cell cycle. Then, after systematically mutating possible CDK target phosphorylation sites on likely proteins of the centrosome, they came up with Nud1 as the probable target of CDK control. This makes complete sense, since Spc72 assembles on top of Nud1 in the structure, as diagrammed at top. They go on to show the direct phosphorylation of Nud1 by CDK, as well as direct binding between Nud1 and Spc72.

Final model from the article shows how the mechanics they revealed relate to the cell cycle. A daughter centrosome slowly develops off the side of the mother centrosome, but its "licensing" by CDK to nucleate microtubules (black rods anchored by the blue cones) only comes later on in M phase, just as the final steps of division need to take place. This gives the mother centrosome the jump, allowing it to migrate to the bud (daughter cell) and nucleate the microtubules needed to drive half of the replicated DNA/chromosomes into the bud. GammaTC is nucleating gamma tubulin, "P" stands for activating phosphorylation sites on Nud1.

This is a nice example of the power of a model system like yeast, whose rich set of mutants, ease of genetic and physical manipulation, complete genome sequence and associated bioinformatics, and many other technologies make it a gold mine of basic research. The only hard part was the microscopy, since yeast cells are substantially smaller than human cells, making that part of the study a tour de force.

Saturday, November 7, 2020

Why we Have Sex

Eukaryotes had to raise their game in the sex department.

Sex is very costly. On a biological level, not only does one have to put up with all the searching, displaying, courting, sharing, commitment, etc., but one gives up the ability to have children alone, by simple division, or parthanogenesis. Sex seems to be a fundamental development in the earliest stages of eukaryotic evolution, along with so many other innovative features that set us apart from bacteria. But sex is one of the oddest innovations, and demands its own explanation. 

Sex and other forms of mating confer one enormous genetic benefit, which is to allow good and bad mutations to be mixed up, separated, and redistributed, so that offspring with a high proportion of bad genes die off, and other offspring with a better collection can flourish. Organisms with no form of sex (that are clonal) can not get rid of bad mutations. Whatever mutations they have are passed to offspring, and since most mutations are bad, not good, this leads to a downward spiral of genetic decline, called Muller's ratchet.

It turns out that non-eukaryotes like bacteria do mate and recombine their genomes, and thus escape this fate. But their process is not nearly as organized or comprehesive as the whole-genome re-shuffling and mating that eukaryotes practice. What bacteria do is called lateral gene transfer, (LGT), because it typically involves short regions of their genomes, (a few genes), and can accept DNA from any sort of partner- they do not have to be the same species, though specific surface structures can promote mating within a species. Thus bacteria have frequently picked up genes from other species- the major limitation happens when the DNA arrives into the recipient cell, and needs to find a homologous region of the recipient's DNA. If it is too dissimilar, then no genetic recombination happens. (An exception is for small autonomous DNA elements like plasmids, which can be transferred wholesale without needing an homologous target in the recipient's genome. Antibiotic resistance genes are frequently passed around this way, for emergency selective adaptation!) This practice has a built-in virtue, in that the most populous bacteria locally will be contributing most of the donor DNA, so if a recipient bacterium wants to adapt to local conditions, it can do worse than try out some local DNA. On the other hand, there is also no going back. Once a foreign piece of DNA replaces the recipient's copy, there are no other copies to return to. If that DNA is bad, death ensues.

Two bacteria, connected by a sex pilus, which can conduct small amounts of DNA. This method is generally used to transfer autonomous genetic elements like plasmids, whereas environmental DNA is typically taken up during stress.

A recent paper modeled why this haphazard process was so thoroughly transformed by eukaryotes into the far more involving process we know and love. The authors argue that fundamentally, it was a question of genome size- that as eukaryotes transcended the energetic and size constraints of bacteria, their genomes grew as well- to a size that made the catch-as-catch-can mating strategy unable to keep up with the mutation rate. Greater size had another effect, of making populations smaller. Even with our modern billions, we are nothing in population terms compared to that of any respectable bacterium. This means that the value of positive mutations is higher, and the cost of negative mutations more severe, since each one counts for more of the whole population. Finding a way to reshuffle genes to preserve the best and discard the worst is imperative as populations get smaller.

Sex does several related things. The genes of each partner recombine randomly during meiosis, at a rate of a few recombination events per chromosome, thereby shuffling each haploid chromosome that was received from its parents. Second, each chromosome pair assorts randomly at meiosis, thereby again shuffling the parental genomes. Lastly, mating combines the genomes of two different partners (though inbreeding happens as well). All this results in a moderately thorough mixing of the genetic material at each generation. The resulting offspring are then a sampling of the two parental (and four grand-parental) genomes, succeeding if they get mostly the better genes, and not (frequently dying in utero) if they do not.

Additionally, eukaryotic sex gave rise to the diploid organism, with two copies of each gene, rather than the one copy that bacteria have. While some eukaryotes spend most of their lives in the haploid phase, and only briefly go through a diploid mated state, (yeasts are a good example of this lifestyle), most spend the bulk of their time as diploids, generating hapoid gametes for an extremely brief hapoid existence. The diploid provides the advantage of being able to ignore many deleterious genes, being a "carrier" for all those bad (recessive) mutations that are covered by a good allele. Mutations do not need to be eliminated immediately, taking a substantial load off the mating system to bring in replacements. (Indeed, some bacteria respond to stress by increasing promiscuity, taking in more DNA in case a genetic correction is needed, in addition to increasing their internal mutation rate.) A large fund of defective alleles can even become grist for evolutionary innovation. Still, for the species to persist, truly bad alleles need to be culled eventually- at a rate faster than that with which they appear.

The authors do a series of simulations with different genome sizes, mutation rates and sizes (DNA length) and rates of lateral gene transfer. Unfortunately, their figures are not very informative, but the logic is clear enough. The larger the genome, the higher the mutation load, assuming constant mutation rates. But LGT is a sporadic process, so correcting mutations takes not just a linearly higher rate of LGT, but some exponentially higher rate- a rate that is both insufficient to address all the mutations, but at the same time high enough to be both impractical and call into question what it means to be an individual of such a species. In their models, only when the length of LGT segments is a fair fraction of the whole genome size, (20%), and the rate quite high, like 10% of all individuals experiencing LGT once in their lifetimes, do organisms have a chance of escaping the ratchet of deleterious mutations.

" We considered a recombination length L = 0.2g [genome size], which is equivalent to 500 genes for a species with genome size of 2,500 genes – two orders of magnitude above the average estimated eDNA length in extant bacteria (Croucher et al., 2012). Recombination events of this magnitude are unknown among prokaryotes, possibly because of physical constraints on eDNA [environmental DNA] acquisition. ... In short, we show that LGT as actually practised by bacteria cannot prevent the degeneration of larger genomes. ... We suggest that systematic recombination across the entire bacterial genomes was a necessary development to preserve the integrity of the larger genomes that arose with the emergence of eukaryotes, giving a compelling explanation for the origin of meiotic sex."

But the authors argue that this scale of DNA length and frequency of uptake are quite unrealistic for actual bacteria. Bacterial LGT is constrained by the available DNA in the environment, and typically takes up only a few genes-worth of DNA. So as far as we know, this is not a process that would or could have scaled up to genomes of ten or one hundred fold larger size. Unfortunately, this is pretty much where the authors leave this work, without entering into an analysis of how meiotic recombination and re-assortment would function in these terms of forestalling the accumulation of deleterious mutations. They promise such insights in future work! But it is obvious that eukaryotic sex is in these terms an entirely different affair from bacterial LGT. Quite apart from featuring exchange and recombination across the entire length of the expanded genomes, it also ensures that only viable partners engage in genetic exchange, and simultaneously insulates them from any damage to their own genomes, instead placing the risk on their (presumably profuse) offspring. It buffers the effect of mutations by establishing a diploid state, and most importantly shuffles loci all over these recombined genomes so that deleterious mutations can be concentrated and eliminated in some offspring while others benefit from more fortunate combinations.

Saturday, September 12, 2020

Genetics and the Shahnameh

We have very archetypal ideas about genetics.

Reading a recent translation of the Persian Epic, the Shahnameh, I was impressed with two things, among all the formulaic focus on war and kingship. First was what it did not say, and second was its attitude, which is shared with all sorts of traditional societies, towards blood, nobility, and what we now understand as genetics. This epic, which transitions from wholly myth in the first half to quasi-history in the second, stops abruptly at the Arab conquest. Not a word is uttered past the overthrow of the last Persian pre-Islamic ruler. Not a word about Islam, not a word about the well over three hundred years of history of Persia under the Arab yoke by the time this was written circa 1000 AD. That says a lot about what the author, Abolqasem Ferdowsi, regarded as the significant boundaries of Persian history. Not that he was opposed to narrations of decline and suffering. The final era of the Sassanian Empire was one of chaos and decline, with regicides and civil war. But apparently, that history was still Persian, while the Arab epoch was something else entirely- something that Iran is still grappling with.

The epic is full of physical descriptions- kings are always tall as cypresses and brave as lions, women are always thin as cypresses, their faces like full moons and their hair musky. True kings radiate farr- glowing splendor that they show from a young age, which marks them as destined rulers. But farr can also be lost, if they turn to the dark side and loose popular support. The Chinese have a similar concept in the mandate of heaven, which, however, is not portrayed as a sort of physical charisma or halo. Children are generally assumed to take after their parents, for good or ill. The concept of the bad seed comes up in the Shahnameh, especially in the saga of Afrasyab, king of Turan and long-time antagonist of Persian kings and their champion, Rostam. Persian king Kavus has fathered a great (and handsome) champion, Sayavash. Through several plot twists, Sayavash must leave Persia and is adopted by Afrasyab, even marrying his daughter. But then the drama turns again and Afrasyab kills Sayavash. Thankfully, Sayavash had already fathered a future king of Persia, whom the Persians suspect of bad lineage, due to his descent from Afrasyab- a suspicion that they are slow to overcome.

"By the time the boy was seven years old, his lineage began to show. He fashioned a bow from a branch and strung it with gut; then he made a featherless arrow and went off to the plains to hunt. When we was ten he was a fierce fighter and confronted bears, wild boar, and wolves. ... Seeing the boy's noble stature, he dismounted and kissed his hand. Then he gazed at him, taking in the signs of kingly glory in his face."


Ancient peoples have generally taken nobility and bloodlines very seriously, for several reasons. First, obviously, is that children do take after their parents, for good and ill, just as ethnic groups similarly have some distinctive characteristics. Second is that, for practical as well as psychological reasons, people always seek good rulers and stable ruling systems, which in the aristocratic, patriarchial setting means an orderly transition from king to prince. The fairy tale (archetypal) ending is that the prince and princess take over the kingdom, and everyone lives happily ever after. Third, is that hierarchy of some sort seems to be part of our cultural DNA. Someone or group is always up, others down. Whatever the group or organization and whatever its professed principles, hierarchy re-asserts itself. Those on top want naturally to stay on top, and bequeath that position to their future replicas, i.e. their offspring. To do that they will generate all the practical advantages they can, and into the bargain foster a mythos of just distinction, based on their glorious bloodline, if not outright divine sanction from god. Thus genealogical trees, heraldry, etc.

The Shah is not like you or me...

The ruling houses of Europe over the whole post-Roman Era were infested with these archetypes and mythologies. Marrying "up" or "down" was a vast game carried out across the continent. And what has it gotten us? Prince Charles. It is obvious that something went awry in this genetic exercise of assortive mating, as it did ultimately in the tragedies of the Shahnameh as well. The behavior of royals generally fails to select for all the positive traits that are ultimately needed. Their training fails frequently as well to expose those good traits that do exist. But most of all, genetics is far more of a crapshoot than the archetypes allow. 

Children do take after their parents, but there are stringent and interesting limits. A child gets only half of each parent's genes, and those genes may be from either copy in each parent. That copy might have been totally silent- recessive vs the other dominant allele. Two brown-eyed parents can have a blue-eyed child, if they are heterozygous for eye color. Multiply this over thousands of loci, and the possibilities are endless. This is why traits of the grandparents sometimes are thought to come up unexpectedly, or novel traits entirely. The genetic mixing that takes place on the way to new life is carefully engineered to replicate, but with wide variation on the theme, such that any child is as much a child of its wider lineage and environment as of its particular parents. Genetic defects remix during this process as well, concentrating in some children, and leaving others fortunately free to realize greater potentials. The obsessive concentration of lineages that characterize royalty systems, such as was taken to an incestuous extreme in Egypt, leads to inbreeding, which means the exposure of defective alleles due to excessive homozygosity. We all have defective gene alleles, which are typically recessive, and thus get exposed only when they pair up with an equally defective partner. Thus an extreme focus on lineage and purity leads to its own destruction.

The differences between ethnicities are far less than those between families. Human lineages may have some strongly selected and differentiated traits, such as skin color, but such traits are exceedingly rare. Otherwise, our genetics are a cloud of variation that crosses all ethnic lines. Humans were a single lineage only a few hundred thousand years ago, or less, so broadly speaking, we are all the same. Indeed compared to most species, such as chimpanzees, we have much less genetic variation overall, and are virtually clones, due to the relatively recent bottlenecks of extremely low population that reduced genetic diversity. Our current population size relative to those of the other great apes certainly does not reflect conditions in the past!

Education was another ingredient in the traditional systems of nobility and aristocracy. Only the rich could afford an education, so only the upper crust were educated, thus gaining one more credential in addition to their genetic credentials, over the middle and lower classes. Such notions of aristocracy died perhaps hardest in military circles, where officers were long an aristocratic class, selected for their connections, not their ability. It was one of the great American innovations to establish a national military college to which admission was distributed liberally to deserving candidates, (at the same time as similar academy was set up in revolutionary France by Napoleon). It is obvious that the capacity for education was far more widespread than originally conceived, and we benefit today from the very active diffusion of education for everyone. Yet not all are college material. Some children are bright, some less so. Genetics and early development still count for a great deal- but good (and bad) genes can come up anywhere. That means that in the end, the American system of meritocracy, for all its defects, of which there are many, and despite its significantly unfulfilled promise to many, is a huge advance over the hidebound traditions, archetypes, and injustice, of aristocracy.

But back to genetics- what are we finally to make of genetics, eugenics, and noble bloodlines? It is clear that humans can be selectively bred, just like any other animal. Twins and twin studies make it abundantly clear that all sorts of traits- physical and mental- are gene-based and heritable, to striking extents. It is also clear that historical attempts at eugenics have not turned out well, whether through systems of nobility or more modern episodes of eugenics. The former were largely self-indulgent and self-serving ideologies designed to keep power and status among an elite, within which poor choices in mates and inbreeding consistently led to genetic doom. The latter were ideological exercises in frank racism, no more anchored in positive values, genetic or otherwise, than the aristocracies of yore. There have been occasional successful genetic experiments in human breeding, such as the Bach family, Yao Ming, and Stephen Curry, which show what can be done when one puts one's mind to it! (The Trump brood may also be cited as another, if negative, example.)

But generally, selective breeding implies a single set of values that constitute its goal. Our values, as a society, are, however, diverse in the extreme. We celebrate some people more than others at a political or social level, but have been heading in the direction, since our country's founding, of recognizing the dignity and worth of every person without exception, along with their freedom to form and express their own values. We can neither agree on a society-wide set of specific values that would shape any form of selective eugenics, nor allow individuals to go beyond the bounds of normal mate selection to plunge into cloning, genetic alterations, and the like, to inordinately expand their genetic influence on succeeding generations. All that would strike at the heart of the social project that is America- to foster individual opportunity and merit, while at the same time respecting the rights and worth of each individual- indeed, each way of life. It is likely that, given the technology, we might come to a general consensus to eradicate certain genetic diseases and syndromes. But beyond that lies a frontier of genetic engineering that the US is particularly poorly suited to cross, at least until we have made America great again, so to speak, and become another society entirely.

  • Some calming piano.
  •  Oh, yeah- remember the tax cut? That went to the rich.
  • Some people are prepping for war.
  • Maybe low-dose infection is one way to approach Covid-19.
  • International fisheries are not just environmental disasters, but human rights disasters.
  • The difference between being a con man and being a president.
  • Some possible futures for Earth. RCP 8.5 takes us (in a matter of 80 years) to conditions last seen 40 million years ago.
  • Followup quote from Frederick Douglass:

"Color prejudice is not the only prejudice against which a Republic like ours should guard. The spirit of caste is malignant and dangerous everywhere. There is the prejudice of the rich against the poor, the pride and prejudice of the idle dandy against the  hard-handed workingman. There is, worst of all, religious prejudice, a prejudice which has stained whole continents with blood. It is, in fact, a spirit infernal, against which every enlightened man should wage perpetual war."

Sunday, September 6, 2020

Why Are Cells So Small?

Or, why are they one size, and not another?

One significant conundrum in biology is how cells know what size they are, and what size they are supposed to be. Bacteria are tiny, while eukaryotic cells are huge in comparison. And eukaryotic cells vary tremendously in size, from small yeast cells to peripheral nerves that span much of your body, even on to ostrich eggs. Outside of yeasts, not much is known about how these cells judge what size is right and when to divide. A recent paper proposed that the protein Rb plays an important role in setting cell size, at least for some eukaryotic cell types.

Rb is named for retinoblastoma, the form of cancer it is most directly responsible for, and is a well known gene. Many other cancers also have mutations in Rb, since it is what is called a "tumor suppressor gene". That is, it is the opposite of an oncogene. Rb interacts with hundreds of proteins in our cells, but its most important partner is transcription activator E2F1, an activator of cell cycle progression. Rb binds to and inhibits the activity of E2F1, (and a family of related proteins), halting cell division until some alteration takes place, like a regulatory phosphorylation that shuts Rb off, or an insufficient amount remaining in the cell.

The researchers took a clue from yeast, whose gene Whi5 accomplishes similar inhibition of the cell cycle as Rb, and is known to regulate the size of cells at division. So this work was not a big surprise. The interesting aspect is that Rb now has one more role, which logically integrates with its other known roles in the cell cycle. The authors used cells that over or under-express Rb to show that the copy number of Rb has a significant, if not overwhelming, effect on cell size. 

Amount of Rb correlates with the size of cell. The authors set up an inducible genetic construct to drive Rb expression, from zero to four times normal amounts.


So how do they imagine this mechanism working? Rb is a durable, stable protein, with a half-life almost twice as long (29 hours) as the cell division cycle in the conditions the experimenters used. Secondly, all Rb is pretty much in the nucleus, attached to DNA. So at cell division, roughly equal amounts necessarily partition to each daughter cell, even if their cell volumes are very different. Thereafter, each cell synthesizes Rb at a low rate, which does not keep up with cell growth, especially during the G1 phase of the cell cycle- that period prior to DNA replication and commitment to division. In fact, very little Rb is made in that period, allowing it to serve as a limiting factor through dilution as the cell grows. And when it is sufficiently dilute, it then contributes to the decision to have new cell cycle, by letting go of its repression of E2F1.

How several proteins accumulate during the cell cycle. Rb is shown in dark blue, and hardly accumulates at all in G1, the growth phase of the cell cycle before DNA replication (S phase) and division (M phase). For comparison, nuclear volume and a generic translation protein (EF1) rise monotonically with cell growth. Cdt1 is a key licensing factor for DNA replication. It accumulates during G1, and after the DNA replication origins fire, is destroyed by the end of S phase. Conversely, Geminin is a protein that binds to and represses Cdt1, preventing re-replication of DNA that has already replicated once. It accumulates during S phase and stays high until after division. After S phase, more Rb is made, partially catching up to the current cell size. 

That is the theory, at least, backed by pretty good evidence. But its effect is not proportional, and not uniform among cell types. There are clearly other controls over cell size in play- this is only one. Indeed, there are a couple of siblings of Rb (in a family termed "pocket proteins") which also regulate the cell cycle, and a vast network of other controls and stimuli that impinge on it. So finding even one regulator of this kind, and finding conditions where it has strong effects on cell size, is quite significant. As for the ultimate rationale of cell size in these or other instances, Rb regulation is only a mechanism that enforces logic that has been arrived at over evolutionary time, about the practical limits and ideal proportions of cells in, in this case, the human body, in response to various situations. Smaller cells have one virtue, that they are more easily disposable- such as the countless skin and gut epithelial cells that are sacrificed daily. Our long peripheral nerves are much more difficult to replace.

Conversely, Rb has many other roles in the cell, as suggested by the vast number of its interaction partners, diagrammed below by functional classification.


Functional classification of the many proteins that interact directly with Rb. It also has about 15 phosphorylation sites that can be regulated by various kinases.


  • The Fed goes all MMT, behind the scenes. No more reserve requirements, no more market-based interest manipulation.
  • We are increasingly at risk of civil war.
  • Guess who recommends illegal voter fraud?
  • Yet another effective Chinese vaccine.
  • Bob Cringely on the pandemic loan program, and other misguided incentives.
  • How the virus disarms and shuts down the host cell.

Saturday, May 16, 2020

Origin of Life- RNA Only, or Proteins Too?

Proteins as Participants in the Origin of Life

After going through some momentous epochs in the history of life in the last two weeks (the origin of eukaryotes, and the nature of the original metabolism), we are only part of the way to the origin of life itself. The last common ancestor of all life, (LUCA), rooted at the divergence between bacteria and archaea, was a fully fledged cell, with many genes, a replication system and translation system, membranes, and a robust metabolism based on a putative locale at hydrothermal vents. This is a stage long after the origination of life, about which our concepts remain far hazier, at the chemical interface of early Earth.

A recent paper (and prior) takes a few more speculative shots at this question, (invoking what it calls the initial Darwinian ancestor, or IDA), making the observation that proteins are probably as fundamental as RNA to this origination event. One popular model has been the "RNA world", based on the discovery that RNA has some catalytic capability, making it in principle capable of being the Ur-genetic code as well as the Ur-enzyme that replicated that same code into active, catalytic molecules, i.e., itself. But not only has such a polymathic molecule been impossible to construct in the lab, the theory is also problematic.

Firstly, RNA has some catalytic ability, but not nearly as much as it needs to make a running start at evolution. Second, there is a great symmetry in the mechanisms of life- proteins make RNA and other nucleic acids, as polymerases, while RNA makes proteins, via the great machine of the ribosome. This seems like a deep truth and reflection of our origins. It is probable that proteins would, in theory, be quite capable of forming the ribosomal core machinery- and much more efficiently- with the exception of the tRNA codon interpretation system that interacts closely with the mRNA template. But they haven't and don't. We have ended up with a byzantine and inefficient ribosome, which centers on an RNA-based catalytic mechanism and soaks up a huge proportion of cellular resources, due to what looks like a historical event of great significance. In a similar vein, the authors also find it hard to understand how, if RNA had managed to replicate itself in a fully RNA-based world, how it managed to hand off those functions to proteins later on, when the translation function never was. (It is worth noting that the spliceosome is another RNA-based machine that is large and inefficient.)

The basic pathways of information in biology. We are currently under siege by an organism that uses an RNA-dependent RNA polymerase to make, not nonsense RNA, but copies of itself and other messages by which it blows apart our lung cells. Reverse transcriptases, copying RNA into DNA, characterize retro-viruses like HIV, which burrow into our genomes.

This thinking leads to a modified version of the RNA world concept, suggesting that RNA is not sufficient by itself, though it was clearly central. It also leads to questions about nascent systems for making proteins. The ribosome has an active site that lines up three tRNAs in a production line over the mRNA template, so that the amino acids attached on their other ends can be lined up likewise and progressively linked into a protein chain. One can imagine this process originating in much simpler RNA-amino acid complexes that were lined up haphazardly on short RNA templates to make tiny proteins, given conducive chemical conditions. (Though conditions may not have been very conducive.) Even a slight bias in the coding for these peptides would have led to a selective cycle that increased fidelity, assuming that the products provided some function, however marginal. This is far from making a polymerase for RNA, however, so the origin and propagation mechanisms for the RNA remain open.

"The second important demonstration will be that a short peptide is able to act as an RNA-dependent RNA polymerase."
- The authors, making in passing what is a rather demanding statement.

The point is that at the inception of life, to have any hope of achieving the overall cycle of selection going between an information store and some function which it embodies or encodes, proteins, however short, need to participate as functional components, and products of encoding, founding a cycle that remains at the heart of life today. The fact that RNA has any catalytic ability at all is a testament to the general promiscuity of chemistry- that tinkering tends to be rewarded. Proteins, even in exceedingly short forms, provide a far greater, and code-able, chemical functionality that is not available from either RNA (poor chemical functionality) or ambient minerals (which have important, but also limited, chemical functionality, and are difficult to envision as useful in polymeric, coded form). Very little of the relevant early chemistries needed to be coded originally, however. The early setting of life seems to have been rich with chemical energy and diverse minerals and carbon compounds, so the trick was to unite a simplest possible code with simple coded functions. Unfortunately, despite decades of work and thought, the nature of this system, or even a firm idea of what would be minimally required, remains a work in progress.


  • Thus, god did it.
  • Health care workers can be asymptomatic, and keep spreading virus over a week after symptoms abate.
  • Choir practices are a spreading setting.

Saturday, April 18, 2020

Birth of a Gene

Where do genes come from? Well, lots of them rise right out of the muck- the junk of the genome, according to one paper.

Can genes arise out of nothing? The intellegent design folks spent a lot of sweat and pseudo-math showing that that was absolutely impossible. But here we are anyhow. They got their physics and math wrong. New genes arise all the time, mostly from pre-existing genes, by duplication events which are rampant, given the capacity of biological systems to replicate their constituent molecules. The human genome carries vast fleets of genes whose origin is duplication over evolutionary time - hundreds of zinc finger transcription factors, hundreds of odorant receptors, not to mention tens of thousands of duplicated transposons and viral remnants. And yet, can genes arise from nothing at all?

A recent paper says that yes, many functional genes have come from completely non-functional DNA, rather than pre-existing genes. While not the same as assembling a gene from the primordial soup, an event that remains difficult to reconstruct while singular in its global impact, this claim does suggest that the long-term plasticity of our genomes and of biological functions is even higher than many biologists appreciate. These researchers use synteny as their touchstone- the tendency of genes to stay in the same place on chromosomes through time, to conclude that most genes that lack homologs in other species did not arise by duplication, but by the conversion of some junk DNA to a functional state.

Syntenic relations of some of the human chromosomes, with those of chimpanzee. Lines indicate concordant / homologous positions. Note several massive inversions, and a few smaller segments that have jumped from one location to another. But on the whole, our genomes are highly similar in gross structure.

Humans and chimpanzees have strongly syntenic chromosomes, since we are so closely related. Most chromosomes line up precisely, with a few dramatic inversions (places where a portion of a chromosome in one lineage flipped orientation by recombination), and a few gaps and migrations of segments to new locations. This means that it is easy to trace which gene is ancestrally related to which gene in the other species. But not just genes, all nearby portions of the DNA are similarly lineally related, even if they are not well-conserved, as the cores of genes typically are. The researchers used human, fly, and yeast lineage tracing, benefiting from the large numbers of genomes that have now been sequenced from closely related species. This allowed them to determine the origin of novel genes lacking homologs among other species, but situated between normal, and normally homologous, genes. Either that novel gene arose in place, from the materials available, or else it came from elsewhere as a duplication or gene conversion event, with recognizable antecedents.

At a gene with no recognizable homolog (green), synteny helps to tell us that its origin was from a pre-existing gene, not from junk DNA.

Given all that information, one can then ask- did this gene decay from some known gene that is homologous to others among many species, and if so, how long did that decay take? At this point we need to define gene similarity. Typically software programs can give quantitative answers to how similar two protein sequences are, or two nucleotide sequences. But there is a twilight zone where similarity is so low that it can not be computationally recognized- like a game of telephone after too many transfers. But that does not mean that the two sequences are not lineally related, or even that they don't have the same function. There are many examples of protein pairs with no discernable sequence similarity, but very similar structures and functions. So evolution can go places our computers can not quite follow, though that may change once we solve the protein folding problem.

The authors portray the estimated time to gene degradation for orphan genes they studied, based on their presumed ancestors identified by synteny analysis. A very long time, in any case, but even longer for humans. my = millions of years. and the Y axis is proportion of the genome, going up to 10%.

The researchers show that this time to gene decay is much faster in flies and yeast than it is in humans. What takes 200 million years in yeast or 400 million years in flies (10% of lineage-ancestral, syntenic genes decayed to unrecognizeable similarity) takes an extrapolated 2 billion years in human genes. This may be due to the vastly different generation times of these species, considering that meiosis may be the most likely time for genome rearrangements.

An example (MNE1) of a large protein coding gene (in single letter code) that has essentially no recognizable sequence similarity, but still has synteny and functional homology with its relative (here, from S. ceverisiae to K. Lactis, both yeasts).

The next question was- how many of the novel genes across the genome came from that decay process of pre-existing genes, and which did not, rather (by default) coming from de novo origination out of the local DNA segment? It is a complicated question, a function of how one calculates similarity, and models synteny across related species. Do lineages where the matching syntenic DNA disappeared rather than decayed count towards the de novo origin hypothesis, or do they count as similar DNA that supports the decayed gene hypothesis? Since one partner in the homology pair is absent, the analysis depends on having enough other lineages fully sequenced to figure out what happened in detail. The authors' conclusion is that, on the whole, only one-third of novel genes arose from decay processes, and the rest arose de novo. That is a stunning conclusion, and sort of buried in the paper, which focuses on the decay processes that are easier to analyze, and comprise all the figures.

Unfortuntely, their logic breaks down when it comes to this conclusion. Yes, genes degrade to various degrees over time when they fail to see strong selection for function. That is given. But their key assumption is that their derived rate of gene decay at syntenic positions (let us say X) can be extrapolated over the entire genome. They thus claim that since, from separate analysis, Y is the number of genes in the entire genome that are novel (or orphan, lacking recognizable relatives), that Y - X is then the proportion that did not degrade from pre-existing genes, but rather arose denovo from other non-gene genetic elements. From this, they offer an estimate of roughly Y = 3*X, leaving 2/3 of Y coming from somewhere else, presumably de novo formation. The problem is that degradation of a gene at a syntenic position is a special case, compared to the also quite frequent duplication of genes and other sequences to distant locations which is another source of pseudogenes and ultimately of gene degradation and novel or orphan sequences. The mutation rates that apply to these cases are likely to be different, because the syntenic case never involves gene duplication, at least not in the recent past, by definition. Duplication is far more likely to lead to an immediate loss of function and selection than is degradation in a syntenic location.

So I do not think we can conclude what this paper (and an accompanying review) claim. They have not demonstrated at all the de novo origin of novel genes, but only suggested such origins from highly questionable negative evidence. Nevertheless, the topic is an interesting one, and someone is likely to study it with more care than was done here. Many tiny open reading frames and other stray genetic proto-elements litter our genomes, and other studies have shown that practically all of them are expressed at some level, at least as RNA, if not as proteins. So the question remains- whether and at what rate any of them gain an actual selected function, rising to the level of a gene of significance to the organism.

  • I wonder what a psychopathic clown melt-down looks like.
  • Never waste a chance to be utterly corrupt.
  • Making America number 1 ... in disorganized health care and coronavirus deaths.
  • Government operates mostly in an economy of blame, not of wealth or effectiveness.

Saturday, December 21, 2019

We Are All Special

A study in yeast finds that rare mutations have outsize influence on traits.

The word "mutation" is frowned upon in these politically correct days. While we may have a human reference genome sequence derived from some guy from Buffalo, New York, all genomes are equal, and thus differences between them and the reference are now termed "variations" rather than mutations.

After the first human genome was cranked out, the natural question was- How do we differ, and what do those differences say about our medical futures and our evolutionary past? This led to the 1000 genomes project, and much more, to the point that whole genome sequencing is creeping into normal medical practice, along with even more common panels of a smattering of genes analyzed for specific issues, principally cancers. Well, we differ a lot, and this data has been richly rewarding, especially in forensic and evolutionary terms, but only somewhat in medical terms. The ambition for medical studies has been extremely high- to interpret from our code why exactly we are the way we are, and what our medical fate will be. And this ambition remains unfulfilled. It will take decades to get there, and our code is far from controlling everything- even complete knowledge of our sequences and their impact on our development and medical issues will leave a great deal to accidents of fate.

The first approach to mining the new genomic information, especially variations among humans, for medically useful information was the GWAS study. These put the 1000 genomes (or some other laboriously accumulated set of sequences, which came tagged with medical information) into a blender and asked which variations between people's sequences correlated with variations in their diseases. Did diabetes correlate with particular genes being defective or altered? Despite huge resources and high hopes, these studies yielded very little.

The reason was that the notion of variation (or mutation) and especially the intricate field of evolutionary population genetics, was, among these researchers, in a somewhat primitive state. They only accepted variations that occurred a few times, so that they could be sure they were not just random sequencing mistakes. In a population of, say, 1000, any variation that occurs a few times has a particular nature, which is to say that it must be somewhat stable in the population and have a long history, to allow it to rise to such a (modest, but significant) level of prevalence. This in turn means that it can not have a severe effect, in evolutionary terms, which would otherwise have cut its history in the population rather short. So it turned out that these researchers were studying the variations least likely to have any effect, and for all the powerful statistics they brought to bear, little fruit turned up. It was a very frustrating experience for all concerned.

A recent paper recapitulated some of these arguments in the setting of yeast genetics. The topic remains difficult to approach in humans, because rare variations are, by definition, rare, and hard to link to diseases or traits. Doing so in a clinical study requires statistical power, which arises from the number of times the linkage is seen- a catch-22 unless one can find an obscure family pedigree or a Turkish village where such a trait is rampant. In yeast, one can generate lineages of infinite size at will, and the sequencing is a breeze, with a genome 1/250 the size of ours. The only problem is that the phenotypic range of yeast is slightly impoverished compared to ours(!) Yet what variety they can display is often quantifiable, via growth assays. The researchers used 16 yeast strains from diverse backgrounds as parents, (presumably containing a wide variety of distinctive variations), generated and sequenced 28,000 progeny, and subjected them to 38 growth conditions to elicit various phenotypes.

The major result, graphing the frequency of variations against their phenotypic effect. The effect goes up quite strongly as the frequency goes down.

These researchers claim that they can account for 73% of phenotypic variation from their genetic studies- far higher the rate seen for any complex human trait. They see on average 120 loci affecting each trait across the study, and 12 loci affecting each trait in any one mating. Based on past work with libraries of yeast strains, they could also classify the mutations, er, variations they saw coming from these diverse parents as either common (similar to what was analyzed in the classic GWAS experiments in humans, occurring at 1% or more) or rare. Sure enough, the rarer the allele, the bigger its effect on phenotype, as shown below. In rough terms, the rare variants accounted for half the phenotypic variation, despite comprising only a quarter of the genetic variation.

In an additional analysis, they compared all these variants to their relatives in a close relative of this yeast species, in order to judge which allele (variant / mutant or the reference / normal version) was ancestral, i.e. older. As expected, the rare variations that led to phenotypic effects were mostly of recent origin, and more so the stronger their effect.
"Strikingly, no ancient variant decreased fitness by more than 0.5 SD units, whereas 41 recent variants did."

The upshot is that to learn about the connection between genotype and phenotype, one needs significant (and typically deleterious) mutations, as geneticists have known since the time of Mendel and Morgan. Thus the use of common variants (with small effects) to analyze human syndromes and diseases has yielded very little, either medically or scientifically, while the study of rare variants has been a gold mine. And we all have numerous rare variants- they come up all the time, and are likewise selected out of existence all the time, due to their significant effects.

The scale of the experiments done here in yeast allow high precision genetic mapping. Here, one trait (growth in caffeine) is mapped against correlating genomic variations. The correlations home in on variations in the TOR1 gene, a known target of caffeine and a master regulator of cell growth and metabolism.

  • Stiglitz on neoliberalism.
  • Thoughts about Britain's (and our) first past the post voting system.
  • Economists have no idea what they are talking about- Phillips curve edition.
  • Hayek and the veneration of prices.
  • Real trial or show trial?
  • The case for Justice Roberts.
  • Winning vs success.
  • Psychotic.
  • Lead from gasoline is being remobilized by wildfires.
  • Winter has been averted.

Saturday, November 30, 2019

Metrics of Muscles

How do the microstructures of the muscle work, and how do they develop to uniform sizes?

Muscles are gaining stature of late, from being humble servants for getting around, to being a core metabolic organ and playing a part in mental health. They are one of the most interesting tissues from a cell biology and cytoarchitectural standpoint, with their electrically-activated activity, and their complex and intensely regimented organization. How does that organization happen? There has been a lot of progress on this question, one example of which is a recent paper on the regulation of Z-disc formation, using flies as a model system.

A section of muscle, showing its regimented structure. Wedged in the left middle is a cell nucleus. The rest of these cells are given over to sarcomeres- the repeating structure of muscles, with dark myosin central zones, and the sharp Z-lines in the light regions that anchor actin and separate adjacent sarcomeres.

The basic repeating unit of muscle is the sarcomere, which occurs end-to-end within myofibrils, which are bundled together into muscle fibers, which constitute single muscle cells. Those cells are then in turn bundled into fascicles, bundles, and whole muscles. The sarcomere contains end-plates called the Z-disk, which attach actin filaments that travel lengthwise into the sarcomere (to variable distances depending on contraction). In the center of the sarcomere, interdigitated with the actin filaments, are myosin filaments, which look much thicker in the microscope. Myosin contains the ATP-driven motor which pulls along the actin, causing the whole sarcomere to contract. The two assemblies contact each other like two combs with interdigitated teeth.

Some molecular details of the sarcomere. Myosin is in green, actin in red. Titin is in blue, and nubulin in teal. The Z-disks are in light blue at the sides, where actin and titin attach. Note how the titin molecules extend from the Z-disks right through the myosin bundles, meet in the middle. Titin is highly elastic, unfolding like an accordion, and also has stress sensitivity, containing a protein kinase domain (located in the central M-band region) that can transmit mechanical stress signals. The diagram at bottom shows the domain structure of nebulin, which has the significant role of metering the length of the actin bundles. It is also typical in containing various domains that interact with numerous other proteins, in addition to repetitive elements that contribute to its length.

There are over a hundred other molecules involved in this structure, but some of more notable ones are huge structural proteins, the biggest in the genome, which provide key guides for the sizes of some sarcomeric dimensions. Nubulin is a ~800 kDa protein that wraps around the actin filaments as they are assembled out from the Z-disk and sets the length of the actin polymer. The sizes of all the components of the sarcomere are critical, so that the actin filaments don't run into each other during contraction, the myosins don't run into the Z-disk wall, etc. Everything naturally has to be carefully engineered. Conversely, titin is a protein of ~4,000 kDa (over 34,000 amino acids long) that is highly elastic and spans from the Z-disk, through the myosin bundles, and to a pairing site at the M-line. In addition to forming the core around which the myosin motors cluster, thus determining the length of the myosin region, it appears to set the size of the whole sarcomere, and forms a spring that stores elastic force, among much else.

Many of these proteins come together at the Z-disk. Actin attaches to alpha-actinin there, and to numerous other proteins. One of these is ZASP, the subject of the current paper. ZASP joins the Z-disk very early, and contains domains (PDZ) that bind to alpha actinin, a key protein that anchors actin filaments, and other domains that bind to each other (ZM and LIM). To make things interesting, ZASP comes in several forms, from a couple of gene duplications and also from alternative splicing that includes or discards various exons during the processing of transcripts from these genes. In humans, ZASP has 14 exons and at least 12 differently spliced forms. Some of these forms include more or fewer of the self-interacting LIM domains. These authors figured that if the ZASP protein plays an early and guiding role in controlling Z-disk size, it may do so by arriving in its full-length, fully interlocking version early in development, and then later arriving in  shorter "blocking" versions, lacking self-interacting domains, thereby terminating growth of the Z-disks.

Overexpression of the ZASP protein (bottom panels) causes visibly larger, yet also somewhat disorganized, Z-disks in fly muscles. Note how beautifully regular the control muscle tissue is, top. Left sides show fluorescence labels for both actin and ZASP, while right sides show fluorescence only from ZASP for the same field.

The authors show (above) that overexpressing ZASP makes Z-disks grow larger and somewhat disorganized, while conversely, overexpressing truncated versions of ZASP leads to smaller Z-disks. They then show (below) that in the wild-type state, the truncated forms (from a couple of diverged gene duplicates) tend to reside at the outsides of the Z-disks, relative to the full length forms. They also show in connection with this that the truncated forms are also expressed later in development in flies, in concordance with the theory.

Images of Z-disks, end-on. These were not mutant, but are expressing fluorescently labelled ZASP proteins from the major full length form (Zasp52, c and d), or from endogenous gene duplicates that express "blocking" shortened forms (Zasp66 and Zasp67, panels in d). They claim by their merged image analysis (right) to find that full length ZASP resides with higher probability near the centers of the disks, while the shorter forms reside more towards the outsides.

Compared with what else is known, (and unknown), this is a tiny step. It also begs a lot of questions- could gene expression be so finely controlled as to create the extremely regimented Z-disk pattern? (Unlikely) And if so, what controls all this gene expression and alternative splicing, both in normal development, and in wound repair and other times when muscle needs to be rebuilt, which can not be solely time-dependent, but appears, from the regularity of the pattern, to follow some independent metric of ideal Z-disk size? It is likely that there is far more to this story that will come out during further analysis.

It is notable that the Z-disk is a hotbed of genes that cause myopathies of various sorts when mutated. Thus the study of these structures, while fascinating in its own right and a window into the wonders of biology and our own bodies, is also informative in medical terms, and while unlikely to lead to significant treatments until the advent of gene therapy, may at least provide understanding of syndromes that might otherwise be though of as acts of a cruel god.


Saturday, November 2, 2019

To Model is to Know

Getting signal out of the noise of living systems, by network modeling.

Biology is complex. That is as true on the molecular level as it is on the organismal and ecological levels. So despite all the physics envy, even something as elegant as the structure of DNA rapidly gets wound up in innumerable complexities as it meets the real world and needs reading, winding, cutting, packaging, synapsing, recombining, repairing, etc. This is particularly true of networks of interactions- the pathways of (typically) protein interactions that regulate what a cell is and does.

An article from several years ago discussed an interesting and influential way to learn about these interactions. The advent of "big data" in biology allowed us to do things like tabulate all the interactions of individual proteins in a cell, or sample the abundance of every transcript or protein in cells of a tissue. But it turned out that these alone did not lead directly to the elucidation of how things work. Where the genome was a part ordering list, offering one catalog number and description for each part, these experiments provided the actual parts list- how many screws, how many manifolds, how many fans, etc., and occasionally, what plugs into what else. These were all big steps ahead, but hardly enough to figure out how complicated machinery works. We still lack the blueprint, one that ideally is animated to show how the machine runs. But that is never going to happen unless we build it ourselves. We need to build a model, and we need more information to do so.

These authors added one more dimension to the equation- time, via engineered perturbations to the system. Geneticists and biochemists have been doing this (aka experiments- such as mutations, reconstitutions, and titrations) forever, including in gene expression panels and other omics data collections. But employing a perturbation method in a systematic and informative way on the big data level in biology remains a significant advance. The problem is called network inference- figuring out how a complex system works, (which is to say, making a model), now that we are given some but not all important information of its composition and activities. And the problem is difficult because biological networks are frequently very large and can be modeled in an astronomical number of ways, given that we have scanty information about key internal aspects. For instance, even if many individual interactions are known from experimental data, not all are known, many key conditions (like tissue, cell type, phase of the cell cycle, local nutrient conditions etc.) are unknown, and quantitative data is very rare.

One way to get around some of these specifics is to poke the system somewhere and track what happens thereafter. It is a lot like epistasis analysis in genetics, where if you, say, mutate a set of genes acting in a linear process, the ones towards the end can not be cured by supplying chemical intermediates that are made upstream- the later genes are "epistatic" to those earlier in the process. Such logic needs to be expanded exponentially to address inference over realistic biological networks, and gets the authors into some abstruse statistics and mathematics. Their goal is to simplify the modeling and search problem enough to make it computationally tractable, while still exploring the most promising parts of their parameter space to come up with reasonably accurate models. They also seek to iterate- to bring in new perturbation information and use it to update the model. One step is to discretize the parameters, rather than exploring continuous space. Second is to use preliminary calculations to get near optimal values for their model parameters, and thereafter explore those approximated local spaces, rather than all possible space.

Speed of this article's method (teal) compared with a conventional method for the same task (green).

All this is done over again for experimental cases, where some perturbation has been introduced into their real system, generating new data for input to the model. Their improved speed of calculation is critical here, enabling as many iterations and alterations as needed, to update and refine the model. If the model is correct, it will respond accurately to the alteration, giving the output that is also observed in real life. Such a model then makes it possible to perform virtual perturbations, such as simulating the effect of a drug on the model, which then predicts entirely in silico what the effects will be on the biological network.
"It is also useful, as an exercise, to evaluate the overall performance of the BP algorithm on data sets engineered from completely known networks. With such toy datasets we achieve the following: (i) demonstrate that BP converges quickly and correctly; (ii) compare BP network models to a known data-generating network; and (iii) evaluate performance in biologically realistic conditions of noisy data from sparse perturbations."
...
"Each model is then a set of differential equations describing the behavior of the system in response to perturbations."

The upshot of all this is models that are roughly correct, and influential on later work. The figure below shows (A) the smattering of false positives and missing (false negatives) interactions, but (B) accounts for most of this error as shortcuts of various kinds- the inference of regulation that is globally correct, but may be missing a step here or there. So they suggest that the scoring is actually better than the roughly 50 - 70% correct rate that they report.

An example pair of interconnecting pathways inferred from experimental protein abundance data and perturbed abundance data, with the protein molecules as nodes and their interactions as arrows. Where would a drug have the most effect if it inhibited one of these proteins?

They offer one pathway as an example, with an inferred pattern of activity, (above), and a few predictions about what proteins would be good drug targets. For example, PLK1 in this diagram is a key node, and has dramatic effects if perturbed. This came up automatically from their analysis, but PLK1 happens to already be an anticancer drug target with two drugs under development. Any biologist in the field could have told them about this target, but they went ahead with proof-of-principle experiments to show that yes, indeed, treatment of their RAF-inhibitor drug-resistant melanoma cells, which were the subject of modeling, with an experimental anti-PLK1 drug results in dramatic cell killing at quite low concentrations. They had used other drugs as perturbation agents in the physical experiments to develop this model, but not this one, so at least in their terms, this is a novel finding, arrived at out of their modeling work.

Given that these authors were working from scratch, not starting with manually curated pathway models that incorporate a lot of known individual interactions, this is impressive work (and they note parenthetically that using such curated data would be a big help to their modeling). Having computationally tractable ways to generate and refine large molecule networks based on typical experimentation is a recipe for advancement in the field- I hope these tools become more popular.



  • Ruminations on PGE. Minimally, the PUC needs to be publically elected. Maximally, the state needs to take over PGE entirely and take responsibility.
  • Study on the effect of automation on labor power ... which is minor.
  • What has happened to the Supreme Court?
  • Will bribery help?