Recent research shows that the difference between humans and mice arises mostly from differences in the DNA of transcriptional control regions.
It has long been known that we are genetically not very different from other mammals- 98% identical to Chimpanzees, 85% identical to mice. Indeed, there are many genes that we share to a recognizable extent with bacteria. So how do the evident differences in our characters come about? Experts in evolution and molecular biology have long suspected that differences in encoded proteins are less likely to be the leading cause of differences than are differences in how they are controlled. A new piece of work illustrates this concept nicely.
The article by Wilson et al. in Science (with a review- subscription needed for full access) demonstrates what happens when you place human chromosome #21 into a mouse and ask whether its patterns of gene control / expression resemble that of the similar regions of the mouse genome (mostly chromosome #16), or that of the human chromosome in human cells, or whether it is different from each.
If mouse proteins have changed significantly from human proteins in the ~80 to 100 million years since our divergence, then a human chromosome placed in mouse cells should show a unique pattern of proteins binding to it- certainly different from that in human cells, and probably different from their pattern on the similar (homologous) mouse genes. Recall that DNA is inert in cells until proteins bind to it- proteins that locate genes and tell those genes to turn on or off at specific times and places. We already know that the DNA sequences in between genes are far more variable over evolutionary time than the DNA in the protein-coding regions of genes. That is a key empirical finding of bioinformatics. Coding regions are far denser with information, and changing a protein sequence is more damaging (in selective terms) than tinkering with upstream control regions, let alone non-gene regions that neither code for anything nor control the expression of anything.
What they found was that the pattern of protein binding to the human DNA was ~90% the same as it was in human cells. Likewise, the pattern of gene expression was highly correlated (R~0.9) between the same chromosome in mouse (red graphs in the drawing, from Coller and Kruglyak) or in human cells (purple graph). The message is that those mouse proteins that bind to human chromosome 21 and control its gene expression have changed very little in the tens of millions of years of our divergence from mice, and rather the differences between us arise from differences in the control DNA itself- control that recapitulates roughly as well in mouse cells as in human cells, when directed by the human DNA.
DNA sequence patterns used for gene control are very distinct from those used to code for proteins. The coding region is a linear triplet progression of codons, each coding for one amino acid. If any one is out of register, the whole resulting protein sequence is thrown off, since ribosomes read off the code in strict three-by-three steps. And if the identity of one codon is off, the function of the resulting protein in whatever it does may be changed, often disastrously.
In contrast, DNA sequences used to control genes (typically within a few thousand bases of the coding sequence) are small, degenerate, and modular. They are typically only six to ten bases long, like CCCAGCCCC, which binds the famous regulatory protein SP1. Variations are common, (indeed, it is often extremely difficult to determine what the optimal binding sequence for such proteins actually is), and have subtle effects on the binding and activity of the regulatory protein. These sequences (also called binding sites) can also be relocated, mixed and matched in the gene control region (typically upstream relative the the coding region) with relatively little effect. One gene is often regulated by multiple control regions, each composed of several individual binding sites and each with a different role, such as activating the gene's activity in separate organs, or different times of development.
The upshot is that gene control regions are eminently "evolvable". Duplications of control regions have minimal immediate effects and allow the generation of new patterns of control. Alterations of individual binding sites or alterations of site arrangements are likely to alter only a small aspect of gene expression, such as in one stage of development or an uptick in amount produced, in contrast to protein mutations, which affect the action of the encoded protein at all times and everywhere.
It was already well known that most proteins are very well conserved between mouse and man. Indeed, of the 25,000 or so proteins encoded by each species, only about 100 to 200 fail to have detectable homologs in the other species. It is routine to express proteins from one species (human) in the other (mouse) to study their native function, and indeed to express specifically mutated forms to create models of human diseases in transgenic mice. What was not fully appreciated was the scale of conservation, such that these authors find that huge swaths of one human chromosome are handled in mouse cells essentially as they would be in human cells.
A metaphor for the genome might be a giant pipe organ, where each gene is a key. Over evolutionary time, the keys change very little, but the music played changes more dramatically, programmed as it is by the highly mutable control elements.
This picture indicates that solving the very difficult problem of predicting gene control from known DNA sequence (given knowledge about the binding preferences of control proteins and their activities in regulating gene expression) is even more important than previously suspected, since it would not only allow us to model the gene control circuitry of cells and organisms accurately, but would allow us to model evolutionary history with unprecedented detail and insight as well.