There is something dissatisfying about our current explanations of how the genome exerts its effects on the cell. This is particularly true of the non-protein-coding regulatory regions of the genome, which, as we all know, make up a substantially larger fraction of the genome than those DNA sequences that encode proteins.
So what is that we don’t understand? Rather than give a wordy and abstract explanation, let’s go with an analogy: our poor understanding of how the genome operates is like my poor understanding of how a CD player works.
Let’s start with what I do know about CD players (with a little help from Wikipedia, which I hate but still refer to dozens of times per day.) The data in a CD is encoded as little pits in a polycarbonate surface. Behind the polycarbonate surface is the shiny layer of the CD, and so the pattern of pits can be scanned by using a photodiode to detect laser light that is reflected off the CD. The pits change how the light is reflected, which changes the electrical signal that is emitted by the photodiode. Those output electrical signals are amplified, passed to a loudspeaker and finally to my ears and slightly buzzed brain. (Obviously I’m talking about listening to music after work.)
I’ve just listed the parts of my CD player and stereo system, and roughly outlined the sequence of events that occurs as the signal is read from the CD and passed to the speaker. But at a deep level, this doesn’t really tell me how information on a CD gets converted into sound. I really have no idea how realistic, high-fidelity musical sounds emerge from a certain pattern of etchings in plastic that are converted into a very specific pattern of electrical signals, which in turn are transformed into a series of vibrations of a plastic cone that looks nothing like a guitar, banjo, or the human vocal apparatus. If you gave me a laser, a photodiode, some electrical wiring, and a plastic cone, I could go into my garage and rig up something that looked like a CD player, but if it produced any sound at all, it would just be noise.
Why? Because I don’t know anything about the logical principles of Pulse Code Modulation, which is a way of digitally representing analog signals such as sound waves. Knowing the parts of the CD player, and even knowing the sequence of events involved in reading a CD, tells me nothing about the underlying logical principles by which the system operates. Pulse Code Modulation is the more fundamental, principled answer to the question of how CD players and other audio systems work.
This is like our current understanding of how the genome works. We know many of the parts, and we know the sequence of events that occur during gene expression, but we’re missing the underlying logical principles, what we sometimes call the cis-regulatory logic of gene expression.
Here are three big questions about genome operation that still largely remain unanswered for eukaryotic genomes:
1) How does a functional sequence of non-coding DNA compete for regulatory proteins with the incredibly vast excess of non-functional sites in the genome? Or in other words, what physical and informational properties make a sequence functional? The DNA sequence specificity of regulatory proteins does not seem to be sufficient to explain gene regulation. Maybe the answer is nucleosomes, combinatorial binding, something else, or all of the above.
2) What is the regulatory grammar of transcription factor binding sites? Certain combinations of transcription factor sites will give you one gene expression pattern, while a slightly different combination can give you something dramatically different. Why? What aspect of the machinery of transcription is being affected by these different combinations?
3) How does an individual genome or cell cope with stochastic fluctuations and small numbers of molecules? Our ordinary, macroscopic expectations of thermodynamics and kinetics, those staples of our chemistry classes, don’t quite apply when we’re looking at the behavior of just a few molecules. How is information encoded in the genome to cope with this challenge?
All of these questions reflect the fact that we’re missing an underlying physical and logical model of genome operation, the biological equivalent of pulse code modulation. Our ability to predict the consequences when our genomes go awry, or to engineer genomes to do new useful things will be limited until we figure this out.