The genome is a huge haystack. How do you find the needle?

The complexity of the machinery by which our cells run is so extreme that one of the key questions in biological research is, why doesn’t the whole thing just collapse like a house of cards in a tornado? Another way of phrasing this question is to ask, where does the information come from to keep everything running smoothly?

Consider this: the crucial task of gene regulation is carried out in large part by transcription factors, regulatory proteins that recognize and bind to very short, degenerate DNA sequences located somewhere in the rough (sometimes very rough) vicinity of genes. Once they bind, transcription factors recruit the machinery that activates their target genes. (You can also have transcription factors that repress target genes.) This is all good, until you consider the fact that a human transcription factor has to find its target sequences from among the 3 billion base pairs in the human genome. Some plant and fish transcription factors have to search through genomes with more than 100 billions base pairs. So the question is, why don’t transcription factors get lost? Where are they asking for directions?

On finding needles in the genomic haystack

Let’s run some numbers to get a feel for just how bad this information problem is. Your average mammalian cell typically has anywhere from a few thousand to a few hundred thousand copies of any given transcription factor protein present in that cell. (Check out Bionumbers and type in p53 for an example.) Those thousands of protein molecules need to find regulatory sequences for roughly a few dozen to a few hundred (or very rarely, a few thousand) target genes. In other words, those proteins have to find a few specific sites from among the roughly 3 billion non-specific sites in the genome. That’s a serious needle in the haystack problem.

The result of this is that the vast majority of transcription factor proteins are very probably bound non-specifically. Harvard biochemist Kevin Struhl estimates that, in the relatively small yeast genome, 90% RNA polymerase (part of the gene expression machinery) is bound to non-specific sites (”Transcriptional noise and the fidelity of initiation by RNA polymerase II”). And this is what happens even when RNA polymerase is pretty damn good at recognizing specific sites – when it is 10,000 times more likely to bind a specific site than a non-specific one! Think about how accurate that is – a individual protein with that kind of specificity rejects a non-specific site 99.99% of the time it encounters one, and still 90% of that protein population is bound non-specifically. The genome is like a giant sponge that non-specifically sucks up molecules of transcription factor. The problem just gets worse as your genome gets larger.

There is more bad news. We’ve been discussing how a protein discriminates between specific and non-specific sites. But there are also too many specific sites. In a three billion base pair genome, there are millions of very good recognition sequences for any given transcription factor with an 8 base pair recognition sequence. Of those millions, only a few thousand, at the very most, are going to have any sort of functional role in gene regulation. So how does a transcription factor distinguish between the right specific sequence, and the wrong specific sequence? This appears to be a serious information problem. It appears that there is not enough sequence information in the genome for gene regulation to work. (No, this is not an invitation to invoke intelligent design creationism as the answer.)

Bacteria solve the problem by being small
How do bacteria solve this problem? In the 1980’s, Otto Berg and Peter von Hippel wrote a series of papers addressing this problem (start with this one). In bacteria, genomes are small, transcription factor recognition sites tend to be large, and specific recognition sites occur very rarely in the genome. Basically, every specific recognition site has a high probability of being a functional one. So it all works out very nicely – the haystack is small, the needle is large, and there are no fake copies of the needle lying around. This solution is impossible in eukaryotes, and this means that eukaryotes must have a fundamentally different approach to the genome information problem. The answer almost certainly involves chromatin structure, and probably spatial orientation in the nucleus.

Levinthal’s paradox in the genome
This kind of information problem is not an unfamiliar one in biology. If you’ve taken a biochemistry course, you’ve probably heard of Levinthal’s paradox. Proteins are synthesized as linear chains of amino acids; these chains have to fold up into the correct three dimensional structure. And they have to do it quickly. The problem is, that there are too many degrees of freedom – an unfolded chain of amino acids would have to sample a much-more-than-astronomical number of 3D configurations before hitting on the correct one; this obviously can’t happen within the very short time it takes proteins to fold. It turns out that proteins fold quickly not by sampling all possible states, but by moving along defined folding pathways that involve intermediate, lower-energy states. These are essentially way stations on the way to the final destination.

We have a Levinthal’s paradox in the genome. We need to identify the other sources of information that guide transcription factors to their targets. Given sequence information alone, it should be impossible for transcription factors to find their destinations in a large genome, and yet they do. Gene regulation shouldn’t work, and yet it does.

Author: Mike White

Genomes, Books, and Science Fiction View all posts by Mike White

Share this:

Related

Author: Mike White

Leave a comment Cancel reply