How to find your way in E. coli without stopping for directions

One of the keys to success in life is to regulate your genes properly. Genes are regulated by transcription factor proteins, which have to navigate their way around the genome and bind particular DNA targets. The problem is that there are only a few correct targets and the genome is large. So an obvious question is, why don’t transcription factors get lost? Do they stop and ask for directions? Where is the information for genome navigation coming from?

The answer to this question is still being worked out for eukaryotes, but it has been solved for E. coli. Peter von Hippel and Otto Berg largely figured out the answer in their classic 1986 paper “On the specificity of DNA protein interactions.” E. coli’s solution for making gene regulation manageable is simple and elegant, because this bacterium has the virtue of possessing a small genome. Let’s take a look at how genome navigation works in a bacterium:

The lac operon
Following von Hippel and Berg, we’ll take as our example everyone’s favorite test case in E. coli: the lac operon. The lac operon (or at least the part of the operon that we’re concerned with) is a set of three genes strung together, all regulated by a stretch of non-protein-coding DNA immediately upstream of the coding genes. The three lac operon genes are crucial for metabolizing lactose, and so E. coli’s gene regulation task is this:

1) In the presence of lactose, turn these genes on.
2) In the absence of lactose, keep these genes off.

Simple, eh?

The machinery by which this works is fairly straightforward as far as molecular machines go. In the absence of lactose, the lac repressor protein binds to the particular stretch of DNA in the regulatory region near the lac genes, and prevents these genes from being expressed, by blocking the gene activation machinery from the DNA. In the presence of lactose, the repressor finds itself bent out of shape around a lactose molecule, and is thus rendered incapable of binding DNA. The lac genes are then transcribed, because the repressor is no longer blocking access to the DNA.

In other words, gene regulation depends on whether the regulatory DNA site is bound by repressor molecules. For the visually oriented, the following is my standard figure for illustrating a repressor regulated gene, boiled down to its essentials (ORF stands for open reading frame, the coding region of the gene):

May the odds be ever in your favor
Our task is to figure out how the repressor manages to home in on its target and keep the lac operon shut off in the absence of lactose. How does the repressor distinguish its correct binding site in a genome that is about 4.6 million base-pairs large, and thus has 4.6 million potential incorrect binding sites? Finding a needle in a haystack is literally orders of magnitude easier (PDF).

Defining the task more specifically, von Hippel and Berg begin by noting just how good the lac repressor is at navigating the genome. In the fully repressed state, the odds that the correct target site is bound by repressor are 1000 to 1. In other words, the probability that the operon is bound by repressor is 1000 times the probability that the operon is unbound. That’s fairly good repression, and that’s our number we have to explain – 1000, which von Hippel and Berg denote as x.

Next, von Hippel and Berg derive the following (approximate) equation:

$x \approx \frac{K_s [R_T - D_s - D_{ps}]}{1 + K_{ns}D_{ns}}$

I’ve already told you what x is. Ks is the affinity of lac repressor for its target site, RT is the total repressor concentration, Ds is the concentration of correct target sites, Dps is the concentration of pseudosites (imitators of a correct site that have occurred by chance in the genome), Kns is the non-specific affinity of repressor for your average stretch of non-target DNA, and Dns is the concentration of incorrect, non-specific sites.

Do the math
This simple thermodynamic equation is pretty easy to parse. We’ll start with the numerator. Obviously, Ks should be pretty high, so that the numerator becomes much greater than the denominator. Ds by default will be small, since there is only one correct target site, but Dps could be a problem in a large genome – the bigger your genome, the more likely it is that imitation target sites will crop up by chance. Mammalian genomes have hundreds of thousands or more of these imitation sites, but fortunately for E. coli, it has less than 5 at most. This is partly due to small genome size, but also possibly because bacterial transcription factors seem to recognize longer, less degenerate sites than eukaryotic transcription factors.

And now for the denominator: again, small genomes help here, because Dns, the concentration of non-specific sites, depends simply on the number of base pairs in the genome (any give base pair could be the start of a new non-specific site). Fortunately for E. coli, the small genome proves its utility again.

With reasonable biological values for protein-DNA affinities, it is easy for the numerator to stay large and the denominator to stay small, just what you want in order to easily arrive at your genome destination. von Hippel and Berg crunch the numbers and show that, from simple thermodynamic considerations, 10-20 copies of the lac repressor (roughly at 10 nM concentration ), with a not unreasonable affinity for DNA (on the order of 1011), can keep the lac operon sufficiently off in the repressed state.

In other words, you don’t need to ask for directions in E. coli, because it’s really not that hard to find your way around with simple thermodynamic considerations. Where does the information come from? Sequence specificity, pure and simple.

Living dangerously

Unfortunately, this strategy completely fails for eukaryotes. The huge genome sizes – billions, instead of millions of base pairs – cause the number of pseudosites and non-specific sites to explode, and you can no longer plug in realistic K values into the above equation to get a satisfactory x. Furthermore, this approximate model breaks down, and you have to turn to something fancier like this one. This is a conversation for another day, but what it seems to mean is that, while the E. coli genome may be easy to navigate, you probably do need to ask for directions in eukaryotes. Transcription factor sequence specificity can’t get you where you need to go without help. Where those directions come from is a topic of intense debate, but part of the answer must be that chromatin structure shrinks the effective genome size, so that many imitation target sites are taken out of play. Just don’t ask me where the information for chromatin structure comes from.