On Saturday, my former Center for Genome Sciences colleague Sean Eddy brought up the idea of a Random Genome Project: let’s create a random genome to serve as a null model of genome function. With this random genome, we can determine how much supposedly functional biochemical activity do we expect to see just by chance, and, among other things, we might use a random genome to explore how new functions evolve by “repurposing” (Eddy’s great term) non-functional DNA. In the comments to that post, you can read some discussion of how you might go about making a random genome.
An easier task would be to implement the random genome computationally, an idea I’ve been exploring recently, using a genome-wide binding model along the lines of the one by Wasson and Hartemink.
Why do this? Because we could explore two kinds of null models – the random genome described by Sean Eddy, and the naked genome.
To understand why these null models would be useful, let’s step back and look at what we don’t understand about transcription factor (TF) binding. My Washington U. colleague Joe Corbo studies the transcription factor CRX, which is a master regulator of genes in photoreceptor cells. His lab performed a ChIP-seq experiment on CRX, and identified thousands of places in the mouse genome where CRX binds inside rod photoreceptor cells.
A Levinthal’s Paradox in the genome
If we just look at the CRX results in mouse chromosome 1 (which includes ~197 million base pairs, or ~7% of the mouse genome), we find 328 ChIP-seq peaks, or regions to which CRX binds reproducibly. Why those 328 regions and not elsewhere? We don’t know.
CRX recognizes variants of the sequence CTAATCCC. How many times do CRX recognition sites crop in chromosome one? Depending on your cutoff score (which I set to biologically reasonable values – even weak sites can without question be important), the answer is this:
Only 328 sites are bound out of 30,000 to 150,000 possible sites. About 0.1% of all CRX recognition sites are actually bound reproducibly.
Why? Where is the information for specific, functional binding coming from? (How many of those 328 sites are actually functional is another question.)
This is where the naked genome model comes in. CRX, and most transcription factors face a sort of Levinthal’s Paradox of the genome: they need to find their functional targets in a genome where the number of possible recognition sites often outnumbers the amount of TF molecules present in the cell, and, at first glance, there don’t seem to be many guideposts marking the path to target binding sites.
The guideposts could come in two possible forms, not mutually exclusive (and somewhat oversimplified, by leaving out the role of TFs in creating open chromatin):
1) TFs seek out recognition sites in pre-existing regions of open/marked chromatin.
2) TFs only bind in highly specific combinations with other TFs, and those specific combinations are present only at functional sites.
A Naked, null genome model
The naked genome is a genome in which nucleosomes are completely absent and the entire genome is then accessible. It’s a null model which allows us to quantify the potential contributions of chromatin structure and combinatorial binding, because we can ask, what CRX binding do we expect in the total absence of nucleosomes and cooperation with other TFs? Is the genome a giant sponge, soaking up TFs non-specifically? Weak recognition sites vastly outnumber high-affinity recognitions sites – who wins out in the naked genome? If we computationally simulate combinatorial binding with other TFs, do we achieve the specificity that we observe in the cell? The Naked Genome provides a baseline against which we can compare our hypotheses about TF binding.
We could also ask, how does nucleosome coverage tune TF binding? We have a Goldilocks problem: Is there too much inaccessible chromatin, too little, or is it just right?
You can imagine a situation in which too much of the genome is wrapped up in chromatin, resulting in too few accessible, specific recognition sites, and therefore TFs tend to pack non-specifically into any region of open chromatin. Is this what we’re seeing in the ENCODE data?
Or maybe the genome is too open, leaving too many ‘decoy’, non-target recognition sites accessible, and TFs find their targets primarily by sequence information, and not via chromatin guideposts. (The naked genome is the extreme version of this model.) In this case, if we try to experimentally test the ability of these decoy sites to drive transcription, we would predict little or no activity.
Maybe the nucleosome coverage is just right (or TF concentrations are tuned to the correct level), covering up just enough DNA so that TFs aren’t exposed to too many decoy sites, while leaving enough sites open so that TFs aren’t driven non-specifically into any available region of open chromatin.
And then there is the case of so-called pioneer TFs, which induce sites of open chromatin – how do they sort out the target sites from the decoys?
At this point, these kinds of questions are poorly developed, but they illustrate the need for a good null model as we try to understand how genome information is used to direct gene regulation.