Yesterday I wrote about why negative controls are important in a genome-scale search for functional DNA. Today, I’ll discuss the main focus of our recent work: understanding what makes a piece of DNA functional.
The particular DNA I’m interested in is known by not very functional term ‘cis-regulatory’ DNA – a term that requires six syllables, an italicized Latin prefix, and a hyphen. This is DNA that is crucial in gene decisions: cis-regulatory DNA helps to control when, where, and how much genes are expressed. This happens because cis-regulatory DNA serves as a landing pad for ‘transcription factors’, proteins that land on cis-regulatory DNA and control the expression of nearby (or sometimes not so nearby) genes.
The question that haunts me is this: why don’t transcription factors get lost? My worry follows from these three observations:
1. Transcription factors recognize very short segments of DNA. To give you an example, the transcription factor I study, the eye development factor Cone-rod homeobox (Crx), recognizes slight variations of the 8-base sequence CTAATCCC.
2. Large eukaryotic genomes are packed with millions of copies of these short sequences. The 8-base Crx recognition site occurs more than 6.6 million times in the mouse genome.
3. Only a small fraction of all potential transcription factor recognition sequences are actually bound by transcription factors. In the case of Crx, only about 14,000 of 6.6 million 8-base recognition sequences are bound by Crx in the genome.
I’ve illustrated the problem below. On the left is an image of the transit of Venus across the Sun. On the right, the blue circle shows the small fraction of Crx recognition sites that are bound.
So yes, I’m serious – why don’t transcription factors get lost in giant genomes?
There are two primary answers (not mutually exclusive) that people typically turn to:
Hypothesis A: Chromatin context is everything. In any given cell, most of the genome is inaccessible, wrapped up into large, compact regions of dense chromatin. This reduces the transcription factors’ search space, so they don’t get lost.
But if context is the answer, how do the right parts of the genome get left exposed?
Hypothesis B: DNA grammar provides specificity. Short 8-base pair recognition sequences are not enough; true functional sites consist of rare, highly specific combinations of short recognition sites. The millions of spurious sites in the genome do not have the right DNA grammar, and they are not bound.
But if grammar is the answer, why do transcription factors seem to non-specifically bind all over the place?
We set out to resolve this dilemma by testing the cis-regulatory function of two types of genomic DNA:
Bound DNA: DNA segments with Crx recognition sites that are bound by Crx in the genome (i.e., ChIP-seq peaks).
Unbound DNA: DNA segments with equal numbers of good Crx recognition sites, but which are not bound by Crx in the genome.
Importantly, we tested these DNA pieces outside of their native genomic context, in the very permissive context of plasmids.
So, under Hypothesis A, both bound DNA and unbound DNA should be functional, because they have both been removed from any genomic context and placed on plasmids. The context is now the same for both classes of DNA.
Under Hypothesis B, bound DNA should be more functional than unbound DNA, regardless of context.
We tested large numbers of bound and unbound DNAs in our massively parallel1 functional (i.e., reporter gene) assay. (We did this in the dissected whole retinas of baby mice.) The result came as a surprise to me, because I was betting on Hypothesis A. But it turns out that bound DNA differed from our random DNA distribution (showing function), while the unbound DNA largely resembled the random DNA distribution (showing non-function).
You can see this in these histograms from Fig. 1 of our paper. In the first panel, bound DNA distribution is in blue, the random DNA distribution is in green, and the level of gene expression is shown on the x-axis:
In the next panel, unbound DNA is in blue, and random DNA is in green – notice that these two largely overlap (except in the left tail):
What this means is that the distinction between functional (bound) and non-functional (unbound) DNA is independent of context, at least to a large degree. The information that distinguishes function from non-function is therefore locally encoded in the short DNA regions (84 bases) that we tested.
So the answer must be hypothesis B, DNA grammar, right? Well, maybe, but so far we have been unable to find any obvious grammar. We’re now looking at more subtle structural DNA features that could account for the difference between function and non-function.
One final point – without our random DNA negative controls, our empirical null distribution, we would have drawn a very different conclusion. Unbound DNA would have looked very functional, just not quite as functional as bound DNA. But in fact the unbound DNA, which has lots of Crx recognition motifs, behaves much like our random DNA, which has very few Crx recognition motifs, and that means that most of what the unbound DNA is doing is non-specific, the result perhaps of biological noise.
Our paper: “Massively parallel in vivo enhancer assay reveals that highly local features determine the cis-regulatory function of ChIP-seq peaks”, PNAS July 16, 2013 vol. 110 no. 29 11952-11957.
1. Yes, ‘massively parallel’ is a technical term, one that has a different meaning from ‘high thoughput’.