I’m ready to drink myself into a stupor, and not because it’s my birthday. This week we’re seeing a massive science reporting fail on a large scale. And just to be clear, I’m not only (or even mostly) blaming reporters.
We’ve known for a long time that protein-coding genes are regulated by non-coding DNA sequences, ‘gene switches’, if you will. We’ve known for decades that the genome contains many ‘gene switches’. (See the references in this review.) That’s uncontested.
ENCODE is significant because they’ve provided a very useful data set, and not because they’ve a) shown that non-coding DNA is important (we knew that), or b) most of the genome has phenotypically important regulatory function (it does not), or c) that most of the genome is evolutionarily conserved (not true either). What they have shown is that much of the genome is covered by introns, and it is hard to find biochemically inert DNA, which those of us who’ve tried to generate random, ‘neutral’ DNA sequences (for say, spacers in synthetic promoter experiments) will agree with.
Now, let’s see how major media stories are handling the significance of ENCODE (h/t to Ryan Gregory for compiling the list of stories):
The human genome is packed with at least four million gene switches that reside in bits of DNA that once were dismissed as “junk” but that turn out to play critical roles in controlling how cells, organs and other tissues behave. The discovery, considered a major medical and scientific breakthrough, has enormous implications for human health because many complex diseases appear to be caused by tiny changes in hundreds of gene switches…
As scientists delved into the “junk” — parts of the DNA that are not actual genes containing instructions for proteins — they discovered a complex system that controls genes. At least 80 percent of this DNA is active and needed.
Indeed, the vast majority of human DNA seems to be involved in maintaining individuals’ well being — a view radically at odds with what biologists have thought for the past three decades…
The project’s chief discovery is the identification of about 4 million sites involved in regulating gene activity. Previously, only a few thousand such sites were known. In all, at least 80 percent of the genome appears to be active at least sometime in our lives. Further research may reveal that virtually all of the DNA passed down from generation to generation has been kept for a reason.
International research teams have junked the notion of “junk” DNA, reporting that at least 80% of the human genetic blueprint contains gene switches, once thought useless, that control the genes that make us healthy or sick.
I can’t blame NPR for this one, because they quote a scientist saying something I guarantee you few of my colleagues believe. This statement is overblown, and I doubt it would be made at say, the Biology of Genomes conference because it would be ripped to shreds:
“Most of the human genome is out there mainly to control the genes,” said John Stamatoyannopoulis, a geneticist at the University of Washington School of Medicine, who also participated in the project.
The Guardian, while more respectable in the numbers they throw around, falls for the typical trap of stuffing all non-coding DNA, regulatory DNA and the detritus of repetitive elements into a giant straw man:
For years, the vast stretches of DNA between our 20,000 or so protein-coding genes – more than 98% of the genetic sequence inside each of our cells – was written off as “junk” DNA. Already falling out of favour in recent years, this concept will now, with Encode’s work, be consigned to the history books.
Encode is the largest single update to the data from the human genome since its final draft was published in 2003 and the first systematic attempt to work out what the DNA outside protein-coding genes does. The researchers found that it is far from useless: within these regions they have identified more than 10,000 new “genes” that code for components that control how the more familiar protein-coding genes work.
Scientists have once and for all swept away any notion of “junk DNA” by showing that that the vast majority of the human genome does after all have a vital function by regulating the genes that build and maintain the body.
Junk DNA was a term coined 40 years ago to describe the part of the genome that does not contain any genes, the individual instructions for making the body’s vital proteins. Now, this vast genetic landscape could hold hidden clues to eradicating human disease, scientists said.
In the new project, named Encode, scientists found that 80 per cent of the “junk” region helps dictate how and where proteins are produced….
In particular, they found four million areas which act like dimmer switches for individual genes, dictating how active or inactive each one is.
It’s news to me that regulatory sequences were considered less important:
Molecules that didn’t form protein-coding genes were mostly overlooked, partly because they were considered less important [WTF?!], but also because new tools and techniques were needed to study them. Like someone who knows a box is full of hardware but doesn’t know if it contains nails or screws or something else, scientists knew the genome was full of other molecules, but didn’t know what they were.
Pseudogenes perform other functions?
In the ENCODE data are thousands of newly identified structures known as pseudogenes, fossil genes and dead genes, which look like protein-coding genes but perform other functions
I’m picking on the misleading stuff. Many of these articles have reasonable parts to them as well, but the reasonable parts are couched in so much breathless hype that the end result is massive distortion.
And finally, I have a major beef with the scientists who are quoted in these pieces. They are making broad claims for function using an absurdly loose definition of function (reproducible biochemical activity). We need to remind ourselves that demonstrating function is not easy. And, more importantly, they are operating without a serious null hypothesis. What exactly do we expect non-functional DNA to look like?
Well, it’s not going to be inert.
Chromosomes Nucleosomes have low sequence specificity, and so we expect, in a large genome, many regions that, just by chance, have a random piece of DNA that reproducibly positions nucleosomes. Transcription factors recognize short, degenerate sequences that occur, again, just by chance, all over the genome. And so again, in a large genome, we expect plenty of reproducible but functionally irrelevant TF binding. That’s going to lead to pervasive, tissue-specific transcription at low levels, along with various chromatin marks. Transcription factor binding sites turn over fairly rapidly in evolution, and so we expect dense, complicated networks just by chance.
So as you read the hype, keep the null hypothesis in mind, and always ask, given everything we knew about transcriptional regulation in say, 2006, what did we expect the genome to look like? Have the ENCODE results defied those expectations?