ENCODE Media FAIL (or, Where’s the Null Hypothesis?)

I’m ready to drink myself into a stupor, and not because it’s my birthday. This week we’re seeing a massive science reporting fail on a large scale. And just to be clear, I’m not only (or even mostly) blaming reporters.

We’ve known for a long time that protein-coding genes are regulated by non-coding DNA sequences, ‘gene switches’, if you will. We’ve known for decades that the genome contains many ‘gene switches’. (See the references in this review.) That’s uncontested.

ENCODE is significant because they’ve provided a very useful data set, and not because they’ve a) shown that non-coding DNA is important (we knew that), or b) most of the genome has phenotypically important regulatory function (it does not), or c) that most of the genome is evolutionarily conserved (not true either). What they have shown is that much of the genome is covered by introns, and it is hard to find biochemically inert DNA, which those of us who’ve tried to generate random, ‘neutral’ DNA sequences (for say, spacers in synthetic promoter experiments) will agree with.

Now, let’s see how major media stories are handling the significance of ENCODE (h/t to Ryan Gregory for compiling the list of stories):

The NY Times confuses activity with necessity:

The human genome is packed with at least four million gene switches that reside in bits of DNA that once were dismissed as “junk” but that turn out to play critical roles in controlling how cells, organs and other tissues behave. The discovery, considered a major medical and scientific breakthrough, has enormous implications for human health because many complex diseases appear to be caused by tiny changes in hundreds of gene switches…

As scientists delved into the “junk” — parts of the DNA that are not actual genes containing instructions for proteins — they discovered a complex system that controls genes. At least 80 percent of this DNA is active and needed.

The Washington Post suggests the whole genome is conserved:

Indeed, the vast majority of human DNA seems to be involved in maintaining individuals’ well being — a view radically at odds with what biologists have thought for the past three decades…

The project’s chief discovery is the identification of about 4 million sites involved in regulating gene activity. Previously, only a few thousand such sites were known. In all, at least 80 percent of the genome appears to be active at least sometime in our lives. Further research may reveal that virtually all of the DNA passed down from generation to generation has been kept for a reason.

USA today claims that 80% of the genome is comprised of promoters or enhancers:

International research teams have junked the notion of “junk” DNA, reporting that at least 80% of the human genetic blueprint contains gene switches, once thought useless, that control the genes that make us healthy or sick.

I can’t blame NPR for this one, because they quote a scientist saying something I guarantee you few of my colleagues believe. This statement is overblown, and I doubt it would be made at say, the Biology of Genomes conference because it would be ripped to shreds:

“Most of the human genome is out there mainly to control the genes,” said John Stamatoyannopoulis, a geneticist at the University of Washington School of Medicine, who also participated in the project.

The Guardian, while more respectable in the numbers they throw around, falls for the typical trap of stuffing all non-coding DNA, regulatory DNA and the detritus of repetitive elements into a giant straw man:

For years, the vast stretches of DNA between our 20,000 or so protein-coding genes – more than 98% of the genetic sequence inside each of our cells – was written off as “junk” DNA. Already falling out of favour in recent years, this concept will now, with Encode’s work, be consigned to the history books.

Encode is the largest single update to the data from the human genome since its final draft was published in 2003 and the first systematic attempt to work out what the DNA outside protein-coding genes does. The researchers found that it is far from useless: within these regions they have identified more than 10,000 new “genes” that code for components that control how the more familiar protein-coding genes work.

Same thing at The Independent:

Scientists have once and for all swept away any notion of “junk DNA” by showing that that the vast majority of the human genome does after all have a vital function by regulating the genes that build and maintain the body.

Junk DNA was a term coined 40 years ago to describe the part of the genome that does not contain any genes, the individual instructions for making the body’s vital proteins. Now, this vast genetic landscape could hold hidden clues to eradicating human disease, scientists said.

The Telegraph:

In the new project, named Encode, scientists found that 80 per cent of the “junk” region helps dictate how and where proteins are produced….

In particular, they found four million areas which act like dimmer switches for individual genes, dictating how active or inactive each one is.

Wired is so confused it just makes me want to cry:

It’s news to me that regulatory sequences were considered less important:

Molecules that didn’t form protein-coding genes were mostly overlooked, partly because they were considered less important [WTF?!], but also because new tools and techniques were needed to study them. Like someone who knows a box is full of hardware but doesn’t know if it contains nails or screws or something else, scientists knew the genome was full of other molecules, but didn’t know what they were.

Pseudogenes perform other functions?

In the ENCODE data are thousands of newly identified structures known as pseudogenes, fossil genes and dead genes, which look like protein-coding genes but perform other functions

I’m picking on the misleading stuff. Many of these articles have reasonable parts to them as well, but the reasonable parts are couched in so much breathless hype that the end result is massive distortion.

And finally, I have a major beef with the scientists who are quoted in these pieces. They are making broad claims for function using an absurdly loose definition of function (reproducible biochemical activity). We need to remind ourselves that demonstrating function is not easy. And, more importantly, they are operating without a serious null hypothesis. What exactly do we expect non-functional DNA to look like?

Well, it’s not going to be inert. Chromosomes Nucleosomes have low sequence specificity, and so we expect, in a large genome, many regions that, just by chance, have a random piece of DNA that reproducibly positions nucleosomes. Transcription factors recognize short, degenerate sequences that occur, again, just by chance, all over the genome. And so again, in a large genome, we expect plenty of reproducible but functionally irrelevant TF binding. That’s going to lead to pervasive, tissue-specific transcription at low levels, along with various chromatin marks. Transcription factor binding sites turn over fairly rapidly in evolution, and so we expect dense, complicated networks just by chance.

So as you read the hype, keep the null hypothesis in mind, and always ask, given everything we knew about transcriptional regulation in say, 2006, what did we expect the genome to look like? Have the ENCODE results defied those expectations?

Author: Mike White

Genomes, Books, and Science Fiction

20 thoughts on “ENCODE Media FAIL (or, Where’s the Null Hypothesis?)”

  1. I have pity for the science journalists. This is very complex, even for people with years of experience in the field. Most trained scientists don’t understand many of the issues. And, the journalists were confronted with well coordinated and unified PR by people with a vested interest in presenting ENCODE as a huge deal.

    1. When you present research to the media you get just one big punchline. ENCODE chose to make that punchline about the easily distortable theme of junk DNA, unfortunately. Since ENCODE stands for ENCyclopedia Of DNA Elements, they should have run with the encyclopedia theme, e.g. ‘we’ve inventoried the (putatively) functional portions of the genome, and identified x million…”

      Of course I blame this mysterious K Varley whose personal communications keep getting cited in the main ENCODE paper. I’m sure she’s unreliable.

  2. the reaction by estabishment evolutionist is more interesting than what we learn from ENCODE. Inability to accept new information comes to mind.

    1. What new information is that? Perhaps you didn’t read this post, where I describe what non-fcuntional DNA looks like biochemically. The ENCODE results are not surprising; they’re what we expect the genome to look like biochemically.

  3. Keerney: “the reaction by estabishment evolutionist… Inability to accept new information”

    Wait, Keerney is a creationist who thinks 6000 years ago, dirt turned into the human genome by sorcery. And he’s accusing others of the inability to accept new information.

    The belief that the “Death of Junk DNA” goes back to 1989, and has been repeated every few months since then. The “Death of Junk DNA” is an idea that’s 23 years old. T. Ryan Gregory accumulates quotes about scientists desperate for publicity, pretending to kill Junk DNA over the last 23 years: [http://www.genomicron.evolverzone.com/2008/02/quotes-of-interest-beware-single/]

    The ENCODE project itself announced the “Death of Junk DNA” back in 2007, then they announced it again this year. They keep announcing it’s dead because they never get any EVIDENCE with which to kill it!

    The “Death of Junk DNA” is an ancient idea, like ghosts or vampires. It must die in the light of knowledge and evidence.

    You bring a knife to a gunfight. If you want to kill the Junk DNA hypothesis, come back next time with some evidence.

    Your genome is mostly junk. That’s new information and you’re unable to accept it.

Leave a comment