The ENCODE media fail was epic enough that it totally dominated the discussion when the results were released to the public. Now our collective fury has abated1, I’d like to talk about, not what ENCODE did, but what it might mean for how we conduct genomic research in the future.
ENCODE produced an unprecedented amount of data with unprecedented levels of reproducibility between labs. This data will be useful to researchers around the world for year to come. To do so, however, it commanded tremendous resources and marginalized the concerns of independent researchers. Can we harness the data collection power of these collective projects without destroying the creativity and risk-taking of individual scientists in the crucible of collaborative compromise?
The homogeneity of purpose and compromise that characterized ENCODE are great for methods development, massive data collection, and getting reproducible results, but they are not particularly good for creative, innovative, high risk research. Genome sequencing projects aspired to similar characteristics, though with widely varying success (current communication/sharing technology does make this process easier now).
“Big Science” projects, like ENCODE and the Human Genome Project, are, in many ways, an evolution of the timeless collaborative process in biological research. Researchers collaborate to bring skills and experience to a project that they do not have in their own lab. Sometimes this means sharing technical skills or samples. Sometimes this means sharing the workload so more samples can be assayed. The ENCODE consortium make have taken this process to 11, but it was not reinventing the wheel. Discussion, review, argument – these are important for critical and creative analysis of data. Compromise? Not so much.
At its heart, ENCODE was a data collection project, like genome sequencing projects. When it comes time to make their results public and get well-deserved credit for all the people who worked hard to make that data collection possible, they are forced to spin those data into conclusions. The structure of publishing peer-reviewed research honors complete “stories” not the individual elements that build our knowledge. The results of big, expensive data collection projects are newsworthy and practically demand big, splashy conclusions. This leads to over-simplified, nuance-free howlers like “80% of the genome is functional2” and “the platypus is half mammal, half reptile”.
Currently, data must be linked to conclusions. This may have made sense before the era of genomics (or, really, the -omics era) and pervasive collaborations. We now have Big Biology and Big Data. What happens if we separately give credit for generating data and for interpreting the data?
For genomics, we would be able to have let our technology development wizards focus on building better toys. We would be able to let those groups that are experts in churning out assay after reproducible assay get on to the next sample. And, we could let those researchers who have a special talent for looking at data from a new angle focus on sorting out what it all means. Instead of maintaining the myth of the Renaissance scientists in our fantastically complex world of modern biology3, we can let people focus on that voodoo that they do so well.
This specialization would also create incentives to share data more freely and with accessible formats4. If producing the data is your thing, then you don’t need to jealously guard it to stop someone with better analytical chops from scooping you (or making your analysis look dumb).
It is not a perfect analogy, but it seems to me5 that we can learn something about how to do this from astronomy. Astronomy has always dealt with the problem of having very expensive data collection infrastructure. For much of astronomy’s history, access to the valuable data collection infrastructure (ie, telescopes) was limited to the wealthy or well-connected. In the past century, astronomers have developed better methods for sharing these resources more equitably.
Astronomy lacks the funds to build a new, large telescope (or launch one into space) for every clever researcher. Similarly, the infrastructure to do massive amounts of high-throughput sequencing for genomics reliably and developing new methods is relatively specialized and expensive. We have made major advancements that have made these costs “acceptable” and “possible”, rather than the stuff of dreams. That does not make it “easy” or “cheap”.
Astronomy, in part and oversimplified by an outsider, deals with this problem in a relatively democratic manner. Astronomers get together, vote on a prioritized list of major, data collection projects (eg, telescopes, space probes, etc) to try to fund, and set up systems for individual researcher to “pitch” uses of those data collection facilities. The Big Science data collection happens and the individual researcher is not left out in the cold.
We won’t be able to build an instrument for every specific project this way. That limit, however, may foster greater creativity in how we use the data. Astronomers are constantly finding new ways to use equipment built to older specifications and techniques. We often don’t do this in genomics6 because the focus is on building the next great technology. We don’t want to lost that focus. Yet, we are now running into real limitations on generating massive amounts of sequencing, like data storage and data transfers7.
If we fail to separate credit for data collection from the credit for insightful analysis, if we continue to keep data locked up and privileged, especially if we continue the trend toward bigger science and bigger collaborations in biology, then we run the risk of permanently disadvantaging or losing the small independent researchers. When we lose them, we lose variety and creativity.
The future I’m imagining involves partnerships between technical specialists, who are dedicated to pushing the technology of science as far as it can go, analysts that interpret the data and suggest new directions for data collection, and theorists that help us decide where to look next. A great deal of astronomy already operates on a similar model and it has done pretty well for itself despite very little funding help from the rest of us.
Maybe certain fields of biology should embrace the approach of modern astronomy, or at least my straw man view of modern astronomy. Back in the day, biology learned a lot from physicists – giving birth to modern molecular biology. Maybe it is time we learn from the stargazers…
1. Members of the Eisen clan are exempt from this general statement.
2. For certain, ludicrous definitions of “functional”.
3. It used to be a lot easier to know everything, because everything used to be a lot smaller.
4. Even when genomics data are made publicly available, the formatting standards are poorly implemented. It is understandably, a low priority for the publishing researchers. Buy me a beer and I can regale you with hours of horror stories about parsing another lab’s customized formatting to avoid wasting the taxpayers’ money on redoing preliminary experiments.
5. I did actually bother to ask a few astronomers if my view of this was more-or-less correct. They said “yes”, but they had also been drinking. Draw your own conclusions.
6. There are many exceptions to this rule, such as labs using Sanger (capillary, or old-style) sequencing and chemical modifications to solve RNA structures.
7. Genome sequences are so large that the state-of-the-art in long distance transmission is to FedEx and external hard drive.
2 thoughts on “ENCODE, Astronomy, & the Future of Genomics”
It’s a tricky one. Data collection is arguably more important in the long run than interpretation because you can always re-interpret the data. At the same time, though, you cannot completely separate data collection from the biological questions and analysis, or you collect the wrong data. That said, I think the kind of partnerships you propose are already happening – the focus these days is definitely on *interdisciplinary* collaborations, not just “big science”. (In the UK anyway.) I totally agree about the formatting thing, though. As a bioinformatician, the amount of time I have wasted over the years trying to integrate different sources of essentially the same data is not funny! (It’s even worse when people don’t make their data available to download at all, though.)
I agree, but I think we need to move toward something more systematic and modular. Current partnerships are either ad hoc or driven by big projects from “on high”. What we want is to have more voice from individual researchers in the decisions about priorities. Over the past decade (at least), a lot of that power has devolved into the hands of a small number of people in key positions (eg, heads of major sequencing centers, NHGRI directors, etc). I will grant them their great successes, but those decisions (like the Human Genome Project) were often pretty obvious ones. Others, like the huge investment on Genome Wide Association Studies are much more debatable.