Nightmare data

The lovely and affable Tyler Dukes* has successfully pitched a session for the Science Writers 2012 meeting in October on dealing with “nightmare documents”:

Investigative science writing like this isn’t unique — but it’s a lot more rare than it should be…it’s expensive and time consuming. And more and more often, it’s becoming an unavailable option to news organizations looking to cut costs…In late March, I issued a broad-based call for what I called “nightmare documents,” the sorts of opaque public records that can be a real pain for journalists trying to use them in their reporting…Impossible-to-analyze databases. Government records hidden behind clunky Web interfaces. Unsearchable public reports digitized on ancient scanners.

I’ve encountered the same problem, not as a journalist, but as a researcher – datasets that are “shared” or “publicly available” that are almost unusable due to poor formatting and annotation. Although many journals require datasets to be made available, the requirements for useful formatting and annotation, even at public data repository sites, are usually laughable. And, most busy researchers can only be bothered to meet those minimal standards (eg, “Do you think that is good enough for them to let us publish? Cause I got a grant due.”).

I am happy to say that this is an issue of which Open Data advocates are well aware and are taking concrete steps to address.

*We say nice things about people who want to interview us; and by “us” I mean “me”. Mike says positively horrid things about everyone he talks to.

Unknown's avatar

Author: Josh Witten

http://www.thefinchandpea.com

5 thoughts on “Nightmare data”

  1. And I even say horrid things about people I don’t talk to.

    Unwillingness to put in the effort to make data public is certainly a problem. But I also think there is another cause of ‘nightmare data’ – the fact that much of the time, scientists see themselves as doing something new with a research tool, as opposed to following some pre-specified, standard data acquisition and processing protocol. And so the data doesn’t always come out in a form that meets that requirements of various repositories.

    Although much of the time it probably is better just to use standard methods – there are many bad microarray, RNA-seq, etc. experiments out there.

    1. Absolutely. A primary problem is that scientists (and I imagine many organizations) are (1) just ticking the box “shared”, (2) do not have a vested interest in sharing data, and (3) don’t realize the formatting that works for them “in house” is indecipherable to the outside world.

      Not sure people realize how few of those impact building references to a research paper are actually reusing/reanalyzing existing data. A bit sad really.

      1. What is especially sad is when data from large-scale projects, whose mere existence is justified by the claim that they are producing data we all can use (ahem, ENCODE), is difficult to parse even by people who know how to perform and analyze those types of experiments.

        But it also illustrates how hard it is to get things into a standard, usable format. If ENCODE has trouble with that, god help us all.

  2. From the latest issue of Nature Reviews Genetics:

    “Today, one often hears that life sciences are faced with the ‘big data problem’. However, data are just a small facet of a much bigger challenge. The true difficulty is that most biomedical researchers have no capacity to carry out analyses of modern data sets using appropriate tools and computational infrastructure in a way that can be fully understood and reused by others.”

Leave a reply to Mike White Cancel reply