Science for the People: Dataclysm

sftpThis week Science for the People is looking at how powerful computers and massive data sets are changing the we study each other, scientifically and socially. We’re joined by machine learning researcher Hannah Wallach, to talk about the definition of “big data,” and social science research techniques that use data about individual people to model patterns in human behavior. Then, we speak to Christian Rudder, co-founder of OkCupid and author of the OkTrends blog, about his book Dataclysm: Who We Are (When We Think No One’s Looking).

*Josh provides research & social media help to Science for the People and is, therefore, completely biased.


Always leave a paper trail

Lab notebooks are one of the less glamorous parts of being a scientist. You must meticulously record what you do each day so that some day in the future, someone could read it and replicate that day’s work. Or when you realize you discovered something you would like to patent, you must prove that you indeed thought of it on a particular day.

Confession: I am particularly terrible at maintaining my lab notebook.

Nightmare data

The lovely and affable Tyler Dukes* has successfully pitched a session for the Science Writers 2012 meeting in October on dealing with “nightmare documents”:

Investigative science writing like this isn’t unique — but it’s a lot more rare than it should be…it’s expensive and time consuming. And more and more often, it’s becoming an unavailable option to news organizations looking to cut costs…In late March, I issued a broad-based call for what I called “nightmare documents,” the sorts of opaque public records that can be a real pain for journalists trying to use them in their reporting…Impossible-to-analyze databases. Government records hidden behind clunky Web interfaces. Unsearchable public reports digitized on ancient scanners.

I’ve encountered the same problem, not as a journalist, but as a researcher – datasets that are “shared” or “publicly available” that are almost unusable due to poor formatting and annotation. Although many journals require datasets to be made available, the requirements for useful formatting and annotation, even at public data repository sites, are usually laughable. And, most busy researchers can only be bothered to meet those minimal standards (eg, “Do you think that is good enough for them to let us publish? Cause I got a grant due.”).

I am happy to say that this is an issue of which Open Data advocates are well aware and are taking concrete steps to address.

*We say nice things about people who want to interview us; and by “us” I mean “me”. Mike says positively horrid things about everyone he talks to.