# V is for Correlation

Correlation is not causation.

Who hasn’t heard that old chestnut? No one. How many people actually disagree? No one. Who actually uses the phrase “old chestnut”? No one. See the theme? No one. Wait, what? Stupid pattern loving human brain. Where was I? Argh, stupid forgetful human brain. No one.

Deep breath. Start again. While no one may believe that correlation is the same as causation, it is a tired line that is frequently used, not as an incisive critique of analytical errors in analysis, but as a casual dismissal of potentially inconvenient results without any serious consideration. Casual dismissal requires some style, a touch of pithiness. I find the related aphorism to be more aurally attractive:

Correlation does not imply causation.

It has more words, but I think it is something about the rhythm that I like. The accuracy of this phrase is, however, up for debate. And, it is debated, usually in a rather uneducated manner. The debate centers around the word “imply”. In its original, more mathematical usage, “does not imply” essentially had the same meaning as “is not”. In modern, common usage (eg, internet comment sections and dictionaries), “imply” has a much broader meaning more along the lines of “suggest”. This underscores a major problem with distinguishing correlation and causation – we use the same language and tools to discuss them both.

We are not very good at expressing correlation, whereas talking about causation is very natural. Basic sentence structure works this way. You have a subject (independent variable, x), an active verb (playing the role of “slope” in this labored metaphor), and the direct object (dependent variable, y) – a clear expression of directionality and causation.

We often use the same basic framework for talking about correlation and causation. There is a good reason for this. When variation in one variable causes the variation in another variable, the variables are correlated. The correlation between variables helps us understand how much of the variation in the second, dependent variable is caused by the first, independent variable.

In the absence of evidence for causation, we may find that two variables are correlated. They may have a causative relationship (correlation implying causation, for certain definitions of imply), but correlation provides no information about the direction of causation (ie, which variable is the independent variable). It is also perfectly possible to have correlated variables without causation (eg, both variables have a causative relationship with the same third, unknown variable).

Correlation tells us how much of the variation in one variable can be predicted by the variation in another variable. Causation tells us about the directionality of that relationship.

A two-dimensional scatter plot is a great tool for presenting the relationship between two variables. Yet, it can also lead to a lot of confusion when we are trying to parse this tricky correlation/causation issue. To my mind, this is not a fundamental problem with the plot itself, but a result of our education.

Maybe you were taught differently, but I first learned about two-dimensional plots in geometry when we were taught the basic equation for a line:

y=mx+b

The change in x causes the change in y.

Later, this was reemphasized when I was taught a straw man version of the scientific method in introductory biology with great focus on the terminology. The independent variable goes on the x or horizontal axis and the dependent variable goes on the y or vertical axis  Woe be unto him (or her) who confuses the two on their Science Fair poster board.

So, we know two things:

1. Two-dimensional plots are good tools for representing both correlation and causation.
2. We are predisposed to interpret two-dimensional plots as representing causation.

How can we embed the distinction between correlative and causative relationships without eliminating the benefits of the two-dimensional plot? I have a pretty simple solution that works for me; but does it work for you? Let’s start with a traditional plot of data. We’ll plot the amount of apples produced in the United States between 2004-2010 (metric tonnes)1 on the x-axis and the amount of oranges1 on the y-axis.

The correlation between apple and orange production over these years is r=0.85. The variation in apple production predicts 73% of the variation in orange production and vice versa. In our Science Fair understanding of this plot, it is implied that apples (x) cause oranges (y).

Based on the results of a recent analysis of agricultural data, we have determined that the best way to increase orange harvests will be to invest significantly in research into apple tree health and picker technology.

Yet, I think most of us with a passing familiarity with fruit farming would agree that apples do not cause oranges, no matter how tightly correlated the two are.

Knowing that apples and oranges are highly correlated is not worthless information either. It is quite likely that apples and oranges are correlated because at least some of the variation in both is caused by the same factors, such as rainfall. So, we want to keep the correlation information (it gives us hints), but lose the implication of causation. What if we disrupted the linear association between the axes that we are used to – that we have been trained to associate with causality? What if we twisted our plot, say 45°?

The rotated correlation plot maintains the information about the correlation between apples and oranges. For me, at least, the unusual angle prevents me from going back to that ingrained causal interpretation. And, I think that was what we were going for; but, of coures it works for me – I’m biased. How about you?

NOTES
1. Production statistics courtesy of the Food and Agricultural Organization of the United Nations. They also provide helpful facts that the #19 agricultural product of Afghanistan in 2010 in terms of economic value was watermelon. #1 was wheat. I don’t think they keep track of all of a country’s agricultural products.