Skip links

Big Data: Correlation is no substitute for causation (or a decent model)

I have been working with “Big Data” for the past seven or so years, working with financial markets data, credit card transactions, retail transactions and cancer genomics. I have never really liked the term “Big Data”, as when I first stumbled across it, I found it to be rather vacuous, but it seems like the last couple of years has seen the term settle on the definition that “Big Data is where the size of that data becomes part of the problem”.

It with great trepidation that I downloaded “Big Data: A revolution that will transform how we live, work and think” to my kindle, but with 106 reviews and a 4.3 star rating I felt I might as well give it a go.

I definitely agree with the book when it says that “At its core, big data is about predictions”. In each of the areas I have worked in the key motivation for analysis has been to answer questions that can be acted upon. For the most part, this means making predictions – what customer segment is most likely to respond to direct mail, what drug is a patient’s tumour most likely to respond to, etc.

However, the book also makes an assertion that I do not agree with, that “In a big-data world, by contrast, we won’t have to be fixated on causality, instead we can discover patterns and correlations in the data that offer us novel and invaluable insights”. Later, an example is given: “If millions of electronic medical records reveal that cancer sufferers who take a certain combination of aspirin and orange juice see their disease go into remission, then the exact cause of their improvement in health may be less important than the fact they lived”.

I could not disagree with this statement more.

Nate Silver’s book, “The Signal and the Noise” and pretty much any statistician worth hiring make it pretty clear that this type of reasoning is wrong and dangerous. What we are attempting to do when analyzing data is to build a model of how a system works so that we can then make accurate predictions.

The most important aspect of this is determining what is signal and what is noise. Just because you see a pattern or correlation in a set of data, does not mean that correlation or pattern will continue – it’s likely just to be an artifact of something else: noise or bias. Only when you can build and test a model (e.g. cross validation, using an additional independent data set) can you discover if the relationship is likely to be real or not.

The problem is only compounded by big data, as large volumes of data mean more noise. A phenomenon that only occurs 0.001% of the time will occur 1,000 times if you have a million samples. This means that the more data you have, the more “patterns” and “correlations” you are likely to discover that are spurious.

This problem has long been known about, and is especially true for medical research involving micro-arrays. A micro-array is a device with millions of probes that lets you determine if a frequently occurring DNA mutation exists within a particular person. With 500,000 – 1 million probes, this lets you look for a lot of mutations. So a study design, known as a GWAS was created which took large groups of people affected by a disease (e.g. 10,000 people) and another large group of controls (e.g. 20,000 people) and looked for what DNA mutation markers were more common in the affected population than the control population in an attempt to discover what areas of the genome may be playing a role in disease susceptibility and prognosis.

The problem is that there are so many probes, that it is trivial to find a probe (out of the say, 500,0000) that is more common in one group than the other. In fact, you would expect to see 25,000 if your significance threshold is p < 0.05. Even when you use bonferonni correction, and your p-value for significance drops to 0.0000001, you still end up with a lot of spurious results. This happened so often, that journals such as Nature Genetics began demanding that all GWAS were required to be replicated with an independent study. Even with these stringent checks, in areas such as blood pressure, GWAS have provided little insight into the effect of genetics on the disease as summarized by Manolio in his paper “Finding the missing heritability of complex diseases”:

Genome-wide association studies have identified hundreds of genetic variants associated with complex human diseases and traits, and have provided valuable insights into their genetic architecture. Most variants identified so far confer relatively small increments in risk, and explain only a small proportion of familial clustering, leading many to question how the remaining, ‘missing’ heritability can be explained

Nature 461, 747-753 (8 October 2009) | doi:10.1038/nature08494; Received 25 June 2009; Accepted 11 September 2009

The failure of looking for simple correlations in medical research, and in particular within cancer research makes the tenet that somehow having more data means that causation can be replaced with correlation laughable. That said, correlations are often a pretty good place to start looking (which was always the intention of GWAS), but making predictions based solely around correlations is unlikely to get you very far as an analyst.

Any serious work a data scientist, analyst or statistician is likely to be employed to work on is rooted in the real world: “Will this drug work better than this other drug? Will a set of lost customer return if we offer them a discount?” Thus a data scientist must be able to explain the answers generated by their model not only in terms of statistics, but also in terms of the real world. Because no matter how good your model is, you are always going to have someone in a nicer suit than yours ask the question, “Why?” and discussing the intricacies of your Support Vector Machine / Regression / Classification Tree model is not going to impress them.