How Big Data Can Ruin True Statistics

How Big Data Can Ruin True Statistics

February 13

Big Data presents a lot of opportunities for information discovery. The world has begun creating billions of bytes of data, which can be analyzed and utilized for everything from marketing to scientific research. But as Voltaire said, with great power comes great responsibility.

People have always been able to manipulate data and create true, yet absurd statistics. But Big Data makes it even easier. According to a recent Wired article by Nassim Taleb, a risk engineering Professor at NYU, Big Data has brought cherry picking to an industrial level. Although researchers and Big Data analysts can now understand and use information in new ways, it’s also easier for them to misuse it in new ways. With Big Data and its vastness of information, statistical correlations can be found simply because of the size of the data sets and not necessarily because the correlations are genuinely valid.

Taleb explains that, “despite claims to advance knowledge, you can hardly trust statistically oriented sciences or empirical studies these days.” It’s far too easy to manipulate statistics in general, let alone large data sets. Take this example, used in an article by William Burns:

First gather data from all the fires in San Francisco for the last ten years. Now, correlate the number of fire engines at each fire and the damages in dollars at each fire. There will be a significant relationship between the number of fire engines and the amount of damage (bigger fire means more damage and more trucks to put out the fire). Now, you could conclude that the fire engines caused the damage because more trucks always mean more damage.

Of course, to anyone really looking at the data, this conclusion is absurd. These are spurious or illusory correlations. The data is statistically related, but not causally linked. Clearly, fire caused the damage and not the fire trucks. But it can’t be denied that the fire trucks were present and that there was always more damage when more trucks were present.

Taleb mentions that anyone can find false statistics in Big Data because the spurious rises to the surface. Basically, falsity grows faster than information. Because of this, more false correlations appear as the amount of variables grows. This means there’s a larger number of false correlations than true correlations. Large data sets contain a lot of variables and with them, a lot of bad correlative information.

The example I mentioned used a relatively small set of data but imagine if you had a hundred times as many variables in your data set. Suppose you conducted a survey that yielded three thousand responses but only one was from a woman and she replied that she liked your product. Based on those results you could say that 100 percent of women surveyed like your product—it’s true, but also ridiculous. Given enough data, you could make hundreds, even thousands of spurious connections. These results can seem legitimate, though they may not be causally connected at all.

Spurious data makes it easier for researchers to point a study in a particular direction, which can allow those who fund research to get the results they want, even if the correlations are illusory. Researchers could also stop researching once the data appears as they want it to, and not even use the full set of data for fear that the statistics will no longer show the outcome they initially favored.

Data manipulation has always been an issue when it comes to research but as Big Data gets bigger, instances of manipulation and error increase exponentially whether the research is conducted by a large enterprise or a small to medium-sized business. It’s crucial now more than ever to analyze not only the published results but the methods used to gather data so you can be sure that your (or someone else’s) facts are legitimate.

Photo Credit:
lumaxart via Compfight cc