When Too Much Data Obfuscates the Obvious

An overwhelming deluge of data is undermining the credibility of science, says Gary Smith, an economics professor and author of a couple of books on data science.

At a time when organizations around the world are turning to data-driven insights and investing in analytics solutions, Smith made this dire assertion in a contributed opinion piece on Bloomberg, citing an array of eventually debunked studies published in reputable journals.

More data than ever

But first, some context for readers who might be less versed with the state of data science and big data: While the topic of “big data” is hardly new, there is arguably a massive uptick over the last few years in the volume of data organizations collect, manage, and analyze.

From surveillance video, sales records, and manufacturing efficiency to production schedules, businesses are doing their utmost to leverage their data. They are also increasingly turning to technologies such as IoT to extract even more data from retail shops, manufacturing environments, and agriculture farms.

The idea is that hidden within this vast repository of data are nuggets of valuable insights to either unlock outsized rewards or increase efficiency by a few percentage points and tangibly impact the bottom line. One might argue that data is the new digital transformation.

Of course, this burgeoning cache of data must be stored somewhere, and tools that make it easier to store, manipulate or analyze data will give organizations an edge over their rivals. Unsurprisingly, some of the biggest names in the data space such as Snowflake and Databricks are known for their ability to process and analyze data faster and more flexibly than traditional solutions.

For now, the incredible growth shows no sign of abating. Indeed, recent reports predict big data market volume to exceed USD100 billion over the next five years.

Drawing inferences from random data

But back to the story about why too much data could end up undermining science. According to Smith, a too-low threshold for statistical significance – arbitrarily pegged over a hundred years ago at a five percent probability that outcomes might happen by chance – culminated in the publication of untold studies that drew eyebrow-raising and subsequently debunked conclusions.

“Suppose that a hapless researcher calculates the correlations among hundreds of variables, blissfully unaware that the data are all, in fact, random numbers. On average, one out of 20 correlations will be statistically significant, even though every correlation is nothing more than coincidence.”

But how does that impact the typical enterprise, which is after all not interested in publishing studies?

Smith explained: “It is tempting to believe that more data means more knowledge. However, the explosion in the number of things that are measured and recorded has magnified beyond belief the number of coincidental patterns and bogus statistical relationships waiting to deceive us.”

In a nutshell, the risk is real that organizations might end up using data for its own sake, shifting randomly chosen variables around for patterns until they eventually find correlations that might well be coincidental, unlikely to be useful, or perhaps even erroneous.

Eyes wide open

While I do believe in the value of data, collecting data for data’s sake is pointless. What’s more, organizations tend to think of data purely in terms of on-premises or cloud storage costs, but neglect overheads that manifest through compliance and regulatory risks of managing the extraneous data.

As noted by Smith, statistical significance should not supersede common sense. Large data sets mined using enough time and effort can reveal utterly useless patterns. Organizations should not take controversial insights at face value without further validation; the conclusions should ideally be tested through pilots or checked by a different team.

When all is said and done, it would be foolish for any organization to ignore data today. The inherent challenges of too much data can be overcome with the right approaches, while various techniques exist to simplify existing data sets and ascertain its quality.

Ultimately, with the right data culture, tools, and people, it is possible to build an organization to win with data.

Paul Mah is the editor of DSAITrends. A former system administrator, programmer, and IT lecturer, he enjoys writing both code and prose. You can reach him at [email protected].​

Image credit: iStockphoto/wildpixel