CDO University Data Management

When Too Much Data Obfuscates the Obvious

By Paul Mah
May 08, 2022

An overwhelming deluge of data is undermining the credibility of science, says Gary Smith, an economics professor and author of a couple of books on data science.

At a time when organizations around the world are turning to data-driven insights and investing in analytics solutions, Smith made this dire assertion in a contributed opinion piece on Bloomberg, citing an array of eventually debunked studies published in reputable journals.

More data than ever

But first, some context for readers who might be less versed with the state of data science and big data: While the topic of “big data” is hardly new, there is arguably a massive uptick over the last few years in the volume of data organizations collect, manage, and analyze.

From surveillance video, sales records, and manufacturing efficiency to production schedules, businesses are doing their utmost to leverage their data. They are also increasingly turning to technologies such as IoT to extract even more data from retail shops, manufacturing environments, and agriculture farms.

The idea is that hidden within this vast repository of data are nuggets of valuable insights to either unlock outsized rewards or increase efficiency by a few percentage points and tangibly impact the bottom line. One might argue that data is the new digital transformation.

Of course, this burgeoning cache of data must be stored somewhere, and tools that make it easier to store, manipulate or analyze data will give organizations an edge over their rivals. Unsurprisingly, some of the biggest names in the data space such as Snowflake and Databricks are known for their ability to process and analyze data faster and more flexibly than traditional solutions.

For now, the incredible growth shows no sign of abating. Indeed, recent reports predict big data market volume to exceed USD100 billion over the next five years.

Drawing inferences from random data

But back to the story about why too much data could end up undermining science. According to Smith, a too-low threshold for statistical significance – arbitrarily pegged over a hundred years ago at a five percent probability that outcomes might happen by chance – culminated in the publication of untold studies that drew eyebrow-raising and subsequently debunked conclusions.

“Suppose that a hapless researcher calculates the correlations among hundreds of variables, blissfully unaware that the data are all, in fact, random numbers. On average, one out of 20 correlations will be statistically significant, even though every correlation is nothing more than coincidence.”

But how does that impact the typical enterprise, which is after all not interested in publishing studies?

Smith explained: “It is tempting to believe that more data means more knowledge. However, the explosion in the number of things that are measured and recorded has magnified beyond belief the number of coincidental patterns and bogus statistical relationships waiting to deceive us.”

In a nutshell, the risk is real that organizations might end up using data for its own sake, shifting randomly chosen variables around for patterns until they eventually find correlations that might well be coincidental, unlikely to be useful, or perhaps even erroneous.

Eyes wide open

While I do believe in the value of data, collecting data for data’s sake is pointless. What’s more, organizations tend to think of data purely in terms of on-premises or cloud storage costs, but neglect overheads that manifest through compliance and regulatory risks of managing the extraneous data.

As noted by Smith, statistical significance should not supersede common sense. Large data sets mined using enough time and effort can reveal utterly useless patterns. Organizations should not take controversial insights at face value without further validation; the conclusions should ideally be tested through pilots or checked by a different team.

When all is said and done, it would be foolish for any organization to ignore data today. The inherent challenges of too much data can be overcome with the right approaches, while various techniques exist to simplify existing data sets and ascertain its quality.

Ultimately, with the right data culture, tools, and people, it is possible to build an organization to win with data.

Paul Mah is the editor of DSAITrends. A former system administrator, programmer, and IT lecturer, he enjoys writing both code and prose. You can reach him at [email protected].

Image credit: iStockphoto/wildpixel

Paul Mah

Paul Mah is the editor of DSAITrends, where he report on the latest developments in data science and AI. A former system administrator, programmer, and IT lecturer, he enjoys writing both code and prose.

When Too Much Data Obfuscates the Obvious

Related

Singapore Bets Big on AI for Future Growth

Can You Have Too Much Data?

Supporting Data Science Through Data Management

Paul Mah

Recommended Stories

Steer Between Debt and Delay With Platform Engineering

Intel Bets on AI Everywhere and Demand for Smaller, Targeted LLMs

Fix Your Data To Take Back 40-60% of IT Spend

Will the Real AI Please Stand Up?

Sentience and LLMs: What To Test for Consciousness in GenAI

Recommended Whitepapers

Your Data-First Guide to APAC Expansion: Spotlight on South Korea

An IDC Report: How South Korean Enterprises Are Embracing Hybrid Cloud

Country Focus: Australia | Unlocking Innovation and Efficiency: Harnessing the Power of Hybrid Cloud and Data

IDC InfoBrief: Hybrid Cloud & Data for Innovation in Asia Pacific and Hong Kong

Scale Your Business Opportunity in APAC With Hong Kong as Your Digital Hub