At the Limits of Data

The power of data has never been more evident in 2021, as data-centric capabilities are increasingly rolled out across industry verticals and organizations to enhance the customer experience, better understand shopping preferences, find fresh customer leads, or reduce waste in manufacturing.  

More data, faster insights

If there is one constant, it would be how more businesses are attempting to leverage data than ever. They are turning to advanced data tools and new AI-powered systems to ferret through their voluminous data more quickly than ever for the insights to outmaneuver competitors.

Indeed, just earlier this month, Databricks announced that it has established a new world record for the TPC-DS benchmark used to evaluate the performance of data warehouse systems. This is a big deal for the data firm valued at USD38 billion, as it seeks to bring the performance of data warehouses to data lakes, effectively allowing the same pool of data to serve both data scientists and traditional business intelligence (BI) tools.

Elsewhere, self-service data labeling platforms are also gaining attention. By giving businesses the ability to create and curate data sets — with the help of humans, organizations can more easily accomplish tasks such as data collection and data curation in a self-directed way.

But data is static and of limited use sitting in a data repository. This explains why many organizations are also seeking to democratize data, turning data over to citizen data scientists who work in parallel with business analysts to unlock additional insights and improve the bottom line.

The limitations of data

While the increased use of data is generally a good thing, conversations about data typically sidestep the inconvenient elephant in the room: That identifying and weeding out inherent bias in data is as big a challenge as performing data analysis or building complex machine learning (ML) models.

The global scramble to collect and analyze real-time data at the start of the ongoing pandemic saw this issue cast into the spotlight. As previously noted on CDOTrends, rudimentary errors in how AI models were trained and the presence of poor quality data meant that none of the AI tools created to fight COVID made a real difference.

Writing on the Tableau blog, a senior policy analyst highlighted the importance of considering and understanding the limits of existing data. “It’s important to also explore the roots of those limitations and scrutinize the reasons certain data are not available or are not robust enough — why are certain data not collected, not reported, or inconsistent across data sources?” wrote Rabah Kamal.

“Who was involved in collecting data, and who was not? How might data collection and reporting itself be perpetuating inequities? And — very importantly in a time of so much widespread misinformation — how do we explore these issues without discounting the useful and credible information we do have?”

When perfect data fails to deliver

But sometimes, even apparently perfect data and highly-tuned AI models still aren’t good enough. Consider the case of American online real estate marketplace company Zillow. CNN Business reported on how it recently announced the shuttering of a business, Zillow Offers, barely eight months after it launched in February.

At the heart is a “Zestimate” metric that has been an integral part of the Zillow brand since its launch in 2006. Computed using at least 500,000 unique valuation models and terabytes of U.S. real-estate data, a Zestimate is effectively an ML-assisted estimate of a home’s market value.

After spending years improving the algorithm internally, including by running a multi-year data science competition to leverage external expertise, Zillow decided to rely on it — for certain homes — as an initial cash offer from the company to purchase the property through Zillow Offers.

Unfortunately, it didn’t pan out, culminating with the closure of Zillow Offers. For now, Zillow has recently announced a USD304 million inventory write-down for Q3 and now plans to ax 2,000 jobs or a quarter of its workforce. Zillow attributed the stunning failure to an inability to accurately forecast price trends.

Said a Zillow spokesperson to CNN Business: “The challenge we faced in Zillow Offers was the ability to accurately forecast the future price of inventory three to six months out, in a market where there were larger and more rapid changes in home values than ever before.”

Not in the data?

What went wrong? It turns out that certain things such as pegging a valuation to a house are both an art and a science. Apart from vital considerations that an experienced real estate agent might catch immediately, hidden problems such as structural defects can dramatically skew prices in another direction.

Moreover, there are many unquantifiable aspects of putting a price tag on a home. Someone trying to buy a house down a street from their parents, or who grew up in that neighborhood, is much more likely to pay more to secure it. No doubt, the failure of Zillow Offers will be the subject of study for years.

As organizations embark on their data journey, they must also consider that their data might not always offer the answer to all their challenges. In a fluid, imperfect world, the correct way forward might sometimes only be found by tempering clinical data points with a good dose of industry experience and human intuition.

Paul Mah is the editor of DSAITrends. A former system administrator, programmer, and IT lecturer, he enjoys writing both code and prose. You can reach him at [email protected].​

Image credit: iStockphoto/ktsimage