Data Science DataOps

One Key Mantra for CDOs: Source Data is Not the Master

By Andy Palmer, Tamr
July 21, 2021

“The source is not the master.”

That statement hit me like a ton of bricks when I first heard it.

It came from a conversation with Elena Alikhachkina, who at the time was leading a significant digital transformation initiative at Johnson & Johnson.

But it should be a key mantra for any chief data officer or data executive charged with delivering the data so desperately needed to support ground-breaking analytics, more efficient operations, and dramatically better data-driven decision making.

When practicing data mastering, it is too easy to think that the original data source (or the most accessible source) is the master — and act accordingly. It’s human nature. Logically it is intuitive (and easier) to give more weight to the source closest at hand. But it turns out this behavioral thinking is dead wrong when it comes to data and, as a consequence, has knocked more enterprise data projects off the rails than I can count.

Four facts about enterprise data mastering

As always, it’s best to start with the facts. Here are four facts about enterprise data mastering to know before starting a mastering project.

Fact #1: No single data source will give the complete picture required across the analytic and operational consumption of data.

Any single data source is only a fraction of a company’s overall data. It’s one of 50, 100s, or even 1,000s of important function-, user- or geographic-specific data sources that have sprung up over the lifetime of the business: a vast, heterogeneous legacy ecosystem of siloed data sources. Each source serves a purpose or audience. Each is constantly evolving. And each has hidden value for the business at a macro, holistic level.

But each of these data sources is hamstrung by bad data, missing data, unauthorized duplicates, usage idiosyncrasies, and other problems. This complicates creating (1) a consistent and complete set of data from all possible sources and (2) a dynamic master based on dataset endpoints versioned and organized for key logical data entity types such as “customers,” “products,” and “employees.”

Fact #2: No data source is perfect, and it never will be.

Any original data source worth its salt has been getting hammered on (some for decades) by humans, business analysts, application owners, IT practitioners, and company events. There’s a fraught history of additions, edits, nulls, duplicate records, and structural and application-driven changes of varying quality and unknown provenance. These data sources need to be titrated for value, not used as the solution.

Today, the explosion in data and the need to make data usable across the entire enterprise have maxed out traditional mastering approaches that rely on top-down, rules-based master data management. The only way to integrate and master data broadly across a large or even medium-sized enterprise is to leverage the power of the machine with the careful and thoughtful engagement of data experts to tune the machine-based models.

Fact #3: You can have a successful enterprise mastering project with imperfect source data.

This one is usually a big surprise and a hard one to grok because it’s counter-intuitive. All data will need treatment when creating your master data source. But that doesn’t mean it needs fixing now.

My conversations with CDOs are peppered with the same horror story: armies of staff or expensive consultants hammering away at creating the perfect data source for mastering or the target schema that is the “One Schema to Rule Them All!” This rarely happens (see above) or happens too late (or never), and almost always at a tremendous cost.

You can’t afford to wait.

A modern approach to enterprise data mastering assumes imperfect source data and acts accordingly. This is similar to Google’s search indexing infrastructure, which assumes that all data on the web is imperfect. In response, it puts in place infrastructure to insulate search users from the idiosyncrasy, ensuring they can find what they want as easily as possible. (The PageRank algorithms/system is one of the most important components of that infrastructure.)

By applying machine learning strategically to enterprise data mastering, you can create trustworthy, automated models for the mastered entities that matter to the business: customers, products, employees, suppliers, and so on. Furthermore, a unique and persistent ID creates a connected, functional unified view of key entities across systems and insulates the data consumer from the complexity and idiosyncrasy of any specific system.

Over time — as data consumers validate the best data as a data organization — you can then go back to remediate the original sources’ data. Even better, you can use low-latency autosuggest and autofill services from your data mastering system to ensure that users are not creating duplicates at the point of data entry and are indeed using the best values for new records — essentially improving data quality at the point of data creation.

My long-time friend and collaborator Nabil Hachem calls this “lazy data mastering.” And it’s a concept that works. (In fact, experience tells us it’s the only way enterprise-scale data mastering can work efficiently.)

This brings up the fourth fact.

Fact #4: The more data sources, the better.

The long-prevailing wisdom has been that the fewer data sources you deal with, the better. This belief dates back to the days when data management was in its relative infancy.

With the explosion in the amount of data and ever-rising executive expectations of data, this thinking is too limited. It’s essential to use data as a strategic competitive weapon and in this next generation of what my friend Tom Davenport calls “Competing on Analytics.” Essentially: the more data you have, and the more current the data is, the stronger you become.

Flip your thinking about what data deserves time and attention

It’s time to accept source data imperfection as an organic problem, one that’s never “going away,” but that can be overcome by presumptive enterprise data mastering. Unfortunately, many data executives don’t recognize or admit data-quality issues exist until they impact a vital business initiative and one they’re often tied to or responsible for.

Kick that thinking to the curb. Expect and embrace the fact that the quality of most data sources sucks — and the only way to improve the data is to combine it with other data as continuously and broadly as possible.
Just like you wouldn’t drink water directly from a stream, acknowledge you can no longer drink your data from the source level (or even the data lake level) without ramifications. In today’s hyper-competitive, data-driven enterprises, data sources (like water) need continuous treatment for quality.

First and foremost, data executives should flip their thinking to focus on data quality at the consumption level versus the creation level. Consumption is what truly matters, and it sets the right context for enterprise data mastering (a business problem vs. a technical problem). Obviously, you can’t embrace and deploy a new philosophy overnight, so pick and prioritize your spots.

Many Tamr customers start their enterprise mastering mission with customer data because it’s on the critical path of generating revenue and building market share. Customer data is also all over the place, constantly being tinkered with by sales reps and other business users, and thus often a huge mess (with dirty, duplicate, and incomplete data, the rule is not the exception).

Business essentials like knowing exactly how many customers you have and how to reach them can be a nightmare. Deploying modern enterprise data mastering for this kind of problem has helped Tamr clients, ranging from energy leader Hess to electronic components distributor Littelfuse to financial services leader Santander, with astounding results.

Here’s an overview of a framework you can implement:

(1) Identify and solve the downstream data consumption problem first. Start with the business question being answered.

For Littelfuse, the question was, “What is the distribution of our customers by size?” The bottom-up answer using all data sources was very different from the answer in the primary source system. The source is not the master.

(2) Work backward to define the necessary reconciliation in source data and their systems over time.

Tamr can help here. Our machine-learning-driven, human-guided mastering takes a bottom-up approach, with machine learning capable of doing 90%+ of the heavy lifting of mapping attributes, matching records, and mapping classifications. As you add new data sources to the Tamr master, the underlying models get smarter and smarter over time — making the marginal costs of adding a new data source linear at worst and more often significantly sub-linear. This creates the opportunity to quickly add lots of new data sources and build a broad set of highly accurate data for your organization to use as a competitive weapon. More data is better.

(3) Deploy continuous, machine-driven data-quality improvement at the source level.

As described earlier, you can eventually start to improve data quality at the point of data creation. For example: by using low-latency data mastering, you can populate autofill forms and power autosuggest on source data to guide the creation and conformance of new data. This is a common practice on the modern commercial web that, for some reason, we don’t expect from applications inside the enterprise. It’s time to close that gap and deliver enterprise-class autosuggest and autofill services for every single operational enterprise application that creates data.

We need to get to the point where enterprise data creation and evolution start to look more like Google or Wikipedia, or other Internet-age applications. And less like a data-management sweatshop or a never-ending series of dull enterprise application upgrade projects.

Don’t Let Bad Source Data Keep You From Acting

Source data will always naturally trend toward messy, whether from legacy idiosyncrasies, human error, duplication of effort, growth in number and volumes, changes over time, system additions or modifications, or all of the above. So expect this, plan for it, and fully embrace it.

You can now quickly and relatively easily create a master version of your company’s data. The result is data that is continuously updated, versioned, and broadly consumable by the average corporate citizen via a simple “enterprise datapedia-like” application or by data analysts and data scientists via machine-readable tabular datasets published as simple spreadsheets or database tables, and/or by developers building data-driven applications via RESTful interfaces/JSON.

But even if you don’t do the above, stop thinking of your source system(s) as the data master for your downstream data.

Andy Palmer, chief executive officer and co-founder of Tamr, wrote this article. The original article is here.

The views and opinions expressed in this article are those of the author and do not necessarily reflect those of CDOTrends. Image credit: iStockphoto/jadamprostore