Data Science DataOps

Use DataOps to Dodge These Five Common Data Pitfalls

By Ethan Peck, Tamr
May 26, 2021

Over the last 25 years, DevOps has revolutionized business process automation by changing how software was developed and delivered. In the next 25 years, DataOps will have a similar revolutionary effect on business data by changing how it’s developed and delivered. This is already under way. DataOps will eventually do for the Chief Data Officer what accounting and ERP software did for CFOs: provide automated methods for managing assets with unprecedented efficiency and effectiveness.

Back in May 2017, The Economist famously declared data the world’s most valuable resource — more valuable than oil. While that article was mostly about the need to regulate use of that data (particularly by the wave of big tech companies built on using OPD, or other people’s data), the metaphor reemphasized what most businesses were already thinking. Business data was the new “black gold,” sitting underneath your company just waiting to be mined, pipelined and put to better use in generating more revenue, greater market share or better profit margins.

At its essence, DataOps helps you build efficient pipelines to get data where it needs to go: right time, right place, right users, and right format for immediate use in business operations and analytics. However, just like you can’t directly pipeline crude oil (unrefined) into automobiles or heating systems, you can’t (or shouldn’t) directly pipeline business data into strategic business processes and expect to get the desired results.

Here are the five common data problems that we most often see stalling successful DataOps, with some ways to fix them. Are any of them keeping you awake at night?

Disparate data: Meet my friend Ted, a data scientist and bioinformatics whiz. He built a killer analytic in Spotfire. So did his fellow data scientist/bioinformatics whiz Beth. Same analytic. Same data. Same results, right? WRONG.
What do you tell your business users if you have two analytics that presumably are using the same data but don’t yield the same results? Answer: They’re *not* the same data. Ted took his data from one source. Beth took her data from another source. They appeared to be the same data, but they weren’t. For one thing, Ted’s source was one size; Beth’s was larger. Beth’s might have been made from three datasets, Ted’s from two datasets. Which is the “right” one?

They were non-matching due to a lack of version control, being located in two different places, and being curated but in two different ways — all creating divergent, disparate data. This problem could result in analytic results that are completely different as well as downstream problems. These problems might include untrustworthy data (or at least concerns of untrustworthy data) and duplicated processed data (duplication of some effort and conflicting business decisions).

Bad data: Sara, a senior business analyst, spots erroneous data. Common sense says she should fix it. But fixing it for herself doesn’t solve the problem. The next person will just have to fix it again. This could go on forever. A data scientist may eventually want to clean the data, then share it with others. This just makes it worse, creating primary source confusion. The further from the primary source one is, the more modifications (e.g., cleaned or possibly not cleaned) and assumptions have been applied to data, causing the analytics and decisions from that data to be biased. Ideally, business users should be able to to get as close to a data source as possible while still having data that’s resourceful. What’s needed here is a viable user-feedback mechanism for capturing and correcting erroneous data, which can un-bias the process as well as fix general errors in the data.
Outdated data: As an IT manager, Sumit needs to know that everyone has access to everything they need—and only what they need. Because both security and trust are required, a dashboard of curated data is important: he doesn’t want to revoke access to someone who needs that access. At the same time, if Sumit learns about an incident where someone is hacking through a given user at that exact moment, he’ll need access to the latest information fast to shut off access and do damage control. Using DataOps principles, you can avoid business fallout through a careful orchestration of versioning and updating. In this case, you could pipeline a live version of the data, not yet curated, to the dashboard for firefighting while pipelining accurate, curated data for more routine operations management like audit trails or general role and access management. You’re using the same data but in different ways.
Unknown data: Companies and CDOs need to know what data they have. This is easier said than done, and is often complicated by internal policies and politics, habits like “data hoarding” and other, non-technical issues. Data resides in different ERP, CRM and warehouse systems at both corporate and divisional levels, often representing the same entity (customer, vendor, product, SKU, biological assay, oil well) in different ways. All data sources are constantly being updated during operation, with none of them fully informing the other systems. With effective, modern data cataloging, a well-designed DataOps pipeline will identify the data, tag it, and eventually help put it where everyone can use it with confidence.
Surprise data: This is important data that no one knows exists (often archived). Some organizations have multiple CRM and ERP systems in a single division that don’t talk to each other, making them fairly invisible. Other organizations have systems that *no one* (alive or still on staff) knows about. Data cataloging can similarly help here. This beats having people on laptops change-directory-ing their way through giant, arcane file servers to locate valuable archived data, although this may be unavoidable in some situations. But (hopefully) you’ll have to do this only once. (And yes, we’ve seen this actual situation.)

Successful DataOps is a process that critically depends on clean, organized, findable, trustworthy and usable data. DataOps won’t succeed until this step is incorporated and codified so that it can operate at scale with as much intelligent automation and as little human busywork as possible.

With this codification, you can prime the DataOps pump (back to our original metaphor) with:

Quality Data: Identify your most valuable data, create dynamic masters for key entities, and then continuously curate it, automatically and at scale..

Holistic Data: Share your best data with everyone, pulled from different sources with the best possible visibility, authority, accountability and usability.

Trusted Data: Create a single version of truth for key data that’s curated, complete and trustworthy.

If you have one or more of these five data challenges, you’re not alone. Data management methods like modern schema mapping, data mastering and entity resolution and architectural technologies such as machine learning, AI and cloud can help achieve DataOps success.There’s also a commensurate growing DataOps ecosystem of best-of-breed data-enabling technologies, new roles (for example, data stewards), professional services and evolving best practices, and no shortage of advice from peers, consultants, industry analysts, press and technology vendors.

The original article by Ethan Peck, head of data and technical operations at Tamr, is here.

The views and opinions expressed in this article are those of the author and do not necessarily reflect those of CDOTrends. Image credit: iStockphoto/fizkes