Data Science DataOps

Additional Principles To Build a DataOps Ecosystem

By Andy Palmer, Tamr
January 20, 2021

In my last piece, I wrote about three principles of a successful DataOps ecosystem that we see at work every day within large enterprises. And while an open, best-of-breed approach is more difficult initially, it is much more effective in the long term over a single vendor, single platform approach.

As enterprises put their DataOps ecosystem in place, this article will look at some additional considerations needed to maintain the dramatic pace of change in enterprise data.

Maintain the lineage of your data

As data flows through a next-generation data ecosystem, it is of paramount importance to properly manage this lineage metadata to ensure reproducible data production for analytics and machine learning. Having as much provenance for data as possible enables reproducibility that is essential for any significant scale in data science teams.

Ideally, each version of a tabular input and output to a processing step is registered. In addition to tracking inputs and outputs to data processing steps, some metadata about what the processing steps are doing is essential. With a focus on data lineage and processing tracking in place across the data ecosystem, reproducibility goes up and confidence in data increases.

Establish bi-directional feedback

There is a massive gap in the enterprise, which is methods/infrastructure to collect feedback directly from data consumers and organize and prioritize the prosecution of that feedback by data professionals so that data consumers’ issues are addressed broadly across an entire organization.

Currently, data flows from sources through all the various intermediary methods such as data warehouses, data lakes, or spreadsheets, to consumers. However, there are no fundamental methods to collect feedback from data consumers broadly. What is needed is essentially, “feedback services” that are broadly embedded in all analytical consumption tools to create a “Jira for Data” that will become more intelligent and automated over time.

Batch and streaming processing

The success of Kafka and similar design patterns has validated that a healthy next-gen data ecosystem includes the ability to simultaneously process data from source to consumption in both batch and streaming modes. With all the usual caveats about consistency, these design patterns can give you the best of both worlds: the ability to process batches of data as required, and to process streams of data that for real-time consumption.

Data integration at scale

When bringing data together from disparate silos, it’s tempting to rely on traditional deterministic approaches to engineer the alignment of data with rules or ETL. We believe that the only viable method of bringing data together at scale is the use of machine-based models (probabilistic), rules (deterministic), human feedback (humanistic) to bind the schema and records together as appropriate. This should be done in context of how the data is generated and how the data is consumed.

Aggregated and federated storage

A healthy next-generation data ecosystem embraces data that is both aggregated and federated. Over the past few decades, the industry has gone back and forth between federated and aggregated approaches for integrating data. The modern enterprise requires an overall architecture in which sources and intermediate storage of data will be a combination of both aggregated and federated data. This adds a layer of complexity that was previously challenging, but completely possible now with modern design patterns.

There are always tradeoffs of performance and control when you aggregate versus federate. But over and over, workloads across an enterprise require both aggregated and federated. In your modern DataOps ecosystem, cloud storage methods can make this much easier. In fact, when correctly configured as a primary storage mechanism, Amazon S3 and Google Cloud Services can give you the benefit of both aggregated and federated methods.

Conclusion

The future is inevitable: more data, technology advancements and vendors, alongside an increasing need to implement a successful DataOps. The DataOps principles outlined above and here are a high-level overview with an infinite number of technical caveats.

After doing hundreds of implementations at large and small companies, I can say it is entirely possible to do all the principles I discussed within an enterprise – but not without embracing an open and best-of-breed approach.

Andy Palmer, chief executive officer at Tamr, wrote this article.

The views and opinions expressed in this article are those of the author and do not necessarily reflect those of CDOTrends. Image credit: iStockphoto/taa22