Understanding the Data Mesh
- By Paul Mah
- June 16, 2021
Much has been written about the data mesh, such as whether it is a good idea, to how enterprises can implement a data mesh architecture. Originally developed by Zhamak Dehghani at ThoughtWorks and outlined in a lengthy article titled “How to Move Beyond a Monolithic Data Lake to a Distributed Data Mesh”, she posited it as the next-gen data lake for data users such as data scientists and analysts.
In a follow-up piece a year later, Dehghani suggests four core underlying principles for getting value from analytical data at scale. This starts with a domain-oriented decentralized data ownership and architecture, treating data as a product, implementing a self-serve data infrastructure as a platform, and a federated computational governance as the foundation.
But despite a couple of years since its introduction and two lengthy expositions spanning over 10,000 words, much confusion remains.
Treating data as a product
In a fresh take on this topic, software engineer Chris Riccomini argues that much of the confusion stems from putting the four data mesh principles on an equal footing. The key to understanding it, he says, lies in treating data as a product first, and then using a modern service stack to understand the remaining principles.
“Application data has customers just like any other product. Data scientists, business analysts, finance, sales operations, product managers, and engineers all use application data. Machine learning models, charts, graphs, reports, and even other web services are all built on top of application data,” writes Riccomini.
Unfortunately, organizations do not intuitively think of data as a product. While much effort would be made on endeavors such as documenting and refactoring web services and APIs, data is largely ignored. No wonder data engineers face such difficulties as they attempt to interpret poorly documented internal schemas or “chase” data sources as they get migrated at the drop of a penny.
To succeed, organizations must first treat data with the same care as they would their public-facing APIs, says Riccomini. This calls for documentation, having well-defined schemas, versioning, and enforcing compatibility guarantees. In a nutshell, do not treat data as a resource that requires zero maintenance.
Decentralized data and self-service infrastructure
Once data is treated as a product, it quickly becomes evident that a centralized model will never work. Instead, data products must be built in a decentralized way by the teams that own the data.
Much like a modern service stack is decentralized with hundreds or even thousands of services owned by multiple teams building their own APIs, Riccomini thinks a similar model can make a data mesh architecture a reality.
“A data mesh takes service stack best-practices and applies them to the data layer. Not only should application development teams define APIs for their business logic (in the form of web services); they should do so for their data as well. The infrastructure and culture needed for the two are remarkably similar,” he writes.
Of course, not every data engineer might have the skills to develop full-fledged data products or troubleshoot intricate deployment issues. This is where federated governance comes into play, with other data professionals lending their strength as a counterweight to decentralization. Crucially, centralized teams can define standard data formats or compatibility rules as part of this federated governance structure.
Moving ahead with the data journey
This is not to say that a data mesh architecture will solve all data-related problems. Riccomini observed that lack of data standards or the reverse – an obsessiveness over details that probably do not matter – can hinder the data mesh organization.
And as with any IT initiative, technology alone isn’t adequate to guarantee success. Riccomini says a successful shift to decentralized data pipelines and data warehouses requires a cultural change within the organization.
Even with a DataOps culture in place, however, good data products or data pipelines will need to be built by experienced data infrastructure experts and data engineers. In a nutshell, we are just getting started on the data journey.
Paul Mah is the editor of DSAITrends. A former system administrator, programmer, and IT lecturer, he enjoys writing both code and prose. You can reach him at [email protected].
Image credit: iStockphoto/Ohobbs
Paul Mah
Paul Mah is the editor of DSAITrends, where he report on the latest developments in data science and AI. A former system administrator, programmer, and IT lecturer, he enjoys writing both code and prose.