Around the world, the scale and scope of data in the enterprise have surpassed the ability of individual human effort to catalog, move, and organize. Automating one’s data infrastructure using the principles of highly engineered systems is crucial to maintain the dramatic pace of change in enterprise data. Indeed, the principles at work in automating the flow of data from sources to consumption are very similar to those that drove the automation of software development in the form of DevOps – and is one of the key reasons why we call the approach “DataOps”.
Amongst the Global 2000 organizations, there is a consistent pattern of key principles of a DataOps ecosystem that is in stark contrast to a traditional single vendor, single platform approach. An open, best-of-breed approach is more difficult, but also much more effective in the long term. Adopting these principles represent is a winning strategy to maximize the reuse of quality data in the enterprise and avoid the over-simplified trap of writing a massive check to a single vendor.
In this first piece, we look at three principles of a successful DataOps ecosystem that we see at work every day within large enterprises.
Adopting a cloud-first approach
The center of gravity for enterprise data has shifted to the cloud. Though the full transition will take decades, most of the large companies we work with on a regular basis prefer to start big data projects natively on the cloud. This is a significant improvement, primarily because using cloud-native infrastructure reduces project times significantly – in my experience by at least 50%.
Additionally, modern cloud database systems are designed to scale out natively and massively simplify operations and maintenance of large quantities of data. Finally, the core compute services available on the large cloud data platforms are incredibly powerful and easy to scale out quickly as required. Replicating these services on-premises would cost more than enterprises can afford, while the elastic environments of the cloud make it easy to scale resources as required with little to no capital investment.
Build highly automated and agile data infrastructure
Data changes constantly. Enterprise data sources should be treated as dynamic objects rather than static objects, and next-gen infrastructure should enable data to flow dynamically and treat data updates as the norm rather than the exception. As the enterprise begins to embrace the dynamic nature of data and manage the flow of data from many diverse sources to all potential consumption endpoints, it is vital to build an infrastructure that supports a continuous flow of data.
Rather than adopting yesteryear’s “boil the ocean” approach, the next generation of data management infrastructure should enable a more agile approach to organizing, aligning, and mastering data. The emergence of “data wrangling” and self-service data preparation is a move in the right direction to support a more agile approach to data management. However, enabling consumers to customize the way they would like to prepare the data is necessary but not sufficient to solve the broader problem of data reuse. This requires some collaborative unification, alignment and mastering of data across the entire organization.
The key to success in the long term is to empower users to shape the data to suit their needs while also broadly organizing and mastering the data to ensure its adequate consistency, quality, and integrity as it is used across an organization.
The rise of best-of-breed solutions
The primary characteristic of a modern DataOps ecosystem is that it is not a single proprietary software artifact or even a small collection of artifacts from a single vendor. In the next phase of data management in the enterprise, it would be a waste of time for an organization to “sell their data souls” to single vendors that promote proprietary platforms.
The ecosystem in DataOps should resemble DevOps ecosystems, where there are many best-of-breed FOSS and proprietary tools that are expected to interoperate via APIs. An open ecosystem results in better software being adopted broadly, offering the flexibility to replace, with minimal disruption to your business.
Closely related to having an open ecosystem is embracing technologies and tools that are best-of-breed solutions with each key component of the system is built for purpose, providing a function that is the best available at a reasonable cost. As the tools and technology that the large internet companies built to manage their data goes mainstream, the enterprise has been flooded with a set of tools that are powerful – and intimidating.
Selecting the right tools for your data workloads is difficult because of the massive heterogeneity of data in the enterprise, and because of the dysfunction introduced by organizations that over-promote their own capabilities. It all sounds the same on the surface, so the only way to really figure out what systems are capable of is to try them or take the word of a real customer.
In my next post, I’ll talk more about four other principles of a DataOps ecosystem.
The original article by Andy Palmer, chief executive officer and co-founder of Tamr, is here.
The views and opinions expressed in this article are those of the author and do not necessarily reflect those of CDOTrends. Image credit: iStockphoto/ChakisAtelier