When a Data Engineer Becomes a Chef

Image credit: iStockphoto/Maxim Zarya

What’s cooking in your Kitchen? Yes, DataOps Kitchen is now a real thing and is becoming a hot trend in the constantly-evolving dataops world.

Kitchen, or self-servicing sandboxes as it is otherwise called, aim to do what data scientists and data engineers secretly wished for: a platform designed to stop each team from annoying the other.

From the data scientist's side, the problem is essentially about time. When tasked to work on a problem statement, it can take close to two months to set up a development environment. It includes the excruciating wait for suitable systems, data and, yes, the long string of approvals. A DataKitchen survey puts the period at 10 to 20 weeks for Fortune 500 financial firms.

Meanwhile, data engineers are experiencing massive burnout. Dealing with unrealistic demands, getting blamed when data pipelines leak or break, facing governance policies that complicate life, and frequent disruptions from unplanned work are creating immense frustrations. A Wakefield research report saw 78% of respondents (all data engineers) wishing their jobs came with a therapist.

So, you can imagine when data scientists feel that they are waiting on data engineers who are overloaded with endless requests. It’s a pipeline problem that all types of engineers are familiar with. And it is one reason why dataops itself is becoming popular. The idea of automating end-to-end data lifecycle workflows from removing errors to observability and governance makes sense for data engineers (well, that’s the dream anyways).

Understanding the Kitchen

The Kitchen takes the dataops concept a step further. It creates an on-demand environment with all the components for a data scientist or analyst to start cracking at the problem statement. Essentially, a complete Kitchen will have reusable microservices (the ingredients), a complete toolchain (the chef’s toolkit), workflow integration, observability, governance (the cookbook) and more.

You can also have different Kitchens for different purposes. For example, Eran Strod’s DataKitchen blog explains that a development Kitchen can pass its analytics workload to a production Kitchen. Both access the technical environment that underpins both Kitchens. This means that you don’t add any non-portable references from development to production.

Kitchens can also be flexible. It can be a persistent workspace or a temporary one and be tied to a specific project or more. Policy enforcement is also more efficient as it can be built into the automated workflow.

God sent for data governance

Underpinning the development of the Kitchen concept is DataGovOps. The latter’s ability to automate data governance, ensuring that it does not stifle data engineering or hinder faster innovation, can be a boon for data engineers.

This Medium blog made a great analogy: “If manual governance is like handing out speeding tickets, then self-service sandboxes are like purpose-built race tracks. The track enforces where you can go and what you can do, and is built specifically to enable you to go really fast.”

One key advantage of DataGovOps is that it focuses on process lineage, not just data lineage. This is vital for data engineers who now don’t have to spend additional time to determine which processes are creating issues — a significant issue for today’s complex pipelines.

By making DataGovOps work in the background, Kitchens can address policy violations and a thorough audit trail. It addresses a major pain point for many companies that work with test data and have to wait for a long time for clean, accurate, privacy-aware data sets. A Kitchen can be set up to provide test data on demand.

Easing clogged pipelines

At the end of the day, Kitchens are about speeding up data-driven decisions. In the past, the speed at which data scientists and analysts worked was directly proportional to how effective the data engineering team or toolchain was.

A Kitchen looks to use recent advancements in automation and DataGovOps so that the data scientists can start their projects faster (and get to their analyses more quickly) while unburdening the data engineering team and allowing them to focus on building complex and reliable pipelines from clean data sources.

So, the business case is clear. Of course, in a dataops world where many decision-makers are still coming to terms with the fresh onslaught of new concepts and vocabulary, it remains to be seen whether the concept will stick or be discarded as a fad.

Winston Thomas is the editor-in-chief of CDOTrends and DigitalWorkforceTrends. He’s a singularity believer, a blockchain enthusiast, and believes we already live in a metaverse. You can reach him at [email protected].

Image credit: iStockphoto/Maxim Zarya