Should You Build or Buy Your Data Stack in 2023?
- By Paul Mah
- January 28, 2023
Data is crucial for decision-making and informed action in today's fast-paced business and technological landscape. To forge ahead, organizations must leverage data to gain valuable insights into their operations, customers, and markets.
But what tools should businesses leverage to manage and manipulate their data to derive these insights?
Last year, I wrote about whether an organization should train or hire to develop its data competency. As we kick off the new year, I felt it made sense to take a closer look at whether businesses should build or buy their data stack.
An explosion of solutions
According to Nishith Agarwal, the head of data and machine learning platforms at Lyra Health, building a data stack has never been easier and yet more nuanced. He should know a thing or two about that, having spent years building data platforms for teams at organizations such as Walmart Labs and Uber before joining Lyra Health.
The reason why building a data stack is so much easier today can be attributed to an explosion of solutions for managing data and a rapidly maturing data tooling ecosystem. From a handful of nascent tools a few years ago, there are now a significant number of impactful commercial and open-source solutions available.
Agarwal highlighted a handful of them, such as dbt, Apache Airflow, and Apache Hudi; he was part of the team at Uber that created Apache Hudi, a transactional data lake platform designed to achieve low database ingestion and power a massive 100PB database.
Cost is one of the main considerations that organizations must evaluate when it comes to deciding whether to build or buy, says Agarwal. And while a lot boils down to the budget, organizations need to look beyond the sticker price.
Specifically, they must evaluate if it costs more to hire or train the data engineers needed to build the data stack versus the cost of purchasing an off-the-rack solution. Agarwal cautioned that employee costs could add up quickly, and this is before we even consider the opportunity cost here.
The interoperability conundrum
Moreover, modern data stacks are more fragmented than ever. This makes interoperability a top consideration, which unfortunately, is an area not helped by the large and rapidly growing data ecosystem.
“Your home-built or open-source solution may work with Snowflake today, but will it work once they release a major feature update? And does it work with the 20 other solutions you need to plug and play together?” he wrote.
Should businesses simply stick with off-the-shelf? Agarwal asserts this is not the case. “As companies scale beyond a certain point and begin to build in-house tooling to meet more nuanced data needs, that complexity can grow beyond what managed plug-and-play solutions can support.”
Plug-and-play solutions, he says, may only address 70% of use cases, leaving the engineering team to come up with solutions for the remaining 30%. And the need for tech talent to bridge the gap brings up another thorny issue of hiring the right talent.
Ironically, a heavy focus on in-house tooling can adversely affect talent retention. This is because data engineers will typically choose opportunities that let them expand their experience with industry-standard tooling; a lack of standard tools can hence impact the ability to staff and retain top talent over the long term.
The most important consideration
When all is said and done, there is no point to develop a unique solution if it doesn’t give the organization a competitive advantage.
Of course, how long it takes to develop the solution matters, too. As noted by Agarwal, an in-house tool not only requires more time to build over customizing an existing solution but also requires more engineering time to support. Any advantages conferred by the solution must therefore be balanced against the time to value.
Building typically entails a significant investment and should not be taken lightly. But while the natural inclination of some might be to lean towards “safe” preexisting solutions, forcible attempts to customize solutions to enable capabilities they were never designed for could result in kludgy, non-optimal solutions – and take almost as much effort.
If there is one takeaway, it would be that there is no one-size-fits-all solution when it comes to the data stack.
Paul Mah is the editor of DSAITrends. A former system administrator, programmer, and IT lecturer, he enjoys writing both code and prose. You can reach him at [email protected].
Image credit: iStockphoto/katerinasergeevna
Paul Mah
Paul Mah is the editor of DSAITrends, where he report on the latest developments in data science and AI. A former system administrator, programmer, and IT lecturer, he enjoys writing both code and prose.