Hybrid Lakehouses and Iceberg Tables: Wrangling the AI Data Stampede
- By Winston Thomas
- September 30, 2024
The GenAI gold rush has flipped the tech landscape on its head. It's a full-blown data rodeo out there: GPUs are the new mustangs, mothballed data is suddenly the hottest commodity and storage needs are skyrocketing like a runaway tumbleweed.
But amidst all this chaos, one overlooked nugget catches the eye of AI wranglers: the dusty world of data formats. While big players are busy building foundational models, the rest of the pack is jumping on the AI bandwagon through inference and Retrieval Augmented Generation (RAG). And guess what's fueling this new wave? Open data formats and those architectural mavericks, the lakehouses.
Reinventing the modern AI backbone
Inference, the bread and butter of AI, is where models flex their predictive muscles on real-world data. Traditionally, that meant training on massive datasets and then fine-tuning with inference. But, as Justin Borgman, the co-founder and chief executive officer of Starburst, bluntly puts it: “The economics of AI are forcing organizations to rethink their data strategies.”
For many, ballooning costs and lacking in-house AI gunslingers make fine-tuning open-source models the more economically sound strategy. This is where the showdown between open data formats and data architectures takes center stage.
Enter the lakehouse, a hybrid beast that marries the untamed sprawl of a data lake with the structure and performance of a data warehouse. It's the perfect corral for the massive, messy datasets that generative AI thrives on.
Think of Apache Iceberg tables as the sturdy wagons of this new data frontier. They're an open table format for handling massive datasets with ACID (atomicity, consistency, isolation, and durability) transactions, schema evolution, and time travel features. They're like the sheriff of your data town, ensuring consistency and reliability in a world where data quality can make or break your AI models. And with heavy hitters like Snowflake and DataStax backing them, Iceberg tables are fast becoming the gold standard.
Starburst: Blazing the open source trail
Starburst, fueled by the Trino query engine, is taking it further. It is breaking down data silos, enabling you to query data across hybrid and multi-cloud environments like a true data Jedi. Their support for open formats like Iceberg is helping companies build a truly open and flexible data architecture that can support the demands of generative AI.
It also allows AI teams to blend cloud and on-premises infrastructure strategies as they find better ways to avoid cloud sticker shock.
“We're seeing a renewed interest in on-premises infrastructure driven by GenAI,” says Borgman. “The economics of AI are forcing organizations to rethink their data strategies and explore hybrid models that leverage the best of both on-premises and cloud environments.”
Borgman sees the future will be filled with hybrid lakehouses and is positioning his company to take advantage of the paradigm shift.
The unasked questions: The high cost of late realization
Despite the clear advantages of hybrid lakehouses and open formats, many companies are still wrangling with managing and optimizing their data infrastructure for AI workloads.
Why? Part of the problem is that many new tech advancements require maturity. “One question we think they're just starting to discover is how to manage Iceberg tables the way they manage tables in a traditional data warehouse,” observes Borgman. “How do I do the data management side of things? How do I ingest the data? How do I apply governance to it?”
The problem is that these are critical questions that companies need to address early on in their AI journey, not later. Failing to do so can lead to significant delays, cost overruns, and missed opportunities that many are experiencing.
Another critical question often overlooked is ensuring data freshness in AI workflows, especially as many companies embrace RAG. “To do RAG right, you want to have access to super, super fresh data,” emphasizes Borgman.
Data freshness requires companies to go directly to the source of data capture, bypassing traditional ETL (extract, transform, load) processes that can introduce latency and compromise data accuracy — the advantages are clear; going against established schools of thought is not.
Conclusion: Agility is the new currency
GenAI is creating a tectonic shift that's cracking the foundations of traditional data management. Hybrid lakehouses and Iceberg tables offer blueprints for survival in this new AI-driven ecosystem. They help you wrangle the unruly beast of big data, turning it from a liability into a launchpad for AI-driven innovation.
Here, Starburst is carving its niche. They enable open source, data democratization, and a hybrid approach that blends on-premises muscle with cloud flexibility. More importantly, they are making these capabilities available now.
Yes, the choice is yours. But remember, in the fast-paced world of AI, your choices today will determine your agility and room to maneuver for years to come. Bootstrapping or taking a step backward is no longer an option.
Image credit: iStockphoto/hkuchera
Winston Thomas
Winston Thomas is the editor-in-chief of CDOTrends. He likes to piece together the weird and wondering tech puzzle for readers and identify groundbreaking business models led by tech while waiting for the singularity.