Data Science Data Management

An Open Conversation About Open Data Lakehouses

By Tim Meehan and Girish Baliga, Presto Foundation
March 20, 2023

Open data lakehouses are the next evolution of modern data stacks, combining the best of the data lake — the ability to handle all types of data formats — with data warehouse-level processing capabilities.

We work at Meta and Uber, respectively, two of the world’s most prominent internet-scale companies. We have direct experience building data analytics at scale using open lakehouse technologies such as HDFS, Hive, ORC, Parquet, and Hudi. Our high-performance, open data lakehouse implementations have proven not only to scale but also to do so reliably. And these technologies aren’t just for internet-scale organizations. Smaller organizations can easily adopt them, confident they will scale reliably and affordably with their growing needs.

In addition to our corporate responsibilities, we are also members of the Presto Foundation. We believe that Presto, an open distributed SQL query engine, is a crucial component of modern data architectures and a key technology for getting the most from data lakehouses.

We recently talked about some of the data stack characteristics and technology choices we’ve made at our companies and how a Presto-powered open data lakehouse can help organizations of any size and, with any kind of business, unlock the full potential of their data.

Here are some excerpts from our conversation.

What are the challenges companies face as they look to modernize?

Tim Meehan (TM): Some of the common challenges involved in what can be called modern data architecture are centered around cost and scalability. Because no one can predict the future, building a data architecture means considering current data requirements and what might be needed later. At a high level, you want to unlock the full potential of your data and use it for future purposes that you can’t plan for, which makes horizontal scalability vital. And, of course, cost drives a lot of the decision-making.

Girish Baliga (GB): In addition to scalability and cost, reliability becomes increasingly important as you scale. The more machines you have, the higher the likelihood that one of them will have a failure. If you have an equal probability of failure for all the machines, do the math: You’re probably going to see a lot of failures much of the time. So it’s very important to have a reliable system that can handle failures, even as it scales.

What do the data stacks look like at Meta and Uber?

TM: Condensing it to its essence, the Meta data stack looks a lot like a standard open data lakehouse architecture. Our reasoning for using a data lakehouse approach is simple. It’s a very scalable architecture allows you to consolidate investments and scale with your data needs and organizational growth, with multiple options to grow with those parameters.

Meta’s data stack has a scalable storage layer that’s used for whatever purpose is needed. And it all revolves around getting the data in there and getting data out. We use tools to ingest data into this storage layer as well as tools to query it, notably the Presto SQL query language. We can use Presto to do interactive analytics, exploratory analytics, dashboards, experimentation, etc.

Presto is our tool of choice for the bread-and-butter use cases of exploratory analytics dashboards and data analysis within the data lake. We also use Presto extensively for a batch at a very high scale and in conjunction with Spark. In terms of volume, most queries we execute in batch go to Presto, which we do through a combination of techniques. We have dedicated clusters purely for batch-optimized throughput, and we’ve invested heavily in batch-range technologies within Presto itself, including large batch mode and Presto on Spark.

GB: At Uber, our data stack is structured very similarly to the architecture that Tim talked about. Our data stack is also mainly based on open source technologies. We have data coming into ingestion pipelines, which largely route them to Kafka. Then we have the data landing in HDFS, which is what we use for our open data lake.

On top of that, we have a layer of transactions called Hudi that was built in-house in Uber, and this gives us functionality like upsurge and data freshness. We also have Spark jobs that do ETL — we use Hive on Spark, an innovation coming out of Meta — to process our data for querying. Then we have Presto on Spark for our internal data consumers.

As for numbers at Uber, we’re talking about thousands of machines, both east and west clusters. We have at least 6,000 to 7,000 weekly active users and way more monthly. At least half the company uses Presto once a month to do their job, generating at least half a million queries, all very complex. These are people trying to get reports, do analysis, data prep, and all kinds of workloads.

One of the other interesting things about Uber is that we are a very operations-driven company. We have many business groups across many cities, each set up as an independent business unit that works with local regulatory authorities and partners. Those groups rely on the insights they get from our data to conduct business, so they're all consumers of Presto to run daily, weekly, and monthly reports, do analysis and get product insights. This is where Presto is key to business success.

Trends in the use of Presto and open data lakehouses

TM: From my experience at Meta and in discussions with other companies, people are getting value out of Presto as a high-speed SQL processing layer on top of very scalable data lake storage. Presto is a way to unlock the value of this scalable data lake.

For those unfamiliar with Presto, it’s a fast, open source distributed SQL query engine that accesses data through pluggable connectors. One of Presto’s defining features is its interactive nature and its very fast execution. Submitting a query through Presto is quick regarding the startup time and the time to the first byte.

Another defining feature of Presto is that it’s open source. This is important for the open data lake because it removes biases and makes it agnostic as to your data format or table layout.

Presto is already good at reporting and dashboarding, but at the Presto Foundation, we are constantly working to improve Presto so that it continues to scale as the data lake scales. At Meta, we’re running a very large-scale deployment and always pushing the envelope directly, enabling more and more use cases. Presto is the first tool we reach for in this area.

GB: A significant value proposition of Presto is that it lets you query data where it lives, without moving or copying your data over. Its use cases include business intelligence (BI), dashboarding, ad hoc queries, and data lakehouse analytics.

One of the interesting things we see at Uber is that people love Presto because it’s simple. And I think that is a very underrated aspect. Especially when you’re hitting scale, it’s really hard to build programs that can process data in the volumes we want to analyze.

For instance, if you’re talking about gigabytes, you can probably write a program to do it. If you’re talking about terabytes, you’ll probably start hitting limits. Then when you talk about petabytes and thousands of users accessing your system simultaneously, it really boils down to how simple you can make an interface for your users.

SQL is a proven simple interface. Because it’s simple, many people can write SQL queries; you don’t need to be a software engineer. Most data scientists know how to do it. And that’s the power that the Presto distributed SQL query engine brings. You can replace your complex MapReduce pipelines with a few SQL queries, and it will run just fine — on small data sets, big data sets, all the storage formats you can support, and at different companies.

As folks move around to different use cases and companies, the skill set they’re taking matures, and the ability to operate with Presto SQL becomes more and more valuable. Using Presto means adding cost-effectiveness and ease of use to your data lakehouses at any scale.

Tim Meehan is chair of the Presto Foundation technical steering committee and software engineer at Meta. Girish Baliga, Ph.D., is chair of the Presto Foundation governing board and director of engineering at Uber.

The views and opinions expressed in this article are those of the author and do not necessarily reflect those of CDOTrends. Image credit: iStockphoto/cagkansayin