Data Science Transformation

Why Snowflake Built a Data Warehouse in the Cloud

By Paul Mah
November 22, 2019

How do you give businesses the power and flexibility that comes with being data-driven? For Snowflake, the answer boils down to ensuring that organizations have a quick way to build an effective and affordable data warehouse that is accessible to those who need it.

To understand more about Snowflake and why data scientists can benefit from building a data warehouse in the cloud, CDOTrends spoke to Geoff Soon, the newly appointed managing director of Snowflake in South Asia.

Built for the cloud

“We are one of the new organizations that take advantage of the elastic nature of the cloud. Where we are different is the ability to separate compute from storage so that you don’t have any of the concurrency issues you have with traditional warehouses,” explained Soon.

One of the most challenging aspects of database systems is the unpredictability of the workloads, he observed. To ensure that systems run smoothly, database administrators or IT managers typically size their infrastructure with periods of peak workloads in mind – resulting in capacity wastages for the rest of the year.

The cloud-based Snowflake precludes that while offering the added advantage of being able to work with more than one public cloud platform. “Go to our website, define a database and pick an availability zone from various cloud platforms and locations,” said Soon.

Snowflake is currently available on most Amazon Web Services (AWS) and Microsoft Azure regions, he says, and will add support for Google Cloud Platform (GCP) subsequently.

Written from scratch to support SQL natively, the Snowflake engine manages all aspects of software installation and updates. This means administrators don’t have to worry about software updates or patches; Snowflake is not available to run on private cloud infrastructure.

The happy data scientist

But surely the ability to deploy across different public cloud platforms and in different regions isn’t much of an advantage for data scientists, given that they are usually more concerned with computing oomph for their data crunching needs?

According to Soon, Snowflake is an excellent fit for the needs of data scientists: “The traditional data scientist is very frustrated. He needs access to multiple data sets that vary significantly in size. He spends half his life in [the IT department] trying to persuade them to free up more resources to run his models.”

“Because we have separated compute from storage with Snowflake, the data scientist has his pool of resources and can scale it accordingly. He has a small compute cluster to run his models on a subset of the data. When they find the algorithm that they like, they will dramatically increase the size of their cluster, and run it across the entire dataset.”

“A task that may have taken 128 hours on one machine, can take just an hour on 128 machines – and cost the same,” said Soon, who noted that Snowflake currently supports up to 128 nodes within a single cluster.

Start small

Any advice for organizations just starting with their data warehousing initiatives?

“Many organizations adopt an ‘all or nothing’ approach to data warehousing, when in fact it is very possible to take an incremental approach to it. Focus on the use cases that currently causing the most stress and the most problems in your existing environment.”

“Look at your existing IT infrastructure. There’s always a couple of users or reports that are chewing up a disproportionate amount of resources. These are the stuff that you can consider leaving to Snowflake.”

In closing, Soon observed that it is impossible to do any form of AI or ML without a solid foundation of data – which can only begin with establishing a data warehouse. This means organizations should get started sooner rather than later.

“Once you [set up a data warehouse], you can start to report on the past. The next step is trying to predict the future by drawing inferences through predictive modeling,” he said. “Automate the predictive model using ML to validate and create the model. The final stage is AI. But none of these options are possible without the foundational datasets.”

“With an on-premises infrastructure, you have to spend months in front of a crystal ball to size it right. With Snowflake you can get started in a matter of hours. You don’t have to get everything right up front.”

Paul Mah

Paul Mah is the editor of DSAITrends, where he report on the latest developments in data science and AI. A former system administrator, programmer, and IT lecturer, he enjoys writing both code and prose.