Deploying a Data Warehouse With PostgreSQL

Is it possible to deploy a data warehouse using the popular open-source PostgreSQL database management system (DBMS)? According to Cedric Dussud, cofounder of analytics platform, you can – if you adhere to some commonsense guidelines and make some configuration tweaks.

To be clear, traditional database systems such as PostgreSQL are row-oriented and work best with large numbers of separate queries.

On the other hand, analytical use cases typically process a substantial number of rows at a time, can take seconds to minutes to complete – compared to milliseconds non-analytical queries, and likely only access a limited number of columns at a time.

Using PostgreSQL for analytics

Dussud, formerly the engineering director at WeWork and an ex-senior software engineer at Microsoft, says PostgreSQL can comfortably service analytical queries with his tweaks.

However, he acknowledged that a cloud-based warehouse such as Snowflake would likely be easier to manage and maintain in the long run.

Many of the tips offered by Dussud will also benefit data scientists who might be using a traditional DBMS for their work:

  • Avoid running analytics workloads on production systems
  • Upgrade to PostgreSQL to version 12 or newer if you use ‘WITH’ in your queries.
  • If possible, remove ‘WITH’ entirely – queries become less readable, but performance can go up by a factor of 4.
  • Go easy on Indexes, since most analytics workloads do not benefit from it

Disk performance matters

While no enterprises will run a DBMS on a server with less than stellar specifications, the table scans common to analytic workloads mean that I/O (Input/Output) performance is more crucial than ever.

On this front, Dussud suggests using SSDs over hard disk drives, though the feasibility will probably depend on the quantity of data and available budget. In addition, the savvy data scientist should ensure adequate memory to cache the commonly assessed tables – or consider trimming them down.

One tip: Check the “pg_stat_activity” table to check whether a long-running query is reading from the storage drive.

It is worth noting that some cloud hosting providers will automatically throttle I/O beyond a predefined threshold. For data warehouses built on virtual servers in the cloud, this is a vital consideration to take note of.

PostgreSQL-specific tips

Finally, use fewer columns where possible. Unlike dedicated data warehouse solutions, PostgreSQL uses row-based storage and writes data from each row sequentially to disk. This means PostgreSQL must fully load each row into memory to access the desired column, resulting in a little less performance for tables with more columns.

One advantage of PostgreSQL is how it tries to run queries in parallel whenever possible. Despite a slight overhead to spawn new workers, this benefits analytical workloads and is unlikely to be noticeable. Adjusting the number of workers by tweaking the max_parallel_workers and max_parallel_workers_per_gather configuration setting can be advantageous.

Of course, there comes a time when a cloud-based data warehouse makes more sense for extremely large deployments. But where does one draw the line? Dussud says his informal tests show that PostgreSQL performs well with a table of between 50 to 100 million rows, though he stressed that this is a generalization.

Dussud suggests the free Citus PostgreSQL extension should the necessity arise to build a PostgreSQL cluster to “scale into billions of rows”. So, it seems possible to use a DBMS solution for your analytic needs but cloud-based data warehouses remain more popular for now.

Paul Mah is the editor of DSAITrends. A former system administrator, programmer, and IT lecturer, he enjoys writing both code and prose. You can reach him at [email protected].​

Image credit: iStockphoto/gorodenkoff