Data Science Data Management

A Closer Look at Apache Arrow for Data Analytics

By Paul Mah
October 07, 2020

Whether you noticed it or not, an open-source project has been a quiet champion in the data processing and machine learning sphere. Under the hood, its efficient data interchange format has been leveraged by various well-known projects such as Apache Spark, Google BigQuery, and TensorFlow.

Apache Arrow

Specifically, Apache Arrow is used by the various open-source projects above, as well as “many” commercial or closed-source services, according to software engineer and data expert Maximilian Michels.

At its core, Arrow was designed for high-performance analytics and supports efficient analytic operations on modern hardware like CPUs and GPUs with lightning-fast data access. It offers a standardized, language-agnostic specification for representing structured, table-like datasets in-memory.

To support cross-language development, Arrow currently works with popular languages such as C, Go, Java, JavaScript, Python, R, Ruby, among others. The project also includes a query engine that works with Arrow data called DataFusion that is written in Rust.

Great for data analytics

In a blog entry, Michels pointed to Arrow’s support for in-memory computing, standardized columnar storage format, and IPC and RPC framework for data exchange as the outstanding features.

In-memory computing: Because Arrow understands how to read and operate on disparate data sources such as Spark or Cassandra directly, there is no need convert or load the entire chunk of data into memory first. Arrow will load as much data into memory as possible, using memory-mapping techniques to load data as needed into RAM. This means Arrow can load datasets larger than the available physical memory.
Columnar storage format: Much of the magic around the Arrow format is how it serializes columnar data. By arranging data from the same rows together, Arrow leverages the capabilities of the modern CPU architectures in terms of specialized instructions and processor-level caching and pipelining, explained Michels. The result is a tremendous performance boost that “enables new applications which were not feasible before”.
Efficient data exchange: Finally, the Arrow library offers interfaces for communication between both processes within the same machine, or across multiple nodes. For instance, a Python or Java process can exchange data efficiently without having to make additional copies – which can be a lifesaver when dabbling with a lot of data. When it comes to transferring between nodes, transferring parallel streams of Arrow data can result in up to 50 times better performance compared to ODBC over TCP.

So, if you are facing efficiency issues with manipulating or transferring data for your data science project, you might want to check out Apache Arrow.

Photo credit: iStockphoto/persistent

Paul Mah

Paul Mah is the editor of DSAITrends, where he report on the latest developments in data science and AI. A former system administrator, programmer, and IT lecturer, he enjoys writing both code and prose.