Pandas 2.0 Finally Released
- By Paul Mah
- April 11, 2023
Last week saw the release of the long-awaited Pandas 2.0, three years after its first major release in 2020.
Pandas is a popular open-source library used in data science for data manipulation and analysis. Built on top of the Python programming language, it enables data scientists to perform tasks such as data cleaning, data transformation, and data aggregation, among others.
Significant performance improvements
The release of Pandas 2.0 brings significant improvements and optimizations to the popular data manipulation library.
Key updates include stabilized extension arrays (EAs) for users to define their own data types, better integration with Apache Arrow, and Copy-on-Write optimizations.
EAs now support missing values for custom data types, improving flexibility and performance. The new lazy copy mechanism further boosts performance by deferring copying until object modification.
Copy-on-write now comes with a new lazy copy mechanism, which defers copying until an object is modified, resulting in significant performance gains. In addition, the DataFrame constructor also uses lazy copies for columns when constructed from a dictionary of Series objects.
Crucially, the copy-on-write mechanism is now respected in various functions, such as DataFrame.from_records(), DataFrame.replace(), DataFrame.transpose(), and arithmetic operations that can be performed in-place. This improved mechanism streamlines operations and ensures better compliance with copy-on-write rules, boosting overall efficiency when doing data manipulation.
Another noteworthy update in Pandas 2.0 is the support for less accurate date values, addressing the limitations of representing timestamps with nanosecond precision – the default previously.
This change benefits researchers working with historical or geological timeframes, allowing users working with data spanning millennia or millions of years to analyze extensive timeframes without encountering errors.
The full list of updates is significant, and like Pandas 1.0, Pandas 2.0 brings a long list of new APIs and deprecates a host of older ones. A full list of updates can be viewed at the official website here.
Image credit: iStockphoto/leungchopan
Paul Mah
Paul Mah is the editor of DSAITrends, where he report on the latest developments in data science and AI. A former system administrator, programmer, and IT lecturer, he enjoys writing both code and prose.