|Developer(s)||Apache Software Foundation|
|Initial release||October 10, 2016|
7.0.0 / 8 February 2022
|Type||Data format, algorithms|
|License||Apache License 2.0|
Apache Arrow is a language-agnostic software framework for developing data analytics applications that process columnar data. It contains a standardized column-oriented memory format that is able to represent flat and hierarchical data for efficient analytic operations on modern CPU and GPU hardware. This reduces or eliminates factors that limit the feasibility of working with large sets of data, such as the cost, volatility, or physical constraints of dynamic random-access memory.
Arrow has been used in diverse domains, including analytics, genomics, and cloud computing.
Apache Parquet and Apache ORC are popular examples of on-disk columnar data formats. Arrow is designed as a complement to these formats for processing data in-memory. The hardware resource engineering trade-offs for in-memory processing vary from those associated with on-disk storage. The Arrow and Parquet projects includes libraries that allow for reading and writing data between the two formats.
Apache Arrow was announced by The Apache Software Foundation on February 17, 2016, with development led by a coalition of developers from other open source data analytics projects. The initial codebase and Java library was seeded by code from Apache Drill.