Apache Arrow (Python)¶
Arrow is a columnar in-memory analytics layer designed to accelerate big data. It houses a set of canonical in-memory representations of flat and hierarchical data along with multiple language-bindings for structure manipulation. It also provides IPC and common algorithm implementations.
This is the documentation of the Python API of Apache Arrow. For more details on the format and other language bindings see the main page for Arrow. Here will we only detail the usage of the Python API for Arrow and the leaf libraries that add additional functionality such as reading Apache Parquet files into Arrow structures.
Install PyArrow¶
Conda¶
To install the latest version of PyArrow from conda-forge using conda:
conda install -c conda-forge pyarrow
Pip¶
Install the latest version from PyPI:
pip install pyarrow
Note
Currently there are only binary artifacts available for Linux and MacOS.
Otherwise this will only pull the python sources and assumes an existing
installation of the C++ part of Arrow. To retrieve the binary artifacts,
you’ll need a recent pip
version that supports features like the
manylinux1
tag.
Installing from source¶
See Development.
Development¶
Developing with conda¶
Linux and macOS¶
System Requirements¶
On macOS, any modern XCode (6.4 or higher; the current version is 8.3.1) is sufficient.
On Linux, for this guide, we recommend using gcc 4.8 or 4.9, or clang 3.7 or higher. You can check your version by running
$ gcc --version
On Ubuntu 16.04 and higher, you can obtain gcc 4.9 with:
$ sudo apt-get install g++-4.9
Finally, set gcc 4.9 as the active compiler using:
export CC=gcc-4.9
export CXX=g++-4.9
Environment Setup and Build¶
First, let’s create a conda environment with all the C++ build and Python dependencies from conda-forge:
conda create -y -q -n pyarrow-dev \
python=3.6 numpy six setuptools cython pandas pytest \
cmake flatbuffers rapidjson boost-cpp thrift-cpp snappy zlib \
brotli jemalloc -c conda-forge
source activate pyarrow-dev
Now, let’s clone the Arrow and Parquet git repositories:
mkdir repos
cd repos
git clone https://github.com/apache/arrow.git
git clone https://github.com/apache/parquet-cpp.git
You should now see
$ ls -l
total 8
drwxrwxr-x 12 wesm wesm 4096 Apr 15 19:19 arrow/
drwxrwxr-x 12 wesm wesm 4096 Apr 15 19:19 parquet-cpp/
We need to set some environment variables to let Arrow’s build system know about our build toolchain:
export ARROW_BUILD_TYPE=release
export ARROW_BUILD_TOOLCHAIN=$CONDA_PREFIX
export PARQUET_BUILD_TOOLCHAIN=$CONDA_PREFIX
export ARROW_HOME=$CONDA_PREFIX
export PARQUET_HOME=$CONDA_PREFIX
Now build and install the Arrow C++ libraries:
mkdir arrow/cpp/build
pushd arrow/cpp/build
cmake -DCMAKE_BUILD_TYPE=$ARROW_BUILD_TYPE \
-DCMAKE_INSTALL_PREFIX=$ARROW_HOME \
-DARROW_PYTHON=on \
-DARROW_BUILD_TESTS=OFF \
..
make -j4
make install
popd
Now, optionally build and install the Apache Parquet libraries in your toolchain:
mkdir parquet-cpp/build
pushd parquet-cpp/build
cmake -DCMAKE_BUILD_TYPE=$ARROW_BUILD_TYPE \
-DCMAKE_INSTALL_PREFIX=$PARQUET_HOME \
-DPARQUET_BUILD_BENCHMARKS=off \
-DPARQUET_BUILD_EXECUTABLES=off \
-DPARQUET_ZLIB_VENDORED=off \
-DPARQUET_BUILD_TESTS=off \
..
make -j4
make install
popd
Now, build pyarrow:
cd arrow/python
python setup.py build_ext --build-type=$ARROW_BUILD_TYPE \
--with-parquet --with-jemalloc --inplace
If you did not build parquet-cpp, you can omit --with-parquet
.
You should be able to run the unit tests with:
$ py.test pyarrow
================================ test session starts ====================
platform linux -- Python 3.6.1, pytest-3.0.7, py-1.4.33, pluggy-0.4.0
rootdir: /home/wesm/arrow-clone/python, inifile:
collected 198 items
pyarrow/tests/test_array.py ...........
pyarrow/tests/test_convert_builtin.py .....................
pyarrow/tests/test_convert_pandas.py .............................
pyarrow/tests/test_feather.py ..........................
pyarrow/tests/test_hdfs.py sssssssssssssss
pyarrow/tests/test_io.py ..................
pyarrow/tests/test_ipc.py ........
pyarrow/tests/test_jemalloc.py ss
pyarrow/tests/test_parquet.py ....................
pyarrow/tests/test_scalars.py ..........
pyarrow/tests/test_schema.py .........
pyarrow/tests/test_table.py .............
pyarrow/tests/test_tensor.py ................
====================== 181 passed, 17 skipped in 0.98 seconds ===========
Windows¶
First, make sure you can build the C++ library.
Now, we need to build and install the C++ libraries someplace.
mkdir cpp\build
cd cpp\build
set ARROW_HOME=C:\thirdparty
cmake -G "Visual Studio 14 2015 Win64" ^
-DCMAKE_INSTALL_PREFIX=%ARROW_HOME% ^
-DCMAKE_BUILD_TYPE=Release ^
-DARROW_BUILD_TESTS=off ^
-DARROW_PYTHON=on ..
cmake --build . --target INSTALL --config Release
cd ..\..
After that, we must put the install directory’s bin path in our %PATH%
:
set PATH=%ARROW_HOME%\bin;%PATH%
Now, we can build pyarrow:
cd python
python setup.py build_ext --inplace
Running C++ unit tests with Python¶
Getting python-test.exe
to run is a bit tricky because your
%PYTHONPATH%
must be configured given the active conda environment:
set CONDA_ENV=C:\Users\wesm\Miniconda\envs\arrow-test
set PYTHONPATH=%CONDA_ENV%\Lib;%CONDA_ENV%\Lib\site-packages;%CONDA_ENV%\python35.zip;%CONDA_ENV%\DLLs;%CONDA_ENV%
Now python-test.exe
or simply ctest
(to run all tests) should work.
Pandas Interface¶
To interface with Pandas, PyArrow provides various conversion routines to consume Pandas structures and convert back to them.
DataFrames¶
The equivalent to a Pandas DataFrame in Arrow is a pyarrow.table.Table
.
Both consist of a set of named columns of equal length. While Pandas only
supports flat columns, the Table also provides nested columns, thus it can
represent more data than a DataFrame, so a full conversion is not always possible.
Conversion from a Table to a DataFrame is done by calling
pyarrow.table.Table.to_pandas()
. The inverse is then achieved by using
pyarrow.Table.from_pandas()
. This conversion routine provides the
convience parameter timestamps_to_ms
. Although Arrow supports timestamps of
different resolutions, Pandas only supports nanosecond timestamps and most
other systems (e.g. Parquet) only work on millisecond timestamps. This parameter
can be used to already do the time conversion during the Pandas to Arrow
conversion.
import pyarrow as pa
import pandas as pd
df = pd.DataFrame({"a": [1, 2, 3]})
# Convert from Pandas to Arrow
table = pa.Table.from_pandas(df)
# Convert back to Pandas
df_new = table.to_pandas()
Series¶
In Arrow, the most similar structure to a Pandas Series is an Array.
It is a vector that contains data of the same type as linear memory. You can
convert a Pandas Series to an Arrow Array using pyarrow.array.from_pandas_series()
.
As Arrow Arrays are always nullable, you can supply an optional mask using
the mask
parameter to mark all null-entries.
Type differences¶
With the current design of Pandas and Arrow, it is not possible to convert all
column types unmodified. One of the main issues here is that Pandas has no
support for nullable columns of arbitrary type. Also datetime64
is currently
fixed to nanosecond resolution. On the other side, Arrow might be still missing
support for some types.
Pandas -> Arrow Conversion¶
Source Type (Pandas) | Destination Type (Arrow) |
---|---|
bool |
BOOL |
(u)int{8,16,32,64} |
(U)INT{8,16,32,64} |
float32 |
FLOAT |
float64 |
DOUBLE |
str / unicode |
STRING |
pd.Categorical |
DICTIONARY |
pd.Timestamp |
TIMESTAMP(unit=ns) |
datetime.date |
DATE |
Arrow -> Pandas Conversion¶
Source Type (Arrow) | Destination Type (Pandas) |
---|---|
BOOL |
bool |
BOOL with nulls |
object (with values True , False , None ) |
(U)INT{8,16,32,64} |
(u)int{8,16,32,64} |
(U)INT{8,16,32,64} with nulls |
float64 |
FLOAT |
float32 |
DOUBLE |
float64 |
STRING |
str |
DICTIONARY |
pd.Categorical |
TIMESTAMP(unit=*) |
pd.Timestamp (np.datetime64[ns] ) |
DATE |
pd.Timestamp (np.datetime64[ns] ) |
File interfaces and Memory Maps¶
PyArrow features a number of file-like interfaces
Hadoop File System (HDFS)¶
PyArrow comes with bindings to a C++-based interface to the Hadoop File System. You connect like so:
import pyarrow as pa
hdfs = pa.HdfsClient(host, port, user=user, kerb_ticket=ticket_cache_path)
By default, pyarrow.HdfsClient
uses libhdfs, a JNI-based interface to the
Java Hadoop client. This library is loaded at runtime (rather than at link
/ library load time, since the library may not be in your LD_LIBRARY_PATH), and
relies on some environment variables.
HADOOP_HOME
: the root of your installed Hadoop distribution. Often has lib/native/libhdfs.so.JAVA_HOME
: the location of your Java SDK installation.ARROW_LIBHDFS_DIR
(optional): explicit location oflibhdfs.so
if it is installed somewhere other than$HADOOP_HOME/lib/native
.CLASSPATH
: must contain the Hadoop jars. You can set these using:
export CLASSPATH=`$HADOOP_HOME/bin/hdfs classpath --glob`
You can also use libhdfs3, a thirdparty C++ library for HDFS from Pivotal Labs:
hdfs3 = pa.HdfsClient(host, port, user=user, kerb_ticket=ticket_cache_path,
driver='libhdfs3')
Reading/Writing Parquet files¶
If you have built pyarrow
with Parquet support, i.e. parquet-cpp
was
found during the build, you can read files in the Parquet format to/from Arrow
memory structures. The Parquet support code is located in the
pyarrow.parquet
module and your package needs to be built with the
--with-parquet
flag for build_ext
.
Reading Parquet¶
To read a Parquet file into Arrow memory, you can use the following code
snippet. It will read the whole Parquet file into memory as an
Table
.
import pyarrow.parquet as pq
table = pq.read_table('<filename>')
As DataFrames stored as Parquet are often stored in multiple files, a
convenience method read_multiple_files()
is provided.
If you already have the Parquet available in memory or get it via non-file
source, you can utilize pyarrow.io.BufferReader
to read it from
memory. As input to the BufferReader
you can either supply
a Python bytes
object or a pyarrow.io.Buffer
.
import pyarrow.io as paio
import pyarrow.parquet as pq
buf = ... # either bytes or paio.Buffer
reader = paio.BufferReader(buf)
table = pq.read_table(reader)
Writing Parquet¶
Given an instance of pyarrow.table.Table
, the most simple way to
persist it to Parquet is by using the pyarrow.parquet.write_table()
method.
import pyarrow as pa
import pyarrow.parquet as pq
table = pa.Table(..)
pq.write_table(table, '<filename>')
By default this will write the Table as a single RowGroup using DICTIONARY
encoding. To increase the potential of parallelism a query engine can process
a Parquet file, set the chunk_size
to a fraction of the total number of rows.
If you also want to compress the columns, you can select a compression
method using the compression
argument. Typically, GZIP
is the choice if
you want to minimize size and SNAPPY
for performance.
Instead of writing to a file, you can also write to Python bytes
by
utilizing an pyarrow.io.InMemoryOutputStream()
:
import pyarrow.io as paio
import pyarrow.parquet as pq
table = ...
output = paio.InMemoryOutputStream()
pq.write_table(table, output)
pybytes = output.get_result().to_pybytes()
API Reference¶
Type and Schema Factory Functions¶
null () |
|
bool_ () |
|
int8 () |
|
int16 () |
|
int32 () |
|
int64 () |
|
uint8 () |
|
uint16 () |
|
uint32 () |
|
uint64 () |
|
float16 () |
|
float32 () |
|
float64 () |
|
time32 (unit_str) |
|
time64 (unit_str) |
|
timestamp (unit_str[, tz]) |
|
date32 () |
|
date64 () |
|
binary (int length=-1) |
Binary (PyBytes-like) type |
string () |
UTF8 string |
decimal ((int precision, int scale=0) -> DataType) |
|
list_ (DataType value_type) |
|
struct (fields) |
|
dictionary (DataType index_type, Array dictionary) |
Dictionary (categorical, or simply encoded) type |
field (name, DataType type, ...) |
Create a pyarrow.Field instance |
schema (fields) |
Construct pyarrow.Schema from collection of fields |
from_numpy_dtype (dtype) |
Convert NumPy dtype to pyarrow.DataType |
Scalar Value Types¶
Array Types and Constructors¶
array (sequence, DataType type=None, ...) |
Create pyarrow.Array instance from a Python sequence |
Array |
|
BooleanArray |
|
DictionaryArray |
|
FloatingPointArray |
|
IntegerArray |
|
Int8Array |
|
Int16Array |
|
Int32Array |
|
Int64Array |
|
NullArray |
|
NumericArray |
|
UInt8Array |
|
UInt16Array |
|
UInt32Array |
|
UInt64Array |
|
BinaryArray |
|
FixedSizeBinaryArray |
|
StringArray |
|
Time32Array |
|
Time64Array |
|
Date32Array |
|
Date64Array |
|
TimestampArray |
|
DecimalArray |
|
ListArray |
Tables and Record Batches¶
ChunkedArray |
Array backed via one or more memory chunks. |
Column |
Named vector of elements of equal type. |
RecordBatch |
Batch of rows of columns of equal length |
Table |
A collection of top-level named, equal length Arrow arrays. |
get_record_batch_size (RecordBatch batch) |
Return total size of serialized RecordBatch including metadata and padding |
Tensor type and Functions¶
Tensor |
|
write_tensor (Tensor tensor, NativeFile dest) |
Write pyarrow.Tensor to pyarrow.NativeFile object its current position |
get_tensor_size (Tensor tensor) |
Return total size of serialized Tensor including metadata and padding |
read_tensor (NativeFile source) |
Read pyarrow.Tensor from pyarrow.NativeFile object from current position. |
Interprocess Communication and Messaging¶
FileReader (source[, footer_offset]) |
Class for reading Arrow record batch data from the Arrow binary file format |
FileWriter (sink, schema) |
Writer to create the Arrow binary file format |
StreamReader (source) |
Reader for the Arrow streaming binary format |
StreamWriter (sink, schema) |
Writer for the Arrow streaming binary format |
Memory Pools¶
MemoryPool |
|
default_memory_pool () |
|
jemalloc_memory_pool () |
Returns a jemalloc-based memory allocator, which can be passed to |
total_allocated_bytes () |
|
set_memory_pool (MemoryPool pool) |
Type Classes¶
DataType |
|
DecimalType |
|
DictionaryType |
|
FixedSizeBinaryType |
|
Time32Type |
|
Time64Type |
|
TimestampType |
|
Field |
Represents a named field, with a data type, nullability, and optional |
Schema |
Apache Parquet¶
ParquetDataset (path_or_paths[, filesystem, ...]) |
Encapsulates details of reading a complete Parquet dataset possibly |
ParquetFile (source[, metadata]) |
Reader interface for a single Parquet file |
read_table (source[, columns, nthreads, metadata]) |
Read a Table from Parquet format |
write_metadata (schema, where[, version]) |
Write metadata-only Parquet file from schema |
write_table (table, where[, row_group_size, ...]) |
Write a Table to Parquet format |
Getting Involved¶
Right now the primary audience for Apache Arrow are the developers of data systems; most people will use Apache Arrow indirectly through systems that use it for internal data handling and interoperating with other Arrow-enabled systems.
Even if you do not plan to contribute to Apache Arrow itself or Arrow integrations in other projects, we’d be happy to have you involved:
- Join the mailing list: send an email to dev-subscribe@arrow.apache.org. Share your ideas and use cases for the project or read through the Archive.
- Follow our activity on JIRA
- Learn the Format / Specification
- Chat with us on Slack
jemalloc MemoryPool¶
Arrow’s default MemoryPool
uses the system’s allocator
through the POSIX APIs. Although this already provides aligned allocation, the
POSIX interface doesn’t support aligned reallocation. The default reallocation
strategy is to allocate a new region, copy over the old data and free the
previous region. Using jemalloc we can simply extend
the existing memory allocation to the requested size. While this may still be
linear in the size of allocated memory, it is magnitudes faster as only the page
mapping in the kernel is touched, not the actual data.
The jemalloc
allocator is not enabled by default to allow the
use of the system allocator and/or other allocators like tcmalloc
. You can
either explicitly make it the default allocator or pass it only to single
operations.
import pyarrow as pa
jemalloc_pool = pyarrow.jemalloc_memory_pool()
# Explicitly use jemalloc for allocating memory for an Arrow Table object
array = pa.Array.from_pylist([1, 2, 3], memory_pool=jemalloc_pool)
# Set the global pool
pyarrow.set_memory_pool(jemalloc_pool)
# This operation has no explicit MemoryPool specified and will thus will
# also use jemalloc for its allocations.
array = pa.Array.from_pylist([1, 2, 3])