ExArrow. DataFrame
(ex_arrow v0.6.0)
View Source
Ergonomic conversion between Explorer DataFrames and Arrow data.
This module provides the from_arrow/1 and to_arrow/1 API requested by
users who think in DataFrame-first terms. It delegates to
ExArrow.Explorer for the actual IPC round-trip.
Requires {:explorer, "~> 0.11"} in your mix.exs dependencies. When
Explorer is absent every function returns {:error, "Explorer is not available..."}.
Arrow hierarchy
An Arrow RecordBatch is a collection of column arrays with a shared schema
and row count. A Stream is a sequence of batches. Both carry the same
columnar data; from_arrow/1 accepts either.
Examples
# DataFrame → Arrow
df = Explorer.DataFrame.new(x: [1, 2, 3], y: ["a", "b", "c"])
{:ok, batch} = ExArrow.DataFrame.to_arrow(df)
# Arrow → DataFrame
{:ok, df2} = ExArrow.DataFrame.from_arrow(batch)
Explorer.DataFrame.n_rows(df2) #=> 3
Summary
Functions
Convert Arrow data to an Explorer.DataFrame.
Convert an Explorer.DataFrame to a single ExArrow.RecordBatch.
Functions
@spec from_arrow(ExArrow.RecordBatch.t() | ExArrow.Stream.t()) :: {:ok, Explorer.DataFrame.t()} | {:error, String.t()}
Convert Arrow data to an Explorer.DataFrame.
Accepts either an ExArrow.RecordBatch or an ExArrow.Stream. Streams
are consumed entirely (all batches collected) before conversion.
Returns {:ok, dataframe} or {:error, message}.
Examples
{:ok, stream} = ExArrow.IPC.Reader.from_file("/data/events.arrow")
{:ok, df} = ExArrow.DataFrame.from_arrow(stream)
Explorer.DataFrame.n_rows(df) #=> 1_000_000
{:ok, batch} = ExArrow.DataFrame.to_arrow(df)
{:ok, df2} = ExArrow.DataFrame.from_arrow(batch)
Explorer.DataFrame.names(df2) #=> ["x", "y"]
@spec to_arrow(Explorer.DataFrame.t()) :: {:ok, ExArrow.RecordBatch.t()} | {:error, String.t()}
Convert an Explorer.DataFrame to a single ExArrow.RecordBatch.
The dataframe is serialised to Arrow IPC via
Explorer.DataFrame.dump_ipc_stream!/1, then read back as native Arrow
batches. Explorer may split a large dataframe into multiple IPC batches;
these are concatenated into a single RecordBatch so that the full row
count and all values are preserved.
Returns {:ok, batch} or {:error, message}.
Examples
df = Explorer.DataFrame.new(x: [1, 2, 3], y: ["a", "b", "c"])
{:ok, batch} = ExArrow.DataFrame.to_arrow(df)
ExArrow.RecordBatch.num_rows(batch) #=> 3