ExArrow.Batch (ex_arrow v0.7.0)

View Source

Lightweight ExArrow.RecordBatch transformations.

This module provides a small, Arrow-native set of column and row operations that preserve the underlying native batch handle. It is not a dataframe implementation and not a replacement for Explorer — use Explorer for analytics and ExArrow.Batch for in-flight pipeline transformations where keeping data in Arrow memory matters.

Every function returns either {:ok, batch} / {:ok, schema} or {:error, message}. Column data is never converted to row maps.

Column-wise implementation

  • select/2, drop/2, and filter/2 delegate to the native compute kernels (ExArrow.Compute) and work for all Arrow types ExArrow supports.
  • take/2 builds a boolean mask and filters through ExArrow.Compute.filter/2, so it also works for all Arrow types ExArrow supports.
  • rename/2 rebuilds a batch from raw column buffers. The buffer-extraction NIF supports the fixed-width numeric and boolean types (s8s64, u8u64, f32, f64, bool). Columns of other types (utf8, binary, timestamps, dates, durations) return {:error, "unsupported column type..."}. For workloads that need to rename string columns, round-trip through ExArrow.Explorer (which exposes its own rename) or project to a numeric-only batch first.

Schema and metadata preservation

Field order, field types, and nullability are preserved across select/2, drop/2, and filter/2. rename/2 preserves types and order, changing only the field names supplied in the mapping. Arrow schema metadata is not currently exposed by the NIF layer and is therefore not modified.

Examples

{:ok, stream} = ExArrow.Stream.from_parquet("/data/events.parquet")
batch = ExArrow.Stream.next(stream)

{:ok, slim}    = ExArrow.Batch.select(batch, ["user_id", "score"])
{:ok, renamed} = ExArrow.Batch.rename(slim, %{"user_id" => "id"})
{:ok, top10}   = ExArrow.Batch.take(renamed, 10)

Summary

Functions

Return a batch with columns removed.

Filter rows of batch using the first (boolean) column of predicate_batch.

Rename one or more columns of batch.

Return the ExArrow.Schema handle for batch.

Project a subset of columns from batch by name.

Select a subset of rows.

Functions

drop(batch, columns)

@spec drop(ExArrow.RecordBatch.t(), [String.t()]) ::
  {:ok, ExArrow.RecordBatch.t()} | {:error, String.t()}

Return a batch with columns removed.

All remaining columns keep their original relative order. Delegates to ExArrow.Compute.project/2 over the complement of columns.

Returns {:ok, batch} or {:error, message}.

Examples

{:ok, rest} = ExArrow.Batch.drop(batch, ["internal_flag", "debug"])

# Dropping an unknown column is an error.
{:error, _} = ExArrow.Batch.drop(batch, ["no_such_column"])

filter(batch, predicate)

@spec filter(ExArrow.RecordBatch.t(), ExArrow.RecordBatch.t()) ::
  {:ok, ExArrow.RecordBatch.t()} | {:error, String.t()}

Filter rows of batch using the first (boolean) column of predicate_batch.

Delegates directly to ExArrow.Compute.filter/2. Rows where the predicate is true are kept; rows where it is false or null are dropped. The predicate's first column must be a boolean Arrow array with the same row count as batch.

Returns {:ok, filtered_batch} or {:error, message}.

Example

{:ok, mask}     = ExArrow.Compute.project(batch, ["is_active"])
{:ok, filtered} = ExArrow.Batch.filter(batch, mask)

rename(batch, mapping)

@spec rename(
  ExArrow.RecordBatch.t(),
  %{required(String.t()) => String.t()} | keyword()
) ::
  {:ok, ExArrow.RecordBatch.t()} | {:error, String.t()}

Rename one or more columns of batch.

mapping is a map of %{old_name => new_name} or a keyword list of {atom, new_name} where the atom is the old column name. Columns not present in mapping keep their names. Column order and types are preserved.

Rebuilds the batch from raw column buffers, so only the buffer-extractable fixed-width numeric and boolean types are supported (see the moduledoc). Returns {:ok, batch} or {:error, message}.

Examples

{:ok, renamed} = ExArrow.Batch.rename(batch, %{"user_id" => "id"})

{:ok, renamed} = ExArrow.Batch.rename(batch, %{"a" => "x", "b" => "y"})

# Unknown source column is an error.
{:error, _} = ExArrow.Batch.rename(batch, %{"missing" => "x"})

schema(batch)

Return the ExArrow.Schema handle for batch.

Equivalent to ExArrow.RecordBatch.schema/1. Provided here so callers can stay within the ExArrow.Batch API for inspection.

select(batch, columns)

@spec select(ExArrow.RecordBatch.t(), [String.t()]) ::
  {:ok, ExArrow.RecordBatch.t()} | {:error, String.t()}

Project a subset of columns from batch by name.

Columns appear in the result in the order given. Delegates to ExArrow.Compute.project/2 and works for every Arrow type ExArrow supports.

Returns {:ok, projected_batch} or {:error, message}.

Examples

{:ok, two} = ExArrow.Batch.select(batch, ["user_id", "score"])

{:ok, reordered} = ExArrow.Batch.select(batch, ["score", "user_id"])

{:error, "column 'missing' not found"} = ExArrow.Batch.select(batch, ["missing"])

take(batch, n)

@spec take(ExArrow.RecordBatch.t(), non_neg_integer() | [non_neg_integer()]) ::
  {:ok, ExArrow.RecordBatch.t()} | {:error, String.t()}

Select a subset of rows.

The second argument may be:

  • an integer n — keep the first n rows (n >= 0). n larger than the batch row count returns the batch unchanged.
  • a list of zero-based row indices — keep the rows at the given positions. Rows are returned in their original row order (the boolean-mask filter kernel preserves row order and does not reorder by the index list). Out-of-range indices are an error.

Implemented by building a boolean mask and filtering through ExArrow.Compute.filter/2, so it works for every Arrow type ExArrow supports. Returns {:ok, batch} or {:error, message}.

Examples

{:ok, first10} = ExArrow.Batch.take(batch, 10)

{:ok, picked} = ExArrow.Batch.take(batch, [0, 2, 4])