Arrow Pipelines Overview
View SourceExArrow v0.7.0 introduces Arrow-native pipelines for the BEAM. This guide is a short orientation to the pipeline modules and how they fit together; the subsequent guides cover each in depth.
The principle
The central architectural principle is:
Operate on Arrow RecordBatch values. Not list(map()). Not Explorer.DataFrame. Not Nx.Tensor.
Explorer and Nx remain downstream consumers. ExArrow is the Arrow layer.
The modules
| Module | Role | Guide |
|---|---|---|
ExArrow.Stream | Open a stream of RecordBatch values from any source | 06 Arrow streams |
ExArrow.Batch | Lightweight column/row transforms on a batch | (see module docs) |
ExArrow.Pipeline | Lazy, composable DSL for transforming and sinking | 10 Pipeline patterns |
ExArrow.Flow | Parallel batch processing via Flow | 07 Arrow and Flow |
ExArrow.GenStage | Demand-driven producers with backpressure | 08 Arrow and GenStage |
ExArrow.Broadway | Ingestion pipelines (Kafka, SQS, ...) | 09 Arrow and Broadway |
ExArrow.Sink.* | Standard destinations (Parquet, Flight, DataFrame, Nx) | (see module docs) |
ExArrow.Telemetry | Events for every transport and pipeline operation | (see module docs) |
Quick example
ExArrow.Stream.from_parquet("/data/events.parquet")
|> ExArrow.Pipeline.map_batches(fn batch ->
{:ok, slim} = ExArrow.Batch.select(batch, ["id", "score"])
slim
end)
|> ExArrow.Pipeline.write_parquet("/data/slim.parquet")Where to start
- Read 06 Arrow streams to understand the streaming
abstraction and the
from_*/1constructors. - Read 10 Pipeline patterns for the
ExArrow.PipelineDSL and sinks. - Read the integration guides (07–09) when you need parallelism, backpressure, or ingestion.