Arrow Pipelines Overview

View Source

ExArrow v0.7.0 introduces Arrow-native pipelines for the BEAM. This guide is a short orientation to the pipeline modules and how they fit together; the subsequent guides cover each in depth.

The principle

The central architectural principle is:

Operate on Arrow RecordBatch values. Not list(map()). Not Explorer.DataFrame. Not Nx.Tensor.

Explorer and Nx remain downstream consumers. ExArrow is the Arrow layer.

The modules

ModuleRoleGuide
ExArrow.StreamOpen a stream of RecordBatch values from any source06 Arrow streams
ExArrow.BatchLightweight column/row transforms on a batch(see module docs)
ExArrow.PipelineLazy, composable DSL for transforming and sinking10 Pipeline patterns
ExArrow.FlowParallel batch processing via Flow07 Arrow and Flow
ExArrow.GenStageDemand-driven producers with backpressure08 Arrow and GenStage
ExArrow.BroadwayIngestion pipelines (Kafka, SQS, ...)09 Arrow and Broadway
ExArrow.Sink.*Standard destinations (Parquet, Flight, DataFrame, Nx)(see module docs)
ExArrow.TelemetryEvents for every transport and pipeline operation(see module docs)

Quick example

ExArrow.Stream.from_parquet("/data/events.parquet")
|> ExArrow.Pipeline.map_batches(fn batch ->
  {:ok, slim} = ExArrow.Batch.select(batch, ["id", "score"])
  slim
end)
|> ExArrow.Pipeline.write_parquet("/data/slim.parquet")

Where to start

  1. Read 06 Arrow streams to understand the streaming abstraction and the from_*/1 constructors.
  2. Read 10 Pipeline patterns for the ExArrow.Pipeline DSL and sinks.
  3. Read the integration guides (07–09) when you need parallelism, backpressure, or ingestion.