ExZarr Gap Analysis: v1.1.0

View Source

Comparison Matrix

CapabilityPython ZarrTensorStoreTileDBExZarr v1.0ExZarr v1.1
Lazy chunk iterationYesYesYesPartial (chunk_stream)Yes (stream_chunks)
Parallel readsThread poolAsync I/OThread poolTask.async_streamTask + Flow + GenStage
Write streamingYesYesYesNoYes (write_stream)
BackpressureLimitedYesYesNoGenStage + Flow
Fault-tolerant pipelinesExternal (Dask)LimitedYesNoBroadway integration
Cloud storagefsspecNative GCS/S3S3S3/GCS/AzureHardened patterns doc
Nx/torch integrationVia Dask/XarrayTF/JAXLimitedExZarr.NxStreaming tensors
Distributed processingDaskLimitedYesNoStretch goal
TelemetryLimitedYesYesDocumented onlyImplemented

BEAM Unique Advantages

  1. Process isolation: Each chunk read runs in an isolated process. A failing chunk decode does not crash the entire pipeline when :on_error is configured.

  2. Preemptive scheduling: CPU-bound decompression scales across cores without a GIL, unlike Python threading.

  3. Supervision: Broadway pipelines restart failed stages without losing the entire array processing job.

  4. Cheap concurrency: Spawning 100+ concurrent chunk reads is practical on the BEAM where OS thread pools would be expensive.

  5. Hot code upgrades: Long-running streaming pipelines can be upgraded in place on production nodes.

Remaining Gaps (Post v1.1.0)

  • Multi-node distributed chunk processing (Horde/Swarm)
  • Explorer direct streaming integration
  • Async codec pipeline (overlap I/O and decode)
  • Zarr v3 async store interface alignment

Opportunities

  • Elixir/Phoenix data pipelines that need Zarr streaming without leaving the BEAM
  • Livebook-first education for scientific Elixir community
  • Cloud-native deployments on Fly.io/Gigalixir with Broadway pipelines