ExDataSketch (ExDataSketch v0.9.0)

Copy Markdown View Source

Production-grade streaming data sketching algorithms for Elixir.

ExDataSketch provides probabilistic data structures for approximate counting and frequency estimation on streaming data. All sketch state is stored as Elixir-owned binaries, enabling straightforward serialization, distribution, and persistence.

Sketch Families

Architecture

  • Binary state: All sketch state is canonical Elixir binaries. No opaque NIF resources.
  • Backend system: Computation is dispatched through backend modules. ExDataSketch.Backend.Pure (pure Elixir) is always available. ExDataSketch.Backend.Rust (optional, precompiled binaries provided) provides NIF acceleration.
  • Serialization: ExDataSketch-native format (EXSK) for all sketches, plus Apache DataSketches interop for Theta CompactSketch.
  • Deterministic hashing: ExDataSketch.Hash provides a stable 64-bit hash interface for reproducible results.

Quick Example

# Cardinality estimation with HLL
sketch = ExDataSketch.HLL.new(p: 14)
sketch = ExDataSketch.update_many(sketch, ["alice", "bob", "alice"])
ExDataSketch.HLL.estimate(sketch)

# Frequency estimation with CMS
sketch = ExDataSketch.CMS.new(width: 2048, depth: 5)
sketch = ExDataSketch.update_many(sketch, ["page_a", "page_a", "page_b"])
ExDataSketch.CMS.estimate(sketch, "page_a")

Integration Patterns

Each sketch module provides convenience functions for ecosystem integration:

  • from_enumerable/2 — build a sketch from any Enumerable in one call.
  • merge_many/1 — merge a collection of sketches (e.g. from parallel workers).
  • reducer/1 — returns a 2-arity function for use with Enum.reduce/3, Flow, etc.
  • merger/1 — returns a 2-arity function for merging sketches in reduce operations.

Stream Integration

ExDataSketch.Stream provides terminal stream consumers that build sketches from lazy enumerables without buffering the entire input:

1..100_000
|> Stream.map(&to_string/1)
|> ExDataSketch.Stream.hll(p: 14)
|> ExDataSketch.HLL.estimate()

For partition-local reduction:

1..1_000_000
|> ExDataSketch.Stream.reduce_partitioned(ExDataSketch.HLL, partitions: 8, p: 14)

Collectable

All mergeable sketches implement the Collectable protocol, enabling Enum.into/2 usage:

sketch = Enum.into(1..1000, ExDataSketch.HLL.new(p: 14))

See the Integration Guide for examples with Flow, Broadway, Explorer, Nx, and other ecosystem libraries.

See the Quick Start guide for more examples.

Summary

Functions

Updates a sketch with multiple items in a single pass.

Functions

update_many(sketch, items)

Updates a sketch with multiple items in a single pass.

Delegates to the appropriate sketch module's update_many/2 based on the struct type.

Examples

iex> sketch = ExDataSketch.HLL.new(p: 10)
iex> sketch = ExDataSketch.update_many(sketch, ["a", "b"])
iex> ExDataSketch.HLL.estimate(sketch) > 0.0
true