This document is the consolidated architectural overview of
ex_data_sketch v0.8.0 ("Deterministic Foundations"). It is intended
for contributors who want to understand the whole-system design after
five phases of focused work, and for downstream maintainers who need to
reason about the library's guarantees.
For per-phase detail see the individual plan documents (linked at the
bottom). For risk tracking see plans/0.8.0-risks.md. For the user-
facing change log see CHANGELOG.md.
The thesis behind v0.8.0
ex_data_sketch v0.1.0 through v0.7.1 grew the surface area of the
library: HLL, then CMS, then Theta, then KLL, then DDSketch,
FrequentItems, Bloom, Cuckoo, Quotient, CQF, XorFilter, IBLT, REQ,
MisraGries, ULL — 15 sketch families in 7 releases. Each release added
a sketch; none invested in the substrate.
v0.8.0 inverts the priority. It adds zero new sketch families and instead establishes the production substrate they all share:
- Deterministic hashing. A documented, validated, byte-stable hash layer used by every sketch.
- Binary stability. A versioned, corruption-checked wire format that every sketch agrees on.
- Hot-path performance. In-Rust hashing for every cardinality sketch, across both XXH3 and Murmur3.
- Installation reliability. Precompiled NIFs across an 8-target matrix.
- Probabilistic correctness validation. Property-based locking of the algebraic and probabilistic guarantees every sketch claims.
The thesis: a library can ship 15 algorithms and still be hobby-grade if its substrate is undocumented. v0.8.0 turns ex_data_sketch from "a collection of sketches" into "a probabilistic runtime on the BEAM".
Layered architecture
┌─────────────────────────────────────────────┐
│ User code: Stream.transform / Broadway / │
│ Phoenix / etc. │
└────────────────┬────────────────────────────┘
│
┌────────────────▼────────────────────────────┐
│ Sketch modules (15) │
│ HLL, ULL, KLL, REQ, Theta, CMS, │
│ DDSketch, FrequentItems, MisraGries, │
│ Bloom, Cuckoo, Quotient, CQF, XorFilter, │
│ IBLT, FilterChain │
└────────────────┬────────────────────────────┘
│
┌────────────────▼────────────────────────────┐
│ Binary facade │
│ ExDataSketch.Binary │
│ encode/3, decode/1, peek_version/1, │
│ build_payload/2, metadata_from_opts/3 │
└─────────┬─────────────────────┬─────────────┘
│ │
┌───────────▼────────────┐ ┌────▼────────────────┐
│ Binary v2 (default) │ │ Codec v1 (legacy) │
│ Header / Validator/CRC │ │ ExDataSketch.Codec │
└───────────┬────────────┘ └─────────────────────┘
│
┌───────────▼────────────┐
│ Hash.Metadata block │
│ (algorithm, seed, │
│ family, family_ver, │
│ backend, extension) │
└───────────┬────────────┘
│
┌───────────▼────────────────────────────────────┐
│ Hash registry and validators │
│ ExDataSketch.Hash │
│ default_algorithm/0, algorithm_info/1, │
│ resolve_strategy/1 │
│ ExDataSketch.Hash.Validation │
│ validate_options!/3, validate_metadata!/3 │
└───────────┬────────────────────────────────────┘
│
┌───────────▼─────────────────────────────────────┐
│ Hash implementations │
│ ExDataSketch.Hash.XXH3 (NIF only) │
│ ExDataSketch.Hash.Murmur3 (Pure + NIF) │
│ ExDataSketch.Hash.* (phash2 + mix64 fallback) │
└───────────┬─────────────────────────────────────┘
│
┌───────────▼─────────────────────────────────────┐
│ Backends │
│ ExDataSketch.Backend.Pure (always present) │
│ ExDataSketch.Backend.Rust (NIF dispatcher) │
└───────────┬─────────────────────────────────────┘
│
┌───────────▼─────────────────────────────────────┐
│ Rust NIF (native/ex_data_sketch_nif/src/) │
│ hash.rs — xxhash3, murmur3 │
│ crc.rs — crc32c │
│ hll.rs — register update + raw_h dispatch │
│ ull.rs — register update + raw_h dispatch │
│ theta.rs — BTreeSet ops + raw_h dispatch │
│ cms.rs — counter update + raw_h dispatch │
│ {bloom, cuckoo, quotient, cqf, xor, iblt, fi, │
│ kll, ddsketch}.rs — sketch-specific │
└─────────────────────────────────────────────────┘Layer responsibilities (rules)
- Sketch modules own the algorithm. They know their state layout, their parameter semantics, and their estimator math. They do NOT know about wire format or hash algorithm details.
- The Binary facade owns the wire format. Sketch modules call
Binary.encode/3andBinary.decode/1. They never write magic bytes, version bytes, or CRCs themselves. - The Hash registry owns hash identity. Sketch modules call
Hash.resolve_strategy/1at construction and pass the resulting:hash_strategyopt through their operations. They never compare hash strategies themselves; that isHash.Validation's job. - Backends own execution. Sketch modules dispatch operations
through
Backend.PureorBackend.Rust. They do not invoke NIFs directly. - The Rust NIF owns hot loops. Everything inside a NIF is stateless and operates on input bytes. Sketch state is BEAM-owned; the NIF receives a binary, computes a new binary, and returns it.
These rules are the architectural invariants that v0.8.0 establishes. They are enforced by structure (not lint) — violating them produces visible code-review smell.
Phase-by-phase contribution
Phase 1 — Deterministic Hashing
What it added. Four new submodules (Hash.XXH3, Hash.Murmur3,
Hash.Metadata, Hash.Validation) plus the registry API on
ExDataSketch.Hash and the byte-identical pure-Elixir/Rust Murmur3
parity.
Why it matters. Every probabilistic merge depends on hash identity. Without a documented, validated, versioned hash layer, the library cannot promise that a sketch produced on Node A will merge correctly with a sketch produced on Node B six months later. Phase 1 is the foundation everything else stands on.
Key invariant. Two sketches may be merged only when their hash algorithm, hash seed, sketch family, and sketch family version agree. Backend (Pure vs Rust) is intentionally NOT part of this equivalence — the parity tests guarantee both backends produce byte-identical output.
Phase 2 — Binary Stability & Corruption Detection
What it added. The ExDataSketch.Binary facade and three
submodules (Binary.Header, Binary.Validator, Binary.CRC). EXSK
v2 wire format. v1 reader backward compatibility. Regenerated golden
vectors with test/vectors_v1/ preserved as a regression corpus.
60+ new tests including a 200-mutation bit-flip fuzz suite.
Why it matters. Pre-v0.8 EXSK had no checksum. A bit-flip in persisted state would silently corrupt the next merge or estimate. v2 closes that gap with CRC32C (Castagnoli, hardware-accelerated). It also embeds the Phase 1 hash metadata into every frame so the merge invariant from Phase 1 has somewhere to live on the wire.
Key invariant. Every persisted sketch carries its own hash identity. The serializer cannot lie about it; the deserializer cannot ignore it.
Phase 3 — HLL Hot-Path Optimization
What it added. 8 new Rust NIFs (_raw_h_nif family) for HLL,
ULL, Theta, CMS. Each accepts an algorithm: u8 parameter (XXH3 or
Murmur3) and dispatches at the per-NIF-call boundary, not per-item.
Hash.resolve_strategy/1 opens the :hash_strategy opt to user
selection. bench/hll_hot_path_bench.exs measures all four paths
across three batch sizes.
Why it matters. v0.7.1 introduced in-Rust hashing for XXH3.
Phase 3 generalizes to Murmur3 so the new :murmur3 strategy from
Phase 1 doesn't fall off the fast path. Net effect: ~15x throughput
over Pure Elixir, ~8% slowdown for Murmur3 vs XXH3 (intrinsic to
the algorithm).
Key invariant. Sketch state is BEAM-owned. The NIF receives a binary, returns a binary. Per-item allocation crosses zero Elixir references in steady state.
Phase 4 — Precompiled NIF Validation
What it added. Two Windows targets (x86_64-pc-windows-msvc,
aarch64-pc-windows-msvc) bringing the matrix to 8 × 2 = 16
artifacts per release. mix test.nif_on / mix test.nif_off aliases
for local NIF mode flips. 18 NIF-availability contract tests.
Why it matters. Adoption friction. Pre-v0.8 Windows users had to install Rust to build the NIF. Phase 4 removes that step. Apple Silicon, Linux glibc/musl, Linux ARM64, Windows x86_64, and Windows ARM64 all install via Hex with no toolchain.
Key invariant. EX_DATA_SKETCH_SKIP_NIF=true (NIF stubs only)
and EX_DATA_SKETCH_BUILD=true (source build) are independent
escape hatches. The default precompiled-download path is the user-
visible recommended install.
Phase 5 — Property-Based Validation
What it added. test/property_guarantees_test.exs with 14 new
StreamData properties locking the algebraic and probabilistic
guarantees the prompt enumerates: HLL/ULL monotonicity and RSE
bounds, KLL/REQ rank consistency and quantile inversion, CMS
overestimation-only, Bloom/XOR/Cuckoo no-false-negative, Binary v2
bit-flip corruption never silently propagates.
Why it matters. Example-based tests check one trajectory. The production substrate Phase 1-4 builds is only worth the substrate work if its guarantees hold across the distribution of inputs. Property-based testing closes that gap.
Key invariant. Coverage ≥ 70% (current: 92.7%). Property suite runs in < 1 s on top of the example suite. Every property carries prose justification of its tolerance / slack.
Numbers that matter
Test suite
| Metric | v0.7.1 | v0.8.0 | Delta |
|---|---|---|---|
| Tests (NIF on) | 1,186 | 1,317 | +131 |
| Tests (NIF off) | ~1,000 | 1,088 | +88 |
| Doctests | 169 | 202 | +33 |
| Properties (NIF on) | 152 | 171 | +19 |
| Properties (NIF off) | 116 | 128 | +12 |
| Line coverage | 88% | 92.7% | +4.7 pp |
mix credo --strict issues | 0 | 0 | — |
Performance
| Path (HLL p=14) | v0.7.1 throughput | v0.8.0 throughput | Notes |
|---|---|---|---|
| Pure phash2 | ~1.7 M items/sec | ~1.7 M items/sec | unchanged |
| Pure xxhash3 | ~1.9 M items/sec | ~1.9 M items/sec | unchanged |
| Rust raw XXH3 | ~30 M items/sec | ~30 M items/sec | unchanged |
| Rust raw_h Murmur3 | — | ~28 M items/sec | new in v0.8.0 |
Code surface
| Asset | v0.7.1 | v0.8.0 | Delta |
|---|---|---|---|
Elixir modules in lib/ | ~30 | ~40 | +10 |
| Rust NIF functions | 47 | 58 | +11 |
| Plans / design docs | 47 | 62 | +15 |
| Precompiled NIF targets | 6 | 8 | +2 (Windows MSVC) |
| Artifacts per release | 12 | 16 | +4 |
Wire format
| Sketch (empty) | v1 size | v2 size | Overhead |
|---|---|---|---|
| HLL p=4 | 18 bytes | 50 bytes | +32 (2.8x) |
| HLL p=14 | 16,398 bytes | 16,430 bytes | +32 (0.2%) |
| KLL k=200 (populated) | ~3-5 KB | ~3-5 KB | +32 (~1%) |
Design decisions worth re-reading
A handful of decisions shaped the v0.8.0 architecture and deserve explicit documentation here so future maintainers don't relitigate them.
Why two-layer versioning (frame + metadata block)?
The EXSK v2 frame has its own serialization_version byte. The
embedded Hash.Metadata block has its own block_version byte. Two
independent axes.
The rationale is fine-grained evolution:
- Adding a new hash algorithm: claim a new wire byte. No version bump on either axis.
- Adding a new metadata field: append to the metadata block's
extensiontrailer. v1 readers preserve unknown extension bytes verbatim on re-encode. No version bump. - Restructuring the metadata block layout itself: bump
block_version. Frame version stays at 2. - Restructuring the frame layout (e.g., changing the magic, the CRC
algorithm, the header field order): bump
serialization_versionto 3.
Single-axis versioning would force every change to either be backward-incompatible or to crowd into a single ever-larger version namespace. Two axes give us 16+ years of additive evolution before either runs out of room.
Why CRC32C (Castagnoli), not CRC32 (IEEE) or xxhash3-32?
- CRC32C has hardware acceleration on every modern CPU (Intel SSE 4.2, ARMv8.1+). Same speed class as CRC32 IEEE on hardware that supports it; substantially faster on hardware that does not.
- CRC32C is the standard checksum in iSCSI, Btrfs, SCTP, Snappy frame format. The algorithm is settled; the wire bytes are stable across implementations. Cross-language interop is trivial.
- xxhash3-32 is faster but is NOT a CRC. It has different error- detection guarantees. For storage integrity (the primary use case) CRC32 family is the right tool.
Full rationale in plans/corruption_detection.md.
Why preserve the legacy _raw_nif family alongside _raw_h_nif?
The v0.7.1 _raw_nif family hardcoded XXH3. Phase 3 added the
generalized _raw_h_nif family with an algorithm byte. The two
families are now functionally equivalent for XXH3.
We preserve the legacy NIFs because:
- They are part of the v0.7.x ABI. Removing them is a breaking change reserved for v1.0.
- They serve as a regression baseline:
_raw_nifand_raw_h_nif (algo=1)are property-tested for byte-identical output, locking the equivalence and catching any drift.
A v1.0 deprecation could remove the legacy family.
Why a 16-byte fixed metadata block when sketches differ in family?
The block could vary per sketch family. We chose fixed for three reasons:
- Cross-family validation. The merge validator can compare two metadata blocks without knowing which sketch family they belong to. Useful for generic tooling.
- Forward compatibility. The fixed length means a v0.8 reader can skip a future metadata block of unknown internal structure and still successfully parse the surrounding frame.
- Smallest worst-case. For HLL p=4 the overhead is 32 bytes. For any production-sized sketch (p >= 8) the overhead is < 1%.
Variable-length metadata was rejected as a premature optimization that would have made the binary contract harder to validate.
Why is Backend.default/0 still Pure?
The "no silent default change" guarantee from v0.7.x. Users who
benchmarked the library and chose Pure for some reason should not
have their default flip under them on a minor-version upgrade.
This is locked by test/ex_data_sketch/nif_availability_test.exs
and documented in precompiled_nifs.md.
The trade-off: users adopting the library for the first time may benchmark with the wrong backend. We accept that as the smaller risk.
A future major-version bump (v1.0) is the appropriate moment to revisit.
What v0.8.0 does NOT do
Explicit non-goals to prevent scope creep in maintenance and to document the boundary for v0.9.0 planning:
- No new sketch families. CPC, Tuple, MinHash, VarOpt — all deferred (v0.11+).
- No Apache DataSketches binary interop beyond Theta CompactSketch which already existed. KLL and HLL interop deferred (v0.10).
- No streaming integrations. Broadway, Flow, GenStage — deferred (v0.9).
- No persistence layers. ETS, DETS, CubDB — deferred (v0.9).
- No telemetry / OpenTelemetry. Deferred (v0.9).
- No SIMD intrinsics. The HLL hot path uses scalar Rust; hyperloglog-rs uses SIMD and is 2-3x faster. Deferred (v1.0).
- No 6-bit register packing. HLL stores 1 byte per register, wasting 25%. Deferred (v1.0).
- No raw-NIF path for membership filters. Bloom, Cuckoo, Quotient, CQF, XorFilter, IBLT still hash in Elixir. Deferred (v0.9 candidate).
- No SBOM / SLSA / reproducible builds. Deferred (v1.0).
See also
prompts/0.8.0_prompt.md— original release brief.plans/next_steps.md— strategic roadmap (v0.8.0 through v1.0).plans/0.8.0_implementation_plan.md— master tracker.hash_strategies.md,plans/hash_binary_contract.md— Phase 1 deep dives.plans/binary_contract.md,plans/corruption_detection.md— Phase 2 deep dives.hll_performance.md,plans/hll_scheduler_safety.md— Phase 3 deep dives.precompiled_nifs.md— Phase 4 deep dive.plans/property_testing.md— Phase 5 deep dive.- Phase 1-5 reviewer checklists — per-phase checklists.
v0.8.0_migration_notes.md— downstream upgrade guide.serialization_compatibility.md— wire-format stability contract.roadmap.md— next release preview.plans/0.8.0-risks.md— open risks at release time.CHANGELOG.md— full v0.8.0 change log.