v0.8.0 Architectural Summary

Copy Markdown View Source

This document is the consolidated architectural overview of ex_data_sketch v0.8.0 ("Deterministic Foundations"). It is intended for contributors who want to understand the whole-system design after five phases of focused work, and for downstream maintainers who need to reason about the library's guarantees.

For per-phase detail see the individual plan documents (linked at the bottom). For risk tracking see plans/0.8.0-risks.md. For the user- facing change log see CHANGELOG.md.

The thesis behind v0.8.0

ex_data_sketch v0.1.0 through v0.7.1 grew the surface area of the library: HLL, then CMS, then Theta, then KLL, then DDSketch, FrequentItems, Bloom, Cuckoo, Quotient, CQF, XorFilter, IBLT, REQ, MisraGries, ULL — 15 sketch families in 7 releases. Each release added a sketch; none invested in the substrate.

v0.8.0 inverts the priority. It adds zero new sketch families and instead establishes the production substrate they all share:

  1. Deterministic hashing. A documented, validated, byte-stable hash layer used by every sketch.
  2. Binary stability. A versioned, corruption-checked wire format that every sketch agrees on.
  3. Hot-path performance. In-Rust hashing for every cardinality sketch, across both XXH3 and Murmur3.
  4. Installation reliability. Precompiled NIFs across an 8-target matrix.
  5. Probabilistic correctness validation. Property-based locking of the algebraic and probabilistic guarantees every sketch claims.

The thesis: a library can ship 15 algorithms and still be hobby-grade if its substrate is undocumented. v0.8.0 turns ex_data_sketch from "a collection of sketches" into "a probabilistic runtime on the BEAM".

Layered architecture

                
                   User code: Stream.transform / Broadway /  
                   Phoenix / etc.                            
                
                                 
                
                   Sketch modules (15)                       
                   HLL, ULL, KLL, REQ, Theta, CMS,           
                   DDSketch, FrequentItems, MisraGries,      
                   Bloom, Cuckoo, Quotient, CQF, XorFilter,  
                   IBLT, FilterChain                         
                
                                 
                
                   Binary facade                             
                   ExDataSketch.Binary                       
                       encode/3, decode/1, peek_version/1,   
                       build_payload/2, metadata_from_opts/3 
                
                                               
                 
               Binary v2 (default)        Codec v1 (legacy)   
               Header / Validator/CRC     ExDataSketch.Codec  
                 
                          
              
               Hash.Metadata block    
               (algorithm, seed,      
                family, family_ver,   
                backend, extension)   
              
                          
              
               Hash registry and validators                   
               ExDataSketch.Hash                              
                 default_algorithm/0, algorithm_info/1,       
                 resolve_strategy/1                           
               ExDataSketch.Hash.Validation                   
                 validate_options!/3, validate_metadata!/3    
              
                          
              
               Hash implementations                            
                 ExDataSketch.Hash.XXH3       (NIF only)       
                 ExDataSketch.Hash.Murmur3    (Pure + NIF)     
                 ExDataSketch.Hash.* (phash2 + mix64 fallback) 
              
                          
              
               Backends                                        
                 ExDataSketch.Backend.Pure   (always present)  
                 ExDataSketch.Backend.Rust   (NIF dispatcher)  
              
                          
              
               Rust NIF (native/ex_data_sketch_nif/src/)       
                 hash.rs    xxhash3, murmur3                  
                 crc.rs     crc32c                            
                 hll.rs     register update + raw_h dispatch  
                 ull.rs     register update + raw_h dispatch  
                 theta.rs   BTreeSet ops   + raw_h dispatch   
                 cms.rs     counter update + raw_h dispatch   
                 {bloom, cuckoo, quotient, cqf, xor, iblt, fi, 
                  kll, ddsketch}.rs  sketch-specific          
              

Layer responsibilities (rules)

  1. Sketch modules own the algorithm. They know their state layout, their parameter semantics, and their estimator math. They do NOT know about wire format or hash algorithm details.
  2. The Binary facade owns the wire format. Sketch modules call Binary.encode/3 and Binary.decode/1. They never write magic bytes, version bytes, or CRCs themselves.
  3. The Hash registry owns hash identity. Sketch modules call Hash.resolve_strategy/1 at construction and pass the resulting :hash_strategy opt through their operations. They never compare hash strategies themselves; that is Hash.Validation's job.
  4. Backends own execution. Sketch modules dispatch operations through Backend.Pure or Backend.Rust. They do not invoke NIFs directly.
  5. The Rust NIF owns hot loops. Everything inside a NIF is stateless and operates on input bytes. Sketch state is BEAM-owned; the NIF receives a binary, computes a new binary, and returns it.

These rules are the architectural invariants that v0.8.0 establishes. They are enforced by structure (not lint) — violating them produces visible code-review smell.

Phase-by-phase contribution

Phase 1 — Deterministic Hashing

What it added. Four new submodules (Hash.XXH3, Hash.Murmur3, Hash.Metadata, Hash.Validation) plus the registry API on ExDataSketch.Hash and the byte-identical pure-Elixir/Rust Murmur3 parity.

Why it matters. Every probabilistic merge depends on hash identity. Without a documented, validated, versioned hash layer, the library cannot promise that a sketch produced on Node A will merge correctly with a sketch produced on Node B six months later. Phase 1 is the foundation everything else stands on.

Key invariant. Two sketches may be merged only when their hash algorithm, hash seed, sketch family, and sketch family version agree. Backend (Pure vs Rust) is intentionally NOT part of this equivalence — the parity tests guarantee both backends produce byte-identical output.

Phase 2 — Binary Stability & Corruption Detection

What it added. The ExDataSketch.Binary facade and three submodules (Binary.Header, Binary.Validator, Binary.CRC). EXSK v2 wire format. v1 reader backward compatibility. Regenerated golden vectors with test/vectors_v1/ preserved as a regression corpus. 60+ new tests including a 200-mutation bit-flip fuzz suite.

Why it matters. Pre-v0.8 EXSK had no checksum. A bit-flip in persisted state would silently corrupt the next merge or estimate. v2 closes that gap with CRC32C (Castagnoli, hardware-accelerated). It also embeds the Phase 1 hash metadata into every frame so the merge invariant from Phase 1 has somewhere to live on the wire.

Key invariant. Every persisted sketch carries its own hash identity. The serializer cannot lie about it; the deserializer cannot ignore it.

Phase 3 — HLL Hot-Path Optimization

What it added. 8 new Rust NIFs (_raw_h_nif family) for HLL, ULL, Theta, CMS. Each accepts an algorithm: u8 parameter (XXH3 or Murmur3) and dispatches at the per-NIF-call boundary, not per-item. Hash.resolve_strategy/1 opens the :hash_strategy opt to user selection. bench/hll_hot_path_bench.exs measures all four paths across three batch sizes.

Why it matters. v0.7.1 introduced in-Rust hashing for XXH3. Phase 3 generalizes to Murmur3 so the new :murmur3 strategy from Phase 1 doesn't fall off the fast path. Net effect: ~15x throughput over Pure Elixir, ~8% slowdown for Murmur3 vs XXH3 (intrinsic to the algorithm).

Key invariant. Sketch state is BEAM-owned. The NIF receives a binary, returns a binary. Per-item allocation crosses zero Elixir references in steady state.

Phase 4 — Precompiled NIF Validation

What it added. Two Windows targets (x86_64-pc-windows-msvc, aarch64-pc-windows-msvc) bringing the matrix to 8 × 2 = 16 artifacts per release. mix test.nif_on / mix test.nif_off aliases for local NIF mode flips. 18 NIF-availability contract tests.

Why it matters. Adoption friction. Pre-v0.8 Windows users had to install Rust to build the NIF. Phase 4 removes that step. Apple Silicon, Linux glibc/musl, Linux ARM64, Windows x86_64, and Windows ARM64 all install via Hex with no toolchain.

Key invariant. EX_DATA_SKETCH_SKIP_NIF=true (NIF stubs only) and EX_DATA_SKETCH_BUILD=true (source build) are independent escape hatches. The default precompiled-download path is the user- visible recommended install.

Phase 5 — Property-Based Validation

What it added. test/property_guarantees_test.exs with 14 new StreamData properties locking the algebraic and probabilistic guarantees the prompt enumerates: HLL/ULL monotonicity and RSE bounds, KLL/REQ rank consistency and quantile inversion, CMS overestimation-only, Bloom/XOR/Cuckoo no-false-negative, Binary v2 bit-flip corruption never silently propagates.

Why it matters. Example-based tests check one trajectory. The production substrate Phase 1-4 builds is only worth the substrate work if its guarantees hold across the distribution of inputs. Property-based testing closes that gap.

Key invariant. Coverage ≥ 70% (current: 92.7%). Property suite runs in < 1 s on top of the example suite. Every property carries prose justification of its tolerance / slack.

Numbers that matter

Test suite

Metricv0.7.1v0.8.0Delta
Tests (NIF on)1,1861,317+131
Tests (NIF off)~1,0001,088+88
Doctests169202+33
Properties (NIF on)152171+19
Properties (NIF off)116128+12
Line coverage88%92.7%+4.7 pp
mix credo --strict issues00

Performance

Path (HLL p=14)v0.7.1 throughputv0.8.0 throughputNotes
Pure phash2~1.7 M items/sec~1.7 M items/secunchanged
Pure xxhash3~1.9 M items/sec~1.9 M items/secunchanged
Rust raw XXH3~30 M items/sec~30 M items/secunchanged
Rust raw_h Murmur3~28 M items/secnew in v0.8.0

Code surface

Assetv0.7.1v0.8.0Delta
Elixir modules in lib/~30~40+10
Rust NIF functions4758+11
Plans / design docs4762+15
Precompiled NIF targets68+2 (Windows MSVC)
Artifacts per release1216+4

Wire format

Sketch (empty)v1 sizev2 sizeOverhead
HLL p=418 bytes50 bytes+32 (2.8x)
HLL p=1416,398 bytes16,430 bytes+32 (0.2%)
KLL k=200 (populated)~3-5 KB~3-5 KB+32 (~1%)

Design decisions worth re-reading

A handful of decisions shaped the v0.8.0 architecture and deserve explicit documentation here so future maintainers don't relitigate them.

Why two-layer versioning (frame + metadata block)?

The EXSK v2 frame has its own serialization_version byte. The embedded Hash.Metadata block has its own block_version byte. Two independent axes.

The rationale is fine-grained evolution:

  • Adding a new hash algorithm: claim a new wire byte. No version bump on either axis.
  • Adding a new metadata field: append to the metadata block's extension trailer. v1 readers preserve unknown extension bytes verbatim on re-encode. No version bump.
  • Restructuring the metadata block layout itself: bump block_version. Frame version stays at 2.
  • Restructuring the frame layout (e.g., changing the magic, the CRC algorithm, the header field order): bump serialization_version to 3.

Single-axis versioning would force every change to either be backward-incompatible or to crowd into a single ever-larger version namespace. Two axes give us 16+ years of additive evolution before either runs out of room.

Why CRC32C (Castagnoli), not CRC32 (IEEE) or xxhash3-32?

  • CRC32C has hardware acceleration on every modern CPU (Intel SSE 4.2, ARMv8.1+). Same speed class as CRC32 IEEE on hardware that supports it; substantially faster on hardware that does not.
  • CRC32C is the standard checksum in iSCSI, Btrfs, SCTP, Snappy frame format. The algorithm is settled; the wire bytes are stable across implementations. Cross-language interop is trivial.
  • xxhash3-32 is faster but is NOT a CRC. It has different error- detection guarantees. For storage integrity (the primary use case) CRC32 family is the right tool.

Full rationale in plans/corruption_detection.md.

Why preserve the legacy _raw_nif family alongside _raw_h_nif?

The v0.7.1 _raw_nif family hardcoded XXH3. Phase 3 added the generalized _raw_h_nif family with an algorithm byte. The two families are now functionally equivalent for XXH3.

We preserve the legacy NIFs because:

  1. They are part of the v0.7.x ABI. Removing them is a breaking change reserved for v1.0.
  2. They serve as a regression baseline: _raw_nif and _raw_h_nif (algo=1) are property-tested for byte-identical output, locking the equivalence and catching any drift.

A v1.0 deprecation could remove the legacy family.

Why a 16-byte fixed metadata block when sketches differ in family?

The block could vary per sketch family. We chose fixed for three reasons:

  1. Cross-family validation. The merge validator can compare two metadata blocks without knowing which sketch family they belong to. Useful for generic tooling.
  2. Forward compatibility. The fixed length means a v0.8 reader can skip a future metadata block of unknown internal structure and still successfully parse the surrounding frame.
  3. Smallest worst-case. For HLL p=4 the overhead is 32 bytes. For any production-sized sketch (p >= 8) the overhead is < 1%.

Variable-length metadata was rejected as a premature optimization that would have made the binary contract harder to validate.

Why is Backend.default/0 still Pure?

The "no silent default change" guarantee from v0.7.x. Users who benchmarked the library and chose Pure for some reason should not have their default flip under them on a minor-version upgrade.

This is locked by test/ex_data_sketch/nif_availability_test.exs and documented in precompiled_nifs.md.

The trade-off: users adopting the library for the first time may benchmark with the wrong backend. We accept that as the smaller risk.

A future major-version bump (v1.0) is the appropriate moment to revisit.

What v0.8.0 does NOT do

Explicit non-goals to prevent scope creep in maintenance and to document the boundary for v0.9.0 planning:

  • No new sketch families. CPC, Tuple, MinHash, VarOpt — all deferred (v0.11+).
  • No Apache DataSketches binary interop beyond Theta CompactSketch which already existed. KLL and HLL interop deferred (v0.10).
  • No streaming integrations. Broadway, Flow, GenStage — deferred (v0.9).
  • No persistence layers. ETS, DETS, CubDB — deferred (v0.9).
  • No telemetry / OpenTelemetry. Deferred (v0.9).
  • No SIMD intrinsics. The HLL hot path uses scalar Rust; hyperloglog-rs uses SIMD and is 2-3x faster. Deferred (v1.0).
  • No 6-bit register packing. HLL stores 1 byte per register, wasting 25%. Deferred (v1.0).
  • No raw-NIF path for membership filters. Bloom, Cuckoo, Quotient, CQF, XorFilter, IBLT still hash in Elixir. Deferred (v0.9 candidate).
  • No SBOM / SLSA / reproducible builds. Deferred (v1.0).

See also