Migrating to ex_data_sketch v0.8.0

Copy Markdown View Source

This document explains what changes between v0.7.x and v0.8.0, who is affected, and how to roll out the upgrade safely.

TL;DR

  • No code changes required for most users. All sketch public APIs are source-compatible with v0.7.x.
  • EXSK serialization bumped from v1 to v2 (adds CRC32C trailer and embedded hash metadata block). v0.8.0 still reads v1 binaries. v0.7.x cannot read v2 binaries.
  • Two Windows precompiled NIF targets added (x86_64-pc-windows-msvc, aarch64-pc-windows-msvc); installation on Windows no longer requires a local Rust toolchain.
  • :murmur3 is now a user-selectable hash strategy for HLL / ULL / Theta / CMS, enabling Apache DataSketches interop. Default remains :xxhash3.
  • Upgrade order for distributed deployments: readers first, producers second.

Who is affected and how

User profileAction required
Pure in-process user, no persistenceNone. mix deps.update ex_data_sketch and continue.
Persists sketches to disk / Redis / etc.Plan the rollout: deploy v0.8.0 readers everywhere first; once stable, deploy producers. v0.8.0 readers handle both v1 and v2 binaries; v0.7.x readers cannot handle v2.
Multi-node distributed (sketches travel between nodes)Same staged rollout as above. Mixed v0.7.x and v0.8.0 nodes work as long as producers stay on v0.7.x until all readers are upgraded.
Linux glibc/musl x86_64/aarch64, macOSPrecompiled NIF available; no Rust toolchain needed.
Windows x86_64 or ARM64NEW: precompiled NIF available (was source-build only).
FreeBSD / NetBSD / RISC-VSource build still required: EX_DATA_SKETCH_BUILD=1 mix deps.compile.
Custom :hash_fn callersNone. The :hash_fn path is preserved verbatim.
Apache DataSketches interopOptional: switch to hash_strategy: :murmur3 for native compatibility.

What's new in v0.8.0

Hash strategy selection

HLL.new/1, ULL.new/1, Theta.new/1, CMS.new/1 now honor a user-supplied :hash_strategy. Previous v0.7.x behavior silently overrode this option.

# v0.7.x — silently used :xxhash3 regardless of the :hash_strategy opt.
sketch = ExDataSketch.HLL.new(p: 14, hash_strategy: :murmur3)
# sketch.opts[:hash_strategy] was :xxhash3 (BUG).

# v0.8.0 — honors the requested strategy.
sketch = ExDataSketch.HLL.new(p: 14, hash_strategy: :murmur3)
# sketch.opts[:hash_strategy] == :murmur3.

Supported strategies:

  • :xxhash3 (default when the Rust NIF is loaded) — fastest, stable across platforms.
  • :murmur3 — Apache DataSketches compatibility. ~8% slower than XXH3 in the Rust hot path. Pure Elixir fallback bundled.
  • :phash2 — BEAM-only. Not portable across OTP major versions; use only for offline / single-OTP workloads.
  • :custom — pass :hash_fn for a caller-supplied closure. Sketches built with :custom are NEVER merge-compatible with any other sketch.

Cross-strategy merges are rejected with ExDataSketch.Errors.IncompatibleSketchesError. This is unchanged from v0.7.1 but is now stricter: a v0.8.0 reader catches more mismatch cases.

EXSK v2 binary format

Every sketch's serialize/1 now produces an EXSK v2 frame. The new layout adds:

  • A 16-byte Hash.Metadata block recording the exact hash identity used to produce the sketch.
  • A family_version byte for per-sketch internal-state evolution.
  • A flags byte (reserved in v2).
  • A trailing CRC32C checksum over the entire preceding frame.

Empty HLL sketch grows from ~18 bytes (v1) to ~50 bytes (v2). For any sketch larger than ~1 KB the overhead is negligible (< 5%).

# v0.7.x produced an EXSK v1 frame:
v1 = HLL.serialize(sketch)
<<"EXSK", 1, _rest::binary>> = v1  # version byte = 1

# v0.8.0 produces an EXSK v2 frame:
v2 = HLL.serialize(sketch)
<<"EXSK", 2, _rest::binary>> = v2  # version byte = 2

v0.8.0's deserialize/1 accepts both:

# Both succeed in v0.8.0:
{:ok, _} = HLL.deserialize(v1)
{:ok, _} = HLL.deserialize(v2)

# Only v1 succeeds in v0.7.x:
# v0.7.x will return {:error, %DeserializationError{}} on v2 input.

If you need to write v1 frames from v0.8.0 during a staged rollout, the legacy codec is still available:

# Reads v2:
{:ok, decoded} = ExDataSketch.Binary.decode(v2_binary)

# Writes v1 (legacy, for backward-compat producers only):
v1_binary = ExDataSketch.Codec.encode(
  ExDataSketch.Codec.sketch_id_hll(),
  1,                                    # version 1
  <<14, 1>>,                            # params (p + hash strategy)
  sketch.state                          # state binary
)

No public sketch API exposes the v1 writer; use Codec.encode/4 directly only as an interim staged-rollout escape hatch.

Corruption detection

EXSK v2 frames carry a trailing CRC32C (Castagnoli, hardware- accelerated on modern x86 and ARM). Single-bit corruption in any byte of the frame is detected with probability > 99.99999% and surfaces as a structured error:

case HLL.deserialize(possibly_corrupted) do
  {:ok, sketch} -> use_it(sketch)
  {:error, %ExDataSketch.Errors.DeserializationError{message: msg}} ->
    Logger.warning("corrupt HLL: #{msg}")
end

v0.7.x had no corruption detection — bit-flips in the sketch state would silently produce wrong estimates.

Precompiled NIF: Windows support

The precompiled NIF matrix now covers 8 target triples (was 6):

  • Linux glibc x86_64 / ARM64
  • Linux musl x86_64 / ARM64
  • macOS Intel / Apple Silicon
  • Windows MSVC x86_64 / ARM64 (NEW in v0.8.0)

Each target has 2 NIF versions (2.16, 2.17), so 16 artifacts per release. Hex installation on any of these platforms requires no Rust toolchain.

For FreeBSD / NetBSD / RISC-V or other unsupported platforms:

EX_DATA_SKETCH_BUILD=1 mix deps.compile ex_data_sketch

(Requires a local Rust toolchain.)

Upgrade procedure

Step 1: Local dev environment

# Bump the version in your mix.exs:
{:ex_data_sketch, "~> 0.8.0"}

# Update and compile:
mix deps.update ex_data_sketch
mix deps.compile ex_data_sketch

The precompiled NIF will be downloaded automatically. If you are on an unsupported platform, set EX_DATA_SKETCH_BUILD=1.

Step 2: Verify your test suite

Run your existing tests. All sketch public APIs are source-compatible with v0.7.x. The only behavior difference visible from sketch APIs is:

  1. Serialized binaries have a new version byte (2 instead of 1).
  2. Sketches built with :hash_strategy: :murmur3 (or :phash2) actually use that strategy now. v0.7.x silently used :xxhash3.

If your tests assert byte-identical equality with hardcoded v1 frames (unlikely), update them to expect v2 frames.

Step 3: Plan the staged rollout (if you persist sketches)

For deployments that share persisted sketches across process / node / machine boundaries:

  1. Deploy v0.8.0 to all readers first. v0.8.0 reads both v1 and v2 frames. v0.7.x readers cannot read v2.
  2. Verify reader stability for at least one deploy cycle.
  3. Deploy v0.8.0 to producers. They now emit v2 frames.
  4. Optional rollback drill. If you need to roll back a producer to v0.7.x while v2 frames are in flight, you can:
    • Re-serialize affected sketches with Codec.encode/4 (the escape hatch shown above), or
    • Accept temporary data loss for the v2-only sketches.

Step 4: Adopt new features (optional)

  • Hash strategy. Switch high-throughput callers to hash_strategy: :xxhash3 (default; no change required). Switch Apache DataSketches interop callers to hash_strategy: :murmur3.
  • Custom hash. The :hash_fn opt is unchanged.

Behavior changes that may surprise

HLL.new(p: 14, hash_strategy: :murmur3) now actually uses Murmur3

In v0.7.x, this option was silently overridden to :xxhash3. If your v0.7.x code relied on this silent override (e.g., by passing :hash_strategy from a config that was never validated), the v0.8.0 behavior may produce DIFFERENT estimates than v0.7.x for the same input.

The estimates are still mathematically correct. The difference is that two sketches that used to hash identically (both XXH3 in practice) now hash differently if one is built with :xxhash3 and another with :murmur3.

Mitigation. Audit any hash_strategy: opts in your codebase. If you intended :xxhash3 (the most common case), the option becomes optional in v0.8.0 (the default already is :xxhash3 when the NIF is loaded). If you intended :murmur3 for interop, you now have a real interop path; merge with v0.7.x sketches is impossible.

Merge of sketches built with different hash strategies now fails fast

v0.7.1 introduced merge validation. v0.8.0 inherits and extends it. A merge between two sketches with mismatched :hash_strategy or :seed raises IncompatibleSketchesError:

xxh3_sketch = HLL.from_enumerable(items, hash_strategy: :xxhash3)
murm_sketch = HLL.from_enumerable(items, hash_strategy: :murmur3)

HLL.merge(xxh3_sketch, murm_sketch)
# ** (ExDataSketch.Errors.IncompatibleSketchesError)
#    HLL hash strategy mismatch: xxhash3 vs murmur3

This is intended; merging would produce a corrupt result.

v2 frame size grows for tiny sketches

Empty HLL: 18 bytes -> 50 bytes (~2.8x). For any production-sized sketch (p ≥ 8, KLL k ≥ 100) the overhead is < 5%.

If you persist millions of tiny sketches, audit storage. The trade- off is documented in plans/binary_contract.md.

Backend.default/0 is still Pure

Even when the Rust NIF is loaded, the default backend is ExDataSketch.Backend.Pure. To opt into the Rust hot paths:

# Per-sketch:
sketch = HLL.new(p: 14, backend: ExDataSketch.Backend.Rust)

# Or application-wide:
config :ex_data_sketch, backend: ExDataSketch.Backend.Rust

This is unchanged from v0.7.x and is intentional. The "no silent default change" guarantee is documented in precompiled_nifs.md.

Test-side changes

If you maintain a test suite that exercises ex_data_sketch:

NIF mode switching

If you flip EX_DATA_SKETCH_BUILD between local test runs, use the new aliases:

EX_DATA_SKETCH_BUILD=1 mix test.nif_on
EX_DATA_SKETCH_SKIP_NIF=true mix test.nif_off

These automatically reset the per-env rustler_precompiled state. The bare mix test invocation works for single-mode CI but trips the compile-vs-runtime env check when used to flip modes locally.

Hardcoded v1 frame assertions

If your tests do something like:

# v0.7.x style:
assert <<"EXSK", 1, _rest::binary>> = HLL.serialize(sketch)

…update to v2:

# v0.8.0 style:
assert <<"EXSK", 2, _rest::binary>> = HLL.serialize(sketch)

Three sketch-internal tests in this repo were updated this way (KLL, REQ, MisraGries). External users likely don't have such assertions.

Rolling back from v0.8.0 to v0.7.x

Possible but requires care:

  1. Stop producers, drain the v2-frame queue.
  2. Roll producers back to v0.7.x.
  3. Any v2 frame that survives the drain will fail to deserialize on v0.7.x with DeserializationError. The persisted state binary inside the v2 frame is recoverable by hand-parsing the EXSK v2 layout (see plans/binary_contract.md).

In practice, rolling back a binary-format upgrade is painful. Plan the v0.8.0 upgrade as one-way and test it thoroughly in staging.

Known issues at release time

Tracked in plans/0.8.0-risks.md. Highlights:

  • ULL accuracy at low p and high cardinality (5-R1). ULL at p < 12 produces large over-estimates when n / 2^p exceeds ~2. Production guidance: use p ≥ 12. Investigation tracked as a follow-up issue.
  • HLL memory profile at very high cardinality (X-R1). Streaming 10M items into a single HLL allocates ~1.86 GB of transient Elixir state due to Stream.chunk_every/2 lifecycle. Investigation tracked as a follow-up issue. Workaround: smaller chunk sizes or custom enumerable batching.

See also

  • plans/binary_contract.md — full EXSK v2 layout specification.
  • plans/corruption_detection.md — CRC32C rationale and error taxonomy.
  • hash_strategies.md — hash algorithm selection guide.
  • hll_performance.md — performance characteristics of each path.
  • precompiled_nifs.md — platform support details.
  • plans/0.8.0-risks.md — open risk register at release time.
  • CHANGELOG.md — full v0.8.0 change log.