This document explains what changes between v0.7.x and v0.8.0, who is affected, and how to roll out the upgrade safely.
TL;DR
- No code changes required for most users. All sketch public APIs are source-compatible with v0.7.x.
- EXSK serialization bumped from v1 to v2 (adds CRC32C trailer and embedded hash metadata block). v0.8.0 still reads v1 binaries. v0.7.x cannot read v2 binaries.
- Two Windows precompiled NIF targets added (
x86_64-pc-windows-msvc,aarch64-pc-windows-msvc); installation on Windows no longer requires a local Rust toolchain. :murmur3is now a user-selectable hash strategy for HLL / ULL / Theta / CMS, enabling Apache DataSketches interop. Default remains:xxhash3.- Upgrade order for distributed deployments: readers first, producers second.
Who is affected and how
| User profile | Action required |
|---|---|
| Pure in-process user, no persistence | None. mix deps.update ex_data_sketch and continue. |
| Persists sketches to disk / Redis / etc. | Plan the rollout: deploy v0.8.0 readers everywhere first; once stable, deploy producers. v0.8.0 readers handle both v1 and v2 binaries; v0.7.x readers cannot handle v2. |
| Multi-node distributed (sketches travel between nodes) | Same staged rollout as above. Mixed v0.7.x and v0.8.0 nodes work as long as producers stay on v0.7.x until all readers are upgraded. |
| Linux glibc/musl x86_64/aarch64, macOS | Precompiled NIF available; no Rust toolchain needed. |
| Windows x86_64 or ARM64 | NEW: precompiled NIF available (was source-build only). |
| FreeBSD / NetBSD / RISC-V | Source build still required: EX_DATA_SKETCH_BUILD=1 mix deps.compile. |
Custom :hash_fn callers | None. The :hash_fn path is preserved verbatim. |
| Apache DataSketches interop | Optional: switch to hash_strategy: :murmur3 for native compatibility. |
What's new in v0.8.0
Hash strategy selection
HLL.new/1, ULL.new/1, Theta.new/1, CMS.new/1 now honor a
user-supplied :hash_strategy. Previous v0.7.x behavior silently
overrode this option.
# v0.7.x — silently used :xxhash3 regardless of the :hash_strategy opt.
sketch = ExDataSketch.HLL.new(p: 14, hash_strategy: :murmur3)
# sketch.opts[:hash_strategy] was :xxhash3 (BUG).
# v0.8.0 — honors the requested strategy.
sketch = ExDataSketch.HLL.new(p: 14, hash_strategy: :murmur3)
# sketch.opts[:hash_strategy] == :murmur3.Supported strategies:
:xxhash3(default when the Rust NIF is loaded) — fastest, stable across platforms.:murmur3— Apache DataSketches compatibility. ~8% slower than XXH3 in the Rust hot path. Pure Elixir fallback bundled.:phash2— BEAM-only. Not portable across OTP major versions; use only for offline / single-OTP workloads.:custom— pass:hash_fnfor a caller-supplied closure. Sketches built with:customare NEVER merge-compatible with any other sketch.
Cross-strategy merges are rejected with
ExDataSketch.Errors.IncompatibleSketchesError. This is unchanged
from v0.7.1 but is now stricter: a v0.8.0 reader catches more
mismatch cases.
EXSK v2 binary format
Every sketch's serialize/1 now produces an EXSK v2 frame. The new
layout adds:
- A 16-byte
Hash.Metadatablock recording the exact hash identity used to produce the sketch. - A
family_versionbyte for per-sketch internal-state evolution. - A
flagsbyte (reserved in v2). - A trailing CRC32C checksum over the entire preceding frame.
Empty HLL sketch grows from ~18 bytes (v1) to ~50 bytes (v2). For any sketch larger than ~1 KB the overhead is negligible (< 5%).
# v0.7.x produced an EXSK v1 frame:
v1 = HLL.serialize(sketch)
<<"EXSK", 1, _rest::binary>> = v1 # version byte = 1
# v0.8.0 produces an EXSK v2 frame:
v2 = HLL.serialize(sketch)
<<"EXSK", 2, _rest::binary>> = v2 # version byte = 2v0.8.0's deserialize/1 accepts both:
# Both succeed in v0.8.0:
{:ok, _} = HLL.deserialize(v1)
{:ok, _} = HLL.deserialize(v2)
# Only v1 succeeds in v0.7.x:
# v0.7.x will return {:error, %DeserializationError{}} on v2 input.If you need to write v1 frames from v0.8.0 during a staged rollout, the legacy codec is still available:
# Reads v2:
{:ok, decoded} = ExDataSketch.Binary.decode(v2_binary)
# Writes v1 (legacy, for backward-compat producers only):
v1_binary = ExDataSketch.Codec.encode(
ExDataSketch.Codec.sketch_id_hll(),
1, # version 1
<<14, 1>>, # params (p + hash strategy)
sketch.state # state binary
)No public sketch API exposes the v1 writer; use Codec.encode/4
directly only as an interim staged-rollout escape hatch.
Corruption detection
EXSK v2 frames carry a trailing CRC32C (Castagnoli, hardware-
accelerated on modern x86 and ARM). Single-bit corruption in any byte
of the frame is detected with probability > 99.99999% and surfaces
as a structured error:
case HLL.deserialize(possibly_corrupted) do
{:ok, sketch} -> use_it(sketch)
{:error, %ExDataSketch.Errors.DeserializationError{message: msg}} ->
Logger.warning("corrupt HLL: #{msg}")
endv0.7.x had no corruption detection — bit-flips in the sketch state would silently produce wrong estimates.
Precompiled NIF: Windows support
The precompiled NIF matrix now covers 8 target triples (was 6):
- Linux glibc x86_64 / ARM64
- Linux musl x86_64 / ARM64
- macOS Intel / Apple Silicon
- Windows MSVC x86_64 / ARM64 (NEW in v0.8.0)
Each target has 2 NIF versions (2.16, 2.17), so 16 artifacts per release. Hex installation on any of these platforms requires no Rust toolchain.
For FreeBSD / NetBSD / RISC-V or other unsupported platforms:
EX_DATA_SKETCH_BUILD=1 mix deps.compile ex_data_sketch
(Requires a local Rust toolchain.)
Upgrade procedure
Step 1: Local dev environment
# Bump the version in your mix.exs:
{:ex_data_sketch, "~> 0.8.0"}
# Update and compile:
mix deps.update ex_data_sketch
mix deps.compile ex_data_sketch
The precompiled NIF will be downloaded automatically. If you are on
an unsupported platform, set EX_DATA_SKETCH_BUILD=1.
Step 2: Verify your test suite
Run your existing tests. All sketch public APIs are source-compatible with v0.7.x. The only behavior difference visible from sketch APIs is:
- Serialized binaries have a new version byte (2 instead of 1).
- Sketches built with
:hash_strategy: :murmur3(or:phash2) actually use that strategy now. v0.7.x silently used:xxhash3.
If your tests assert byte-identical equality with hardcoded v1 frames (unlikely), update them to expect v2 frames.
Step 3: Plan the staged rollout (if you persist sketches)
For deployments that share persisted sketches across process / node / machine boundaries:
- Deploy v0.8.0 to all readers first. v0.8.0 reads both v1 and v2 frames. v0.7.x readers cannot read v2.
- Verify reader stability for at least one deploy cycle.
- Deploy v0.8.0 to producers. They now emit v2 frames.
- Optional rollback drill. If you need to roll back a producer
to v0.7.x while v2 frames are in flight, you can:
- Re-serialize affected sketches with
Codec.encode/4(the escape hatch shown above), or - Accept temporary data loss for the v2-only sketches.
- Re-serialize affected sketches with
Step 4: Adopt new features (optional)
- Hash strategy. Switch high-throughput callers to
hash_strategy: :xxhash3(default; no change required). Switch Apache DataSketches interop callers tohash_strategy: :murmur3. - Custom hash. The
:hash_fnopt is unchanged.
Behavior changes that may surprise
HLL.new(p: 14, hash_strategy: :murmur3) now actually uses Murmur3
In v0.7.x, this option was silently overridden to :xxhash3. If your
v0.7.x code relied on this silent override (e.g., by passing
:hash_strategy from a config that was never validated), the v0.8.0
behavior may produce DIFFERENT estimates than v0.7.x for the same
input.
The estimates are still mathematically correct. The difference is
that two sketches that used to hash identically (both XXH3 in
practice) now hash differently if one is built with :xxhash3 and
another with :murmur3.
Mitigation. Audit any hash_strategy: opts in your codebase. If
you intended :xxhash3 (the most common case), the option becomes
optional in v0.8.0 (the default already is :xxhash3 when the NIF
is loaded). If you intended :murmur3 for interop, you now have a
real interop path; merge with v0.7.x sketches is impossible.
Merge of sketches built with different hash strategies now fails fast
v0.7.1 introduced merge validation. v0.8.0 inherits and extends it.
A merge between two sketches with mismatched :hash_strategy or
:seed raises IncompatibleSketchesError:
xxh3_sketch = HLL.from_enumerable(items, hash_strategy: :xxhash3)
murm_sketch = HLL.from_enumerable(items, hash_strategy: :murmur3)
HLL.merge(xxh3_sketch, murm_sketch)
# ** (ExDataSketch.Errors.IncompatibleSketchesError)
# HLL hash strategy mismatch: xxhash3 vs murmur3This is intended; merging would produce a corrupt result.
v2 frame size grows for tiny sketches
Empty HLL: 18 bytes -> 50 bytes (~2.8x). For any production-sized sketch (p ≥ 8, KLL k ≥ 100) the overhead is < 5%.
If you persist millions of tiny sketches, audit storage. The trade-
off is documented in plans/binary_contract.md.
Backend.default/0 is still Pure
Even when the Rust NIF is loaded, the default backend is
ExDataSketch.Backend.Pure. To opt into the Rust hot paths:
# Per-sketch:
sketch = HLL.new(p: 14, backend: ExDataSketch.Backend.Rust)
# Or application-wide:
config :ex_data_sketch, backend: ExDataSketch.Backend.RustThis is unchanged from v0.7.x and is intentional. The "no silent
default change" guarantee is documented in precompiled_nifs.md.
Test-side changes
If you maintain a test suite that exercises ex_data_sketch:
NIF mode switching
If you flip EX_DATA_SKETCH_BUILD between local test runs, use the
new aliases:
EX_DATA_SKETCH_BUILD=1 mix test.nif_on
EX_DATA_SKETCH_SKIP_NIF=true mix test.nif_off
These automatically reset the per-env rustler_precompiled state.
The bare mix test invocation works for single-mode CI but trips
the compile-vs-runtime env check when used to flip modes locally.
Hardcoded v1 frame assertions
If your tests do something like:
# v0.7.x style:
assert <<"EXSK", 1, _rest::binary>> = HLL.serialize(sketch)…update to v2:
# v0.8.0 style:
assert <<"EXSK", 2, _rest::binary>> = HLL.serialize(sketch)Three sketch-internal tests in this repo were updated this way (KLL, REQ, MisraGries). External users likely don't have such assertions.
Rolling back from v0.8.0 to v0.7.x
Possible but requires care:
- Stop producers, drain the v2-frame queue.
- Roll producers back to v0.7.x.
- Any v2 frame that survives the drain will fail to deserialize on
v0.7.x with
DeserializationError. The persisted state binary inside the v2 frame is recoverable by hand-parsing the EXSK v2 layout (seeplans/binary_contract.md).
In practice, rolling back a binary-format upgrade is painful. Plan the v0.8.0 upgrade as one-way and test it thoroughly in staging.
Known issues at release time
Tracked in plans/0.8.0-risks.md. Highlights:
- ULL accuracy at low
pand high cardinality (5-R1). ULL atp < 12produces large over-estimates whenn / 2^pexceeds ~2. Production guidance: usep ≥ 12. Investigation tracked as a follow-up issue. - HLL memory profile at very high cardinality (X-R1). Streaming
10M items into a single HLL allocates ~1.86 GB of transient
Elixir state due to
Stream.chunk_every/2lifecycle. Investigation tracked as a follow-up issue. Workaround: smaller chunk sizes or custom enumerable batching.
See also
plans/binary_contract.md— full EXSK v2 layout specification.plans/corruption_detection.md— CRC32C rationale and error taxonomy.hash_strategies.md— hash algorithm selection guide.hll_performance.md— performance characteristics of each path.precompiled_nifs.md— platform support details.plans/0.8.0-risks.md— open risk register at release time.CHANGELOG.md— full v0.8.0 change log.