This document explains the HLL hot-path architecture introduced (and
extended) by ex_data_sketch v0.8.0 Phase 3.
Why HLL throughput matters
HyperLogLog is the workhorse cardinality sketch. In production it is
called once per stream event, so its update path runs on every record
in the system. A 10x improvement in update_many/2 directly translates
into a 10x improvement in the upstream pipeline's ceiling. CMS, ULL,
and Theta share the same hot-path structure and benefit from the same
optimizations.
Why the BEAM-side hot path is slow
For an HLL with p = 14 (16,384 registers), the steady-state per-item
work is:
- Encode the item as a binary (
term_to_binaryif not already binary). - Compute a 64-bit hash.
- Split the hash into bucket index (top
pbits) + remaining bits. - Count leading zeros in the remaining bits.
- Conditionally update one byte in the register array.
Steps 1–4 produce per-item heap garbage on the BEAM. Even the "optimal" pure Elixir path allocates a binary for the hash, a tuple for the bucket+rank pair, and a few intermediate small integers per item. At 1 M items the GC pressure dominates everything else, and the bench numbers in this document confirm it: Pure HLL caps at ~2 M items/sec while Rust HLL exceeds 30 M items/sec on the same hardware.
v0.8.0 hot-path architecture
There are four hot paths exposed by v0.8.0:
| # | Path | Backend | Hashing | Used by |
|---|---|---|---|---|
| 1 | Pure Elixir | Backend.Pure | Elixir (phash2 / xxh3 / murmur3) | NIF-less builds, BEAM-only test envs |
| 2 | Rust non-raw NIF | Backend.Rust | Elixir → pass hashes_bin to Rust | Pre-hashed inputs, :phash2, :custom |
| 3 | Rust raw NIF (XXH3) | Backend.Rust | Rust XXH3 inside the NIF | Default :xxhash3 path |
| 4 | Rust raw_h NIF (Murmur3) | Backend.Rust | Rust Murmur3 inside the NIF | :murmur3 path (new in v0.8.0) |
The four sketches that benefit from this architecture are HLL, ULL, Theta, CMS — the cardinality / point-count family that hashes every input item.
Path dispatch logic
Each of the four sketches' update_many/2 runs this decision tree:
if backend == Backend.Pure:
# Path 1: pure Elixir, batched
hashes = chunk |> Enum.map(&hash_item/2)
Backend.Pure.<sketch>_update_many(state, hashes, opts)
elif backend == Backend.Rust:
if opts[:hash_fn] != nil:
# Custom closure: must run on the BEAM. Path 2.
hashes = chunk |> Enum.map(&hash_item/2)
Backend.Rust.<sketch>_update_many(state, hashes, opts)
elif opts[:hash_strategy] == :phash2:
# phash2 not implemented in Rust. Path 2.
hashes = chunk |> Enum.map(&hash_item/2)
Backend.Rust.<sketch>_update_many(state, hashes, opts)
else:
# :xxhash3 or :murmur3 → Rust hashing. Path 3 or 4.
Backend.Rust.<sketch>_update_many_raw(state, chunk, opts)The <sketch>_update_many_raw/3 Elixir function then dispatches at the
Rust-call level:
:xxhash3→ legacy_raw_nif(binary-compatible with v0.7.1).:murmur3→ new_raw_h_nifwithalgorithm=2.- Both gain a Dirty-CPU variant once the chunk size exceeds the configured threshold (default 10,000 items).
Why chunking matters
Items arrive at update_many/2 as an arbitrary enumerable. The
implementation chunks them at 10,000 items per chunk. This serves three
purposes:
- Bounded per-call cost — even with 100M items, each NIF call stays bounded to ~30ms wall-clock, well within Erlang's 1 ms reduction budget for the dirty scheduler.
- Steady allocation — a 10,000-item
ListIteratoris small enough to fit in L2 cache on most platforms. - Predictable scheduler behavior — the chunk size is the unit compared against the configured dirty threshold to choose normal vs dirty NIF.
Why update_many is the only acceleration target
The update/2 (single-item) path is intentionally NOT NIF-accelerated.
A single NIF call has fixed overhead of ~200 ns; for a single item
update this dominates the actual register write (~30 ns). Callers that
want per-item throughput either batch with update_many/2 or accept
the BEAM-side single-item path. This matches Apache DataSketches' own
single-item-vs-batch trade-off.
v0.8.0 Phase 3 additions
The Phase 3 work in v0.8.0 is:
- Generalized in-Rust hashing. v0.7.1 shipped a
_raw_niffamily that hardcoded XXH3. Phase 3 adds a parallel_raw_h_niffamily that accepts an algorithm byte (1 = XXH3,2 = Murmur3). The legacy XXH3-only NIFs stay so v0.7.x callers see zero regression. - End-to-end Murmur3 support.
HLL.new(hash_strategy: :murmur3)(and the same for ULL/Theta/CMS) now wires through to the new_raw_h_nifRust path. Murmur3 callers no longer fall off the fast path. Cross-language interop with Apache DataSketches becomes a one-line configuration change at sketch creation time. ExDataSketch.Hash.resolve_strategy/1— single source of truth for sketch constructors. Centralizes the user-opt → strategy mapping that all four sketches previously open-coded inconsistently.
What Phase 3 explicitly does NOT do (per the v0.8.0 prompt's non-goals):
- No SIMD intrinsics (deferred to v0.9 / v1.0).
- No 6-bit register packing (deferred — would require a sketch-family version bump and is incompatible with the existing 8-bit-register state binary).
- No sparse/dense layout redesign.
- No ARM-specific tuning.
- No raw-NIF migration for membership filters (Bloom/Cuckoo/Quotient/
CQF/XorFilter/IBLT) — those already cross the NIF boundary with
pre-hashed integers via
put_manyand would need a separate design pass.
Scheduler safety
Every hot-path NIF is paired with a Dirty-CPU variant:
hll_update_many_nif (normal) small batches
hll_update_many_dirty_nif (dirty) batches above threshold
hll_update_many_raw_nif (normal) small batches, XXH3 inside Rust
hll_update_many_raw_dirty_nif (dirty) batches above threshold
hll_update_many_raw_h_nif (normal) small batches, dispatched hash
hll_update_many_raw_h_dirty_nif (dirty) batches above thresholdThe dispatcher in Backend.Rust picks dirty when length(chunk) > threshold. Defaults are listed in lib/ex_data_sketch/backend/rust.ex
under @default_thresholds. They can be overridden per-call (opts[:dirty_threshold])
or globally (config :ex_data_sketch, :dirty_thresholds, %{...}).
See plans/hll_scheduler_safety.md for the full scheduler discussion.
Measured throughput (Phase 3 benchmark)
The full benchmark is bench/hll_hot_path_bench.exs. Below is a
representative run on Apple Silicon (M1 / OTP 27 / Elixir 1.19); your
hardware will vary. Units are items per second computed from the
ips * batch_size column.
| Path | 10,000 items | 100,000 items | 1,000,000 items |
|---|---|---|---|
| Pure Elixir phash2 | ~1.67 M/s | ~1.81 M/s | ~2.02 M/s |
| Pure Elixir xxhash3 | ~1.72 M/s | ~1.89 M/s | ~2.10 M/s |
| Rust raw XXH3 | ~25.8 M/s | ~29.4 M/s | ~34.3 M/s |
| Rust raw_h Murmur3 | ~23.8 M/s | ~27.8 M/s | ~31.1 M/s |
Headline numbers:
- Rust XXH3 vs Pure xxhash3: ~15× throughput.
- Rust XXH3 vs Pure phash2: ~17× throughput.
- Rust Murmur3 vs Rust XXH3: ~92% throughput. The 8% slowdown is the cost of Murmur3's full finalizer arithmetic, which is acceptable given Murmur3's interop value.
- Memory: Rust paths allocate ~10% less than Pure paths because no intermediate hashes list is materialized in Elixir.
How these numbers compare to external HLL implementations
We deliberately do NOT re-benchmark external HLL implementations in this release. Throughput numbers from upstream documentation:
- Apache DataSketches HLL (Java) — ~10–20 M items/sec for
p=14with Java's MurmurHash3. Reference: their JMH suite. hyperloglog-rs(Rust) — ~80–120 M items/sec forp=14with XxHash3 and SIMD. Reference: crate README.axiomhq/hyperloglog-go— ~25 M items/sec forp=14. Reference: the Axiom blog post that accompanied the library release.- RedisBloom HLL — single-thread bound by the Redis event loop; upstream throughput is on the order of 1 M items/sec including network round-trip.
Our position in the stack: v0.8.0's Rust HLL is within 2–3× of the fastest in-process Rust implementation (which uses SIMD and a custom packed register layout), substantially ahead of Java and Go for the single-threaded case, and ~30× ahead of RedisBloom when network is included.
The 2–3× headroom against hyperloglog-rs is deliberate: closing
that gap requires SIMD-tuned register updates and 6-bit packing, both
of which are deferred to v0.9 / v1.0 per the v0.8.0 prompt's scope.
Reproducing these numbers
EX_DATA_SKETCH_BUILD=1 MIX_ENV=dev mix run bench/hll_hot_path_bench.exs
Outputs JSON to bench/output/hll_hot_path_bench.json for further
analysis.
References
lib/ex_data_sketch/hll.ex,lib/ex_data_sketch/ull.ex,lib/ex_data_sketch/theta.ex,lib/ex_data_sketch/cms.ex— high- levelupdate_many/2dispatch.lib/ex_data_sketch/backend/rust.ex— Rust-side dispatch (hll_update_many_raw/3and friends).native/ex_data_sketch_nif/src/hll.rs— Rust HLL implementation.native/ex_data_sketch_nif/src/ull.rs,native/ex_data_sketch_nif/src/theta.rs,native/ex_data_sketch_nif/src/cms.rs— sibling sketches.bench/hll_hot_path_bench.exs— this document's measurement source.plans/hll_scheduler_safety.md— companion scheduler discussion.- Heule et al. "HyperLogLog in Practice." Google, 2013.