HLL Hot-Path Optimization

Copy Markdown View Source

This document explains the HLL hot-path architecture introduced (and extended) by ex_data_sketch v0.8.0 Phase 3.

Why HLL throughput matters

HyperLogLog is the workhorse cardinality sketch. In production it is called once per stream event, so its update path runs on every record in the system. A 10x improvement in update_many/2 directly translates into a 10x improvement in the upstream pipeline's ceiling. CMS, ULL, and Theta share the same hot-path structure and benefit from the same optimizations.

Why the BEAM-side hot path is slow

For an HLL with p = 14 (16,384 registers), the steady-state per-item work is:

  1. Encode the item as a binary (term_to_binary if not already binary).
  2. Compute a 64-bit hash.
  3. Split the hash into bucket index (top p bits) + remaining bits.
  4. Count leading zeros in the remaining bits.
  5. Conditionally update one byte in the register array.

Steps 1–4 produce per-item heap garbage on the BEAM. Even the "optimal" pure Elixir path allocates a binary for the hash, a tuple for the bucket+rank pair, and a few intermediate small integers per item. At 1 M items the GC pressure dominates everything else, and the bench numbers in this document confirm it: Pure HLL caps at ~2 M items/sec while Rust HLL exceeds 30 M items/sec on the same hardware.

v0.8.0 hot-path architecture

There are four hot paths exposed by v0.8.0:

#PathBackendHashingUsed by
1Pure ElixirBackend.PureElixir (phash2 / xxh3 / murmur3)NIF-less builds, BEAM-only test envs
2Rust non-raw NIFBackend.RustElixir → pass hashes_bin to RustPre-hashed inputs, :phash2, :custom
3Rust raw NIF (XXH3)Backend.RustRust XXH3 inside the NIFDefault :xxhash3 path
4Rust raw_h NIF (Murmur3)Backend.RustRust Murmur3 inside the NIF:murmur3 path (new in v0.8.0)

The four sketches that benefit from this architecture are HLL, ULL, Theta, CMS — the cardinality / point-count family that hashes every input item.

Path dispatch logic

Each of the four sketches' update_many/2 runs this decision tree:

if backend == Backend.Pure:
    # Path 1: pure Elixir, batched
    hashes = chunk |> Enum.map(&hash_item/2)
    Backend.Pure.<sketch>_update_many(state, hashes, opts)

elif backend == Backend.Rust:
    if opts[:hash_fn] != nil:
        # Custom closure: must run on the BEAM. Path 2.
        hashes = chunk |> Enum.map(&hash_item/2)
        Backend.Rust.<sketch>_update_many(state, hashes, opts)

    elif opts[:hash_strategy] == :phash2:
        # phash2 not implemented in Rust. Path 2.
        hashes = chunk |> Enum.map(&hash_item/2)
        Backend.Rust.<sketch>_update_many(state, hashes, opts)

    else:
        # :xxhash3 or :murmur3 → Rust hashing. Path 3 or 4.
        Backend.Rust.<sketch>_update_many_raw(state, chunk, opts)

The <sketch>_update_many_raw/3 Elixir function then dispatches at the Rust-call level:

  • :xxhash3 → legacy _raw_nif (binary-compatible with v0.7.1).
  • :murmur3 → new _raw_h_nif with algorithm=2.
  • Both gain a Dirty-CPU variant once the chunk size exceeds the configured threshold (default 10,000 items).

Why chunking matters

Items arrive at update_many/2 as an arbitrary enumerable. The implementation chunks them at 10,000 items per chunk. This serves three purposes:

  1. Bounded per-call cost — even with 100M items, each NIF call stays bounded to ~30ms wall-clock, well within Erlang's 1 ms reduction budget for the dirty scheduler.
  2. Steady allocation — a 10,000-item ListIterator is small enough to fit in L2 cache on most platforms.
  3. Predictable scheduler behavior — the chunk size is the unit compared against the configured dirty threshold to choose normal vs dirty NIF.

Why update_many is the only acceleration target

The update/2 (single-item) path is intentionally NOT NIF-accelerated. A single NIF call has fixed overhead of ~200 ns; for a single item update this dominates the actual register write (~30 ns). Callers that want per-item throughput either batch with update_many/2 or accept the BEAM-side single-item path. This matches Apache DataSketches' own single-item-vs-batch trade-off.

v0.8.0 Phase 3 additions

The Phase 3 work in v0.8.0 is:

  1. Generalized in-Rust hashing. v0.7.1 shipped a _raw_nif family that hardcoded XXH3. Phase 3 adds a parallel _raw_h_nif family that accepts an algorithm byte (1 = XXH3, 2 = Murmur3). The legacy XXH3-only NIFs stay so v0.7.x callers see zero regression.
  2. End-to-end Murmur3 support. HLL.new(hash_strategy: :murmur3) (and the same for ULL/Theta/CMS) now wires through to the new _raw_h_nif Rust path. Murmur3 callers no longer fall off the fast path. Cross-language interop with Apache DataSketches becomes a one-line configuration change at sketch creation time.
  3. ExDataSketch.Hash.resolve_strategy/1 — single source of truth for sketch constructors. Centralizes the user-opt → strategy mapping that all four sketches previously open-coded inconsistently.

What Phase 3 explicitly does NOT do (per the v0.8.0 prompt's non-goals):

  • No SIMD intrinsics (deferred to v0.9 / v1.0).
  • No 6-bit register packing (deferred — would require a sketch-family version bump and is incompatible with the existing 8-bit-register state binary).
  • No sparse/dense layout redesign.
  • No ARM-specific tuning.
  • No raw-NIF migration for membership filters (Bloom/Cuckoo/Quotient/ CQF/XorFilter/IBLT) — those already cross the NIF boundary with pre-hashed integers via put_many and would need a separate design pass.

Scheduler safety

Every hot-path NIF is paired with a Dirty-CPU variant:

hll_update_many_nif            (normal)   small batches
hll_update_many_dirty_nif      (dirty)    batches above threshold

hll_update_many_raw_nif        (normal)   small batches, XXH3 inside Rust
hll_update_many_raw_dirty_nif  (dirty)    batches above threshold

hll_update_many_raw_h_nif      (normal)   small batches, dispatched hash
hll_update_many_raw_h_dirty_nif (dirty)   batches above threshold

The dispatcher in Backend.Rust picks dirty when length(chunk) > threshold. Defaults are listed in lib/ex_data_sketch/backend/rust.ex under @default_thresholds. They can be overridden per-call (opts[:dirty_threshold]) or globally (config :ex_data_sketch, :dirty_thresholds, %{...}).

See plans/hll_scheduler_safety.md for the full scheduler discussion.

Measured throughput (Phase 3 benchmark)

The full benchmark is bench/hll_hot_path_bench.exs. Below is a representative run on Apple Silicon (M1 / OTP 27 / Elixir 1.19); your hardware will vary. Units are items per second computed from the ips * batch_size column.

Path10,000 items100,000 items1,000,000 items
Pure Elixir phash2~1.67 M/s~1.81 M/s~2.02 M/s
Pure Elixir xxhash3~1.72 M/s~1.89 M/s~2.10 M/s
Rust raw XXH3~25.8 M/s~29.4 M/s~34.3 M/s
Rust raw_h Murmur3~23.8 M/s~27.8 M/s~31.1 M/s

Headline numbers:

  • Rust XXH3 vs Pure xxhash3: ~15× throughput.
  • Rust XXH3 vs Pure phash2: ~17× throughput.
  • Rust Murmur3 vs Rust XXH3: ~92% throughput. The 8% slowdown is the cost of Murmur3's full finalizer arithmetic, which is acceptable given Murmur3's interop value.
  • Memory: Rust paths allocate ~10% less than Pure paths because no intermediate hashes list is materialized in Elixir.

How these numbers compare to external HLL implementations

We deliberately do NOT re-benchmark external HLL implementations in this release. Throughput numbers from upstream documentation:

  • Apache DataSketches HLL (Java) — ~10–20 M items/sec for p=14 with Java's MurmurHash3. Reference: their JMH suite.
  • hyperloglog-rs (Rust) — ~80–120 M items/sec for p=14 with XxHash3 and SIMD. Reference: crate README.
  • axiomhq/hyperloglog-go — ~25 M items/sec for p=14. Reference: the Axiom blog post that accompanied the library release.
  • RedisBloom HLL — single-thread bound by the Redis event loop; upstream throughput is on the order of 1 M items/sec including network round-trip.

Our position in the stack: v0.8.0's Rust HLL is within 2–3× of the fastest in-process Rust implementation (which uses SIMD and a custom packed register layout), substantially ahead of Java and Go for the single-threaded case, and ~30× ahead of RedisBloom when network is included.

The 2–3× headroom against hyperloglog-rs is deliberate: closing that gap requires SIMD-tuned register updates and 6-bit packing, both of which are deferred to v0.9 / v1.0 per the v0.8.0 prompt's scope.

Reproducing these numbers

EX_DATA_SKETCH_BUILD=1 MIX_ENV=dev mix run bench/hll_hot_path_bench.exs

Outputs JSON to bench/output/hll_hot_path_bench.json for further analysis.

References

  • lib/ex_data_sketch/hll.ex, lib/ex_data_sketch/ull.ex, lib/ex_data_sketch/theta.ex, lib/ex_data_sketch/cms.ex — high- level update_many/2 dispatch.
  • lib/ex_data_sketch/backend/rust.ex — Rust-side dispatch (hll_update_many_raw/3 and friends).
  • native/ex_data_sketch_nif/src/hll.rs — Rust HLL implementation.
  • native/ex_data_sketch_nif/src/ull.rs, native/ex_data_sketch_nif/src/theta.rs, native/ex_data_sketch_nif/src/cms.rs — sibling sketches.
  • bench/hll_hot_path_bench.exs — this document's measurement source.
  • plans/hll_scheduler_safety.md — companion scheduler discussion.
  • Heule et al. "HyperLogLog in Practice." Google, 2013.