All notable changes to this project will be documented in this file.

The format is based on Keep a Changelog, and this project adheres to Semantic Versioning.

Unreleased

0.1.0 - 2026-05-15

First public release. Implements the immutable UltraLogLog sketch with all three estimators from the paper, bit-exact validated against the Hash4j v0.17.0 Java reference.

Added

  • Core encoding from Ertl 2024 §3 (pack / unpack / encode / merge_registers), ported bit-exact from Hash4j v0.17.0 — the version dynatrace-research/ultraloglog-paper pins as its reproducibility ground truth.
  • FGRA estimator (UltraLogLog.Estimator.FGRA) — Algorithm 6 from the paper, paper-faithful inline implementation (no lookup tables). Computes g(r), λₚ, and the small/large-range corrections (σ, ψ, φ) directly from the paper's equations.
  • MLE estimator (UltraLogLog.Estimator.MLE) — secant solver from Ertl 2017 Algorithm 8 over the log-likelihood, Jensen lower-bound initial guess, inline Taylor + doubling recurrence for h(x) (no :math.exp in the inner loop), first-order bias correction per paper eq. (11).
  • Martingale estimator (UltraLogLog.Estimator.Martingale) — HIP estimation per Algorithm 2; constant-time μ ← μ − h(r) + h(r') update per insert. The martingale field of %UltraLogLog{} carries {estimate, μ} for active sketches and nil after invalidation by merge/2 or from_binary/1.
  • CRDT merge (UltraLogLog.merge/2) — element-wise on registers under the ULL partial order; commutative, associative, idempotent.
  • Binary serialization (UltraLogLog.to_binary/1 / UltraLogLog.from_binary/1) — versioned compact format (<<"ULL1", precision::8, registers::binary>>).
  • Downsize (UltraLogLog.downsize/2) — accepts target_p == p (returns the sketch unchanged); for target_p < p raises pending v0.2 implementation (paper §5).

Validation

  • 16 byte-for-byte reference vectors against Hash4j v0.17.0 (p ∈ {8, 10, 12, 14} × n ∈ {100, 1k, 10k, 100k}).
  • 222 encoding-vector cases at p ∈ {3, 8, 10, 12, 14, 26}.
  • FGRA / MLE / martingale spot-check vs Hash4j within IEEE 754 noise: max relative error 8.4e-16, 1.3e-16, and 0.0 respectively.
  • Statistical correctness at p ∈ {10, 12, 14}, N ∈ {100, 1k, 10k, 100k, 1M}, 30 trials per cell; worst-case empirical bias 0.566% against a 3σ bound of 1.127% (martingale, p=10, N=10⁶).
  • Property tests for the algebraic laws of merge_registers/2 (commutativity, associativity, idempotence) on the reachable subset of register bytes.
  • Convergence-speed tests for the MLE secant solver: mean 3.05–3.79 iterations per query across 300 random sketches per precision, max 6 iterations.
  • Full empirical baselines committed under docs/measurements/ in the repository.

Notes

  • Paper section numbering: this implementation cites the conference PVLDB PDF. The arXiv extended version uses different numbers for the same content (FGRA is §3.3 / Alg. 6 in the conference PDF; §4.1 in the arXiv extended version).
  • Register encoding: nonzero registers carry a +4p − 8 shift relative to the paper's bare encoding so byte values saturate at 255 regardless of precision. Documented in Ertl 2024 §4 as a Hash4j choice; reproduced here for bit-exact compatibility.
  • Hashing: UltraLogLog.Hash.hash64/1 uses a :erlang.phash2/2 derivation. Adequate for well-distributed inputs; production workloads with adversarial or skewed keyspaces should pass pre-computed hashes from a quality 64-bit function. A native xxhash3 NIF is planned for v0.2.
  • Dialyzer runs under MIX_ENV=test to avoid a pre-existing OTP 27+ float-match warning in the :hyper benchmark comparison dependency (planned for v0.2 benchmarks; not yet exercised in v0.1). The warning has no effect on this package's code.
  • No CI is configured in v0.1. v0.2 will add GitHub Actions.
  • Concurrent and cluster paths are deferred: a lock-free :atomics-backed insert path is planned for v0.2, and a PartitionSupervisor-sharded cluster-wide merge is planned for v0.3. No skeleton modules ship in v0.1 — these will land as complete commits when their respective releases are ready.