All notable changes to this project will be documented in this file.
The format is based on Keep a Changelog, and this project adheres to Semantic Versioning.
Unreleased
0.2.0 - 2026-05-17
Adds the first concurrent insert path for UltraLogLog: a lock-free,
:atomics-backed sketch safe to insert into from any process or
scheduler. The immutable UltraLogLog from v0.1 is unchanged.
Added
UltraLogLog.Concurrent— lock-free,:atomics-backed concurrent insert path. One 64-bit:atomicscell per register; inserts run a CAS retry loop applyingEncoding.merge_registers/2to the observed cell value. Safe to call from any process or scheduler with no GenServer, no lock, and no message passing.add/2returns:ok(side-effecting on shared state, not value transformation).UltraLogLog.Concurrent.snapshot/1— materializes the immutable%UltraLogLog{}from a concurrent sketch for estimation, serialization, or merge. Snapshot reads each cell independently rather than taking a global lock; a snapshot taken during active writes is a valid intermediate sketch by monotonicity of register merge. The returned%UltraLogLog{}hasmartingale: nilbecause the concurrent path does not maintain Algorithm 2's per-insert update history; FGRA and MLE work normally.- Benchee benchmark suite at
bench/concurrent_bench.exs: single-process throughput, multi-process scaling at N ∈ {1, 2, 4, 8, 16, 32}, immutable vs concurrent single-process comparison, andsnapshot/1cost at p ∈ {10, 12, 14, 16}.
Validation
- Equivalence under contention — a concurrent sketch built by N
parallel processes is bit-for-bit identical to the serial sketch
of the same elements, asserted on the register state via exact
binary equality at
p ∈ {10, 12, 14}×N ∈ {1k, 10k, 100k}. The CAS loop is correct becausemerge_registers/2is monotone, commutative, associative, and idempotent. - Stress — the highest-contention case (p=10, N=10k) was run 20× with independent seeds, 20/20 green.
- Idempotence under contention — double-inserting an element set in parallel produces the same registers as inserting it once (CRDT idempotence under concurrency).
- StreamData property test for arbitrary
(p, N, seed, procs)combinations, tagged:statistical. - Baseline — full Benchee output captured at
docs/measurements/concurrent-v0.2.txt.
Notes
- Single-process performance: the concurrent path is ~1.08×
faster than the immutable path even with one process (25.4 ips
vs 23.6 ips on 100k inserts at p=12, Apple M5). The immutable
add/2rebuilds a binary on every register-changing insert (an O(m) head/byte/tail copy); the concurrent path does two constant-time:atomicsoperations. Use the immutable form for value semantics (snapshots, serialization, CRDT merge); use the concurrent form for shared mutable sketches at high insert rates. - Multi-process scaling: monotone from 1 → ~4.35× at 32 processes on a 10-core machine, flattening past core count as expected for a CPU-bound workload. See the README for the full table.
- Snapshot semantics:
snapshot/1is not globally atomic across cells. By design — register merge is monotone in the UltraLogLog partial order, so a "torn" snapshot is always a valid intermediate sketch, never an invalid one. Quiesce writers (e.g.Task.await_many/2on insert tasks) for a globally consistent snapshot. - Memory: the active concurrent structure costs 8× the
immutable form (one 64-bit
:atomicscell per register; 128 KB at p=14 vs 16 KB). Deliberate — concurrent sketches are long-lived shared objects, not millions-of-instances. Packing multiple registers per cell would create logical false sharing and amplify contention. - Build environment:
bencheewas moved toonly: [:dev, :test]to route around the same OTP 28:hypercompile failure that affectsmix dialyzer,mix docs, andmix hex.publish. Benchmarks therefore run underMIX_ENV=test mix run bench/concurrent_bench.exs. Tracked in issue #2. - Not in v0.2: no native hash NIF, no
PartitionSupervisorsharding, no changes to the v0.1 immutable sketch, encoding, or estimators. - Downsize: precision-reducing
downsize/2(target_p < p) remains unimplemented; it was tentatively scoped to v0.2 in the v0.1.0 notes but deferred. No timeline is promised. - CI: still not configured. The v0.1.0 notes anticipated GitHub Actions for v0.2; deferred. Tracked for a future release.
0.1.0 - 2026-05-15
First public release. Implements the immutable UltraLogLog sketch with all three estimators from the paper, bit-exact validated against the Hash4j v0.17.0 Java reference.
Added
- Core encoding from Ertl 2024 §3 (
pack/unpack/encode/merge_registers), ported bit-exact from Hash4j v0.17.0 — the versiondynatrace-research/ultraloglog-paperpins as its reproducibility ground truth. - FGRA estimator (
UltraLogLog.Estimator.FGRA) — Algorithm 6 from the paper, paper-faithful inline implementation (no lookup tables). Computesg(r),λₚ, and the small/large-range corrections (σ,ψ,φ) directly from the paper's equations. - MLE estimator (
UltraLogLog.Estimator.MLE) — secant solver from Ertl 2017 Algorithm 8 over the log-likelihood, Jensen lower-bound initial guess, inline Taylor + doubling recurrence forh(x)(no:math.expin the inner loop), first-order bias correction per paper eq. (11). - Martingale estimator (
UltraLogLog.Estimator.Martingale) — HIP estimation per Algorithm 2; constant-timeμ ← μ − h(r) + h(r')update per insert. Themartingalefield of%UltraLogLog{}carries{estimate, μ}for active sketches andnilafter invalidation bymerge/2orfrom_binary/1. - CRDT merge (
UltraLogLog.merge/2) — element-wise on registers under the ULL partial order; commutative, associative, idempotent. - Binary serialization (
UltraLogLog.to_binary/1/UltraLogLog.from_binary/1) — versioned compact format (<<"ULL1", precision::8, registers::binary>>). - Downsize (
UltraLogLog.downsize/2) — acceptstarget_p == p(returns the sketch unchanged); fortarget_p < praises pending v0.2 implementation (paper §5).
Validation
- 16 byte-for-byte reference vectors against Hash4j v0.17.0 (p ∈ {8, 10, 12, 14} × n ∈ {100, 1k, 10k, 100k}).
- 222 encoding-vector cases at p ∈ {3, 8, 10, 12, 14, 26}.
- FGRA / MLE / martingale spot-check vs Hash4j within IEEE 754 noise: max relative error 8.4e-16, 1.3e-16, and 0.0 respectively.
- Statistical correctness at p ∈ {10, 12, 14}, N ∈ {100, 1k, 10k, 100k, 1M}, 30 trials per cell; worst-case empirical bias 0.566% against a 3σ bound of 1.127% (martingale, p=10, N=10⁶).
- Property tests for the algebraic laws of
merge_registers/2(commutativity, associativity, idempotence) on the reachable subset of register bytes. - Convergence-speed tests for the MLE secant solver: mean 3.05–3.79 iterations per query across 300 random sketches per precision, max 6 iterations.
- Full empirical baselines committed under
docs/measurements/in the repository.
Notes
- Paper section numbering: this implementation cites the conference PVLDB PDF. The arXiv extended version uses different numbers for the same content (FGRA is §3.3 / Alg. 6 in the conference PDF; §4.1 in the arXiv extended version).
- Register encoding: nonzero registers carry a
+4p − 8shift relative to the paper's bare encoding so byte values saturate at 255 regardless of precision. Documented in Ertl 2024 §4 as a Hash4j choice; reproduced here for bit-exact compatibility. - Hashing:
UltraLogLog.Hash.hash64/1uses a:erlang.phash2/2derivation. Adequate for well-distributed inputs; production workloads with adversarial or skewed keyspaces should pass pre-computed hashes from a quality 64-bit function. A native xxhash3 NIF is planned for v0.2. - Dialyzer runs under
MIX_ENV=testto avoid a pre-existing OTP 27+ float-match warning in the:hyperbenchmark comparison dependency (planned for v0.2 benchmarks; not yet exercised in v0.1). The warning has no effect on this package's code. - No CI is configured in v0.1. v0.2 will add GitHub Actions.
- Concurrent and cluster paths are deferred: a lock-free
:atomics-backed insert path is planned for v0.2, and aPartitionSupervisor-sharded cluster-wide merge is planned for v0.3. No skeleton modules ship in v0.1 — these will land as complete commits when their respective releases are ready.