UltraLogLog.Estimator.MLE (UltraLogLog v0.1.0)

Copy Markdown View Source

Maximum-likelihood cardinality estimator for UltraLogLog.

Asymptotic storage factor 8 · ln(2)/ζ(2, 5/4) ≈ 4.631, relative standard error √(ln(2)/ζ(2, 5/4)) ≈ 0.761/√m. Gives the full ~28% space reduction vs HLL (compared to FGRA's ~24%) at the cost of one iterative root-find per call.

What is being maximized

The log-likelihood under the Poisson model (Ertl 2024 §3.1, line 474):

ln L = -(n/m)·α + Σ_{u=1}^{w-1} β_u · ln(1  e^(n/(m·2^u)))

where α and β_u are closed-form weighted sums of register-value counts c_j (paper lines 480–492). Setting d/dn ln L = 0 gives the ML equation, which has the same shape as the corresponding HLL one — so the paper (line 503) reuses the secant solver from Ertl 2017 (arXiv:1702.01284, Algorithm 8).

After reparameterization that solver maximizes

e^(x·a) · _{k=0}^{n} (1  e^(x/2^k))^{b[k]}

i.e. finds the root of

g(x) := Σb[k] + a·x + Σ_{k} b[k] · h(x/2^k)

with h(x) := 1 − x/(e^x − 1). h is concave and monotonically increasing on (0, ∞), so g is monotonically increasing and has a unique root for any non-degenerate sketch.

Algorithm shape (paper-faithful, mirroring Hash4j v0.17.0)

  1. Map registers → (a, b[]): each register contributes to a and to up to three b[k] slots. The mapping (contribute/3 below) ports UltraLogLog.MaximumLikelihoodEstimator.contribute and follows the paper's α, β_u definitions.

  2. Initial guess: closed-form Jensen lower bound on the root (derivation in DistinctCountUtil.java lines 83–104). Provably ≤ root, so the secant method converges monotonically from below. We deliberately avoid using FGRA.estimate/1 as the initial guess even though the brief originally proposed it — Jensen's bound is coherent with the solver's state (no inter-estimator coupling) and monotonicity is a stronger guarantee.

  3. Secant iteration: at each step, evaluate g(x) and use the secant ratio (g − Σb) / (g_prev − g) to update the step size. Convergence threshold scales with precision: stop when |Δx|/x ≤ 0.001 · √v_ML / √m. Hash4j-tight tolerance — output matches MAXIMUM_LIKELIHOOD_ESTIMATOR to ~1e-12 relative.

  4. Inner-loop h(x) evaluation: avoid :math.exp and :math.log by computing h(2·x') for small x' ∈ [0, 0.25] via a degree-6 Taylor polynomial (x' − x'²/3 + x'⁴/45 − x'⁶/472.5, the Bernoulli expansion of 1 − 2x'/(e^(2x') − 1)) and walking up by the doubling recurrence h(2z) = (z/2 + h(z)(1−h(z))) / (z/2 + (1−h(z))) to the target x/2^k. No transcendental evaluations inside the loop.

  5. Bias correction: divide the secant output by 1 + 0.481.../m (paper eq. 11; the second-order delta-method correction).

Verification

Spot-checked against UltraLogLog.MAXIMUM_LIKELIHOOD_ESTIMATOR v0.17.0 on the same 16 register snapshots used for FGRA — see test/mle_test.exs for the tolerance and statistical-correctness battery, and test/mle_test.exs again for the convergence-speed test (mean iterations < 10, max < 50 across 100 random sketches).

Summary

Functions

Estimate cardinality from the current sketch state, per Ertl 2024 §3.1 (conference numbering; equivalent to arXiv §4.2).

Same as estimate/1, but also returns the secant iteration count. Used by test/mle_test.exs to assert convergence speed; not intended for production use.

Functions

estimate(ull)

@spec estimate(UltraLogLog.t()) :: float()

Estimate cardinality from the current sketch state, per Ertl 2024 §3.1 (conference numbering; equivalent to arXiv §4.2).

estimate_with_iterations(ultra_log_log)

@spec estimate_with_iterations(UltraLogLog.t()) :: {float(), non_neg_integer()}

Same as estimate/1, but also returns the secant iteration count. Used by test/mle_test.exs to assert convergence speed; not intended for production use.