Ferricstore.TDigest.Core (ferricstore v0.3.4)

Copy Markdown View Source

Pure functional t-digest data structure for accurate on-line accumulation of rank-based statistics such as quantiles, trimmed means, and cumulative distribution values.

A t-digest compresses a potentially unbounded stream of floating-point observations into a compact set of centroids -- (mean, weight) pairs -- that can answer percentile queries with bounded memory.

Key properties:

  • Accuracy is best at the tails. P99, P99.9, and P0.1 estimates are more accurate than P50 estimates.
  • Constant memory. A t-digest with compression=100 holds roughly 100-300 centroids regardless of how many samples are added.
  • Mergeable. Two t-digests can be merged in O(c log c) time.
  • Streaming. Samples can be added one at a time with amortized O(1) cost per sample.

Scale function

Uses the k1 scale function from Dunning's paper:

k(q) = (delta / (2 * pi)) * arcsin(2q - 1)

Centroids near the tails (q near 0 or 1) are forced to be small (few observations), while centroids near the median can be large.

Usage

digest = Core.new(100)
digest = Core.add(digest, 42.0)
digest = Core.add_many(digest, [1.0, 2.0, 3.0])
Core.quantile(digest, 0.5)
Core.cdf(digest, 2.5)

Summary

Functions

Adds a single floating-point value to the digest.

Adds multiple floating-point values to the digest.

Returns the estimated value at a given rank position (0-based).

Returns the estimated value at a given reverse-rank position (0-based).

Estimates the cumulative distribution function value for a given observation.

Forces a compression pass, flushing the buffer into the centroids.

Returns metadata about the digest as a map.

Merges another t-digest into this one.

Merges multiple t-digests into a single digest with the given compression.

Creates a new empty t-digest with the given compression parameter.

Estimates the value at a given quantile position.

Estimates the number of observations less than the given value.

Resets the digest to empty, preserving its compression setting.

Estimates the number of observations greater than the given value.

Computes the trimmed mean between two quantile boundaries.

Types

centroid()

@type centroid() :: {float(), float()}

t()

@type t() :: %Ferricstore.TDigest.Core{
  buffer: [float()],
  buffer_size: non_neg_integer(),
  centroids: [centroid()],
  compression: pos_integer(),
  count: non_neg_integer(),
  max: float() | nil,
  min: float() | nil,
  total_compressions: non_neg_integer()
}

Functions

add(digest, value)

@spec add(t(), number()) :: t()

Adds a single floating-point value to the digest.

The value is buffered internally. When the buffer fills (capacity = ceil(compression * 3)), a compression pass runs automatically.

Parameters

  • digest - the t-digest struct
  • value - a float or integer value to add

Returns

The updated t-digest struct.

add_many(digest, values)

@spec add_many(t(), [number()]) :: t()

Adds multiple floating-point values to the digest.

More efficient than calling add/2 repeatedly as it batches the values into the buffer before triggering compression.

Parameters

  • digest - the t-digest struct
  • values - list of float or integer values

Returns

The updated t-digest struct.

by_rank(digest, r)

@spec by_rank(t(), integer()) :: float() | :inf | :"-inf" | :nan

Returns the estimated value at a given rank position (0-based).

Parameters

  • digest - the t-digest struct
  • r - the rank (0-based non-negative integer)

Returns

  • The estimated value at rank r
  • :"-inf" for rank < 0
  • :inf for rank >= count
  • :nan for empty digest

by_rev_rank(digest, r)

@spec by_rev_rank(t(), integer()) :: float() | :inf | :"-inf" | :nan

Returns the estimated value at a given reverse-rank position (0-based).

Rank 0 is the largest observation.

Parameters

  • digest - the t-digest struct
  • r - the reverse rank (0-based)

Returns

  • The estimated value at reverse rank r
  • :inf for r < 0
  • :"-inf" for r >= count
  • :nan for empty digest

cdf(digest, value)

@spec cdf(t(), float()) :: float() | :nan

Estimates the cumulative distribution function value for a given observation.

Returns the estimated fraction of observations less than or equal to value.

Parameters

  • digest - the t-digest struct
  • value - the query value

Returns

  • A float in [0.0, 1.0]
  • :nan if the digest is empty

compress(digest)

@spec compress(t()) :: t()

Forces a compression pass, flushing the buffer into the centroids.

This is called automatically when the buffer fills or when a query is issued. You can call it manually to force the merge.

Returns

The compressed t-digest struct with an empty buffer.

info(digest)

@spec info(t()) :: map()

Returns metadata about the digest as a map.

Returns

A map with keys: :compression, :capacity, :merged_nodes, :unmerged_nodes, :merged_weight, :unmerged_weight, :total_compressions, :memory_usage.

merge(a, b)

@spec merge(t(), t()) :: t()

Merges another t-digest into this one.

The resulting digest contains all observations from both digests.

Parameters

  • a - the destination t-digest
  • b - the source t-digest to merge in

Returns

A new t-digest containing all observations from both.

merge_many(digests, compression)

@spec merge_many([t()], pos_integer()) :: t()

Merges multiple t-digests into a single digest with the given compression.

Parameters

  • digests - list of t-digest structs to merge
  • compression - compression parameter for the result

Returns

A new merged t-digest.

new(compression \\ 100)

@spec new(pos_integer()) :: t()

Creates a new empty t-digest with the given compression parameter.

The compression parameter (delta) controls the accuracy/memory tradeoff. Higher values retain more centroids. Typical values: 50-500.

Parameters

  • compression - positive integer, default 100

Examples

iex> Ferricstore.TDigest.Core.new()
%Ferricstore.TDigest.Core{compression: 100}

iex> Ferricstore.TDigest.Core.new(200)
%Ferricstore.TDigest.Core{compression: 200}

quantile(digest, q)

@spec quantile(t(), float()) :: float() | :nan

Estimates the value at a given quantile position.

Parameters

  • digest - the t-digest struct
  • q - quantile in [0.0, 1.0]

Returns

  • The estimated value at quantile q
  • :nan if the digest is empty

rank(digest, value)

@spec rank(t(), number()) :: integer()

Estimates the number of observations less than the given value.

Parameters

  • digest - the t-digest struct
  • value - the query value

Returns

An integer estimate of the rank (number of values below value). Returns -1 if value is below the minimum observation. Returns -2 for empty digest.

reset(core)

@spec reset(t()) :: t()

Resets the digest to empty, preserving its compression setting.

Returns

A fresh empty t-digest with the same compression parameter.

rev_rank(digest, value)

@spec rev_rank(t(), number()) :: integer()

Estimates the number of observations greater than the given value.

Parameters

  • digest - the t-digest struct
  • value - the query value

Returns

An integer estimate of the reverse rank.

trimmed_mean(digest, lo, hi)

@spec trimmed_mean(t(), float(), float()) :: float() | :nan

Computes the trimmed mean between two quantile boundaries.

Returns the mean of observations between lo and hi quantiles, weighted by the fraction of each centroid that falls within the range.

Parameters

  • digest - the t-digest struct
  • lo - lower quantile boundary in [0.0, 1.0)
  • hi - upper quantile boundary in (0.0, 1.0]

Returns

  • The trimmed mean as a float
  • :nan if the digest is empty