ExDataSketch.DataSketches.CompactSketch (ExDataSketch v0.8.0)

Copy Markdown View Source

Apache DataSketches CompactSketch binary codec for Theta sketches.

This module encodes and decodes the CompactSketch binary format used by Apache DataSketches (Java, C++, Python) for cross-language interoperability.

Hash Semantics

ExDataSketch uses ExDataSketch.Hash.hash64/1 (:erlang.phash2 + Murmur finalization) while DataSketches uses MurmurHash3_x64_128. These hash functions are not cross-compatible — the same input string will produce different hash values. Interop works at the binary level: serialized sketches contain pre-computed hash values, so they can be deserialized and merged regardless of which hash function originally produced them.

Seed Hash

The seed hash is a 16-bit checksum derived from the hash function's seed. It prevents merging sketches that used different hash functions/seeds. The default seed is 9001, matching the DataSketches default. The seed hash is computed using MurmurHash3_x64_128 (see ExDataSketch.DataSketches.Murmur3).

Supported Features

  • Compact format only: This codec reads and writes the compact, ordered representation. Non-compact (hash table) sketches are rejected.
  • Little-endian only: Big-endian sketches (flag bit 0 set) are rejected.
  • All modes: empty, single-item, exact, and estimation modes are supported.

Binary Layout

Variable-length preamble (1, 2, or 3 longs of 8 bytes):

OffsetSizeField
01Preamble longs (1, 2, or 3)
11Serial version (3)
21Family ID (3 = CompactSketch)
31lgNomLongs (log2 of k)
41lgArrLongs (0 for compact)
51Flags
62Seed hash (u16-le)
84Retained entry count (u32-le, preamble ≥ 2)
124Padding (preamble ≥ 2)
168Theta (u64-le, preamble == 3)

After preamble: entries as u64 little-endian values.

Summary

Functions

Decodes a DataSketches CompactSketch binary into sketch components.

Encodes a Theta sketch into the DataSketches CompactSketch binary format.

Functions

decode(binary, opts \\ [])

@spec decode(
  binary(),
  keyword()
) :: {:ok, map()} | {:error, Exception.t()}

Decodes a DataSketches CompactSketch binary into sketch components.

Returns {:ok, %{k: k, theta: theta, entries: [u64], seed_hash: u16}} or {:error, %DeserializationError{}}.

Options

  • :seed - expected seed for seed hash verification (default: 9001). Pass nil to skip seed hash verification.

encode(theta, encode_opts \\ [])

@spec encode(
  ExDataSketch.Theta.t(),
  keyword()
) :: binary()

Encodes a Theta sketch into the DataSketches CompactSketch binary format.

Options

  • :seed - seed for seed hash computation (default: 9001)