Apache DataSketches CompactSketch binary codec for Theta sketches.
This module encodes and decodes the CompactSketch binary format used by Apache DataSketches (Java, C++, Python) for cross-language interoperability.
Hash Semantics
ExDataSketch uses ExDataSketch.Hash.hash64/1 (:erlang.phash2 + Murmur
finalization) while DataSketches uses MurmurHash3_x64_128. These hash
functions are not cross-compatible — the same input string will produce
different hash values. Interop works at the binary level: serialized sketches
contain pre-computed hash values, so they can be deserialized and merged
regardless of which hash function originally produced them.
Seed Hash
The seed hash is a 16-bit checksum derived from the hash function's seed.
It prevents merging sketches that used different hash functions/seeds.
The default seed is 9001, matching the DataSketches default. The seed hash
is computed using MurmurHash3_x64_128 (see ExDataSketch.DataSketches.Murmur3).
Supported Features
- Compact format only: This codec reads and writes the compact, ordered representation. Non-compact (hash table) sketches are rejected.
- Little-endian only: Big-endian sketches (flag bit 0 set) are rejected.
- All modes: empty, single-item, exact, and estimation modes are supported.
Binary Layout
Variable-length preamble (1, 2, or 3 longs of 8 bytes):
| Offset | Size | Field |
|---|---|---|
| 0 | 1 | Preamble longs (1, 2, or 3) |
| 1 | 1 | Serial version (3) |
| 2 | 1 | Family ID (3 = CompactSketch) |
| 3 | 1 | lgNomLongs (log2 of k) |
| 4 | 1 | lgArrLongs (0 for compact) |
| 5 | 1 | Flags |
| 6 | 2 | Seed hash (u16-le) |
| 8 | 4 | Retained entry count (u32-le, preamble ≥ 2) |
| 12 | 4 | Padding (preamble ≥ 2) |
| 16 | 8 | Theta (u64-le, preamble == 3) |
After preamble: entries as u64 little-endian values.
Summary
Functions
Decodes a DataSketches CompactSketch binary into sketch components.
Encodes a Theta sketch into the DataSketches CompactSketch binary format.
Functions
@spec decode( binary(), keyword() ) :: {:ok, map()} | {:error, Exception.t()}
Decodes a DataSketches CompactSketch binary into sketch components.
Returns {:ok, %{k: k, theta: theta, entries: [u64], seed_hash: u16}}
or {:error, %DeserializationError{}}.
Options
:seed- expected seed for seed hash verification (default: 9001). Passnilto skip seed hash verification.
@spec encode( ExDataSketch.Theta.t(), keyword() ) :: binary()
Encodes a Theta sketch into the DataSketches CompactSketch binary format.
Options
:seed- seed for seed hash computation (default: 9001)