Caching guide

View Source

erllama's KV cache turns a multi-second prefill into a millisecond restore. This guide is the operator's-eye view: what it does, when it kicks in, and which knobs to touch.

The mental model

A transformer's "KV state" is the per-layer key/value tensors produced while reading the prompt. Once you have them, generating the next token costs one forward pass. Without them, you have to re-read every token of the prompt from scratch.

erllama's cache stores those tensors keyed on the exact tokens that produced them:

key = sha256(model_fingerprint || quant || ctx_params || tokens_le32)

Same tokens → same key → guaranteed-correct restore. There is no fuzzy matching layer; "close enough" is not allowed at this level.

Three tiers

ram       ETS slabs in BEAM heap. Lowest latency, smallest budget.
ram_file  Files on /dev/shm. Fast, capped only by tmpfs size.
disk      Files on persistent storage. Survives restarts.

Each tier is an independently-supervised gen_server with its own byte quota and its own LRU. A save is written to the tier you configure on the model; reads consult an in-memory index that fans out to the right tier.

The disk tier is a first-class citizen: large models that wouldn't fit alongside a working set of warm KV state in RAM can let the disk tier hold most of the cache, and warm-restore in milliseconds when a hit comes in.

When does a save happen?

The per-model gen_statem fires saves at five well-defined moments, each with its own save_reason:

ReasonWhenSync?
coldRight after a cold prefill, at the trimmed-prefix boundary. Async — the writer pool does the work.
continuedEvery continued_interval tokens during generation. Async.
finishAt the end of a completion, capturing prompt + reply. Async.
evictWhen a holder is asked to release its slab. Sync (pause decode, pack, release).
shutdownOn prep_stop or unload/1. Sync, capped by evict_save_timeout_ms.

Async saves go through erllama_cache_writer (a poolboy pool of dirty-IO workers). Sync saves block the calling process until the file is on stable storage.

When does a hit happen?

Three lookup paths, in order of preference:

  1. Exact key. Caller passes the exact parent_key from the previous turn. Cheapest. Used by Erlang-native multi-turn flows.
  2. Resume. Caller passes a parent_key from an earlier turn, and the new prompt strictly extends the cached prefix.
  3. Longest-prefix walk. No parent_key supplied. The cache walks the new prompt's tokens backward by the configured stride (boundary_align_tokens) and probes the index for each alignment. The longest cached prefix wins.

For stateless callers — OpenAI/Anthropic-shaped HTTP APIs that resend the full conversation each turn — option 3 is what you want. You don't have to do anything; just call erllama:complete/2.

Save policy gates

Saving every prefix would flood the writer pool. erllama gates saves behind a few thresholds, all overridable per-model.

GateDefaultWhat it does
min_tokens512Skip saves shorter than this. Prefills under 512 tokens are usually cheaper than the round-trip to disk.
cold_min_tokens512Don't fire a cold save for shorter prefills.
cold_max_tokens30 000Cap on cold-save size. Protects against pathological prompts.
continued_interval2048Fire a continued save every N generated tokens.
boundary_trim_tokens32Drop the last N tokens before saving. Mid-token, mid-sentence boundaries make poor resume points; trim to a safe alignment.
boundary_align_tokens2048Round trim down to a multiple of this. Sets the longest-prefix walk's stride.
session_resume_wait_ms500When a parent_key is supplied and the cache sees an in-flight finish save, wait up to this long for it to publish before falling through to a fresh prefill.

Bigger boundary_align_tokens = fewer probes per longest-prefix walk but coarser hit alignment. 2048 is the default; 256 makes hits more likely on shorter prompts at the cost of more probes.

Memory-pressure-driven eviction

erllama_scheduler is a polling gen_server that watches a pluggable pressure source and evicts cache slabs when pressure crosses a watermark. Off by default. Enable in sys.config:

{erllama, [
  {scheduler, #{
    enabled         => true,
    pressure_source => system,        %% portable, memsup-backed
    interval_ms     => 5000,
    high_watermark  => 0.85,
    low_watermark   => 0.75,
    evict_tiers     => [ram, ram_file] %% disk fills to its own quota
  }}
]}.

Sources shipped:

  • noop — always reports zero pressure.
  • system — OTP memsup. Linux, macOS, BSD, Windows.
  • nvidia_smi — sums VRAM across all visible NVIDIA GPUs.
  • {module, M} — calls M:sample/0. Implement -behaviour(erllama_pressure) to write your own.

When the source reports Used / Total >= high_watermark, the scheduler asks the cache to evict enough bytes to drop the ratio below low_watermark, scoped to the configured tiers.

Inspecting the cache

%% Hit/miss/save counters and per-path latency totals.
erllama_cache:get_counters().

%% Every row in the index, raw tuples:
%%   {Key, Tier, Size, LastUsedNs, Refcount, Status, HeaderBin,
%%    Location, TokensRef, Hits}
erllama_cache_meta_srv:dump().

%% Synchronous full eviction pass: returns {evicted, N}.
erllama_cache:gc().

%% Free at least N bytes, oldest LRU first: returns {evicted, N, BytesFreed}.
erllama_cache:evict_bytes(256 * 1024 * 1024).
erllama_cache:evict_bytes(256 * 1024 * 1024, [ram, ram_file]).

The counter map is documented inline on erllama_cache:get_counters/0 — call it from a shell to see the keys for your build.

Disabling the cache

For benchmarking or sanity checks: load the model with tier => ram and a tiny min_tokens to bypass saves entirely, or set the application env to disable all saves at the policy level:

{erllama, [
  {min_tokens, infinity}       %% nothing ever clears the gate
]}.

There is no global "off switch" — disabling was an explicit non-goal. The cache is the product.

See also