erllama_model (erllama v0.6.2)
View SourcePer-model gen_statem that drives the request flow and wires the cache subsystem into the model lifecycle.
State machine
┌──── admit (idle_seq_ids non-empty)
│ ┌──── admit (queues in pending when idle_seq_ids empty)
│ │ ┌──── cast tick (self)
▼ ▼ │
idle ──────▶ running ─────┐
▲ │
└────── all reqs finished (req_table = #{} AND pending = [])Two states only:
idle/3—req_tableis empty ANDpendingis empty. Accepts admit events (complete, prefill_only, infer) and transitions torunning. Verifies and other read-only ops are allowed.running/3— one or more#req{}are in flight. Accepts further admit events (allocate seq_id fromidle_seq_idsor enqueue inpending), cancel casts, and the internaltickcast. Verify is refused with{error, busy}because it mutates the context.
Per-request lifecycle
admit step_tick step_tick finish_req
─────▶ req_table ──────▶ prefilling ──────▶ decoding ────────▶
(new #req, (prefill_cursor (prefill_cursor
seq_id popped non-empty, undefined,
from backend:step backend:step
idle_seq_ids, pushes slice samples one
sampler built, to KV) token + decodes)
warm/cold path
chosen)Each #req records its own seq_id, sampler_ref, prompt_tokens,
context_tokens, prefill_cursor, generated, response_target,
cache_hit_kind, and finishing flag. The req_table map is keyed
by seq_id.
step_tick driver
Every tick (one llama_decode call) builds a co-batched op list:
- For each
#reqwithprefill_cursor =/= undefined, append{seq_id, {prefill, Slice}}. - For each
#reqwithprefill_cursor =:= undefinedand a sampler_ref, append{seq_id, {decode, sampler_ref}}.
The op list is bounded by total_batch_budget (context_opts.n_batch):
decode rows are kept whole, prefill rows are sliced head-first
until the sum fits. The NIF returns {seq_id, prefilled} or
{seq_id, {token, T, EogFlag}} per row; results land back in the
respective #req. Reqs that reach response_target tokens, an
eog flag, or a cancel get finishing = true and finalise in the
post-step finisher walk (finish_marked_reqs/2).
Per-tick batch budget
step_tick/1 enforces total tokens ≤ total_batch_budget (mirrors
context_opts.n_batch, default 512). If total_batch_budget is
smaller than the number of in-flight decoders, the gen_statem
crashes deliberately so the supervisor restarts and the operator
fixes n_batch / n_seq_max. Otherwise prefill rows are sliced
head-first to fit; truncated tails resume next tick.
Chunked prefill
Each prefill row is additionally capped by the prefill_chunk_size
policy knob (default max(64, n_batch div 4), or infinity to
disable). The effective slice per prefill row is
min(length(remaining), prefill_chunk_size, available_budget). A
long prompt is therefore sliced across several ticks even when the
batch budget alone would have accommodated it in one, leaving room
for concurrent decoders to make progress between chunks.
Cache save reasons
- cold: fired right after a fresh prompt's prefill completes,
before any decoding. Saves the trimmed prefix the policy
produces from
cold_save_split/2. Per-#req — each new admit fires its own cold save at most once. - continued: fired every
continued_intervaltokens during decode. Per-#req, gated on the request'slast_save_at. - finish: fired when the request finishes (success, length limit, eog, or cancel). Saves the full context_tokens.
- evict / shutdown: fired by external triggers and walks every in-flight #req, firing one save per non-empty context_tokens.
All saves go through fire_save_if/5 which calls
backend:kv_pack/3 against the request's seq_id and hands the
binary off to erllama_cache_writer.
Concurrency contract
The gen_statem is the sole writer of the context's KV cells: every
backend:step/2 and kv_pack / kv_unpack / seq_rm call runs
inside a state callback, so the AGENTS.md paused-context invariant
holds — kv_pack only runs between ticks when no llama_decode
is in flight. Default n_seq_max => 1 collapses this to the v0.2
single-tenant flow bit-identically; opting in via
context_opts.n_seq_max > 1 lets up to N requests run
concurrently through one decode call per tick.
Backwards compatibility
- Public API (
complete/2,3,prefill_only/2,infer/4,cancel/1,status/1,model_info/1,verify/4, etc.) is unchanged. - Default
n_seq_max => 1keeps single-tenant behaviour bit-identical to v0.2; multi-tenancy is opt-in. phaseon the obs row and inmodel_info/1is stillidle | prefilling | generating, computed from the dominant phase across in-flight reqs (dominant_phase/1).
Summary
Functions
Render a normalised chat request through the model's chat template
and tokenise in one step. The Request map carries messages,
system, and tools; the per-model template decides where each
field lands in the prompt.
Cancel an in-flight streaming inference. Idempotent and fire-and-
forget: returns ok even if the ref is unknown (already finished or
never existed). The cancellation is observed at the next
inter-token boundary; the model emits a final {erllama_done, Ref, Stats} with cancelled => true after the running decode step
completes.
Streaming inference that extends a pinned sticky session by prefilling
SuffixTokens directly onto the session's already-resident KV cells,
without re-rendering or re-tokenising the prior turns and without the
prompt prefix-equality check infer/4 performs.
Detokenise a list of token IDs back to a string. Safe to call
concurrently with complete/2,3.
Compute an embedding vector for the given prompt tokens.
Request that the model evict its current state. Fires an evict
save synchronously if there is anything in the context. Called by
erllama_scheduler (future) when GPU memory pressure requires this
model to release its context handle. No-op when the model is idle
with no live context.
Streaming inference. Admits a request and immediately returns a
unique reference(); tokens are delivered to CallerPid via
asynchronous messages
List currently attached adapters as [#{handle => H, scale => F}].
The handle is the same opaque value load_adapter/2 returned.
Load a LoRA adapter from a GGUF file and attach it to the model
with scale 1.0. Returns an opaque handle the caller threads into
unload_adapter/2 and set_adapter_scale/3. The adapter's sha256 is
folded into the effective fingerprint so cache rows produced under
this adapter never collide with rows from a different adapter set.
Snapshot of the model's metadata.
Decode a prompt into KV state and fire a finish save, without
sampling any output tokens. Returns the finish_key so the caller
can hand it as parent_key to a subsequent complete/3 or
infer/4 for token-exact warm restore.
Change an attached adapter's scale. Re-applies the full set on the underlying context.
Fire a shutdown save synchronously and return. Called from the
application's prep_stop hook so live state survives a graceful
restart.
Tokenise a string using the model's tokenizer. Returns a list of
token IDs. Safe to call concurrently with complete/2,3; tokenisation
runs against the model's static vocabulary, not the live KV cache.
Detach + free a previously loaded adapter. Idempotent: a second call
on the same handle returns ok.
Types
-type cache_hit_kind() :: exact | partial | cold | sticky | continuation.
-type completion_result() :: #{reply := binary(), generated := [non_neg_integer()], context_tokens := [non_neg_integer()], committed_tokens := non_neg_integer(), finish_key := binary() | undefined, cache_hit_kind := cache_hit_kind(), finish_reason := finish_reason(), cache_delta := #{read := non_neg_integer(), created := non_neg_integer()}, stats := stats(), stop_sequence => binary()}.
-type finish_reason() :: stop | length | cancelled.
-type infer_params() :: #{response_tokens => pos_integer(), parent_key => term(), temperature => float(), top_p => float(), top_k => pos_integer(), min_p => float(), repetition_penalty => float(), seed => non_neg_integer(), stop_sequences => [binary()], grammar => binary(), thinking => enabled | disabled, thinking_budget_tokens => pos_integer(), session_id => term(), _ => _}.
-type model() :: erllama_registry:model_id() | pid().
-type model_info() :: #{id := binary(), model_id := binary(), pid := pid(), status := idle | prefilling | generating, backend := module(), context_size := non_neg_integer(), quant_type := atom(), quant_bits := non_neg_integer(), quant_tag := binary(), tier := disk | ram_file, fingerprint := binary(), loaded_at_monotonic := integer(), vram_estimate_b := non_neg_integer()}.
-type prefill_result() :: #{context_tokens := [non_neg_integer()], committed_tokens := non_neg_integer(), finish_key := binary() | undefined, cache_hit_kind := cache_hit_kind(), cache_delta := #{read := non_neg_integer(), created := non_neg_integer()}}.
-type stats() :: #{prompt_tokens := non_neg_integer(), completion_tokens := non_neg_integer(), prefill_ms := non_neg_integer(), generation_ms := non_neg_integer(), cache_hit_kind := cache_hit_kind(), finish_reason := finish_reason(), cancelled := boolean(), finish_key := binary() | undefined, committed_tokens := non_neg_integer(), cache_delta := #{read := non_neg_integer(), created := non_neg_integer()}, stop_sequence => binary()}.
Functions
-spec apply_chat_template(model(), erllama_model_backend:chat_request()) -> {ok, [non_neg_integer()]} | {error, term()}.
Render a normalised chat request through the model's chat template
and tokenise in one step. The Request map carries messages,
system, and tools; the per-model template decides where each
field lands in the prompt.
Returns {error, not_supported} if the backend does not implement
chat templating.
-spec cancel(reference()) -> ok.
Cancel an in-flight streaming inference. Idempotent and fire-and-
forget: returns ok even if the ref is unknown (already finished or
never existed). The cancellation is observed at the next
inter-token boundary; the model emits a final {erllama_done, Ref, Stats} with cancelled => true after the running decode step
completes.
-spec complete(model(), binary()) -> {ok, completion_result()} | {error, term()}.
-spec complete(model(), binary(), map()) -> {ok, completion_result()} | {error, term()}.
-spec continue(model(), [non_neg_integer()], map()) -> {ok, reference()} | {error, no_session | sticky_busy | term()}.
Streaming inference that extends a pinned sticky session by prefilling
SuffixTokens directly onto the session's already-resident KV cells,
without re-rendering or re-tokenising the prior turns and without the
prompt prefix-equality check infer/4 performs.
Opts must carry session_id (identifying a previously pinned
session) and caller_pid (where streaming events are sent). All other
infer_params() keys are honoured (response_tokens, temperature,
top_p, grammar, thinking, thinking_budget_tokens,
stop_sequences, ...). parent_key is ignored on this path because
no cache lookup runs.
Contract: the caller asserts that SuffixTokens is the exact
tokenised tail to append on top of the session's stored prefix. The
engine performs no prefix equality check. If the suffix is wrong the
generation will be garbage tokens, but engine state stays consistent.
Use this primitive when the chat template renders different role
markers depending on history length, so the retokenised prefix on
turn N would not equal the stored tokens from turn N-1 and the
sticky path in infer/4 would fall through to a cold admission.
Errors:
{error, no_session}—session_idis unknown (no prior turn pinned this session, or it was released).{error, sticky_busy}— the session's seq is currently in flight with an earlier request.
Streaming wire shape and Stats are identical to infer/4. On
success Stats.cache_hit_kind is continuation,
Stats.cache_delta.read equals the length of the session's stored
tokens before the call, and Stats.cache_delta.created equals
length(SuffixTokens) + completion_tokens.
See the "Sticky sessions" example in the guides for the recommended
caller pattern (slice the rendered prompt at the prior turn's
committed_tokens count and pass the tail).
-spec detokenize(model(), [non_neg_integer()]) -> {ok, binary()} | {error, term()}.
Detokenise a list of token IDs back to a string. Safe to call
concurrently with complete/2,3.
-spec embed(model(), [non_neg_integer()]) -> {ok, [float()]} | {error, term()}.
Compute an embedding vector for the given prompt tokens.
-spec evict(model()) -> ok.
Request that the model evict its current state. Fires an evict
save synchronously if there is anything in the context. Called by
erllama_scheduler (future) when GPU memory pressure requires this
model to release its context handle. No-op when the model is idle
with no live context.
-spec infer(model(), [non_neg_integer()], infer_params(), pid()) -> {ok, reference()} | {error, term()}.
Streaming inference. Admits a request and immediately returns a
unique reference(); tokens are delivered to CallerPid via
asynchronous messages:
{erllama_token, Ref, binary()}per generated token (text fragment; suppressed when the detokenized binary is empty){erllama_token_id, Ref, integer()}per generated token (always delivered, including for tokens whose text fragment is empty; used by speculative-decoding collectors){erllama_done, Ref, stats()}on normal completion{erllama_error, Ref, term()}on failure
Tokens is the prompt as a list of token ids - tokenisation is the
caller's responsibility (use tokenize/2 or apply a chat template
first). Params is an infer_params() map.
Calls that arrive while a previous request is in flight are queued
FIFO. The reply {ok, Ref} is sent as soon as the call is admitted;
streaming events follow once the queue head advances to this
request.
List currently attached adapters as [#{handle => H, scale => F}].
The handle is the same opaque value load_adapter/2 returned.
-spec load_adapter(model(), file:filename_all()) -> {ok, term()} | {error, term()}.
Load a LoRA adapter from a GGUF file and attach it to the model
with scale 1.0. Returns an opaque handle the caller threads into
unload_adapter/2 and set_adapter_scale/3. The adapter's sha256 is
folded into the effective fingerprint so cache rows produced under
this adapter never collide with rows from a different adapter set.
-spec model_info(model()) -> model_info().
Snapshot of the model's metadata.
Returns a model_info() map with status, context size, quantisation,
backend, fingerprint, and tier. Safe to call from any state - the
gen_statem handles it as a common event without disrupting in-flight
inference.
-spec prefill_only(model(), [non_neg_integer()]) -> {ok, prefill_result()} | {error, term()}.
Decode a prompt into KV state and fire a finish save, without
sampling any output tokens. Returns the finish_key so the caller
can hand it as parent_key to a subsequent complete/3 or
infer/4 for token-exact warm restore.
PromptTokens is the prompt as a list of token ids. Tokenisation
is the caller's responsibility (use tokenize/2 or apply a chat
template first). The cache behaviour mirrors complete/3: an exact
or longest-prefix warm restore is taken when available, otherwise
the prompt is prefilled cold.
finish_key is undefined if the finish save was suppressed
because the token count is below the configured min_tokens.
-spec prefill_only(model(), [non_neg_integer()], map()) -> {ok, prefill_result()} | {error, term()}.
Change an attached adapter's scale. Re-applies the full set on the underlying context.
-spec shutdown(model()) -> ok.
Fire a shutdown save synchronously and return. Called from the
application's prep_stop hook so live state survives a graceful
restart.
-spec status(model()) -> idle | prefilling | generating.
-spec stop(model()) -> ok.
-spec tokenize(model(), binary()) -> {ok, [non_neg_integer()]} | {error, term()}.
Tokenise a string using the model's tokenizer. Returns a list of
token IDs. Safe to call concurrently with complete/2,3; tokenisation
runs against the model's static vocabulary, not the live KV cache.
Detach + free a previously loaded adapter. Idempotent: a second call
on the same handle returns ok.
-spec verify(model(), [erllama_nif:token_id()], [erllama_nif:token_id()], pos_integer()) -> {ok, non_neg_integer(), erllama_nif:token_id() | eos} | {error, term()}.