erllama_model (erllama v0.2.0)

View Source

Per-model gen_statem that drives the request flow and wires the cache subsystem into the model lifecycle.

State machine

   admit (idle_seq_ids non-empty)
        admit (queues in pending when idle_seq_ids empty)
             cast tick (self)
            
 idle  running 
                          
   all reqs finished (req_table = #{} AND pending = [])

Two states only:

  • idle/3req_table is empty AND pending is empty. Accepts admit events (complete, prefill_only, infer) and transitions to running. Verifies and other read-only ops are allowed.
  • running/3 — one or more #req{} are in flight. Accepts further admit events (allocate seq_id from idle_seq_ids or enqueue in pending), cancel casts, and the internal tick cast. Verify is refused with {error, busy} because it mutates the context.

Per-request lifecycle

  admit                  step_tick                step_tick               finish_req
   req_table        prefilling        decoding        
         (new #req,              (prefill_cursor          (prefill_cursor
          seq_id popped           non-empty,              undefined,
          from                    backend:step             backend:step
          idle_seq_ids,           pushes slice             samples one
          sampler built,          to KV)                   token + decodes)
          warm/cold path
          chosen)

Each #req records its own seq_id, sampler_ref, prompt_tokens, context_tokens, prefill_cursor, generated, response_target, cache_hit_kind, and finishing flag. The req_table map is keyed by seq_id.

step_tick driver

Every tick (one llama_decode call) builds a co-batched op list:

  • For each #req with prefill_cursor =/= undefined, append {seq_id, {prefill, Slice}}.
  • For each #req with prefill_cursor =:= undefined and a sampler_ref, append {seq_id, {decode, sampler_ref}}.

The op list is bounded by total_batch_budget (context_opts.n_batch): decode rows are kept whole, prefill rows are sliced head-first until the sum fits. The NIF returns {seq_id, prefilled} or {seq_id, {token, T, EogFlag}} per row; results land back in the respective #req. Reqs that reach response_target tokens, an eog flag, or a cancel get finishing = true and finalise in the post-step finisher walk (finish_marked_reqs/2).

Per-tick batch budget

step_tick/1 enforces total tokens ≤ total_batch_budget (mirrors context_opts.n_batch, default 512). If total_batch_budget is smaller than the number of in-flight decoders, the gen_statem crashes deliberately so the supervisor restarts and the operator fixes n_batch / n_seq_max. Otherwise prefill rows are sliced head-first to fit; truncated tails resume next tick.

Chunked prefill

Each prefill row is additionally capped by the prefill_chunk_size policy knob (default max(64, n_batch div 4), or infinity to disable). The effective slice per prefill row is min(length(remaining), prefill_chunk_size, available_budget). A long prompt is therefore sliced across several ticks even when the batch budget alone would have accommodated it in one, leaving room for concurrent decoders to make progress between chunks.

Cache save reasons

  • cold: fired right after a fresh prompt's prefill completes, before any decoding. Saves the trimmed prefix the policy produces from cold_save_split/2. Per-#req — each new admit fires its own cold save at most once.
  • continued: fired every continued_interval tokens during decode. Per-#req, gated on the request's last_save_at.
  • finish: fired when the request finishes (success, length limit, eog, or cancel). Saves the full context_tokens.
  • evict / shutdown: fired by external triggers and walks every in-flight #req, firing one save per non-empty context_tokens.

All saves go through fire_save_if/5 which calls backend:kv_pack/3 against the request's seq_id and hands the binary off to erllama_cache_writer.

Concurrency contract

The gen_statem is the sole writer of the context's KV cells: every backend:step/2 and kv_pack / kv_unpack / seq_rm call runs inside a state callback, so the AGENTS.md paused-context invariant holds — kv_pack only runs between ticks when no llama_decode is in flight. Default n_seq_max => 1 collapses this to the v0.2 single-tenant flow bit-identically; opting in via context_opts.n_seq_max > 1 lets up to N requests run concurrently through one decode call per tick.

Backwards compatibility

  • Public API (complete/2,3, prefill_only/2, infer/4, cancel/1, status/1, model_info/1, verify/4, etc.) is unchanged.
  • Default n_seq_max => 1 keeps single-tenant behaviour bit-identical to v0.2; multi-tenancy is opt-in.
  • phase on the obs row and in model_info/1 is still idle | prefilling | generating, computed from the dominant phase across in-flight reqs (dominant_phase/1).

Summary

Functions

Render a normalised chat request through the model's chat template and tokenise in one step. The Request map carries messages, system, and tools; the per-model template decides where each field lands in the prompt.

Cancel an in-flight streaming inference. Idempotent and fire-and- forget: returns ok even if the ref is unknown (already finished or never existed). The cancellation is observed at the next inter-token boundary; the model emits a final {erllama_done, Ref, Stats} with cancelled => true after the running decode step completes.

Detokenise a list of token IDs back to a string. Safe to call concurrently with complete/2,3.

Compute an embedding vector for the given prompt tokens.

Request that the model evict its current state. Fires an evict save synchronously if there is anything in the context. Called by erllama_scheduler (future) when GPU memory pressure requires this model to release its context handle. No-op when the model is idle with no live context.

Streaming inference. Admits a request and immediately returns a unique reference(); tokens are delivered to CallerPid via asynchronous messages

List currently attached adapters as [#{handle => H, scale => F}]. The handle is the same opaque value load_adapter/2 returned.

Load a LoRA adapter from a GGUF file and attach it to the model with scale 1.0. Returns an opaque handle the caller threads into unload_adapter/2 and set_adapter_scale/3. The adapter's sha256 is folded into the effective fingerprint so cache rows produced under this adapter never collide with rows from a different adapter set.

Snapshot of the model's metadata.

Decode a prompt into KV state and fire a finish save, without sampling any output tokens. Returns the finish_key so the caller can hand it as parent_key to a subsequent complete/3 or infer/4 for token-exact warm restore.

Change an attached adapter's scale. Re-applies the full set on the underlying context.

Fire a shutdown save synchronously and return. Called from the application's prep_stop hook so live state survives a graceful restart.

Tokenise a string using the model's tokenizer. Returns a list of token IDs. Safe to call concurrently with complete/2,3; tokenisation runs against the model's static vocabulary, not the live KV cache.

Detach + free a previously loaded adapter. Idempotent: a second call on the same handle returns ok.

Types

cache_hit_kind()

-type cache_hit_kind() :: exact | partial | cold.

completion_result()

-type completion_result() ::
          #{reply := binary(),
            generated := [non_neg_integer()],
            context_tokens := [non_neg_integer()],
            committed_tokens := non_neg_integer(),
            finish_key := binary() | undefined,
            cache_hit_kind := cache_hit_kind(),
            finish_reason := finish_reason(),
            stats := stats()}.

finish_reason()

-type finish_reason() :: stop | length | cancelled.

infer_params()

-type infer_params() ::
          #{response_tokens => pos_integer(),
            parent_key => term(),
            temperature => float(),
            top_p => float(),
            top_k => pos_integer(),
            min_p => float(),
            repetition_penalty => float(),
            seed => non_neg_integer(),
            stop => [binary()],
            grammar => binary(),
            _ => _}.

model()

-type model() :: erllama_registry:model_id() | pid().

model_info()

-type model_info() ::
          #{id := binary(),
            model_id := binary(),
            pid := pid(),
            status := idle | prefilling | generating,
            backend := module(),
            context_size := non_neg_integer(),
            quant_type := atom(),
            quant_bits := non_neg_integer(),
            quant_tag := binary(),
            tier := disk | ram_file,
            fingerprint := binary(),
            loaded_at_monotonic := integer(),
            vram_estimate_b := non_neg_integer()}.

prefill_result()

-type prefill_result() ::
          #{context_tokens := [non_neg_integer()],
            committed_tokens := non_neg_integer(),
            finish_key := binary() | undefined,
            cache_hit_kind := cache_hit_kind()}.

stats()

-type stats() ::
          #{prompt_tokens := non_neg_integer(),
            completion_tokens := non_neg_integer(),
            prefill_ms := non_neg_integer(),
            generation_ms := non_neg_integer(),
            cache_hit_kind := cache_hit_kind(),
            finish_reason := finish_reason(),
            cancelled := boolean(),
            finish_key := binary() | undefined,
            committed_tokens := non_neg_integer()}.

Functions

apply_chat_template(Model, Request)

-spec apply_chat_template(model(), erllama_model_backend:chat_request()) ->
                             {ok, [non_neg_integer()]} | {error, term()}.

Render a normalised chat request through the model's chat template and tokenise in one step. The Request map carries messages, system, and tools; the per-model template decides where each field lands in the prompt.

Returns {error, not_supported} if the backend does not implement chat templating.

cache_key_meta(Model)

-spec cache_key_meta(model()) ->
                        #{fingerprint := binary(), quant_type := atom(), ctx_params_hash := binary()}.

callback_mode()

cancel(Ref)

-spec cancel(reference()) -> ok.

Cancel an in-flight streaming inference. Idempotent and fire-and- forget: returns ok even if the ref is unknown (already finished or never existed). The cancellation is observed at the next inter-token boundary; the model emits a final {erllama_done, Ref, Stats} with cancelled => true after the running decode step completes.

complete(Model, Prompt)

-spec complete(model(), binary()) -> {ok, completion_result()} | {error, term()}.

complete(Model, Prompt, Opts)

-spec complete(model(), binary(), map()) -> {ok, completion_result()} | {error, term()}.

detokenize(Model, Tokens)

-spec detokenize(model(), [non_neg_integer()]) -> {ok, binary()} | {error, term()}.

Detokenise a list of token IDs back to a string. Safe to call concurrently with complete/2,3.

embed(Model, Tokens)

-spec embed(model(), [non_neg_integer()]) -> {ok, [float()]} | {error, term()}.

Compute an embedding vector for the given prompt tokens.

evict(Model)

-spec evict(model()) -> ok.

Request that the model evict its current state. Fires an evict save synchronously if there is anything in the context. Called by erllama_scheduler (future) when GPU memory pressure requires this model to release its context handle. No-op when the model is idle with no live context.

idle/3

infer(Model, Tokens, Params, CallerPid)

-spec infer(model(), [non_neg_integer()], infer_params(), pid()) -> {ok, reference()} | {error, term()}.

Streaming inference. Admits a request and immediately returns a unique reference(); tokens are delivered to CallerPid via asynchronous messages:

  • {erllama_token, Ref, binary()} per generated token (text fragment; suppressed when the detokenized binary is empty)
  • {erllama_token_id, Ref, integer()} per generated token (always delivered, including for tokens whose text fragment is empty; used by speculative-decoding collectors)
  • {erllama_done, Ref, stats()} on normal completion
  • {erllama_error, Ref, term()} on failure

Tokens is the prompt as a list of token ids - tokenisation is the caller's responsibility (use tokenize/2 or apply a chat template first). Params is an infer_params() map.

Calls that arrive while a previous request is in flight are queued FIFO. The reply {ok, Ref} is sent as soon as the call is admitted; streaming events follow once the queue head advances to this request.

init/1

list_adapters(Model)

-spec list_adapters(model()) -> [#{handle := term(), scale := float()}].

List currently attached adapters as [#{handle => H, scale => F}]. The handle is the same opaque value load_adapter/2 returned.

load_adapter(Model, Path)

-spec load_adapter(model(), file:filename_all()) -> {ok, term()} | {error, term()}.

Load a LoRA adapter from a GGUF file and attach it to the model with scale 1.0. Returns an opaque handle the caller threads into unload_adapter/2 and set_adapter_scale/3. The adapter's sha256 is folded into the effective fingerprint so cache rows produced under this adapter never collide with rows from a different adapter set.

model_info(Model)

-spec model_info(model()) -> model_info().

Snapshot of the model's metadata.

Returns a model_info() map with status, context size, quantisation, backend, fingerprint, and tier. Safe to call from any state - the gen_statem handles it as a common event without disrupting in-flight inference.

prefill_only(Model, PromptTokens)

-spec prefill_only(model(), [non_neg_integer()]) -> {ok, prefill_result()} | {error, term()}.

Decode a prompt into KV state and fire a finish save, without sampling any output tokens. Returns the finish_key so the caller can hand it as parent_key to a subsequent complete/3 or infer/4 for token-exact warm restore.

PromptTokens is the prompt as a list of token ids. Tokenisation is the caller's responsibility (use tokenize/2 or apply a chat template first). The cache behaviour mirrors complete/3: an exact or longest-prefix warm restore is taken when available, otherwise the prompt is prefilled cold.

finish_key is undefined if the finish save was suppressed because the token count is below the configured min_tokens.

running/3

set_adapter_scale(Model, Handle, Scale)

-spec set_adapter_scale(model(), term(), float()) -> ok | {error, term()}.

Change an attached adapter's scale. Re-applies the full set on the underlying context.

shutdown(Model)

-spec shutdown(model()) -> ok.

Fire a shutdown save synchronously and return. Called from the application's prep_stop hook so live state survives a graceful restart.

start_link(ModelId, Config)

-spec start_link(binary(), map()) -> {ok, pid()} | {error, term()}.

status(Model)

-spec status(model()) -> idle | prefilling | generating.

stop(Model)

-spec stop(model()) -> ok.

terminate/3

tokenize(Model, Text)

-spec tokenize(model(), binary()) -> {ok, [non_neg_integer()]} | {error, term()}.

Tokenise a string using the model's tokenizer. Returns a list of token IDs. Safe to call concurrently with complete/2,3; tokenisation runs against the model's static vocabulary, not the live KV cache.

unload_adapter(Model, Handle)

-spec unload_adapter(model(), term()) -> ok | {error, term()}.

Detach + free a previously loaded adapter. Idempotent: a second call on the same handle returns ok.

verify(Model, PrefixTokens, Candidates, K)

-spec verify(model(), [erllama_nif:token_id()], [erllama_nif:token_id()], pos_integer()) ->
                {ok, non_neg_integer(), erllama_nif:token_id() | eos} | {error, term()}.