erllama_model (erllama v0.1.0)

View Source

Per-model gen_statem that drives the request flow and wires the cache subsystem into the model lifecycle.

State machine (v0.1):

  idle complete prefilling prefill_done generating finish idle

On the prefilling → generating transition the model fires a cold save (boundary-trimmed prefix, async). Inside generating it fires a finish save (full live token list, async) just before returning to idle.

The continued save reason (every N tokens of new generation), the evict save reason (driven by an external scheduler), and the shutdown save reason (driven by application:prep_stop) are defined in erllama_cache_policy but not yet wired here; they land in follow-up steps.

Model operations (tokenize, prefill, decode, kv_pack, kv_unpack) are stubbed — the gen_statem's tokens field IS the "context". When step 2b lands the real erllama_nif for llama.cpp, those stubs get replaced; the cache integration is unaffected.

Summary

Functions

Render a normalised chat request through the model's chat template and tokenise in one step. The Request map carries messages, system, and tools; the per-model template decides where each field lands in the prompt.

Cancel an in-flight streaming inference. Idempotent and fire-and- forget: returns ok even if the ref is unknown (already finished or never existed). The cancellation is observed at the next inter-token boundary; the model emits a final {erllama_done, Ref, Stats} with cancelled => true after the running decode step completes.

Detokenise a list of token IDs back to a string. Safe to call concurrently with complete/2,3.

Compute an embedding vector for the given prompt tokens.

Request that the model evict its current state. Fires an evict save synchronously if there is anything in the context. Called by erllama_scheduler (future) when GPU memory pressure requires this model to release its context handle. No-op when the model is idle with no live context.

Streaming inference. Admits a request and immediately returns a unique reference(); tokens are delivered to CallerPid via asynchronous messages

List currently attached adapters as [#{handle => H, scale => F}]. The handle is the same opaque value load_adapter/2 returned.

Load a LoRA adapter from a GGUF file and attach it to the model with scale 1.0. Returns an opaque handle the caller threads into unload_adapter/2 and set_adapter_scale/3. The adapter's sha256 is folded into the effective fingerprint so cache rows produced under this adapter never collide with rows from a different adapter set.

Snapshot of the model's metadata.

Change an attached adapter's scale. Re-applies the full set on the underlying context.

Fire a shutdown save synchronously and return. Called from the application's prep_stop hook so live state survives a graceful restart.

Tokenise a string using the model's tokenizer. Returns a list of token IDs. Safe to call concurrently with complete/2,3; tokenisation runs against the model's static vocabulary, not the live KV cache.

Detach + free a previously loaded adapter. Idempotent: a second call on the same handle returns ok.

Types

cache_hit_kind()

-type cache_hit_kind() :: exact | partial | cold.

finish_reason()

-type finish_reason() :: stop | length | cancelled.

infer_params()

-type infer_params() ::
          #{response_tokens => pos_integer(),
            parent_key => term(),
            temperature => float(),
            top_p => float(),
            top_k => pos_integer(),
            min_p => float(),
            repetition_penalty => float(),
            seed => non_neg_integer(),
            stop => [binary()],
            grammar => binary(),
            _ => _}.

model()

-type model() :: erllama_registry:model_id() | pid().

model_info()

-type model_info() ::
          #{id := binary(),
            pid := pid(),
            status := idle | prefilling | generating,
            backend := module(),
            context_size := non_neg_integer(),
            quant_type := atom(),
            quant_bits := non_neg_integer(),
            tier := disk | ram_file,
            fingerprint := binary()}.

stats()

-type stats() ::
          #{prompt_tokens := non_neg_integer(),
            completion_tokens := non_neg_integer(),
            prefill_ms := non_neg_integer(),
            generation_ms := non_neg_integer(),
            cache_hit_kind := cache_hit_kind(),
            finish_reason := finish_reason(),
            cancelled := boolean()}.

Functions

apply_chat_template(Model, Request)

-spec apply_chat_template(model(), erllama_model_backend:chat_request()) ->
                             {ok, [non_neg_integer()]} | {error, term()}.

Render a normalised chat request through the model's chat template and tokenise in one step. The Request map carries messages, system, and tools; the per-model template decides where each field lands in the prompt.

Returns {error, not_supported} if the backend does not implement chat templating.

callback_mode()

cancel(Ref)

-spec cancel(reference()) -> ok.

Cancel an in-flight streaming inference. Idempotent and fire-and- forget: returns ok even if the ref is unknown (already finished or never existed). The cancellation is observed at the next inter-token boundary; the model emits a final {erllama_done, Ref, Stats} with cancelled => true after the running decode step completes.

complete(Model, Prompt)

-spec complete(model(), binary()) -> {ok, binary(), [non_neg_integer()]} | {error, term()}.

complete(Model, Prompt, Opts)

-spec complete(model(), binary(), map()) -> {ok, binary(), [non_neg_integer()]} | {error, term()}.

detokenize(Model, Tokens)

-spec detokenize(model(), [non_neg_integer()]) -> {ok, binary()} | {error, term()}.

Detokenise a list of token IDs back to a string. Safe to call concurrently with complete/2,3.

embed(Model, Tokens)

-spec embed(model(), [non_neg_integer()]) -> {ok, [float()]} | {error, term()}.

Compute an embedding vector for the given prompt tokens.

evict(Model)

-spec evict(model()) -> ok.

Request that the model evict its current state. Fires an evict save synchronously if there is anything in the context. Called by erllama_scheduler (future) when GPU memory pressure requires this model to release its context handle. No-op when the model is idle with no live context.

generating/3

idle/3

infer(Model, Tokens, Params, CallerPid)

-spec infer(model(), [non_neg_integer()], infer_params(), pid()) -> {ok, reference()} | {error, term()}.

Streaming inference. Admits a request and immediately returns a unique reference(); tokens are delivered to CallerPid via asynchronous messages:

  • {erllama_token, Ref, binary()} per generated token (text fragment)
  • {erllama_done, Ref, stats()} on normal completion
  • {erllama_error, Ref, term()} on failure

Tokens is the prompt as a list of token ids - tokenisation is the caller's responsibility (use tokenize/2 or apply a chat template first). Params is an infer_params() map.

Calls that arrive while a previous request is in flight are queued FIFO. The reply {ok, Ref} is sent as soon as the call is admitted; streaming events follow once the queue head advances to this request.

init/1

list_adapters(Model)

-spec list_adapters(model()) -> [#{handle := term(), scale := float()}].

List currently attached adapters as [#{handle => H, scale => F}]. The handle is the same opaque value load_adapter/2 returned.

load_adapter(Model, Path)

-spec load_adapter(model(), file:filename_all()) -> {ok, term()} | {error, term()}.

Load a LoRA adapter from a GGUF file and attach it to the model with scale 1.0. Returns an opaque handle the caller threads into unload_adapter/2 and set_adapter_scale/3. The adapter's sha256 is folded into the effective fingerprint so cache rows produced under this adapter never collide with rows from a different adapter set.

model_info(Model)

-spec model_info(model()) -> model_info().

Snapshot of the model's metadata.

Returns a model_info() map with status, context size, quantisation, backend, fingerprint, and tier. Safe to call from any state - the gen_statem handles it as a common event without disrupting in-flight inference.

prefilling/3

set_adapter_scale(Model, Handle, Scale)

-spec set_adapter_scale(model(), term(), float()) -> ok | {error, term()}.

Change an attached adapter's scale. Re-applies the full set on the underlying context.

shutdown(Model)

-spec shutdown(model()) -> ok.

Fire a shutdown save synchronously and return. Called from the application's prep_stop hook so live state survives a graceful restart.

start_link(ModelId, Config)

-spec start_link(binary(), map()) -> {ok, pid()} | {error, term()}.

status(Model)

-spec status(model()) -> idle | prefilling | generating.

stop(Model)

-spec stop(model()) -> ok.

terminate/3

tokenize(Model, Text)

-spec tokenize(model(), binary()) -> {ok, [non_neg_integer()]} | {error, term()}.

Tokenise a string using the model's tokenizer. Returns a list of token IDs. Safe to call concurrently with complete/2,3; tokenisation runs against the model's static vocabulary, not the live KV cache.

unload_adapter(Model, Handle)

-spec unload_adapter(model(), term()) -> ok | {error, term()}.

Detach + free a previously loaded adapter. Idempotent: a second call on the same handle returns ok.