erllama (erllama v0.3.0)

View Source

Public façade for the erllama application.

The cache subsystem (erllama_cache) is independent. This module is the user-facing surface for loading and running models.

Typical usage:

  ok = application:ensure_all_started(erllama).
  {ok, Bin} = file:read_file("/srv/models/tinyllama-1.1b-q4_k_m.gguf").
  {ok, Model} = erllama:load_model(#{
      backend => erllama_model_llama,
      model_path => "/srv/models/tinyllama-1.1b-q4_k_m.gguf",
      fingerprint => crypto:hash(sha256, Bin)
  }).
  {ok, #{reply := Reply, finish_key := FK}} =
      erllama:complete(Model, <<"hello">>).
  %% On the next turn, pass FK as parent_key for token-exact warm
  %% restore:
  {ok, #{reply := Reply2}} =
      erllama:complete(Model, <<"hello world">>, #{parent_key => FK}).
  ok = erllama:unload(Model).

Extra cache parameters (tier, tier_srv, quant_type, ctx_params_hash, policy, ...) are optional; the defaults route saves to the RAM tier (erllama_cache_ram). See the loading guide for the full option map and instructions to wire up ram_file / disk tier servers.

Models are dynamic children of erllama_model_sup (simple_one_for_one). A registered name is auto-generated when the caller does not provide an explicit model_id in the config map.

Summary

Functions

Render a chat request through the model's chat template and tokenise. The Request map carries messages, system, and tools.

Cancel an in-flight streaming inference. Idempotent and fire-and-forget; cancellation is observed at the next inter-token boundary. The caller still receives a final {erllama_done, Ref, Stats} with cancelled => true.

Run a completion against a loaded model.

Run a completion against a loaded model with options.

Snapshot of the cache subsystem operational counters.

Detokenise a list of token ids back to text.

Synchronous speculative draft. Generates up to max next-token ids from the model given the supplied prefix and returns them as a list. The list may be shorter than max if the model hits EOS or its response_tokens limit first; an empty list is valid.

Compute an embedding vector for the given prompt tokens.

Fire an evict save synchronously and release the model's live KV state. Used by an external memory-pressure scheduler when it wants this model's working set off the heap without unloading the model.

Streaming inference. Returns immediately with a reference() that identifies this request; tokens are delivered to CallerPid via async messages

Lock-free snapshot of the model's most recent cache-hit summary: the kind (exact | partial | cold) and the warm prefix token count. Returns undefined if the model has not admitted any request yet or is not loaded.

List currently attached adapters with their scales.

Probe how much of PromptTokens is already cached for ModelId on this node. Returns {ok, MatchLen} where MatchLen is the length of the longest cached prefix of PromptTokens (across all tiers: RAM, ram_file, disk). Returns {ok, 0} if no prefix is cached or the prompt is empty. Returns {error, model_not_loaded} if ModelId is not registered locally.

List currently-loaded models as model_info() maps. Each entry includes the model id, status, backend, context size, and quantisation.

Load a LoRA adapter from a GGUF file and attach it to the model with scale 1.0. Returns an opaque handle to pass to set_adapter_scale/3 and unload_adapter/2.

Load a model with an auto-generated id.

Load a model with an explicit id.

Inspect a single loaded model. Returns the same map shape list_models/0 produces. Crashes with noproc if the model is not loaded.

List currently-loaded model pids (low-level supervisor view). Most callers want list_models/0, which returns metadata maps.

Lock-free per-model snapshot of the gen_statem's pending FIFO length — i.e. how many complete/2,3, prefill_only/2, and infer/4 calls are queued behind whatever the model is currently running. Returns 0 if the model is idle or not loaded.

Lock-free per-model phase snapshot. Returns idle, prefilling, or generating; falls back to idle if the model is not loaded.

Decode a prompt into KV state and fire a finish save without sampling any output tokens. Returns the finish_key so the caller can hand it as parent_key on a subsequent complete/3 or infer/4 for token-exact warm restore.

O(1) snapshot of currently-admitted streaming inference requests across all loaded models. Counts only rows registered in erllama_inflight from the infer/4 admission path; pending requests queued inside an individual model gen_statem are not included.

Per-model inflight count. Counts only admitted streaming requests (infer/4); for pending FIFO depth (calls queued inside the model gen_statem behind an in-flight request) use pending_len/1.

Change an attached adapter's scale. The scale is folded into the effective fingerprint, so changes split the cache namespace.

Fire a shutdown save synchronously and return. Called from a release stop hook; bounded by evict_save_timeout_ms.

Current model state. idle means no request is in flight; prefilling and generating are the two active phases.

Tokenise text against a loaded model's tokenizer. Safe to call concurrently with complete/2,3.

Unload a model. Terminates the gen_statem cleanly.

Detach and free a previously loaded adapter. Idempotent.

Alias for unload/1. Provided for API symmetry with load_model/1,2 and the OpenAI/Ollama-style naming used by downstream HTTP servers.

Speculative-decoding verifier. Runs PrefixTokens ++ Candidates (truncated to K candidates) through the model in a single forward pass with per-position argmax, returns the longest accepted prefix length and the model's own next token after it.

VRAM probe across all loaded ggml backends. Sums free / total bytes across non-CPU devices (GPU, integrated GPU, accelerator). Returns {error, no_gpu} on a CPU-only build rather than reporting a fake number; the caller should fall back to a system memory probe of its own choosing in that case.

Types

model()

-type model() :: erllama_model:model().

model_id()

-type model_id() :: erllama_registry:model_id().

model_info()

-type model_info() :: erllama_model:model_info().

Functions

apply_chat_template(Model, Request)

-spec apply_chat_template(model(), erllama_model_backend:chat_request()) ->
                             {ok, [erllama_nif:token_id()]} | {error, term()}.

Render a chat request through the model's chat template and tokenise. The Request map carries messages, system, and tools.

cancel(Ref)

-spec cancel(reference()) -> ok.

Cancel an in-flight streaming inference. Idempotent and fire-and-forget; cancellation is observed at the next inter-token boundary. The caller still receives a final {erllama_done, Ref, Stats} with cancelled => true.

complete(Model, Prompt)

-spec complete(model(), binary()) -> {ok, erllama_model:completion_result()} | {error, term()}.

Run a completion against a loaded model.

Returns {ok, Result} where Result is an erllama_model:completion_result() map carrying the detokenised reply, the generated token list, the full context tokens, the cache finish_key to use as parent_key on the next turn, and per-request stats.

complete(Model, Prompt, Opts)

-spec complete(model(), binary(), map()) -> {ok, erllama_model:completion_result()} | {error, term()}.

Run a completion against a loaded model with options.

Recognised keys in Opts:

  • response_tokens (non_neg_integer()) — cap on the number of tokens generated. Defaults to the model's n_ctx minus prompt length.
  • parent_key (erllama_cache:cache_key()) — the previous turn's finish_key. Skips the longest-prefix walk and resumes directly from that row.
  • stop_sequences ([binary()]) — caller-supplied stop strings. Generation halts on the first occurrence (by list order) of any element in the accumulated detokenised output; the matched string is trimmed from reply and reported as stop_sequence.

Returns {ok, Result} where Result is a completion_result() map carrying:

  • reply — detokenised reply text (trimmed at the matched stop string when one fired)
  • generated — tokens produced by this request
  • context_tokens — full token list (prompt ++ generated)
  • committed_tokenslength(context_tokens)
  • finish_key — cache key for the full context, or undefined if the finish save was suppressed
  • cache_hit_kindexact | partial | cold

  • finish_reasonstop | length | cancelled

  • stop_sequence — only present when a stop_sequences entry fired; the binary of the matched stop string
  • stats — per-request timing and cache stats

counters()

-spec counters() -> #{atom() => non_neg_integer()}.

Snapshot of the cache subsystem operational counters.

detokenize(Model, Tokens)

-spec detokenize(model(), [erllama_nif:token_id()]) -> {ok, binary()} | {error, term()}.

Detokenise a list of token ids back to text.

draft_tokens/3

-spec draft_tokens(model_id(), [erllama_nif:token_id()], #{max => pos_integer(), atom() => term()}) ->
                      {ok, [erllama_nif:token_id()]} | {error, term()}.

Synchronous speculative draft. Generates up to max next-token ids from the model given the supplied prefix and returns them as a list. The list may be shorter than max if the model hits EOS or its response_tokens limit first; an empty list is valid.

Implementation reuses infer/4 and collects the {erllama_token_id, Ref, Id} messages it emits, so the path is identical to ordinary streaming inference apart from the synchronous reply. The 30 s default timeout cancels the underlying request and drains any pending messages so they do not leak into the caller's mailbox.

Used by the upcoming erllama_cluster speculative-decoding strategy to produce K candidate tokens for verification.

embed(Model, Tokens)

-spec embed(model(), [erllama_nif:token_id()]) -> {ok, [float()]} | {error, term()}.

Compute an embedding vector for the given prompt tokens.

evict(Model)

-spec evict(model()) -> ok.

Fire an evict save synchronously and release the model's live KV state. Used by an external memory-pressure scheduler when it wants this model's working set off the heap without unloading the model.

infer(Model, Tokens, Params, CallerPid)

-spec infer(model(), [erllama_nif:token_id()], erllama_model:infer_params(), pid()) ->
               {ok, reference()} | {error, term()}.

Streaming inference. Returns immediately with a reference() that identifies this request; tokens are delivered to CallerPid via async messages:

  • {erllama_token, Ref, Bin :: binary()} — text fragment
  • {erllama_token, Ref, {thinking_delta, Bin :: binary()}} — fragment of an extended-thinking block; only emitted when Params carries thinking => enabled and the backend supports it
  • {erllama_thinking_end, Ref, Sig :: binary()} — close marker for a thinking block, carrying an opaque integrity signature; emitted exactly once per closed block before any subsequent {erllama_token, _, _} message. Sig is <<>> when no signature is available
  • {erllama_done, Ref, Stats} — normal completion
  • {erllama_error, Ref, Reason} — failure

Tokens is the prompt as a list of token ids; tokenisation is the caller's responsibility (use tokenize/2 or apply a chat template first).

When Params carries stop_sequences => [binary()] and one of the strings appears in the accumulated detokenised output, generation halts. The match is trimmed from the streamed {erllama_token, _, _} chunks and the matched value is reported as stop_sequence in the final {erllama_done, _, Stats}.

last_cache_hit(ModelId)

-spec last_cache_hit(model_id()) ->
                        #{kind := exact | partial | cold, prefix_len := non_neg_integer()} | undefined.

Lock-free snapshot of the model's most recent cache-hit summary: the kind (exact | partial | cold) and the warm prefix token count. Returns undefined if the model has not admitted any request yet or is not loaded.

A cold kind with prefix_len = 0 means the previous admission took the full cold path; an exact kind means token-exact warm restore; partial means a longest-prefix walk hit at prefix_len tokens.

Used by cache-affinity routers to bias new requests toward the node whose last admission for this model produced the longest warm prefix.

list_adapters(Model)

-spec list_adapters(model()) -> [#{handle := term(), scale := float()}].

List currently attached adapters with their scales.

list_cached_prefixes/2

-spec list_cached_prefixes(model_id(), [erllama_nif:token_id()]) ->
                              {ok, non_neg_integer()} | {error, term()}.

Probe how much of PromptTokens is already cached for ModelId on this node. Returns {ok, MatchLen} where MatchLen is the length of the longest cached prefix of PromptTokens (across all tiers: RAM, ram_file, disk). Returns {ok, 0} if no prefix is cached or the prompt is empty. Returns {error, model_not_loaded} if ModelId is not registered locally.

Lookup uses the model's effective fingerprint, so attached LoRA adapters are honoured: cached rows produced under one adapter set will not match a probe taken under a different adapter set.

Used by the erllama_cluster cache-affinity router to route prompts to the node with the longest matching cached prefix.

list_models()

-spec list_models() -> [model_info()].

List currently-loaded models as model_info() maps. Each entry includes the model id, status, backend, context size, and quantisation.

load_adapter(Model, Path)

-spec load_adapter(model(), file:filename_all()) -> {ok, term()} | {error, term()}.

Load a LoRA adapter from a GGUF file and attach it to the model with scale 1.0. Returns an opaque handle to pass to set_adapter_scale/3 and unload_adapter/2.

The adapter's file sha256 is folded into the model's effective fingerprint so cache rows produced with the adapter attached never collide with rows from a different attachment set. In-flight requests keep their original fingerprint snapshot; the new value takes effect from the next request.

load_model(Config)

-spec load_model(map()) -> {ok, model_id()} | {error, term()}.

Load a model with an auto-generated id.

load_model(ModelId, Config)

-spec load_model(model_id(), map()) -> {ok, model_id()} | {error, term()}.

Load a model with an explicit id.

model_info(Model)

-spec model_info(model()) -> model_info().

Inspect a single loaded model. Returns the same map shape list_models/0 produces. Crashes with noproc if the model is not loaded.

models()

-spec models() -> [pid()].

List currently-loaded model pids (low-level supervisor view). Most callers want list_models/0, which returns metadata maps.

pending_len(ModelId)

-spec pending_len(model_id()) -> non_neg_integer().

Lock-free per-model snapshot of the gen_statem's pending FIFO length — i.e. how many complete/2,3, prefill_only/2, and infer/4 calls are queued behind whatever the model is currently running. Returns 0 if the model is idle or not loaded.

Reads a named public ETS row written by the model on every queue mutation; the call does not cross the model gen_statem, so it returns instantly even while a decode step is in flight. That matters: the whole point of asking "is this model busy?" is to answer without serialising behind the work you are probing.

Used by erllama_cluster routers to bin-pack requests onto the least-loaded node.

phase(ModelId)

-spec phase(model_id()) -> idle | prefilling | generating.

Lock-free per-model phase snapshot. Returns idle, prefilling, or generating; falls back to idle if the model is not loaded.

Like pending_len/1, this reads a public ETS row without crossing the model gen_statem.

prefill_only(Model, PromptTokens)

-spec prefill_only(model(), [erllama_nif:token_id()]) ->
                      {ok, erllama_model:prefill_result()} | {error, term()}.

Decode a prompt into KV state and fire a finish save without sampling any output tokens. Returns the finish_key so the caller can hand it as parent_key on a subsequent complete/3 or infer/4 for token-exact warm restore.

PromptTokens is the prompt as a list of token ids. Tokenisation is the caller's responsibility (use tokenize/2 or apply a chat template first).

finish_key is undefined if the finish save was suppressed because the token count is below the configured min_tokens.

queue_depth()

-spec queue_depth() -> non_neg_integer().

O(1) snapshot of currently-admitted streaming inference requests across all loaded models. Counts only rows registered in erllama_inflight from the infer/4 admission path; pending requests queued inside an individual model gen_statem are not included.

Used by the erllama_cluster load balancer (least_loaded, power_of_two strategies) as a more accurate alternative to client-side outgoing-request counters.

queue_depth(ModelId)

-spec queue_depth(model_id()) -> non_neg_integer().

Per-model inflight count. Counts only admitted streaming requests (infer/4); for pending FIFO depth (calls queued inside the model gen_statem behind an in-flight request) use pending_len/1.

Returns 0 if the model is not loaded.

set_adapter_scale(Model, Handle, Scale)

-spec set_adapter_scale(model(), term(), float()) -> ok | {error, term()}.

Change an attached adapter's scale. The scale is folded into the effective fingerprint, so changes split the cache namespace.

shutdown(Model)

-spec shutdown(model()) -> ok.

Fire a shutdown save synchronously and return. Called from a release stop hook; bounded by evict_save_timeout_ms.

status(Model)

-spec status(model()) -> idle | prefilling | generating.

Current model state. idle means no request is in flight; prefilling and generating are the two active phases.

tokenize(Model, Text)

-spec tokenize(model(), binary()) -> {ok, [erllama_nif:token_id()]} | {error, term()}.

Tokenise text against a loaded model's tokenizer. Safe to call concurrently with complete/2,3.

unload(Model)

-spec unload(model()) -> ok | {error, term()}.

Unload a model. Terminates the gen_statem cleanly.

unload_adapter(Model, Handle)

-spec unload_adapter(model(), term()) -> ok | {error, term()}.

Detach and free a previously loaded adapter. Idempotent.

unload_model(Model)

-spec unload_model(model()) -> ok | {error, term()}.

Alias for unload/1. Provided for API symmetry with load_model/1,2 and the OpenAI/Ollama-style naming used by downstream HTTP servers.

verify(ModelId, PrefixTokens, Candidates, K)

-spec verify(model_id(), [erllama_nif:token_id()], [erllama_nif:token_id()], pos_integer()) ->
                {ok, non_neg_integer(), erllama_nif:token_id() | eos} | {error, term()}.

Speculative-decoding verifier. Runs PrefixTokens ++ Candidates (truncated to K candidates) through the model in a single forward pass with per-position argmax, returns the longest accepted prefix length and the model's own next token after it.

Behaviour:

  • The verifier model gen_statem is locked for the duration of the call; concurrent infer/4 requests on the same model return {error, busy}. Verify only proceeds when the model is idle.
  • The context's KV cells are mutated during the forward pass but restored before return: post-call the seq_id=0 KV ends at the same length the caller had before, with logits buffered for the last prefix token (so a follow-up decode_one is immediately valid). The caller's pre-call decode_ready flag is not preserved; after verify the context is always ready to sample.
  • An empty PrefixTokens returns {error, empty_prefix} because the acceptance and NextToken indexing both require at least one prefix token.
  • NextToken may be the atom eos if the verifier's argmax at the relevant position is an end-of-generation token; map it to terminate the decode loop.

Used by the upcoming erllama_cluster speculative-decoding strategy after draft_tokens/3.

vram_info()

-spec vram_info() ->
                   {ok,
                    #{total_b := non_neg_integer(),
                      free_b := non_neg_integer(),
                      used_b := non_neg_integer()}} |
                   {error, atom()}.

VRAM probe across all loaded ggml backends. Sums free / total bytes across non-CPU devices (GPU, integrated GPU, accelerator). Returns {error, no_gpu} on a CPU-only build rather than reporting a fake number; the caller should fall back to a system memory probe of its own choosing in that case.

Used by the erllama_cluster scheduler for bin-packing model placement.