erllama (erllama v0.1.2)
View SourcePublic façade for the erllama application.
The cache subsystem (erllama_cache) is independent. This module
is the user-facing surface for loading and running models.
Typical usage:
ok = application:ensure_all_started(erllama).
{ok, Bin} = file:read_file("/srv/models/tinyllama-1.1b-q4_k_m.gguf").
{ok, Model} = erllama:load_model(#{
backend => erllama_model_llama,
model_path => "/srv/models/tinyllama-1.1b-q4_k_m.gguf",
fingerprint => crypto:hash(sha256, Bin)
}).
{ok, Reply, _Tokens} = erllama:complete(Model, <<"hello">>).
ok = erllama:unload(Model).Extra cache parameters (tier, tier_srv, quant_type,
ctx_params_hash, policy, ...) are optional; the defaults route
saves to the RAM tier (erllama_cache_ram). See the loading guide
for the full option map and instructions to wire up
ram_file / disk tier servers.
Models are dynamic children of erllama_model_sup (simple_one_for_one).
A registered name is auto-generated when the caller does not provide
an explicit model_id in the config map.
Summary
Functions
Render a chat request through the model's chat template and
tokenise. The Request map carries messages, system, and tools.
Cancel an in-flight streaming inference. Idempotent and
fire-and-forget; cancellation is observed at the next inter-token
boundary. The caller still receives a final {erllama_done, Ref, Stats} with cancelled => true.
Run a completion against a loaded model.
Run a completion against a loaded model with options.
Snapshot of the cache subsystem operational counters.
Detokenise a list of token ids back to text.
Synchronous speculative draft. Generates up to max next-token
ids from the model given the supplied prefix and returns them as
a list. The list may be shorter than max if the model hits EOS
or its response_tokens limit first; an empty list is valid.
Compute an embedding vector for the given prompt tokens.
Fire an evict save synchronously and release the model's live KV
state. Used by an external memory-pressure scheduler when it wants
this model's working set off the heap without unloading the model.
Streaming inference. Returns immediately with a reference() that
identifies this request; tokens are delivered to CallerPid via
async messages
List currently attached adapters with their scales.
Probe how much of PromptTokens is already cached for ModelId
on this node. Returns {ok, MatchLen} where MatchLen is the
length of the longest cached prefix of PromptTokens (across all
tiers: RAM, ram_file, disk). Returns {ok, 0} if no prefix is
cached or the prompt is empty. Returns {error, model_not_loaded}
if ModelId is not registered locally.
List currently-loaded models as model_info() maps. Each entry
includes the model id, status, backend, context size, and
quantisation.
Load a LoRA adapter from a GGUF file and attach it to the model with
scale 1.0. Returns an opaque handle to pass to set_adapter_scale/3
and unload_adapter/2.
Load a model with an auto-generated id.
Load a model with an explicit id.
Inspect a single loaded model. Returns the same map shape
list_models/0 produces. Crashes with noproc if the model is not
loaded.
List currently-loaded model pids (low-level supervisor view). Most
callers want list_models/0, which returns metadata maps.
O(1) snapshot of currently-admitted streaming inference requests
across all loaded models. Counts only rows registered in
erllama_inflight from the infer/4 admission path; pending
requests queued inside an individual model gen_statem are not
included.
Change an attached adapter's scale. The scale is folded into the effective fingerprint, so changes split the cache namespace.
Fire a shutdown save synchronously and return. Called from a
release stop hook; bounded by evict_save_timeout_ms.
Current model state. idle means no request is in flight;
prefilling and generating are the two active phases.
Tokenise text against a loaded model's tokenizer. Safe to call
concurrently with complete/2,3.
Unload a model. Terminates the gen_statem cleanly.
Detach and free a previously loaded adapter. Idempotent.
Alias for unload/1. Provided for API symmetry with load_model/1,2
and the OpenAI/Ollama-style naming used by downstream HTTP servers.
Speculative-decoding verifier. Runs PrefixTokens ++ Candidates
(truncated to K candidates) through the model in a single
forward pass with per-position argmax, returns the longest
accepted prefix length and the model's own next token after it.
VRAM probe across all loaded ggml backends. Sums free / total bytes
across non-CPU devices (GPU, integrated GPU, accelerator). Returns
{error, no_gpu} on a CPU-only build rather than reporting a fake
number; the caller should fall back to a system memory probe of its
own choosing in that case.
Types
-type model() :: erllama_model:model().
-type model_id() :: erllama_registry:model_id().
-type model_info() :: erllama_model:model_info().
Functions
-spec apply_chat_template(model(), erllama_model_backend:chat_request()) -> {ok, [erllama_nif:token_id()]} | {error, term()}.
Render a chat request through the model's chat template and
tokenise. The Request map carries messages, system, and tools.
-spec cancel(reference()) -> ok.
Cancel an in-flight streaming inference. Idempotent and
fire-and-forget; cancellation is observed at the next inter-token
boundary. The caller still receives a final {erllama_done, Ref, Stats} with cancelled => true.
-spec complete(model(), binary()) -> {ok, binary(), [erllama_nif:token_id()]} | {error, term()}.
Run a completion against a loaded model.
-spec complete(model(), binary(), map()) -> {ok, binary(), [erllama_nif:token_id()]} | {error, term()}.
Run a completion against a loaded model with options.
Recognised keys in Opts:
response_tokens(non_neg_integer()) — cap on the number of tokens generated. Defaults to the model'sn_ctxminus prompt length.parent_key(erllama_cache:cache_key()) — the previous turn's finish-save key. Skips the longest-prefix walk and resumes directly from that row.
Returns {ok, ReplyText, FullTokenList} on success.
-spec counters() -> #{atom() => non_neg_integer()}.
Snapshot of the cache subsystem operational counters.
-spec detokenize(model(), [erllama_nif:token_id()]) -> {ok, binary()} | {error, term()}.
Detokenise a list of token ids back to text.
-spec draft_tokens(model_id(), [erllama_nif:token_id()], #{max => pos_integer(), atom() => term()}) -> {ok, [erllama_nif:token_id()]} | {error, term()}.
Synchronous speculative draft. Generates up to max next-token
ids from the model given the supplied prefix and returns them as
a list. The list may be shorter than max if the model hits EOS
or its response_tokens limit first; an empty list is valid.
Implementation reuses infer/4 and collects the
{erllama_token_id, Ref, Id} messages it emits, so the path is
identical to ordinary streaming inference apart from the
synchronous reply. The 30 s default timeout cancels the
underlying request and drains any pending messages so they do
not leak into the caller's mailbox.
Used by the upcoming erllama_cluster speculative-decoding strategy to produce K candidate tokens for verification.
-spec embed(model(), [erllama_nif:token_id()]) -> {ok, [float()]} | {error, term()}.
Compute an embedding vector for the given prompt tokens.
-spec evict(model()) -> ok.
Fire an evict save synchronously and release the model's live KV
state. Used by an external memory-pressure scheduler when it wants
this model's working set off the heap without unloading the model.
-spec infer(model(), [erllama_nif:token_id()], erllama_model:infer_params(), pid()) -> {ok, reference()} | {error, term()}.
Streaming inference. Returns immediately with a reference() that
identifies this request; tokens are delivered to CallerPid via
async messages:
{erllama_token, Ref, Bin :: binary()}— text fragment{erllama_done, Ref, Stats}— normal completion{erllama_error, Ref, Reason}— failure
Tokens is the prompt as a list of token ids; tokenisation is the
caller's responsibility (use tokenize/2 or apply a chat template
first).
List currently attached adapters with their scales.
-spec list_cached_prefixes(model_id(), [erllama_nif:token_id()]) -> {ok, non_neg_integer()} | {error, term()}.
Probe how much of PromptTokens is already cached for ModelId
on this node. Returns {ok, MatchLen} where MatchLen is the
length of the longest cached prefix of PromptTokens (across all
tiers: RAM, ram_file, disk). Returns {ok, 0} if no prefix is
cached or the prompt is empty. Returns {error, model_not_loaded}
if ModelId is not registered locally.
Lookup uses the model's effective fingerprint, so attached LoRA adapters are honoured: cached rows produced under one adapter set will not match a probe taken under a different adapter set.
Used by the erllama_cluster cache-affinity router to route
prompts to the node with the longest matching cached prefix.
-spec list_models() -> [model_info()].
List currently-loaded models as model_info() maps. Each entry
includes the model id, status, backend, context size, and
quantisation.
-spec load_adapter(model(), file:filename_all()) -> {ok, term()} | {error, term()}.
Load a LoRA adapter from a GGUF file and attach it to the model with
scale 1.0. Returns an opaque handle to pass to set_adapter_scale/3
and unload_adapter/2.
The adapter's file sha256 is folded into the model's effective fingerprint so cache rows produced with the adapter attached never collide with rows from a different attachment set. In-flight requests keep their original fingerprint snapshot; the new value takes effect from the next request.
Load a model with an auto-generated id.
Load a model with an explicit id.
-spec model_info(model()) -> model_info().
Inspect a single loaded model. Returns the same map shape
list_models/0 produces. Crashes with noproc if the model is not
loaded.
-spec models() -> [pid()].
List currently-loaded model pids (low-level supervisor view). Most
callers want list_models/0, which returns metadata maps.
-spec queue_depth() -> non_neg_integer().
O(1) snapshot of currently-admitted streaming inference requests
across all loaded models. Counts only rows registered in
erllama_inflight from the infer/4 admission path; pending
requests queued inside an individual model gen_statem are not
included.
Used by the erllama_cluster load balancer (least_loaded,
power_of_two strategies) as a more accurate alternative to
client-side outgoing-request counters.
Change an attached adapter's scale. The scale is folded into the effective fingerprint, so changes split the cache namespace.
-spec shutdown(model()) -> ok.
Fire a shutdown save synchronously and return. Called from a
release stop hook; bounded by evict_save_timeout_ms.
-spec status(model()) -> idle | prefilling | generating.
Current model state. idle means no request is in flight;
prefilling and generating are the two active phases.
-spec tokenize(model(), binary()) -> {ok, [erllama_nif:token_id()]} | {error, term()}.
Tokenise text against a loaded model's tokenizer. Safe to call
concurrently with complete/2,3.
Unload a model. Terminates the gen_statem cleanly.
Detach and free a previously loaded adapter. Idempotent.
Alias for unload/1. Provided for API symmetry with load_model/1,2
and the OpenAI/Ollama-style naming used by downstream HTTP servers.
-spec verify(model_id(), [erllama_nif:token_id()], [erllama_nif:token_id()], pos_integer()) -> {ok, non_neg_integer(), erllama_nif:token_id() | eos} | {error, term()}.
Speculative-decoding verifier. Runs PrefixTokens ++ Candidates
(truncated to K candidates) through the model in a single
forward pass with per-position argmax, returns the longest
accepted prefix length and the model's own next token after it.
Behaviour:
- The verifier model gen_statem is locked for the duration of
the call; concurrent
infer/4requests on the same model return{error, busy}. Verify only proceeds when the model is idle. - The context's KV cells are mutated during the forward pass
but restored before return: post-call the seq_id=0 KV ends at
the same length the caller had before, with logits buffered
for the last prefix token (so a follow-up
decode_oneis immediately valid). The caller's pre-calldecode_readyflag is not preserved; after verify the context is always ready to sample. - An empty
PrefixTokensreturns{error, empty_prefix}because the acceptance and NextToken indexing both require at least one prefix token. NextTokenmay be the atomeosif the verifier's argmax at the relevant position is an end-of-generation token; map it to terminate the decode loop.
Used by the upcoming erllama_cluster speculative-decoding
strategy after draft_tokens/3.
-spec vram_info() -> {ok, #{total_b := non_neg_integer(), free_b := non_neg_integer(), used_b := non_neg_integer()}} | {error, atom()}.
VRAM probe across all loaded ggml backends. Sums free / total bytes
across non-CPU devices (GPU, integrated GPU, accelerator). Returns
{error, no_gpu} on a CPU-only build rather than reporting a fake
number; the caller should fall back to a system memory probe of its
own choosing in that case.
Used by the erllama_cluster scheduler for bin-packing model
placement.