erllama (erllama v0.1.2)

View Source

Public façade for the erllama application.

The cache subsystem (erllama_cache) is independent. This module is the user-facing surface for loading and running models.

Typical usage:

  ok = application:ensure_all_started(erllama).
  {ok, Bin} = file:read_file("/srv/models/tinyllama-1.1b-q4_k_m.gguf").
  {ok, Model} = erllama:load_model(#{
      backend => erllama_model_llama,
      model_path => "/srv/models/tinyllama-1.1b-q4_k_m.gguf",
      fingerprint => crypto:hash(sha256, Bin)
  }).
  {ok, Reply, _Tokens} = erllama:complete(Model, <<"hello">>).
  ok = erllama:unload(Model).

Extra cache parameters (tier, tier_srv, quant_type, ctx_params_hash, policy, ...) are optional; the defaults route saves to the RAM tier (erllama_cache_ram). See the loading guide for the full option map and instructions to wire up ram_file / disk tier servers.

Models are dynamic children of erllama_model_sup (simple_one_for_one). A registered name is auto-generated when the caller does not provide an explicit model_id in the config map.

Summary

Functions

Render a chat request through the model's chat template and tokenise. The Request map carries messages, system, and tools.

Cancel an in-flight streaming inference. Idempotent and fire-and-forget; cancellation is observed at the next inter-token boundary. The caller still receives a final {erllama_done, Ref, Stats} with cancelled => true.

Run a completion against a loaded model.

Run a completion against a loaded model with options.

Snapshot of the cache subsystem operational counters.

Detokenise a list of token ids back to text.

Synchronous speculative draft. Generates up to max next-token ids from the model given the supplied prefix and returns them as a list. The list may be shorter than max if the model hits EOS or its response_tokens limit first; an empty list is valid.

Compute an embedding vector for the given prompt tokens.

Fire an evict save synchronously and release the model's live KV state. Used by an external memory-pressure scheduler when it wants this model's working set off the heap without unloading the model.

Streaming inference. Returns immediately with a reference() that identifies this request; tokens are delivered to CallerPid via async messages

List currently attached adapters with their scales.

Probe how much of PromptTokens is already cached for ModelId on this node. Returns {ok, MatchLen} where MatchLen is the length of the longest cached prefix of PromptTokens (across all tiers: RAM, ram_file, disk). Returns {ok, 0} if no prefix is cached or the prompt is empty. Returns {error, model_not_loaded} if ModelId is not registered locally.

List currently-loaded models as model_info() maps. Each entry includes the model id, status, backend, context size, and quantisation.

Load a LoRA adapter from a GGUF file and attach it to the model with scale 1.0. Returns an opaque handle to pass to set_adapter_scale/3 and unload_adapter/2.

Load a model with an auto-generated id.

Load a model with an explicit id.

Inspect a single loaded model. Returns the same map shape list_models/0 produces. Crashes with noproc if the model is not loaded.

List currently-loaded model pids (low-level supervisor view). Most callers want list_models/0, which returns metadata maps.

O(1) snapshot of currently-admitted streaming inference requests across all loaded models. Counts only rows registered in erllama_inflight from the infer/4 admission path; pending requests queued inside an individual model gen_statem are not included.

Change an attached adapter's scale. The scale is folded into the effective fingerprint, so changes split the cache namespace.

Fire a shutdown save synchronously and return. Called from a release stop hook; bounded by evict_save_timeout_ms.

Current model state. idle means no request is in flight; prefilling and generating are the two active phases.

Tokenise text against a loaded model's tokenizer. Safe to call concurrently with complete/2,3.

Unload a model. Terminates the gen_statem cleanly.

Detach and free a previously loaded adapter. Idempotent.

Alias for unload/1. Provided for API symmetry with load_model/1,2 and the OpenAI/Ollama-style naming used by downstream HTTP servers.

Speculative-decoding verifier. Runs PrefixTokens ++ Candidates (truncated to K candidates) through the model in a single forward pass with per-position argmax, returns the longest accepted prefix length and the model's own next token after it.

VRAM probe across all loaded ggml backends. Sums free / total bytes across non-CPU devices (GPU, integrated GPU, accelerator). Returns {error, no_gpu} on a CPU-only build rather than reporting a fake number; the caller should fall back to a system memory probe of its own choosing in that case.

Types

model()

-type model() :: erllama_model:model().

model_id()

-type model_id() :: erllama_registry:model_id().

model_info()

-type model_info() :: erllama_model:model_info().

Functions

apply_chat_template(Model, Request)

-spec apply_chat_template(model(), erllama_model_backend:chat_request()) ->
                             {ok, [erllama_nif:token_id()]} | {error, term()}.

Render a chat request through the model's chat template and tokenise. The Request map carries messages, system, and tools.

cancel(Ref)

-spec cancel(reference()) -> ok.

Cancel an in-flight streaming inference. Idempotent and fire-and-forget; cancellation is observed at the next inter-token boundary. The caller still receives a final {erllama_done, Ref, Stats} with cancelled => true.

complete(Model, Prompt)

-spec complete(model(), binary()) -> {ok, binary(), [erllama_nif:token_id()]} | {error, term()}.

Run a completion against a loaded model.

complete(Model, Prompt, Opts)

-spec complete(model(), binary(), map()) -> {ok, binary(), [erllama_nif:token_id()]} | {error, term()}.

Run a completion against a loaded model with options.

Recognised keys in Opts:

  • response_tokens (non_neg_integer()) — cap on the number of tokens generated. Defaults to the model's n_ctx minus prompt length.
  • parent_key (erllama_cache:cache_key()) — the previous turn's finish-save key. Skips the longest-prefix walk and resumes directly from that row.

Returns {ok, ReplyText, FullTokenList} on success.

counters()

-spec counters() -> #{atom() => non_neg_integer()}.

Snapshot of the cache subsystem operational counters.

detokenize(Model, Tokens)

-spec detokenize(model(), [erllama_nif:token_id()]) -> {ok, binary()} | {error, term()}.

Detokenise a list of token ids back to text.

draft_tokens/3

-spec draft_tokens(model_id(), [erllama_nif:token_id()], #{max => pos_integer(), atom() => term()}) ->
                      {ok, [erllama_nif:token_id()]} | {error, term()}.

Synchronous speculative draft. Generates up to max next-token ids from the model given the supplied prefix and returns them as a list. The list may be shorter than max if the model hits EOS or its response_tokens limit first; an empty list is valid.

Implementation reuses infer/4 and collects the {erllama_token_id, Ref, Id} messages it emits, so the path is identical to ordinary streaming inference apart from the synchronous reply. The 30 s default timeout cancels the underlying request and drains any pending messages so they do not leak into the caller's mailbox.

Used by the upcoming erllama_cluster speculative-decoding strategy to produce K candidate tokens for verification.

embed(Model, Tokens)

-spec embed(model(), [erllama_nif:token_id()]) -> {ok, [float()]} | {error, term()}.

Compute an embedding vector for the given prompt tokens.

evict(Model)

-spec evict(model()) -> ok.

Fire an evict save synchronously and release the model's live KV state. Used by an external memory-pressure scheduler when it wants this model's working set off the heap without unloading the model.

infer(Model, Tokens, Params, CallerPid)

-spec infer(model(), [erllama_nif:token_id()], erllama_model:infer_params(), pid()) ->
               {ok, reference()} | {error, term()}.

Streaming inference. Returns immediately with a reference() that identifies this request; tokens are delivered to CallerPid via async messages:

  • {erllama_token, Ref, Bin :: binary()} — text fragment
  • {erllama_done, Ref, Stats} — normal completion
  • {erllama_error, Ref, Reason} — failure

Tokens is the prompt as a list of token ids; tokenisation is the caller's responsibility (use tokenize/2 or apply a chat template first).

list_adapters(Model)

-spec list_adapters(model()) -> [#{handle := term(), scale := float()}].

List currently attached adapters with their scales.

list_cached_prefixes/2

-spec list_cached_prefixes(model_id(), [erllama_nif:token_id()]) ->
                              {ok, non_neg_integer()} | {error, term()}.

Probe how much of PromptTokens is already cached for ModelId on this node. Returns {ok, MatchLen} where MatchLen is the length of the longest cached prefix of PromptTokens (across all tiers: RAM, ram_file, disk). Returns {ok, 0} if no prefix is cached or the prompt is empty. Returns {error, model_not_loaded} if ModelId is not registered locally.

Lookup uses the model's effective fingerprint, so attached LoRA adapters are honoured: cached rows produced under one adapter set will not match a probe taken under a different adapter set.

Used by the erllama_cluster cache-affinity router to route prompts to the node with the longest matching cached prefix.

list_models()

-spec list_models() -> [model_info()].

List currently-loaded models as model_info() maps. Each entry includes the model id, status, backend, context size, and quantisation.

load_adapter(Model, Path)

-spec load_adapter(model(), file:filename_all()) -> {ok, term()} | {error, term()}.

Load a LoRA adapter from a GGUF file and attach it to the model with scale 1.0. Returns an opaque handle to pass to set_adapter_scale/3 and unload_adapter/2.

The adapter's file sha256 is folded into the model's effective fingerprint so cache rows produced with the adapter attached never collide with rows from a different attachment set. In-flight requests keep their original fingerprint snapshot; the new value takes effect from the next request.

load_model(Config)

-spec load_model(map()) -> {ok, model_id()} | {error, term()}.

Load a model with an auto-generated id.

load_model(ModelId, Config)

-spec load_model(model_id(), map()) -> {ok, model_id()} | {error, term()}.

Load a model with an explicit id.

model_info(Model)

-spec model_info(model()) -> model_info().

Inspect a single loaded model. Returns the same map shape list_models/0 produces. Crashes with noproc if the model is not loaded.

models()

-spec models() -> [pid()].

List currently-loaded model pids (low-level supervisor view). Most callers want list_models/0, which returns metadata maps.

queue_depth()

-spec queue_depth() -> non_neg_integer().

O(1) snapshot of currently-admitted streaming inference requests across all loaded models. Counts only rows registered in erllama_inflight from the infer/4 admission path; pending requests queued inside an individual model gen_statem are not included.

Used by the erllama_cluster load balancer (least_loaded, power_of_two strategies) as a more accurate alternative to client-side outgoing-request counters.

set_adapter_scale(Model, Handle, Scale)

-spec set_adapter_scale(model(), term(), float()) -> ok | {error, term()}.

Change an attached adapter's scale. The scale is folded into the effective fingerprint, so changes split the cache namespace.

shutdown(Model)

-spec shutdown(model()) -> ok.

Fire a shutdown save synchronously and return. Called from a release stop hook; bounded by evict_save_timeout_ms.

status(Model)

-spec status(model()) -> idle | prefilling | generating.

Current model state. idle means no request is in flight; prefilling and generating are the two active phases.

tokenize(Model, Text)

-spec tokenize(model(), binary()) -> {ok, [erllama_nif:token_id()]} | {error, term()}.

Tokenise text against a loaded model's tokenizer. Safe to call concurrently with complete/2,3.

unload(Model)

-spec unload(model()) -> ok | {error, term()}.

Unload a model. Terminates the gen_statem cleanly.

unload_adapter(Model, Handle)

-spec unload_adapter(model(), term()) -> ok | {error, term()}.

Detach and free a previously loaded adapter. Idempotent.

unload_model(Model)

-spec unload_model(model()) -> ok | {error, term()}.

Alias for unload/1. Provided for API symmetry with load_model/1,2 and the OpenAI/Ollama-style naming used by downstream HTTP servers.

verify(ModelId, PrefixTokens, Candidates, K)

-spec verify(model_id(), [erllama_nif:token_id()], [erllama_nif:token_id()], pos_integer()) ->
                {ok, non_neg_integer(), erllama_nif:token_id() | eos} | {error, term()}.

Speculative-decoding verifier. Runs PrefixTokens ++ Candidates (truncated to K candidates) through the model in a single forward pass with per-position argmax, returns the longest accepted prefix length and the model's own next token after it.

Behaviour:

  • The verifier model gen_statem is locked for the duration of the call; concurrent infer/4 requests on the same model return {error, busy}. Verify only proceeds when the model is idle.
  • The context's KV cells are mutated during the forward pass but restored before return: post-call the seq_id=0 KV ends at the same length the caller had before, with logits buffered for the last prefix token (so a follow-up decode_one is immediately valid). The caller's pre-call decode_ready flag is not preserved; after verify the context is always ready to sample.
  • An empty PrefixTokens returns {error, empty_prefix} because the acceptance and NextToken indexing both require at least one prefix token.
  • NextToken may be the atom eos if the verifier's argmax at the relevant position is an end-of-generation token; map it to terminate the decode loop.

Used by the upcoming erllama_cluster speculative-decoding strategy after draft_tokens/3.

vram_info()

-spec vram_info() ->
                   {ok,
                    #{total_b := non_neg_integer(),
                      free_b := non_neg_integer(),
                      used_b := non_neg_integer()}} |
                   {error, atom()}.

VRAM probe across all loaded ggml backends. Sums free / total bytes across non-CPU devices (GPU, integrated GPU, accelerator). Returns {error, no_gpu} on a CPU-only build rather than reporting a fake number; the caller should fall back to a system memory probe of its own choosing in that case.

Used by the erllama_cluster scheduler for bin-packing model placement.