erllama (erllama v0.1.0)

View Source

Public façade for the erllama application.

The cache subsystem (erllama_cache) is independent. This module is the user-facing surface for loading and running models.

Typical usage:

  ok = application:ensure_all_started(erllama).
  {ok, Bin} = file:read_file("/srv/models/tinyllama-1.1b-q4_k_m.gguf").
  {ok, Model} = erllama:load_model(#{
      backend => erllama_model_llama,
      model_path => "/srv/models/tinyllama-1.1b-q4_k_m.gguf",
      fingerprint => crypto:hash(sha256, Bin)
  }).
  {ok, Reply, _Tokens} = erllama:complete(Model, <<"hello">>).
  ok = erllama:unload(Model).

Extra cache parameters (tier, tier_srv, quant_type, ctx_params_hash, policy, ...) are optional; the defaults route saves to the RAM tier (erllama_cache_ram). See the loading guide for the full option map and instructions to wire up ram_file / disk tier servers.

Models are dynamic children of erllama_model_sup (simple_one_for_one). A registered name is auto-generated when the caller does not provide an explicit model_id in the config map.

Summary

Functions

Render a chat request through the model's chat template and tokenise. The Request map carries messages, system, and tools.

Cancel an in-flight streaming inference. Idempotent and fire-and-forget; cancellation is observed at the next inter-token boundary. The caller still receives a final {erllama_done, Ref, Stats} with cancelled => true.

Run a completion against a loaded model.

Run a completion against a loaded model with options.

Snapshot of the cache subsystem operational counters.

Detokenise a list of token ids back to text.

Compute an embedding vector for the given prompt tokens.

Fire an evict save synchronously and release the model's live KV state. Used by an external memory-pressure scheduler when it wants this model's working set off the heap without unloading the model.

Streaming inference. Returns immediately with a reference() that identifies this request; tokens are delivered to CallerPid via async messages

List currently attached adapters with their scales.

List currently-loaded models as model_info() maps. Each entry includes the model id, status, backend, context size, and quantisation.

Load a LoRA adapter from a GGUF file and attach it to the model with scale 1.0. Returns an opaque handle to pass to set_adapter_scale/3 and unload_adapter/2.

Load a model with an auto-generated id.

Load a model with an explicit id.

Inspect a single loaded model. Returns the same map shape list_models/0 produces. Crashes with noproc if the model is not loaded.

List currently-loaded model pids (low-level supervisor view). Most callers want list_models/0, which returns metadata maps.

Change an attached adapter's scale. The scale is folded into the effective fingerprint, so changes split the cache namespace.

Fire a shutdown save synchronously and return. Called from a release stop hook; bounded by evict_save_timeout_ms.

Current model state. idle means no request is in flight; prefilling and generating are the two active phases.

Tokenise text against a loaded model's tokenizer. Safe to call concurrently with complete/2,3.

Unload a model. Terminates the gen_statem cleanly.

Detach and free a previously loaded adapter. Idempotent.

Alias for unload/1. Provided for API symmetry with load_model/1,2 and the OpenAI/Ollama-style naming used by downstream HTTP servers.

Types

model()

-type model() :: erllama_model:model().

model_id()

-type model_id() :: erllama_registry:model_id().

model_info()

-type model_info() :: erllama_model:model_info().

Functions

apply_chat_template(Model, Request)

-spec apply_chat_template(model(), erllama_model_backend:chat_request()) ->
                             {ok, [erllama_nif:token_id()]} | {error, term()}.

Render a chat request through the model's chat template and tokenise. The Request map carries messages, system, and tools.

cancel(Ref)

-spec cancel(reference()) -> ok.

Cancel an in-flight streaming inference. Idempotent and fire-and-forget; cancellation is observed at the next inter-token boundary. The caller still receives a final {erllama_done, Ref, Stats} with cancelled => true.

complete(Model, Prompt)

-spec complete(model(), binary()) -> {ok, binary(), [erllama_nif:token_id()]} | {error, term()}.

Run a completion against a loaded model.

complete(Model, Prompt, Opts)

-spec complete(model(), binary(), map()) -> {ok, binary(), [erllama_nif:token_id()]} | {error, term()}.

Run a completion against a loaded model with options.

Recognised keys in Opts:

  • response_tokens (non_neg_integer()) — cap on the number of tokens generated. Defaults to the model's n_ctx minus prompt length.
  • parent_key (erllama_cache:cache_key()) — the previous turn's finish-save key. Skips the longest-prefix walk and resumes directly from that row.

Returns {ok, ReplyText, FullTokenList} on success.

counters()

-spec counters() -> #{atom() => non_neg_integer()}.

Snapshot of the cache subsystem operational counters.

detokenize(Model, Tokens)

-spec detokenize(model(), [erllama_nif:token_id()]) -> {ok, binary()} | {error, term()}.

Detokenise a list of token ids back to text.

embed(Model, Tokens)

-spec embed(model(), [erllama_nif:token_id()]) -> {ok, [float()]} | {error, term()}.

Compute an embedding vector for the given prompt tokens.

evict(Model)

-spec evict(model()) -> ok.

Fire an evict save synchronously and release the model's live KV state. Used by an external memory-pressure scheduler when it wants this model's working set off the heap without unloading the model.

infer(Model, Tokens, Params, CallerPid)

-spec infer(model(), [erllama_nif:token_id()], erllama_model:infer_params(), pid()) ->
               {ok, reference()} | {error, term()}.

Streaming inference. Returns immediately with a reference() that identifies this request; tokens are delivered to CallerPid via async messages:

  • {erllama_token, Ref, Bin :: binary()} — text fragment
  • {erllama_done, Ref, Stats} — normal completion
  • {erllama_error, Ref, Reason} — failure

Tokens is the prompt as a list of token ids; tokenisation is the caller's responsibility (use tokenize/2 or apply a chat template first).

list_adapters(Model)

-spec list_adapters(model()) -> [#{handle := term(), scale := float()}].

List currently attached adapters with their scales.

list_models()

-spec list_models() -> [model_info()].

List currently-loaded models as model_info() maps. Each entry includes the model id, status, backend, context size, and quantisation.

load_adapter(Model, Path)

-spec load_adapter(model(), file:filename_all()) -> {ok, term()} | {error, term()}.

Load a LoRA adapter from a GGUF file and attach it to the model with scale 1.0. Returns an opaque handle to pass to set_adapter_scale/3 and unload_adapter/2.

The adapter's file sha256 is folded into the model's effective fingerprint so cache rows produced with the adapter attached never collide with rows from a different attachment set. In-flight requests keep their original fingerprint snapshot; the new value takes effect from the next request.

load_model(Config)

-spec load_model(map()) -> {ok, model_id()} | {error, term()}.

Load a model with an auto-generated id.

load_model(ModelId, Config)

-spec load_model(model_id(), map()) -> {ok, model_id()} | {error, term()}.

Load a model with an explicit id.

model_info(Model)

-spec model_info(model()) -> model_info().

Inspect a single loaded model. Returns the same map shape list_models/0 produces. Crashes with noproc if the model is not loaded.

models()

-spec models() -> [pid()].

List currently-loaded model pids (low-level supervisor view). Most callers want list_models/0, which returns metadata maps.

set_adapter_scale(Model, Handle, Scale)

-spec set_adapter_scale(model(), term(), float()) -> ok | {error, term()}.

Change an attached adapter's scale. The scale is folded into the effective fingerprint, so changes split the cache namespace.

shutdown(Model)

-spec shutdown(model()) -> ok.

Fire a shutdown save synchronously and return. Called from a release stop hook; bounded by evict_save_timeout_ms.

status(Model)

-spec status(model()) -> idle | prefilling | generating.

Current model state. idle means no request is in flight; prefilling and generating are the two active phases.

tokenize(Model, Text)

-spec tokenize(model(), binary()) -> {ok, [erllama_nif:token_id()]} | {error, term()}.

Tokenise text against a loaded model's tokenizer. Safe to call concurrently with complete/2,3.

unload(Model)

-spec unload(model()) -> ok | {error, term()}.

Unload a model. Terminates the gen_statem cleanly.

unload_adapter(Model, Handle)

-spec unload_adapter(model(), term()) -> ok | {error, term()}.

Detach and free a previously loaded adapter. Idempotent.

unload_model(Model)

-spec unload_model(model()) -> ok | {error, term()}.

Alias for unload/1. Provided for API symmetry with load_model/1,2 and the OpenAI/Ollama-style naming used by downstream HTTP servers.