LlamaCppEx.ModelManager (LlamaCppEx v0.8.23)

Copy Markdown View Source

Holds multiple models resident and routes requests to them by id.

The manager is a node-wide singleton GenServer that owns an ETS table of loaded models. Following the otp-thinking ETS pattern, lifecycle writes serialize through the GenServer, while inference-time lookups read the ETS table directly from the caller — so the manager never becomes a throughput bottleneck for generate/3, stream/3, chat/3, or embed/3.

It is a singleton by design: the client API targets the manager by its module name, and the backing Registry/DynamicSupervisor use fixed names. Start at most one per node — init/1 refuses a second instance with a clear error.

Because the slow parts of load/3 (Hub download + native model load) run in a supervised Task rather than the GenServer process, a long load does not block other lifecycle calls (unload/1, set_default/1, or concurrent load/3s). The memory-budget reservation and the ETS commit are still serialized on the GenServer, so resident models are always accounted for. (The budget remains advisory: a model's footprint is only reserved once its size is known — after resolve — so two models downloading at once may momentarily under-count each other.)

Start it as part of LlamaCppEx.ModelSupervisor, which also starts the Registry and DynamicSupervisor that server-backed models need:

children = [
  {LlamaCppEx.ModelSupervisor,
   memory_budget: :auto,
   models: [
     {"chat", {:hub, "Qwen/Qwen3-0.6B-GGUF", "Qwen3-0.6B-Q8_0.gguf"}, n_gpu_layers: -1},
     {"embed", {:path, "/models/nomic-embed.gguf"}, capabilities: [:embed]}
   ]}
]

Backing modes

  • :server (default for generation/chat) — backs the model with a supervised LlamaCppEx.Server, giving continuous batching, streaming, prefix cache, and telemetry.
  • :direct (auto-selected when :embed is in :capabilities) — holds the %LlamaCppEx.Model{} and runs stateless LlamaCppEx.generate/3 / LlamaCppEx.Embedding.embed/2. Mandatory for embeddings, since the server has no embedding path.

Routing

  • Explicit id: generate("chat", prompt).
  • Default model: generate(:default, prompt) routes to the model marked default: true at load, or set via set_default/1.

Unloading and memory

Model cleanup is GC-based: unload/1 stops the backing server (dropping its context and model refs) and removes the ETS entry, then forces a GC. Because reclamation is by garbage collection, any caller still holding a %Model{} obtained via fetch_model/1 keeps the underlying model alive past unload/1. Prefer id-based dispatch and avoid holding raw refs.

Loads are checked against an advisory memory budget (see LlamaCppEx.ModelManager.Budget); over-budget loads are refused with {:error, {:insufficient_memory, ...}}.

Summary

Types

A model identifier. Any term works as a key; strings (e.g. "chat") or atoms are conventional. Ids flow through as raw terms and are never converted to atoms, so user-supplied strings are safe.

Functions

Routes a chat request to model id (or :default).

Returns a specification to start this module under a supervisor.

Returns the current default model id, or nil.

Routes an embedding request to model id (or :default).

Returns the raw %LlamaCppEx.Model{} for advanced use.

Routes a generation request to model id (or :default).

Returns a sanitized view of one model, or {:error, :not_loaded}.

Lists resident models as sanitized maps (no raw refs).

Loads a model and keeps it resident under id.

Returns whether a model is loaded and :ready.

Resolves id (or :default) to its dispatch target.

Sets the default model used by :default routing.

Starts the manager. Normally started by LlamaCppEx.ModelSupervisor.

Routes a streaming generation request to model id (or :default).

Unloads a model and frees its backing resources (GC-based).

Types

id()

@type id() :: term()

A model identifier. Any term works as a key; strings (e.g. "chat") or atoms are conventional. Ids flow through as raw terms and are never converted to atoms, so user-supplied strings are safe.

source()

Functions

chat(id, messages, opts \\ [])

@spec chat(id(), [LlamaCppEx.Chat.message()], keyword()) ::
  {:ok, String.t()} | {:error, term()}

Routes a chat request to model id (or :default).

child_spec(init_arg)

Returns a specification to start this module under a supervisor.

See Supervisor.

default()

@spec default() :: id() | nil

Returns the current default model id, or nil.

embed(id, text, opts \\ [])

@spec embed(id(), String.t(), keyword()) :: {:ok, [float()]} | {:error, term()}

Routes an embedding request to model id (or :default).

The model must have been loaded with :embed in its :capabilities (which forces :direct mode).

fetch_model(id)

@spec fetch_model(id()) :: {:ok, LlamaCppEx.Model.t()} | {:error, term()}

Returns the raw %LlamaCppEx.Model{} for advanced use.

Holding the returned ref keeps the model alive past unload/1 — prefer id-based dispatch where possible.

generate(id, prompt, opts \\ [])

@spec generate(id(), String.t(), keyword()) :: {:ok, String.t()} | {:error, term()}

Routes a generation request to model id (or :default).

Dispatches to LlamaCppEx.Server.generate/3 (:server mode) or LlamaCppEx.generate/3 (:direct mode).

info(id)

@spec info(id()) :: {:ok, map()} | {:error, :not_loaded}

Returns a sanitized view of one model, or {:error, :not_loaded}.

list()

@spec list() :: [map()]

Lists resident models as sanitized maps (no raw refs).

load(id, source, opts \\ [])

@spec load(id(), source(), keyword()) :: {:ok, id()} | {:error, term()}

Loads a model and keeps it resident under id.

Options

  • :mode - :server or :direct. Defaults to :direct when :capabilities includes :embed, otherwise :server.
  • :capabilities - List of :generate, :chat, :embed. Defaults to [:generate, :chat].
  • :default - When true, mark this model as the default route.
  • Hub options (:cache_dir, :token, :revision, :force) when source is {:hub, repo, file}.
  • Any LlamaCppEx.Model.load/2 or LlamaCppEx.Server.start_link/1 options (e.g. :n_gpu_layers, :n_ctx, :n_parallel).

loaded?(id)

@spec loaded?(id()) :: boolean()

Returns whether a model is loaded and :ready.

route(id)

@spec route(id()) ::
  {:ok, {:server, pid(), LlamaCppEx.ModelManager.Entry.t()}}
  | {:ok, {:direct, LlamaCppEx.Model.t(), LlamaCppEx.ModelManager.Entry.t()}}
  | {:error, term()}

Resolves id (or :default) to its dispatch target.

Returns {:ok, {:server, pid, entry}}, {:ok, {:direct, model, entry}}, or {:error, :not_loaded | {:not_ready, status}}. Primarily for dispatch and testing.

set_default(id)

@spec set_default(id()) :: :ok | {:error, :not_loaded}

Sets the default model used by :default routing.

start_link(opts)

@spec start_link(keyword()) :: GenServer.on_start()

Starts the manager. Normally started by LlamaCppEx.ModelSupervisor.

Options

stream(id, prompt, opts \\ [])

@spec stream(id(), String.t(), keyword()) :: Enumerable.t()

Routes a streaming generation request to model id (or :default).

Raises ArgumentError if the model is not loaded and ready (a lazy stream cannot carry an error tuple).

unload(id, timeout \\ 30000)

@spec unload(id(), timeout()) :: :ok | {:error, :not_loaded}

Unloads a model and frees its backing resources (GC-based).

Stopping a backing server can take a moment for large models, so this accepts an optional timeout (default 30s) rather than the 5s GenServer default.