Holds multiple models resident and routes requests to them by id.
The manager is a node-wide singleton GenServer that owns an ETS table of
loaded models. Following the otp-thinking ETS pattern, lifecycle writes
serialize through the GenServer, while inference-time lookups read the ETS
table directly from the caller — so the manager never becomes a throughput
bottleneck for generate/3, stream/3, chat/3, or embed/3.
It is a singleton by design: the client API targets the manager by its module
name, and the backing Registry/DynamicSupervisor use fixed names. Start at
most one per node — init/1 refuses a second instance with a clear error.
Because the slow parts of load/3 (Hub download + native model load) run in a
supervised Task rather than the GenServer process, a long load does not
block other lifecycle calls (unload/1, set_default/1, or concurrent
load/3s). The memory-budget reservation and the ETS commit are still
serialized on the GenServer, so resident models are always accounted for. (The
budget remains advisory: a model's footprint is only reserved once its size is
known — after resolve — so two models downloading at once may momentarily
under-count each other.)
Start it as part of LlamaCppEx.ModelSupervisor, which also starts the
Registry and DynamicSupervisor that server-backed models need:
children = [
{LlamaCppEx.ModelSupervisor,
memory_budget: :auto,
models: [
{"chat", {:hub, "Qwen/Qwen3-0.6B-GGUF", "Qwen3-0.6B-Q8_0.gguf"}, n_gpu_layers: -1},
{"embed", {:path, "/models/nomic-embed.gguf"}, capabilities: [:embed]}
]}
]Backing modes
:server(default for generation/chat) — backs the model with a supervisedLlamaCppEx.Server, giving continuous batching, streaming, prefix cache, and telemetry.:direct(auto-selected when:embedis in:capabilities) — holds the%LlamaCppEx.Model{}and runs statelessLlamaCppEx.generate/3/LlamaCppEx.Embedding.embed/2. Mandatory for embeddings, since the server has no embedding path.
Routing
- Explicit id:
generate("chat", prompt). - Default model:
generate(:default, prompt)routes to the model markeddefault: trueat load, or set viaset_default/1.
Unloading and memory
Model cleanup is GC-based: unload/1 stops the backing server (dropping its
context and model refs) and removes the ETS entry, then forces a GC. Because
reclamation is by garbage collection, any caller still holding a %Model{}
obtained via fetch_model/1 keeps the underlying model alive past unload/1.
Prefer id-based dispatch and avoid holding raw refs.
Loads are checked against an advisory memory budget (see
LlamaCppEx.ModelManager.Budget); over-budget loads are refused with
{:error, {:insufficient_memory, ...}}.
Summary
Types
A model identifier. Any term works as a key; strings (e.g. "chat") or atoms
are conventional. Ids flow through as raw terms and are never converted to
atoms, so user-supplied strings are safe.
Functions
Routes a chat request to model id (or :default).
Returns a specification to start this module under a supervisor.
Returns the current default model id, or nil.
Routes an embedding request to model id (or :default).
Returns the raw %LlamaCppEx.Model{} for advanced use.
Routes a generation request to model id (or :default).
Returns a sanitized view of one model, or {:error, :not_loaded}.
Lists resident models as sanitized maps (no raw refs).
Loads a model and keeps it resident under id.
Returns whether a model is loaded and :ready.
Resolves id (or :default) to its dispatch target.
Sets the default model used by :default routing.
Starts the manager. Normally started by LlamaCppEx.ModelSupervisor.
Routes a streaming generation request to model id (or :default).
Unloads a model and frees its backing resources (GC-based).
Types
@type id() :: term()
A model identifier. Any term works as a key; strings (e.g. "chat") or atoms
are conventional. Ids flow through as raw terms and are never converted to
atoms, so user-supplied strings are safe.
@type source() :: LlamaCppEx.ModelManager.Entry.source()
Functions
@spec chat(id(), [LlamaCppEx.Chat.message()], keyword()) :: {:ok, String.t()} | {:error, term()}
Routes a chat request to model id (or :default).
Returns a specification to start this module under a supervisor.
See Supervisor.
@spec default() :: id() | nil
Returns the current default model id, or nil.
Routes an embedding request to model id (or :default).
The model must have been loaded with :embed in its :capabilities (which
forces :direct mode).
@spec fetch_model(id()) :: {:ok, LlamaCppEx.Model.t()} | {:error, term()}
Returns the raw %LlamaCppEx.Model{} for advanced use.
Holding the returned ref keeps the model alive past unload/1 — prefer
id-based dispatch where possible.
Routes a generation request to model id (or :default).
Dispatches to LlamaCppEx.Server.generate/3 (:server mode) or
LlamaCppEx.generate/3 (:direct mode).
Returns a sanitized view of one model, or {:error, :not_loaded}.
@spec list() :: [map()]
Lists resident models as sanitized maps (no raw refs).
Loads a model and keeps it resident under id.
Options
:mode-:serveror:direct. Defaults to:directwhen:capabilitiesincludes:embed, otherwise:server.:capabilities- List of:generate,:chat,:embed. Defaults to[:generate, :chat].:default- Whentrue, mark this model as the default route.- Hub options (
:cache_dir,:token,:revision,:force) whensourceis{:hub, repo, file}. - Any
LlamaCppEx.Model.load/2orLlamaCppEx.Server.start_link/1options (e.g.:n_gpu_layers,:n_ctx,:n_parallel).
Returns whether a model is loaded and :ready.
@spec route(id()) :: {:ok, {:server, pid(), LlamaCppEx.ModelManager.Entry.t()}} | {:ok, {:direct, LlamaCppEx.Model.t(), LlamaCppEx.ModelManager.Entry.t()}} | {:error, term()}
Resolves id (or :default) to its dispatch target.
Returns {:ok, {:server, pid, entry}}, {:ok, {:direct, model, entry}}, or
{:error, :not_loaded | {:not_ready, status}}. Primarily for dispatch and
testing.
@spec set_default(id()) :: :ok | {:error, :not_loaded}
Sets the default model used by :default routing.
@spec start_link(keyword()) :: GenServer.on_start()
Starts the manager. Normally started by LlamaCppEx.ModelSupervisor.
Options
:memory_budget-:infinity(default),:auto(~80% of system RAM), or a byte limit.:models- List of{id, source}or{id, source, opts}to auto-load after start.sourceis{:path, p}or{:hub, repo, file}.:io- Backend module (LlamaCppEx.ModelManager.Backend). Defaults toLlamaCppEx.ModelManager.ModelIO; overridden in tests.:name- GenServer name. Defaults toLlamaCppEx.ModelManager.
@spec stream(id(), String.t(), keyword()) :: Enumerable.t()
Routes a streaming generation request to model id (or :default).
Raises ArgumentError if the model is not loaded and ready (a lazy stream
cannot carry an error tuple).
Unloads a model and frees its backing resources (GC-based).
Stopping a backing server can take a moment for large models, so this accepts
an optional timeout (default 30s) rather than the 5s GenServer default.