Advisory, placement-aware memory budgeting for LlamaCppEx.ModelManager.
Given a budget, the footprint a new model would occupy across devices, and what resident models already use, decide whether the new model fits. The manager refuses over-budget loads.
Budget shapes
:infinity/nil— no limit.- an integer — a single combined pool: RAM plus all VRAM counts against one number (backward-compatible with the original single-pool budget).
:auto— RAM ≈ 80% of system memory, and per-GPU VRAM from each device's free memory (viaLlamaCppEx.devices/0).- a map
%{ram: ram, vram: vram}— explicit per-device budget.ramand each VRAM entry may be a byte limit,:infinity, or (forram/vram):auto.vrammay be a list[b0, b1, ...](indexed by GPU) or a map%{gpu_index => bytes}.
Placement across GPUs is derived from a model's :n_gpu_layers, :split_mode,
:tensor_split, and :main_gpu. It is advisory — quantization, compute
buffers, fragmentation, and exact KV growth are approximated, and partial
offload (0 < n_gpu_layers < n_layers) is treated coarsely as fully offloaded.
Summary
Functions
Folds a placement into a usage accumulator.
Decides whether placement fits within budget given current used.
Estimates how a model's footprint is distributed across RAM and GPUs.
An empty usage accumulator.
Resolves a :memory_budget option into an internal budget.
Types
@type budget() :: %{mode: :unlimited} | %{mode: :combined, limit: non_neg_integer()} | %{ mode: :per_device, ram: limit(), vram: :infinity | %{required(non_neg_integer()) => limit()} }
@type limit() :: non_neg_integer() | :infinity
@type placement() :: %{ ram: non_neg_integer(), vram: %{required(non_neg_integer()) => non_neg_integer()} }
Functions
Folds a placement into a usage accumulator.
@spec check(budget(), placement(), placement()) :: :ok | {:error, {:insufficient_memory, keyword()}}
Decides whether placement fits within budget given current used.
Returns :ok or
{:error, {:insufficient_memory, device: device, required: r, available: a}},
where device is :total (combined budget), :ram, or {:gpu, index}.
@spec distribute(non_neg_integer(), keyword(), non_neg_integer()) :: placement()
Estimates how a model's footprint is distributed across RAM and GPUs.
Returns %{ram: bytes, vram: %{gpu_index => bytes}}.
Options (from the load opts)
:mode-:serveradds a coarse KV-cache estimate;:directadds none.:n_gpu_layers-0keeps the model in RAM; anything else (incl.-1) is treated as fully offloaded to GPU(s).:split_mode-:none(default) places everything on:main_gpu;:layer/:rowsplits by:tensor_split.:tensor_split- per-GPU weights; when empty, an equal split across then_gpusdetected devices.:main_gpu- target device for:none(default0).:offload_kqv- whether the KV cache lives on GPU (defaulttrue).:n_ctx,:n_parallel- feed the KV-cache estimate.
@spec empty_usage() :: placement()
An empty usage accumulator.
Resolves a :memory_budget option into an internal budget.
gpu_devices is the GPU subset of LlamaCppEx.devices/0 (maps with
:gpu_index, :memory_free); it is only consulted for :auto VRAM.