LlamaCppEx.ModelManager.Budget (LlamaCppEx v0.8.24)

Copy Markdown View Source

Advisory, placement-aware memory budgeting for LlamaCppEx.ModelManager.

Given a budget, the footprint a new model would occupy across devices, and what resident models already use, decide whether the new model fits. The manager refuses over-budget loads.

Budget shapes

  • :infinity / nil — no limit.
  • an integer — a single combined pool: RAM plus all VRAM counts against one number (backward-compatible with the original single-pool budget).
  • :auto — RAM ≈ 80% of system memory, and per-GPU VRAM from each device's free memory (via LlamaCppEx.devices/0).
  • a map %{ram: ram, vram: vram} — explicit per-device budget. ram and each VRAM entry may be a byte limit, :infinity, or (for ram/vram) :auto. vram may be a list [b0, b1, ...] (indexed by GPU) or a map %{gpu_index => bytes}.

Placement across GPUs is derived from a model's :n_gpu_layers, :split_mode, :tensor_split, and :main_gpu. It is advisory — quantization, compute buffers, fragmentation, and exact KV growth are approximated, and partial offload (0 < n_gpu_layers < n_layers) is treated coarsely as fully offloaded.

Summary

Functions

Folds a placement into a usage accumulator.

Decides whether placement fits within budget given current used.

Estimates how a model's footprint is distributed across RAM and GPUs.

An empty usage accumulator.

Resolves a :memory_budget option into an internal budget.

Types

budget()

@type budget() ::
  %{mode: :unlimited}
  | %{mode: :combined, limit: non_neg_integer()}
  | %{
      mode: :per_device,
      ram: limit(),
      vram: :infinity | %{required(non_neg_integer()) => limit()}
    }

limit()

@type limit() :: non_neg_integer() | :infinity

placement()

@type placement() :: %{
  ram: non_neg_integer(),
  vram: %{required(non_neg_integer()) => non_neg_integer()}
}

Functions

add_usage(used, placement)

@spec add_usage(placement(), placement()) :: placement()

Folds a placement into a usage accumulator.

check(map, placement, used)

@spec check(budget(), placement(), placement()) ::
  :ok | {:error, {:insufficient_memory, keyword()}}

Decides whether placement fits within budget given current used.

Returns :ok or {:error, {:insufficient_memory, device: device, required: r, available: a}}, where device is :total (combined budget), :ram, or {:gpu, index}.

distribute(file_bytes, opts, n_gpus)

@spec distribute(non_neg_integer(), keyword(), non_neg_integer()) :: placement()

Estimates how a model's footprint is distributed across RAM and GPUs.

Returns %{ram: bytes, vram: %{gpu_index => bytes}}.

Options (from the load opts)

  • :mode - :server adds a coarse KV-cache estimate; :direct adds none.
  • :n_gpu_layers - 0 keeps the model in RAM; anything else (incl. -1) is treated as fully offloaded to GPU(s).
  • :split_mode - :none (default) places everything on :main_gpu; :layer/:row splits by :tensor_split.
  • :tensor_split - per-GPU weights; when empty, an equal split across the n_gpus detected devices.
  • :main_gpu - target device for :none (default 0).
  • :offload_kqv - whether the KV cache lives on GPU (default true).
  • :n_ctx, :n_parallel - feed the KV-cache estimate.

empty_usage()

@spec empty_usage() :: placement()

An empty usage accumulator.

resolve(opt, gpu_devices \\ [])

@spec resolve(term(), [map()]) :: budget()

Resolves a :memory_budget option into an internal budget.

gpu_devices is the GPU subset of LlamaCppEx.devices/0 (maps with :gpu_index, :memory_free); it is only consulted for :auto VRAM.