LlamaCppEx.Context (LlamaCppEx v0.8.5)

Copy Markdown View Source

Inference context with KV cache.

Summary

Functions

Clears the KV cache.

Creates a new inference context for the given model.

Decodes a list of tokens through the model.

Runs the generation loop: decodes prompt tokens and generates up to max_tokens new tokens.

Returns the context size.

Returns the max number of sequences.

Types

t()

@type t() :: %LlamaCppEx.Context{model: LlamaCppEx.Model.t(), ref: reference()}

Functions

clear(context)

@spec clear(t()) :: :ok

Clears the KV cache.

create(model, opts \\ [])

@spec create(
  LlamaCppEx.Model.t(),
  keyword()
) :: {:ok, t()} | {:error, String.t()}

Creates a new inference context for the given model.

Options

Core

  • :n_ctx - Context size (max tokens). Defaults to 2048.
  • :n_batch - Max tokens per decode batch. Defaults to n_ctx.
  • :n_ubatch - Max tokens per micro-batch. Defaults to 512.
  • :n_threads - Number of threads for generation. Defaults to system CPU count.
  • :n_threads_batch - Number of threads for prompt processing. Defaults to :n_threads.
  • :n_seq_max - Max number of concurrent sequences. Defaults to 1.
  • :embeddings - Enable embedding extraction. Defaults to false.
  • :pooling_type - Pooling type for embeddings: :unspecified, :none, :mean, :cls, :last, :rank. Defaults to :unspecified.

KV Cache Quantization

  • :type_k - Data type for K cache. Reduces memory at the cost of precision. Values: :f16 (default), :f32, :q8_0, :q4_0, :q4_1, :q5_0, :q5_1, :bf16.
  • :type_v - Data type for V cache. Same values as :type_k. Defaults to :f16.

Flash Attention & GPU Offload

  • :flash_attn - Flash Attention mode: :auto (default), :enabled, :disabled.
  • :offload_kqv - Offload KQV ops and KV cache to GPU. Defaults to true.
  • :op_offload - Offload host tensor operations to device. Defaults to true.

RoPE Scaling (Context Extension)

  • :rope_scaling_type - RoPE scaling mode: :unspecified (default), :none, :linear, :yarn, :longrope.
  • :rope_freq_base - RoPE base frequency. 0.0 uses model default.
  • :rope_freq_scale - RoPE frequency scale. 0.0 uses model default.
  • :yarn_ext_factor - YaRN extrapolation mix factor. -1.0 to disable.
  • :yarn_attn_factor - YaRN magnitude scaling. -1.0 to disable.
  • :yarn_beta_fast - YaRN low correction dimension. -1.0 to disable.
  • :yarn_beta_slow - YaRN high correction dimension. -1.0 to disable.
  • :yarn_orig_ctx - YaRN original context length. 0 to disable.

Misc

  • :attention_type - Attention type: :unspecified (default), :causal, :non_causal. Use :non_causal for embedding models.
  • :no_perf - Disable performance timing. Defaults to true.
  • :swa_full - Use full-size sliding window attention cache. Defaults to true.

decode(context, tokens)

@spec decode(t(), [integer()]) :: :ok | {:error, String.t()}

Decodes a list of tokens through the model.

generate(context, sampler, tokens, opts \\ [])

@spec generate(t(), LlamaCppEx.Sampler.t(), [integer()], keyword()) ::
  {:ok, String.t()} | {:error, String.t()}

Runs the generation loop: decodes prompt tokens and generates up to max_tokens new tokens.

Returns the generated text (not including the prompt).

Options

  • :max_tokens - Maximum tokens to generate. Defaults to 256.

n_ctx(context)

@spec n_ctx(t()) :: integer()

Returns the context size.

n_seq_max(context)

@spec n_seq_max(t()) :: integer()

Returns the max number of sequences.