Inference context with KV cache.
Summary
Functions
Clears the KV cache.
Creates a new inference context for the given model.
Decodes a list of tokens through the model.
Runs the generation loop: decodes prompt tokens and generates up to max_tokens new tokens.
Returns the context size.
Returns the max number of sequences.
Types
@type t() :: %LlamaCppEx.Context{model: LlamaCppEx.Model.t(), ref: reference()}
Functions
@spec clear(t()) :: :ok
Clears the KV cache.
@spec create( LlamaCppEx.Model.t(), keyword() ) :: {:ok, t()} | {:error, String.t()}
Creates a new inference context for the given model.
Options
Core
:n_ctx- Context size (max tokens). Defaults to2048.:n_batch- Max tokens per decode batch. Defaults ton_ctx.:n_ubatch- Max tokens per micro-batch. Defaults to512.:n_threads- Number of threads for generation. Defaults to system CPU count.:n_threads_batch- Number of threads for prompt processing. Defaults to:n_threads.:n_seq_max- Max number of concurrent sequences. Defaults to1.:embeddings- Enable embedding extraction. Defaults tofalse.:pooling_type- Pooling type for embeddings::unspecified,:none,:mean,:cls,:last,:rank. Defaults to:unspecified.
KV Cache Quantization
:type_k- Data type for K cache. Reduces memory at the cost of precision. Values::f16(default),:f32,:q8_0,:q4_0,:q4_1,:q5_0,:q5_1,:bf16.:type_v- Data type for V cache. Same values as:type_k. Defaults to:f16.
Flash Attention & GPU Offload
:flash_attn- Flash Attention mode::auto(default),:enabled,:disabled.:offload_kqv- Offload KQV ops and KV cache to GPU. Defaults totrue.:op_offload- Offload host tensor operations to device. Defaults totrue.
RoPE Scaling (Context Extension)
:rope_scaling_type- RoPE scaling mode::unspecified(default),:none,:linear,:yarn,:longrope.:rope_freq_base- RoPE base frequency.0.0uses model default.:rope_freq_scale- RoPE frequency scale.0.0uses model default.:yarn_ext_factor- YaRN extrapolation mix factor.-1.0to disable.:yarn_attn_factor- YaRN magnitude scaling.-1.0to disable.:yarn_beta_fast- YaRN low correction dimension.-1.0to disable.:yarn_beta_slow- YaRN high correction dimension.-1.0to disable.:yarn_orig_ctx- YaRN original context length.0to disable.
Misc
:attention_type- Attention type::unspecified(default),:causal,:non_causal. Use:non_causalfor embedding models.:no_perf- Disable performance timing. Defaults totrue.:swa_full- Use full-size sliding window attention cache. Defaults totrue.
Decodes a list of tokens through the model.
@spec generate(t(), LlamaCppEx.Sampler.t(), [integer()], keyword()) :: {:ok, String.t()} | {:error, String.t()}
Runs the generation loop: decodes prompt tokens and generates up to max_tokens new tokens.
Returns the generated text (not including the prompt).
Options
:max_tokens- Maximum tokens to generate. Defaults to256.
Returns the context size.
Returns the max number of sequences.