v0.8.3
Changed
- llama.cpp submodule — Updated from b97ebdc98 to e48034dfc (14 commits).
- common: determine generation prompt using longest common prefix (#22657)
- convert: Mistral format yarn
apply_scalesupport (#22612); apply Q/K RoPE permutation in NVFP4 repack path (#22611); disable uint types (#18908) - CUDA: fix device PCI bus ID de-dupe OOMing (ignoring other 3 GPUs entirely) (#22533)
- server: avoid checkpoint data host copies (#22558)
- ggml-virtgpu: fix circular dependency in headers (#22557)
- opencl: Adreno optimization for MoE - MxFP4 (#22301)
- hexagon: HMX flash attention (#22347)
- ggml: bump version to 0.10.2; sync ggml; try fix win32 build
v0.8.2
Changed
- llama.cpp submodule — Updated from d77599234 to b97ebdc98 (18 commits).
- llama-quant: fix
--tensor-typewhen defaultqtypeis overriden (#22572); add fast matmul iquants (#22504) - CUDA: fix tile FA kernel on Pascal (#22541)
- vulkan: support asymmetric FA in coopmat2 path (#21753); add get/set tensor 2d functions (#22514)
- ggml-webgpu: fix vectorized handling in mul-mat and mul-mat-id (#22578); add the upscale shader (#22419); improve performance of mat-vec and mat-mat for
MUL_MAT_ID(#22464) - hexagon: enable non-contiguous row tensor support for unary ops (#22574)
- llama-mmap: use
ftello/fseeko(#22497) - spec: fix draft model checkpoints (#22521); fix vocab compat checks in spec example (#22426); fix argument typo (#22552)
- common: check for null
getpwuidin hf-cache (#22550) - webui: Spring Cleaning Refactor v1 (#22505)
- vendor: update cpp-httplib to 0.43.2 (#22548)
- ci: bump ty to 0.0.33 (#22535)
- scripts: add
wc2wt.sh- create worktree from current HEAD (#22513)
- llama-quant: fix
v0.8.1
Changed
- llama.cpp submodule — Updated from 98dc1418e to d77599234 (49 commits).
- server: use
pos_nextinstead ofn_tokensfor m-rope (#22439); (router) forward form-data to model server (#22118) - CUDA: fuse SSM_CONV + ADD(bias) + SILU (#22478); refactor fusion code (#22468); Blackwell native NVFP4 support (#22196); flash-attn support for DKQ=320/DV=256 with
ncols2=32(#22286); better coalesce data-access for contiguous concat (#22330) - ggml-cpu: disable tiled matmul on AIX to fix page boundary segfault (#22293); append
xsmtvdotiimarch for SpacemiT IME (#22317); re-enable fastgelu_quick_f16(#22339); optimize avx2 q6_k (#22345); SVE-tunedgemm_q8_0_4x8_q8_0kernel (#21916) - ggml-webgpu: fix FlashAttention support check (#22492); fix buffer aliasing for
ssm_scan(#22456); add Q1_0 support (#22374) - vulkan: coalesce Q4_K/Q5_K scale loads (#21751); add barrier after
writetimestamp(#21865) - ggml: bump version to 0.10.1; use 64-byte aligned tile buffers (#21058); skip already-registered backends and devices (#22296); revert to
-lmlinking instead offind_library(#22355); improve SPIR-V headers detection with__has_include(#21918) - hexagon: make vmem and buffer-size configurable (#22487); guard HMX clock request for v75+ platforms (#22377)
- spec: discard last drafted token with low prob (#22506); refactor params (#22397)
- common: do not pass prompt tokens to reasoning budget sampler (#22488); re-arm reasoning budget after DONE on new
<think>(#22323); intentionally leak logger instance to fix hanging on Windows (#22273); fix missing exports inllama-common(#22340) - chat: fix handling of space in reasoning markers (#22353); handle gemma4 parsing edge cases (#22420)
- convert: add support for Nemotron Nano 3 Omni (#22481); remove
input_scalefor dequantized fp8 modelopt (#22356) - model: remove duplicate
wo_sscale afterbuild_attn(Qwen3, LLaMA) (#22421) - opencl: add iq4_nl support (#22272)
- CANN: add new ops, optimize existing ops (#21204)
- TP: fix delayed AllReduce + zero-sized slices (#22489)
- rpc: fix rpc-server cache on Windows (#22394)
- download: prefer q8_0 when q4_k not available (#22428)
- webui: fix slow mic stop and WAV encode (#22480); add Server tools (#21237)
- server: use
v0.8.0
Changed
- llama.cpp submodule — Updated from 550d684bd to 98dc1418e (30 commits).
- server: fix swa-full logic (#22288); rename debug tags to match
--cache-idle-slots(#22292);convert_anthropic_to_oaialso copychat_template_kwargs(#22154); fix heap-buffer-overflow from negativen_discard(CVE-2026-21869) (#22267); (anthropic API) fix prefix caching (#21793) - CUDA: reduce MMQ stream-k overhead (#22298)
- metal: optimize Metal Tensor API usage for
GGML_OP_MUL_MAT(#20962); print GPU description (#22318) - SYCL: optimize Q4_0
mul_matfor Arc770, add scripts (#22291); fix build number for SYCL release (#22283) - hexagon: bump HMX frequency to max corner (#22334); use DIRID 13 in
libggml-htp.inffor modern InfVerif (#22306); add SOLVE_TRI op (#21974); add basic and extended op profiling (#22269) - ggml-webgpu: support for SSM_SCAN and disable
set_rowserror checking (#22327); enableFLASH_ATTN_EXTon browser without subgroup matrix (#22199) - llama-quant: default ftype param
Q5_1→Q8_0(#20828) - spec: fix vocab compat checks (#22358)
- parser: fix structured output bug (#22302)
- common: fix jinja warnings with clang 21 (#22313)
- vendor: update LibreSSL to 4.3.1 (#22285)
- server: fix swa-full logic (#22288); rename debug tags to match
v0.7.9
Changed
- llama.cpp submodule — Updated from 45cac7ca7 to 550d684bd (69 commits).
- server: Enable transcriptions API for LFM2-Audio (#22000); ignore reasoning content from transcription api (#21905); allow cancel loading model (#21814); fix hardcoded proxy connection timeout in router mode (#22003)
- metal: fix event synchronization (#22260); workaround macOS GPU interactivity watchdog (#22216)
- ggml-base: use
MATH_LIBRARYvariable instead of hardcodedm(#22239) - ggml: bump version to 0.10.0
- SYCL: update oneapi 2025.3.3, separate SYCL build, release Ubuntu 24 package (#22078); fused MoE
mul_mat_vec_qfor TG (#21920); improvemul_mat_idmemory efficiency and add BF16 fast path (#22119) - CUDA: fuse relu + sqr (#22249); flush legacy pool on OOM and retry (#22155)
- HIP: flip
GGML_HIP_GRAPHSto default on (#22254) - ggml-webgpu: add support for im2col (#22259); implement async tensor api and event api (#22099); fused RMS_NORM + MUL (#21983); conv2d kernels (#21964); reset CPU/GPU profiling time when freeing context (#22050)
- vulkan: Support F16 OP_FILL (#22177)
- hexagon: add support for FILL op (#22198); DAIG op (#22195); fix missing v79 entry in
libggml-htp.inf(#22194) - mtmd: also support
LLAMA_ROPE_TYPE_NONE(#22242); update HunyuanVL vision-language model support (#22037); correctmtmd_decode_use_mrope()(#22188); add support for Reka Edge 2603 (#21616) - chat: fix
parallel_tool_callsdefault setting based on model capabilities, add tests for parallel tool calls and structured outputs (#22217) - common: refactoring sampler parameters (#22233); refactor, move all conversion functions to common, add tests (#20690)
- speculative: add checkpoint support (#22227); reset
i_lastwhen low acceptance streak occurs (#22168);--spec-defaultarg (#22223) - convert: handle ModelOpt produced mixed precision model during convert to GGUF (#22247)
- openvino: driver setup, CI split, thread safety, and NPU optimizations (#21944)
- llama-ext: fix exports (#22202)
- vendor: update cpp-httplib to 0.43.1 (#22143)
Fixed
- build: Added
-DLLAMA_OPENSSL=OFFto suppress upstream HTTPS dependency pulled in by the newLLAMA_OPENSSL=ONdefault.
v0.7.8
Changed
- llama.cpp submodule — Updated from 30dce2cf2 to 45cac7ca7 (7 commits).
- model: Gemma4 model type detection (#22027)
- mtmd: add missing struct tag (#22023)
- libs: rename
libcommon→libllama-common(#21936) - CUDA: use LRU based eviction for cuda graphs (#21611)
- OpenCL: refactor q8_0
set_tensorandmul_mathost side dispatch for Adreno (#21938) - ggml-webgpu: fix compiler warnings and refactor FlashAttention encoding (#21052)
- ci: add android arm64 build and release (#21647)
v0.7.7
Changed
- llama.cpp submodule — Updated from 408225bb1 to 30dce2cf2 (18 commits).
- model: using single llm_build per arch (#21970), refactor QKV into common
build_qkvandcreate_tensor_qkvhelpers (#21245), support NVFP4 tensors for Gemma4 (#21971) - cli: use
get_media_marker(#22017) - server: tests fetch random media marker via
/apply-template(#21980) - convert: fix NemotronH config parsing (#21664)
- ggml: add
graph_reused(#21764) - ggml-cpu: 128-bit RVV implementation for Quantization Vector Dot (#20633), SIMD gemm kernel for RISC-V vector extension (#20627)
- Metal: implement ROLL op (#21946)
- OpenCL: add q5_K gemm and gemv kernels for Adreno (#21595)
- SYCL: fix Q8_0 reorder garbage on 2nd prompt + crash on full VRAM (#21638)
- hexagon: optimize HMX matmul operations (#21071)
- ggml-webgpu: compute pass batching and remove profiling overhead (#21873)
- cmake: use glob to collect
src/modelssources (#22005) - ci: use ggml-org/ccache-action on RISC-V (#21632)
- devops: add spirv-headers to nix (#21965)
- model: using single llm_build per arch (#21970), refactor QKV into common
v0.7.6
Changed
- llama.cpp submodule — Updated from a8bad3842 to 408225bb1 (28 commits).
- server: use random media marker (#21962), support OAI
/v1/audio/transcriptionsAPI (#21863) - chat: dedicated DeepSeek v3.2 parser + "official" template (#21785)
- autoparser: support case of JSON_NATIVE with per-call markers (test case: Reka-Edge) (#21892)
- common: handle gemma4 parsing edge cases (#21760), skip reasoning budget sampler when no budget is requested (#21870)
- mtmd: add
mtmd_image_tokens_get_decoder_pos()API (#21851) - llama: read
n_ctxback after makingllama_context(#21939) - CUDA: Q1_0 initial backend (#21629), require explicit opt-in for P2P access (#21910), manage NCCL communicators in context (#21891)
- Metal: fix FA support logic (#21898), add XIELU unary op (#20802)
- Vulkan: optimize im2col (#21713), support GGML_TYPE_NVFP4 (#21455), programmatically add RoundingModeRTE to all shaders when the device supports it (#21572)
- ggml-webgpu: fix dequantization helpers to not pass in pointers (#21872), update register tiling matmul to use f32 accumulation (#21644)
- ggml: remove
ggml-ext.h(#21869), fix ARM NEON nvfp4 dot product on non-dotprod targets (#21559) - hexagon: optimization for HMX mat_mul (#21554)
- rpc: add native RDMA transport for RPC backend (RoCEv2) (#20590)
- vendor: update BoringSSL to 0.20260413.0 (#21881)
- cmake: fix CMP0194 warning on Windows with MSVC (#21630)
- ci: re-enable mac workflows (#21894), disable test-backend-ops on Vulkan llvmpipe run and restore default timeout (#21901)
- server: use random media marker (#21962), support OAI
v0.7.5
Changed
- llama.cpp submodule — Updated from 073bb2c20 to a8bad3842 (18 commits).
- mtmd: add Gemma 4 audio conformer encoder support (#21421), qwen3 audio support (qwen3-omni and qwen3-asr) (#19441), use causal attn for gemma 4 audio (#21824), fix crash when sending image under 2x2 pixels (#21711)
- Vulkan: Flash Attention DP4A shader for quantized KV cache (#20797)
- CUDA: limit DeviceSegmentedSort to immediate mode (#21718), skip compilation of superfluous FA kernels (#21768)
- common: add download cancellation and temp file cleanup (#21813)
- server: expose build_info in router mode (#21835)
- convert: force f16 or f32 on step3-vl conv weights (#21646)
v0.7.4
Changed
- llama.cpp submodule — Updated from d12cc3d1c to 073bb2c20 (42 commits).
- model: make Gemma 4 shared-KV tail attn_k tensors optional on load (#21739), fix multimodal padding token for gemma3n/gemma4 (#21625)
- mtmd: add MERaLiON-2 multimodal audio support (#21756), support dots.ocr (#17575)
- common: better align to the updated official gemma4 template (#21704), enable reasoning budget sampler for gemma4 (#21697), add callback interface for download progress (#21735), fix when loading cached HF models with unavailable API (#21670), mark
--split-mode tensoras experimental (#21684), add fluidity to the progress bar (#21671), fix ambiguous grammar rule in gemma4 (#21661), simplify autoparser tagged parser rules (#21216), skip non-primary GGUF split files when selecting model (#21633) - server: ignore
--aliaswhen using--models-preset(#21380), fix grammar commandline args (#21543) - jinja: support
ensure_ascii=true, string repetition and int/float self-filtering (#21623) - vocab: add gemma4 tokenizer tests, fix edge case (#21534)
- structured output: fix broken structured output when using
$refsin json_schema (#21699) - ggml: backend-agnostic tensor parallelism (experimental) (#19378), fix missing GGML_TYPE_Q1_0 cases (#21716), check return value of CUB calls in argsort and top-k (#21676)
- CUDA: fuse muls (#21665), also store
node->srcne/nb for graph equality (#21736) - Metal: add missing mm-id specializations for q1_0 (#21662)
- Vulkan: support Q1_0 (#21539), unify type macros to use Vx instead of _VECx (#21605)
- SYCL: add flash-attn support for head size 512 (#21654)
- HIP: add CDNA4 (gfx950) architecture support for MI350X/MI355X (#21570)
- OpenCL: add basic support for q5_k (#21593)
- WebGPU: support non-square subgroup matrix configs for Intel GPUs (#21669), address quantization precision and backend lifecycle management (#21521)
- hexagon: add support for linux on snapdragon (#21707), improved Op queuing, buffer and cache management (#21705)
- TP: fix Qwen 3 Next data split (#21732)
- webui: static build output improvements (#21667), add "Send message on Enter" setting (#21577), add option to pre-encode conversation for faster next turns (#21034), fix Model Selector choice sync (#21628)
v0.7.3
Changed
- llama.cpp submodule — Updated from b8635075f to d12cc3d1c (55 commits).
- model: add HunyuanOCR support (#21395), support step3-vl-10b (#21287)
- llama: remove per-arch tensor name lists (#21531), correct platform-independent loading of BOOL metadata (#21428)
- server: respect the ignore eos flag (#21203), fix model params not propagated (#21509), fix restore for checkpoints with
pos_min == 0(#21510), handle unsuccessful sink.write in chunked stream provider (#21478), fix logging of build + system info (#21460) - kv-cache: extend cache quantization checks (#21586), support attention rotation for heterogeneous iSWA (#21513)
- vocab: remove
</s>eog token for gemma4 (#21492), add byte token handling to BPE detokenizer for Gemma4 (#21488) - gemma: perform per-layer projections in the first layer (#21612)
- unicode: add custom Qwen2 regex handler to fix segfault on long input (#21257)
- parser: fix MiniMax handling (#21573)
- convert: set
add bos == Truefor Gemma 4 (#21500), fixblock_ff_dimretrieval for lfm2 (#21508) - ggml: add Q1_0 1-bit quantization support (CPU) (#21273), deprecate
GGML_OP_ADD1(#21363), freectx_copyinggml_opt_freeto plug per-training-session leak (#21592) - metal: Q1_0 backend (#21528)
- CUDA: also store
node->src->dataptrs for equality check (#21635), check for buffer overlap before fusing (#21566), make cuda graphs props check faster (#21472), write an optimizedflash_attn_stream_k_fixupkernel (#21159),ds_read_b128for q4_0 and q4_1 mmq kernels (#21168), fix CDNA2 compute capability constant for gfx90a/MI210 (#21519) - SYCL: Add Q8_0 reorder optimization (~3x tg speedup on Intel Arc) (#21527), handle other FA case (#21377)
- Vulkan: add FA dequant for q4_1, q5_0, q5_1, iq4_nl (#21029), Linux output error string for errno on fork failure (#20904)
- WebGPU: query for adapter support when registering backend (#21579), parameterize submission size and add iOS specific limits (#21533), add support of
MUL_MAT_ID(#21147) - hexagon: slight optimization for argsort output init (#21463)
- webui: store reasoning_content so it is sent back in subsequent requests (#21249), fix syntax highlighting lost after streaming (#21206), detect streaming state in reasoning content blocks (#21549), fix RTL text rendering (#21382), send both
backend_sampling == false/true(#18781) - cli: fix stripping of
\nin multiline input (#21485) - llama-bench: add
-fitcand-fittarguments (#21304) - devops/ci: provide KleidiAI-enabled ARM release artifact (#21259), lower cuda12 floor to 12.8.1 for broader host compatibility (#21438), fix vulkan workflow referencing non-existent action (#21442), use default RISE RISC-V Runners (#21263)
v0.7.2
Fixed
- NIF signature mismatch on precompiled builds — When
LLAMA_BACKENDis set, the build now forces compilation from source instead of downloading a precompiled NIF that may have a stale function signature. (#23) - Precompile workflow CI failures — The CI Checks job in the precompile workflow used a stale cached NIF (arity 9 vs 10 for
model_load) because the cache key didn't include C source hashes andmix compileran under the wrongMIX_ENV. Aligned withci.ymlby addingc_src/**to the cache key, compiling forMIX_ENV=test, and runningmix cleanbefore compile. - Precompile archive version mismatch — The precompile and checksum jobs now set
@versionfrom the git tag (viased), matching what the publish job already did. Previously, archives were named with the old version frommix.exs, causing the publish job to fail when looking for archives matching the tag version.
v0.7.1
Added
- Full llama.cpp optimization parameters — Exposed 17 new context parameters and 1 model parameter:
- KV cache quantization:
type_k,type_v(f16, q8_0, q4_0, etc.) for 2-4x memory savings - Flash attention & GPU offload:
flash_attn,offload_kqv,op_offload - RoPE scaling:
rope_scaling_type,rope_freq_base,rope_freq_scale, YaRN parameters - Misc:
attention_type,no_perf,swa_full,check_tensors
- KV cache quantization:
v0.7.0
Added
Prefix caching — Same-slot KV cache reuse for multi-turn chat. When a new request shares a prefix with the slot's previous request, the common prefix is skipped during prefill. 1.23x faster for multi-turn conversations. Controlled by
cache_promptoption (defaultfalse, opt-in). Includes prefix-affinity slot selection. See ADR 007.Pluggable batching strategies — Extracted batch building into
BatchStrategybehaviour with three built-in strategies:DecodeMaximal(default, generation-latency optimized),PrefillPriority(throughput optimized),Balanced(fair split). Custom strategies can implement the behaviour. See ADR 008.Pre-tokenized API —
Server.generate_tokens/3,Server.stream_tokens/3, andServer.get_model/1allow callers to tokenize outside the GenServer, reducing mailbox contention under concurrent load.HuggingFace Hub integration — New
LlamaCppEx.Hubmodule withsearch/2(find GGUF models),list_gguf_files/2(with file sizes via tree API),download/3(with local caching, ETag support, offline mode viaLLAMA_OFFLINE=1), andget_model_info/2. Authentication viaHF_TOKENorHUGGING_FACE_HUB_TOKENenv vars. NewLlamaCppEx.load_model_from_hub/3convenience wrapper. Requires optional:reqdependency.Performance guide — New
docs/performance.mdwith server tuning, prefix caching patterns, strategy selection guide, and optimization recipes.Benchee benchmarks — New
bench/prefix_cache.exs,bench/strategies.exs,bench/tokenize_overhead.exsfor measuring prefix cache impact, strategy comparison, and tokenization overhead.
Changed
- Graceful batch_eval error handling — The server now fails active slots with error replies instead of crashing the GenServer when
batch_evalreturns an error (e.g., KV cache overflow).
Fixed
- CI warning suppression — Suppress
-Wunused-functionwarnings from vendored llama.cpp jinja headers (runtime.h,lexer.h).
v0.6.14
Changed
- llama.cpp submodule — Updated from 50e0ad08f to b8635075f (7 commits).
- common: add Gemma 4 specialized parser (#21418), respect specified tag fallback when tag is empty (#21413)
- llama-model: read
final_logit_softcappingfor Gemma 4 (#21390) - llama: add custom newline split for Gemma 4 (#21406)
- server: fix undefined timing measurement errors in server context (#21201)
- ggml-webgpu: move from parameter buffer pool to single buffer with offsets (#21278)
- ci: add Windows Vulkan backend testing on Intel (#21292)
v0.6.13
Changed
- llama.cpp submodule — Updated from 95a6ebabb to 50e0ad08f (32 commits).
- server: save and clear idle slots on new task (
--clear-idle) (#20993) - common/parser: fix call ID detection (Mistral parser mostly) + atomicity for tag-json parsers (#21230)
- common: fix tool call type detection for nullable and enum schemas (#21327), add commentary rules for gpt-oss-20b (#21286)
- chat: avoid including json in chat.h (#21306), add Granite 4.0 chat template (#20804), Gemma4 tool response support
- jinja: coerce input for string-specific filters (#21370)
- vocab: fix Gemma4 tokenizer (#21343)
- ggml: bump to 0.9.11 (ggml/1456)
- ggml-webgpu: add vectorized flash attention (#20709)
- ggml-zendnn: add MUL_MAT_ID op support for MoE models (#21315)
- rpc: reuse compute graph buffers (#21299)
- kv-cache: do not quantize SWA KV cache (#21277)
- SYCL: fix llama_kv_cache hang when kv_cache is huge: 5GB (#21283)
- hexagon: add cumsum op support (#21246)
- model/mtmd: fix gguf conversion for audio/vision mmproj (#21309)
- tests: add unit test coverage for llama_tensor_get_type (#20112), allow exporting graph ops from HF file without downloading weights (#21182)
- fix: remove stale assert (#21369), fix gemma 4 template (#21326)
- server: save and clear idle slots on new task (
v0.6.12
Changed
- llama.cpp submodule — Updated from 08f21453a to 95a6ebabb (37 commits).
- CUDA: add FA support for head dim 512 (#20998), fix FA kernel selection logic (#21271), add generic NVFP4 MMQ kernel (#21074), fix kernel selection for mmvq mmid kernel (#21238)
- opencl: fix leak in Adreno q8_0 path (#21212)
- ggml: bump to 0.9.10 (ggml/1454), fix RWKV ops thread assignment (#21226)
- ggml-cpu: fix fallback for RVV kernels without zvfh (#21157)
- ggml-webgpu: quantized buffers to u32 + wider browser/device support (#21046), port AOT operators to JIT (#20728)
- kleidiai: add CPU feature detection to CI run script (#20394)
- hexagon: improve RMS_NORM and DIV accuracy (#21251)
- SYCL: support nvfp4 in mul_mat (#21227), enhance fattn perf (#21185)
- CANN: fix multi-thread set_tensor race conditions (#20151)
- memory: respect unified KV cache in hybrid memory for eval tasks (#21224)
- llama: rotate activations for better quantization (#21038), refactor llama_model_quantize_params to pure C interface (#20346)
- common: gpt-oss handle builtin/unsolicited tool calls (#21213), cleanup logs and modernize progress bar (#21215), disable backend sampling if reasoning budget enabled (#21209), add bounds check to prevent segfault on failed model load (#21082), move up common_init() and fix Windows UTF-8 logs (#21176)
- server: bypass API key validation for WebUI static assets (#21269), no more gzip compression for webui (#21073), cleanup dual representation to openai-compat (#21090)
- fix: tool call parsing for LFM2/LFM2.5 (#21242), correct misspellings (#21217), use lower-case proxy headers (#21235), include API key in CORS proxy for MCP (#21193)
- vendor: update BoringSSL to 0.20260327.0 (#21211)
v0.6.11
Changed
- llama.cpp submodule — Updated from 82b703f8b to 08f21453a (21 commits).
- opencl: add q4_K gemm and gemv kernels for Adreno (#20919)
- CUDA: fix CUB's argsort when nrows % block_size == 0 (#21181), optimize MOE GEMV kernel for BS > 1 (#20905)
- jinja: handle empty expressions correctly (#20913)
- common/parser: fix handling of tool definition with missing properties key (#21128), add reasoning_format = none support to gpt-oss (#21094)
- common/json-schema: fix non-capturing groups in pattern converter (#21124)
- common: add character class support to glob_match (#21111)
- server: wrap headers for mcp proxy (#21072), fix processing of multiple back-to-back mtmd chunks (#21107)
- model: add missing ROPE_FACTORS_LONG/SHORT for MiniCPM (#21150)
- llama-model-loader: print warning when using overrides with mmap (#20978)
- hexagon: dma optimizations (#21137)
- SYCL: enhance build script to use half cores to avoid OS hang (#21093)
- rpc: fix misleading error log (#21184)
v0.6.10
Changed
- llama.cpp submodule — Updated from 5c1a7b835 to 82b703f8b (7 commits).
- vendor: update cpp-httplib to 0.40.0 (#21100)
- vulkan: add noncontiguous GLU support (#21081)
- common/parser: fix reasoning whitespace bugs + extra parser tests (#21085)
- cli: add /glob command (#21084)
- webui: conversation forking + branching improvements (#21021)
- docker: fix and enable ARM64 image build (#20929)
v0.6.9
Changed
- llama.cpp submodule — Updated from 9f102a140 to 1743d9805 (38 commits).
- model: F2LLM-v2 support, allow causal_attn and pooling_type on all architectures (#20973)
- convert: register Qwen3Model architecture (#20967), support Qwen3.5/Qwen3.5 Moe NVFP4 and add input scales (#20505), add RuGPT3XL support (#21011)
- ggml-cuda: add NVFP4 dp4a kernel (#20644), support F32 kernel type for CONV_TRANSPOSE_2D (#17094)
- hip: use fnuz fp8 for conversion on CDNA3 (#21040)
- opencl: allow large buffer for Adreno (#20997)
- jinja: fix macro with kwargs (#20960)
- common: make LLAMA_CACHE the one cache for everything (#21009), fix split model migration (#21019), fix verbosity setup (#20989), add getpwuid fallback for HF cache (#21035), filter out imatrix when finding models (#21023)
- llama: fix llama-model-saver (#20503)
- mtmd: add DeepSeekOCR support (#17400), refactor image preprocessing (#21031), fix quant and im2col ops on Metal for deepseek-ocr (#21027)
- imatrix: fix crash with --show-statistics and zero counts (#19532)
v0.6.8
Changed
- llama.cpp submodule — Updated from 1772701f9 to 9f102a140 (15 commits).
- models: move the token embedding norms to the first layer (#20943)
- ggml-backend: re-enable graph reuse with pipeline parallelism (#20927)
- metal: add FLOOR, CEIL, ROUND, TRUNC unary ops (#20930), add FA instantiations for HSK=512, HSV=512 (#20902)
- common: add standard Hugging Face cache support (#20775), add a WARNING for HF cache migration (#20935), fix get_gguf_split_info (#20946), replace wrap_for_generation with a prefix convenience function (#20912)
- hexagon: general DMA and Binary Op fixes for large strides (#20918)
- llama-fit: fix regex pattern for gate_up tensors (#20910)
- vendor: update cpp-httplib to 0.39.0 (#20933)
v0.6.7
Changed
- llama.cpp submodule — Updated from eac9c6ea8 to 1772701f9 (30 commits).
- rpc: RCE patch (#20908), prevent division by zero in deserialize_tensor (#20712)
- memory: fix seq_id bounds in llama_memory_recurrent::state_read_meta() (#20887)
- server: use httplib dynamic threads (#20817), allow router to report child instances sleep status (#20849), fix Host header (#20843)
- metal: add CONV_3D (#19927)
- common/autoparser: detect reasoning markers when enable_thinking changes system prompt (#20859)
- common/grammar: fix grammar parsing issues to prevent stack overflow and hangs (#18604)
- context: use n_embd_out for pooled embedding extraction (#20840)
- jinja: refactor token advancement (#20864)
- CUDA: fix BF16 FA compilation (#20865), native bf16 flash attention for vec kernel (#20525), increase output elements per-thread block for small K-dimension (#20635)
- CANN: add RoPE cache preload before ACL graph capture (#20747)
- opencl: add q6_K gemm and gemv kernels for Adreno (#20089), add flattened Q4_K mv and general Q4_K mm (#20773)
- openvino: explicit memset in buffer_context allocation (#20857)
- mtmd: add dynamic high-resolution image preprocessing for InternVL model (#20847), fix LightOnOCR image preprocessing (#20877)
- ggml: support bf16 and quantized type (#20803)
- webui: improve chat form positioning (#20901), fix --webui-config-file settings not applied on load (#20823)
v0.6.6
Changed
- llama.cpp submodule — Updated from 6729d4920 to eac9c6ea8 (47 commits).
- context: zero output buffer on allocation (#20781)
- model: assert nextn_predict_layers to prevent underflow (#20783), fix Granite Hybrid type check for 7B.A1B (#20795)
- jinja: fix heap OOB read in value equality comparison (#20782)
- common/parser: fix nasty bug causing subtle corruption of generation prompt (#20825), fix out_of_range crash in throw path (#20777), add proper reasoning tag prefill reading (#20424), fix gpt-oss content removal (#20745)
- chat: handle tool calls with no required args in TAG_WITH_TAGGED format (#20764)
- server: fix router mode deadlock on child crash and TOCTOU race (#20763), add cached_tokens info to oaicompat responses (#19361), improve mtmd ctx checkpoints (#20726), become source of truth for sampling defaults (#20558)
- vulkan: change gated_delta_net to shard across subgroup (#20662), dequantize iq4_xs 4 at a time (#20657)
- hip: avoid compiler bug in RDNA code generation during debug builds on Windows (#20655)
- hexagon: add Matrix Extensions (HMX) for NPU backend (#20693)
- CANN: add BF16 support for core operators (#20152), handle in-place ROPE on non-contiguous f32 tensors (#20274), support flash attention for head dim not multiple of 16 (#20031)
- ggml-cpu: add always_inline to tinyBLAS_PPC accumulator saves (#20791)
- ggml-webgpu: ops support for qwen3.5 (SET, TRI_SOLVE, SSM_CONV, GATED_DELTA_NET) (#20687), add DIAG/TRI ops (#20664), update RMS_NORM/L2_NORM (#20665)
- vocab: assert array size of scores and toktypes (#20737)
- convert: support is_causal hyperparameter (#20746), make NVFP4/MXFP4 say correct type (#20730)
- cmake: fix build warning when kleidiai is enabled (#20457), guard KleidiAI DOWNLOAD_EXTRACT_TIMESTAMP for cmake < 3.24 (#20767)
v0.6.5
Changed
- llama.cpp submodule — Updated from b6c83aad5 to 6729d4920 (26 commits).
- model: add control vector support where missing (#20653)
- ggml: bump version to 0.9.8 (ggml/1442), restore ggml_type_sizef() to avoid major version bump (ggml/1441)
- ggml-cpu: fix RVV checks in quants and repacking (#20682), fix unused changemask warning in repack (#20692)
- ggml-blas: set MKL threads from thread context (#20602)
- Vulkan: async and event fixes (#20518), disable MMVQ on Intel Windows driver (#20672), allow graphics queue only through env var (#20599)
- HIP: ignore return of hipMemAdvise (#20696)
- hexagon: add neg, exp, sigmoid, softplus, cont, repeat ops (#20701)
- kleidiai: fix MUL_MAT support for batched (3D) inputs (#20620)
- server: fix ctx checkpoint invalidation (#20671)
- context: fix graph not resetting when control vector changes (#20381)
- llama: re-enable manual LoRA adapter free (#19983)
- common: rework gpt-oss parser (#20393), add
--skip-chat-parsingto force pure content parser (#20289) - webui: fix duplicated messages on q param (#20715), improve tooltip wording for attachment requirements (#20688)
- OpenCL: no timeout for WaitAny in graph submission to avoid deadlocks on llvm-pipe backends (#20618)
v0.6.4
Changed
- llama.cpp submodule — Updated from 463b6a963 to b6c83aad5 (56 commits).
- model: Mistral Small 4 support (#20649), Nemotron-H NVFP4 tensors (#20561), Qwen3.5/Qwen3.5MoE NVFP4 tensors (#20506)
- ggml: OpenVINO backend (#15307), native AVX512-FP16 support for F16 operations (#20529), extend im2col f16 (#1434), guard against sumq2 being 0 in IQ4_NL (#20460)
- CUDA: GDN shared mem latency hiding (#20537), limit FA stream-k block count (#20586), RDNA4-specific MMVQ for bs=1 decode (#19478), FP32 cuBLAS for V100 to avoid overflows (#19959), fix data race in cpy kernel (#20507), avoid creating CUDA context during device init (#20595)
- metal: FA specialization for HSK=320, HSV=256 (#20549)
- Vulkan: fix flash attention dot product precision (#20589), use graphics queue on AMD (#20551)
- HIP: APU compatibility — soft error handling for hipMemAdviseSetCoarseGrain (#20536)
- SYCL: fix untransposed GDA recurrent state (#20583), enhance UPSCALE to support all UT cases (#20637)
- OpenCL: fix l2_norm (#20480)
- server: support refusal content for Responses API (#20285), fix wait in test_cancel_requests() (#20601), fix model selector locked to first loaded model (#20580)
- tools/cli: fix disable reasoning (#20606)
- convert: support mixed-precision ModelOpt NVFP4/FP8 quantization (#20539), support contiguous method on lora tensors (#20489)
- kv-cache: fix reading llama_kv_cell_ext during state read (#20273)
- common: fix iterator::end() dereference (#20445)
- vendor: cpp-httplib 0.37.2 → 0.38.0 (#20484, #20578)
- webui: model information dialog (#20600), MCP CORS proxy detection (#20167), code preview iframe isolation (#20477)
- hexagon: Q4_0 and MXFP4 repack fixes (#20527)
v0.6.3
Added
- CI workflow — New
.github/workflows/ci.ymlrunsmix compile --warnings-as-errors,mix format --check-formatted,mix test, andmix dialyzeron push/PR to master. - Dialyzer — Added
dialyxirdependency for static analysis. All modules pass with zero warnings. - Example scripts — New
examples/directory with 6 runnable scripts:basic_generation.exs,streaming.exs,chat.exs,structured_output.exs,embeddings.exs, andserver.exs. - Expanded test coverage — New
test/schema_test.exscoveringembeds_one,embeds_many, additional Ecto types (:date,:utc_datetime,:decimal,:map), empty schemas, and end-to-end nested schema to GBNF conversion. Added edge case tests totest/thinking_test.exsfor unicode content, nested/malformed tags, and very long content.
Fixed
Chat.apply_template/3— Now accepts string-keyed message maps (%{"role" => ..., "content" => ...}) in addition to atom-keyed maps and tuples.Schema.to_json_schema/1— Fixed Dialyzer opaque type warning (replacedMapSet.member?/2withinoperator).- GitHub Actions Node.js 20 deprecation — Updated
actions/checkoutto v5 and addedFORCE_JAVASCRIPT_ACTIONS_TO_NODE24env to precompile workflow, preparing for the June 2026 Node.js 24 migration. - Stream test reliability — Fixed
stream with early halttest to use a prompt compatible with instruction-tuned models.
Changed
- llama.cpp submodule — Updated from fdb17643d to 463b6a963 (31 commits).
- tools: enable kvu in perplexity for hellaswag, winogrande, multiple-choice (#19954)
- graph: remove redundant GDN state transposes (#20443)
- llama: fix pooling assertion crash in chunked GDN detection path (#20468), disable graph reuse with pipeline parallelism (#20463)
- metal: fix l2 norm scale (#20493), avoid divisions in bin kernel (#20426)
- Vulkan: add GATED_DELTA_NET op support (#20334), fix l2_norm epsilon handling (#20350), fix OOB check in flash_attn_mask_opt (#20296), fix ErrorOutOfHostMemory on Intel GPU with --no-mmap (#20059)
- OpenCL: add cumsum op (#18981), use larger workgroup size for get_rows (#20316)
- HIP: compile debug builds with -O2 to avoid compiler bug (#20392)
- ggml-cpu: add RVV vec dot kernels for quantization types (#18859)
- server: reset counter related to kill-switch on client error (#20513), auto-select first loaded model for new conversations (#20403)
- common/parser: gracefully handle undetected tool parser (#20286), add GigaChatV3/3.1 models support (#19931)
- grammar: fix root symbol check (#19761)
- vendor: update cpp-httplib to 0.37.1 (#20390)
- convert: better mtp check and fix return (#20419)
v0.6.1
Changed
- llama.cpp submodule — Updated from c5a778891 to fdb17643d (70 commits).
- model: add support for Phi4ForCausalLMV, Nemotron 3 Super, Qwen3VL reranker text
- ggml: add NVFP4 quantization type support
- llama: chunked fused GDN path, dynamic head_dim and n_rot for SWA
- metal: extend mul_mv_ext to BF16/Q2_K/Q3_K, fix q5_k register spill, add upscale, handle command buffer failures gracefully
- CUDA/HIP: GDN shared mem for HIP, fix loop unrolling in ssm-conv, display VRAM capacity on init
- Vulkan: add SGN and ELU ops, fix data races in coopmat1, skip zero size tensors in copies
- SYCL: Flash Attention support for fp32/fp16/Q4/Q5/Q8
- WebGPU: add REPEAT op, faster quant matrix operations
- KleidiAI: concurrent SME and NEON kernel execution
- ggml-cpu: add RVV repack GEMM/GEMV for quantization types
- server: kill switch when stuck, fix checkpoints and OAI completion stream index
- common: fix --n-cpu-moe/--cpu-moe for fused gate+up models, gracefully handle incomplete output
- vendor: update cpp-httplib to 0.37.0, miniaudio to 0.11.25
- llama-quant: fail early on missing imatrix, refactor type selection
v0.6.0
Added
- Qwen 3.5 support — llama.cpp updated to c5a778891 (35 commits since v0.5.0).
reasoning_contentin ChatCompletion —chat_completion/3now splits<think>...</think>blocks from the response whenenable_thinking: true. The choice message includesreasoning_content(the thinking text) andcontent(the final answer). Returnsnilwhen thinking is not enabled or no thinking block is present.reasoning_contentin ChatCompletionChunk —stream_chat_completion/3emits chunks withreasoning_contentin the delta while the model is thinking, then switches tocontentafter</think>.LlamaCppEx.Thinking— New module withparse/1for one-shot parsing andstream_parser/1+feed/2for streaming token-boundary-safe parsing of think blocks. Handles the real-world Qwen3/3.5 template behavior where<think>is opened by the template itself.
Changed
- llama.cpp submodule — Updated from 7f5ee54 to c5a778891.
- ggml: add GATED_DELTA_NET op for Qwen 3.5 hybrid architecture
- model: update Qwen 3.5 model type detection
- convert: register Qwen 3.5 ForCausalLM for text only
- CUDA: use shared mem for ssm_conv, improve performance via fewer synchronizations
- Hexagon: add f32 ssm_conv, fp16 binary ops, Flash Attention optimizations
- OpenCL: add l2_norm, neg, exp, diag ops
- CPU: skip redundant ROPE cache updates, fix data race for debug asserts
- quants: add memsets and other fixes for IQ quants
- kv-cache: fix M-RoPE checkpoints, checkpoint every n tokens
- server: preserve Anthropic thinking blocks in conversion
Unchanged
chat/3andstream_chat/3continue returning raw text (no breaking change).
v0.5.0
Added
Structured output via JSON Schema — New
:json_schemaoption ongenerate/3,stream/3,chat/3,stream_chat/3,chat_completion/3, andstream_chat_completion/3. Pass a JSON Schema map and the model output is automatically constrained to valid JSON matching the schema. Uses llama.cpp's built-injson_schema_to_grammar()under the hood.schema = %{ "type" => "object", "properties" => %{"name" => %{"type" => "string"}, "age" => %{"type" => "integer"}}, "required" => ["name", "age"], "additionalProperties" => false } {:ok, json} = LlamaCppEx.chat(model, messages, json_schema: schema, temp: 0.0)LlamaCppEx.Grammar— New module for JSON Schema to GBNF conversion.from_json_schema/1— returns{:ok, gbnf_string}or{:error, reason}from_json_schema!/1— returns the GBNF string or raises
LlamaCppEx.Schema— New module for converting Ecto schema modules to JSON Schema maps. Maps all standard Ecto types (:string,:integer,:float,:boolean,:date,{:array, inner}, etc.) and supports nestedembeds_one/embeds_many. Automatically excludes:idand timestamp fields.NIF:
json_schema_to_grammar_nif/1— Exposes llama.cpp'sjson_schema_to_grammar()vianlohmann::ordered_json.
Changed
- Elixir requirement bumped to
~> 1.18(for built-inJSON.encode!/1). - Dependencies — added
{:ecto, "~> 3.0", optional: true}for optional Ecto schema integration.
v0.4.4
Changed
- llama.cpp submodule — Updated to latest upstream (b8198).
- ggml: fix
ggml_is_contiguous_nfor ne == 1 - ggml: use simple
std::threadin AMX without OpenMP - KleidiAI: add SME fp16 compute path for q4_0 GEMM on aarch64
- OpenCL: add optimized q4_1 mm kernel for Adreno
- Vulkan: tune MMVQ for Intel Windows
- WebGPU: fix workgroup dispatch limit for large batch sizes
- Fix locale-dependent float printing in GGUF metadata
- ggml: fix
v0.4.3
Changed
- llama.cpp submodule — Updated to latest upstream (b8185).
- Vulkan: improve partial offloading performance on AMD
- CUDA: cap grid.y at 65535 in non-contiguous dequantize/convert kernels
- ggml-cpu: optimise s390x multiply extend instructions
- Vendors: update cpp-httplib to 0.35.0, miniaudio to 0.11.24
v0.4.2
Changed
- llama.cpp submodule — Updated to latest upstream (b8179).
v0.4.1
Improved
- Error handling —
Chat.apply_template/3,Tokenizer.encode/3, andTokenizer.decode/2now return{:error, reason}instead of crashing when NIFs raise. - Telemetry documentation — Server moduledoc documents all telemetry events, measurements, and metadata.
- Typespecs — Added
@spectoServer.start_link/1.
Changed
- llama.cpp submodule — Updated to latest upstream (b8157).
v0.4.0
Added
- Full model loading params —
main_gpu,split_mode,tensor_splitfor multi-GPU placement;use_mlockanduse_direct_iofor memory control;vocab_onlyfor cheap model introspection without loading weights. - Server GPU forwarding —
Server.start_link/1now forwardsmain_gpu,split_mode,tensor_split,use_mlock, anduse_direct_iotoModel.load/2.
v0.3.0
Added
- Jinja chat templates — switched from
llama_chat_apply_template()C API to the full Jinja-basedcommon_chat_templates_apply()engine from llama.cpp's common library. enable_thinkingoption — passenable_thinking: falsetoChat.apply_template/3,chat/3,stream_chat/3,chat_completion/3, andstream_chat_completion/3to disable CoT reasoning for models like Qwen3/3.5.chat_template_kwargsoption — pass arbitrary key-value pairs to the Jinja template engine.- Penalty parameters —
penalty_repeat,penalty_freq, andpenalty_presentoptions for repetition/frequency/presence penalties in sampling. - OpenAI-compatible response format —
chat_completion/3andstream_chat_completion/3returnChatCompletionandChatCompletionChunkstructs. - Qwen3.5 benchmark results in README — Qwen3.5-27B and Qwen3.5-35B-A3B on Apple M4 Max.
Changed
Chat.apply_template/3now uses the Jinja engine and takes the model ref directly (no longer accepts:templateoption for raw template strings).- Linked
libcommon.afrom llama.cpp build (previously excluded). LlamaModelRAII wrapper now cachescommon_chat_templatesat model load time.
v0.2.0
Added
- Continuous batching server (
LlamaCppEx.Server) — GenServer with slot pool for concurrent multi-sequence inference. One forward pass per tick with decode tokens and prefill chunks mixed in a single batch. - Embeddings (
LlamaCppEx.Embedding) —embed/3andembed_batch/3with L2 normalization and configurable pooling type. - Grammar-constrained generation — GBNF grammar support via
grammarandgrammar_rootoptions inSampler.create/2andgenerate/3. - Batched inference primitives —
prefill/3,decode_batch/3,decode_token/4,batch_eval/2,sampler_sample_at/3NIFs for building custom inference loops. - Streaming via Server —
LlamaCppEx.Server.stream/3for token-by-token streaming through the batched server. - Telemetry events —
[:llama_cpp_ex, :server, :tick]and[:llama_cpp_ex, :server, :request, :done]for observability. - Benchmark suite (
bench/) — Benchee-based benchmarks for single-sequence and server generation, plus a custom continuous batching harness measuring throughput scaling.
Changed
Sampler.create/1now requires the model as the first argument:Sampler.create(model, opts).Context.create/2accepts new options::embeddings,:pooling_type,:n_seq_max.
v0.1.0
Initial release.
- Model loading and introspection
- Text generation with configurable sampling
- Streaming token generation via
Stream.resource/3 - Chat template support
- Tokenization and detokenization
- Metal, CUDA, Vulkan, and CPU backends
- RAII resource management via
fine