Changelog
View SourceAll notable changes to erllama are documented here. The format follows Keep a Changelog and this project adheres to Semantic Versioning.
Unreleased
0.1.2 - 2026-05-12
Cluster-routing primitives, speculative-decoding verifier, a cold-path correctness fix, and C-safety CI tooling. All additions are backwards-compatible; existing API call sites unchanged.
Added
Cluster routing and load-balancing (#13)
erllama:queue_depth/0returns O(1) inflight count via an atomics counter parked in persistent_term, readable cross-node viaerpc. Used by the upcomingerllama_clusterload balancer (least_loaded, power_of_two strategies).erllama:list_cached_prefixes/2returns the longest cached prefix length of a token list for a given model on this node, across all cache tiers. Used by the cluster cache-affinity router.erllama_nif:vram_info/0walks every loaded ggml backend and sums free + total memory across non-CPU devices; returns{error, no_gpu}on a CPU-only build. Used by the cluster scheduler for bin-packing model placement.erllama:list_models/0map gainsmodel_id,quant_tag,loaded_at_monotonic, andvram_estimate_bkeys. Existing keys (id,pid,status, etc.) are unchanged.
Speculative decoding (#13)
erllama:draft_tokens/3synchronously generates up toMaxnext-token ids for a prefix. Times out at 30 s with a clean cancel + drain so the caller's mailbox stays clean. Empty prefix is rejected as{error, empty_prefix}.erllama:verify/4runsPrefixTokens ++ Candidatesthrough the model in one forward pass and returns the longest accepted prefix length plus the verifier's own next token. Acceptance walksArgmax[P + i - 1] == c_iand stops at the first mismatch. End-of-generation tokens map to the atomeos. Snapshot + restore protocol leaves the caller's pre-call context view unchanged. Allowed only from the model gen_statem's idle state; non-idle callers receive{error, busy}.- Token-id streaming:
erllama_model:stream_emitnow also sends{erllama_token_id, Ref, Id}on every produced token, in addition to the existing{erllama_token, Ref, Bin}. Empty-text tokens (special tokens, BPE merges with no visible bytes) still produce an id message. Existing consumers ignore the new tag.
Backend behaviour
- Optional callbacks
extra_metadata/1(vram-related model metadata) andverify/4(speculative verifier) onerllama_model_backend. Backends that omit either get a graceful{error, not_supported}fallback (#13). - Optional callback
seq_clear/1onerllama_model_backend. Llama backend implements it asllama_state_seq_rm(0, 0, -1). Called by the model layer at the top ofenter_prefilling; see the Fixed section below (#16).
NIFs (#13)
nif_model_size/1,nif_model_n_layer/1,nif_forward_with_argmax/2,nif_vram_info/0. Previously unreachable from Erlang.
Fixed
- Cold-path prefill KV-state leak:
erllama_model:enter_prefillingdid not reset the llama_context's KV cache before the new prefill.llama_batch_get_oneauto-positions atn_past, so a second cold request on the same model wrote its prompt KV at[previous_n_past..]instead of[0..], producing different output for the same prompt + seed across calls. The newseq_clear/1callback wipes seq 0 before the cold prefill. Warm restores viakv_unpackwere already correct (#16).
Changed
- Vendored
c_src/llama.cpp/bumped fromb9093tob9119(16 files, mostly Metal/CUDA tweaks plus a newggml-cuda/allreducekernel pair). Publicllama.hAPI used by the NIF is unchanged (#15). erllama_inflight:register/2andunregister/1switched toets:insert_new/ets:takeso the new atomics counter sees only true admissions and true removals; double-register or double-unregister become observable no-ops (#13).
Internal
- New CI jobs:
sanitizers(ASan+UBSan againsttinyllamas/stories260K.ggufunderLD_PRELOAD'd libasan),clang-tidy(NIF sources only), andscan-build(Clang Static Analyzer with--status-bugs) (#14). - New
c_src/CMakeLists.txtoptions:ENABLE_ASAN,ENABLE_TSAN,ENABLE_UBSAN,ENABLE_CLANG_TIDY, scoped to theerllama_niftarget; default OFF. .clang-tidyconfig at repo root. Vendored llama.cpp is not linted.
ROADMAP
- Pipeline parallelism deferred; blocked on upstream llama.cpp
adding layer-range execution to
llama_decode. The cluster degrades gracefully viafunction_exported. - Verify context isolation: the current snapshot/restore protocol
does not preserve the caller's pre-call
decode_readyflag; callers are assumed to issuedecode_oneimminently. A v2 extension to coverdecode_readyis documented for the next contributor.
0.1.1 - 2026-05-12
NIF safety and SIGSEGV hardening. No public API additions; new
error tuples surface paths that previously crashed the BEAM or
raised badarg across a dirty scheduler.
Fixed
- Adapter use-after-free / double-free when
free_model/1ran while adapter wrappers still referenced the model. The model resource now tracksactive_adaptersalongsideactive_contextsand defersllama_model_freeuntil both reach zero (#10). - Race in
set_adapters/2where a concurrentadapter_freecould null the underlying pointer between the per-adapter mutex release and thellama_set_adapters_loracall. Locks are now held across the llama call in pointer-sorted order to defeat AB-BA between concurrent callers (#10). - Per-message memory leak in
apply_chat_templatewhen the message list was malformed or allocation failed mid-build. The helper now releases its own role/content allocations on every error path (#10). prefill/2andembed/2walked past the KV slab when the prompt size reachedn_ctx, and produced undefined behaviour when it exceededn_batch. Both now bounds-check against the live context before touching state, returning{error, context_overflow}or{error, batch_overflow}(#11).apply_chat_template/2raisedbadargacross the dirty scheduler whencontentwas a list-of-maps (Anthropic-style content blocks) instead of a binary. Returns{error, invalid_content}(#11).load_model/2surfaced a generic{error, load_failed}for malformed GGUF files. Now returns{error, malformed_gguf}on a best-effort basis when the captured llama log line containsGGML_ASSERT. Best-effort only:llama_log_setis process global and concurrent loads can mis-attribute classification. AGGML_ASSERTthat hitsabort()still terminates the BEAM process; subprocess isolation would be required for a complete fix and is intentionally out of scope (#11).
Added
- New error atoms returned by the NIF:
context_overflow,batch_overflow,invalid_content,malformed_gguf. Callers matching{error, _}are unaffected; callers that care about the specific reason should match the new atoms.
0.1.0 - 2026-05-11
Initial public release.
Added
Public API
- Native Erlang/OTP wrapper around llama.cpp via a single
dirty-scheduler NIF (
erllama_nif) covering model load, context construction, tokenisation, prefill, single-token decode, and KV pack/unpack. - Models are identified by
binary()on the public API.erllama:load_model/2,complete/2,3,unload/1,status/1,evict/1,shutdown/1takebinary() | pid(). Internal registration uses{via, erllama_registry, BinaryId}so user- supplied ids cannot exhaust the atom table. erllama:list_models/0returning[model_info()]anderllama:model_info/1keyed on a model id.- Public
erllama:tokenize/2anderllama:detokenize/2keyed on a model id. The low-levelerllama_nif:tokenize/3anderllama_nif:detokenize/2remain available. erllama:unload_model/1as an alias forerllama:unload/1matching the OpenAI/Ollama-style naming downstream HTTP servers use.erllama:infer/4streaming inference. Returns{ok, Ref}; tokens are delivered to the caller as{erllama_token, Ref, _},{erllama_done, Ref, Stats},{erllama_error, Ref, Reason}.erllama:cancel/1. Idempotent and fire-and-forget; observed between tokens.erllama:apply_chat_template/2. Renders a normalised chat request (messages,system,tools) through the model's GGUF chat template and tokenises. Backed byllama_chat_apply_template.erllama:embed/2. Per-sequence pooled embedding viallama_get_embeddings_seqwith last-token fallback.
Sampling
- Sampler parameters:
complete/3andinfer/4honourtemperature,top_k,top_p,min_p,repetition_penalty,seed, andgrammarvia one combined chain builder (erllama_nif:configure_sampler/2). Chain order:grammar -> repetition_penalty -> top_k -> top_p -> min_p -> (temperature > 0 ? temp -> dist(seed) : greedy).set_grammar/2retained as a backwards-compatible alias. - Grammar-constrained sampling: pass
grammar => GBNFin thecomplete/3Opts orinfer/4Params; the per-model sampler chain is rebuilt as grammar then greedy for the duration of the request and reset on completion or cancellation.
LoRA adapters
erllama:load_adapter/2,unload_adapter/2,set_adapter_scale/3,list_adapters/1. Per-adapter sha256 + scale fold into the cache viaerllama_cache_key:effective_fingerprint/2, so rows produced with adapter A never collide with rows from adapter B. Snapshot-at-admission semantics keep in-flight requests on their original fingerprint even if an adapter mutation arrives mid-generation.
Concurrency model
- Concurrent request queue: a second
complete/3orinfer/4arriving while one is in flight is queued FIFO instead of getting{error, busy}. The reply{ok, Ref}is sent as soon as the call is admitted; streaming events follow once the queue head advances to the request. - Decode loop schedules each step via
gen_statem:cast(self(), decode_step)instead ofnext_eventso cancel, evict, status, and queued requests interleave fairly between tokens. - Seq-aware NIFs (infrastructure for 0.2 multi-seq batching):
nif_kv_pack/4accepts an explicitseq_id; newerllama_sampler_tresource owning a standalonellama_sampler*;nif_sampler_new/2,nif_sampler_free/1. Cache rows stay seq-id-free:seq_idis a save/load call argument, never row metadata.
Internals
erllama_registrymodule: ETS-backedviacallback for binary model ids.erllama_inflightmodule:Ref -> ModelPidtable socancel/1routes to the right gen_statem.erllama_model_backendoptional callbacks:apply_chat_template/2,embed/2,set_grammar/2,configure_sampler/2,clear_sampler/1,load_adapter/2,unload_adapter/2,apply_adapters/2. Backends that omit them surface{error, not_supported}from the public API.
Cache subsystem
- Token-exact KV cache with three independently-supervised tiers:
RAM (ETS slabs),
ram_file(/dev/shm), and disk (plain read I/O). - Sole-writer arbitration through
erllama_cache_meta_srv; reads on the hot path go to ETS directly.lookup_exact/1is a single atomicets:lookup(no two-call race) and the meta server cancels waiter timers when an early reply lands so the mailbox doesn't bloat under load. - The disk tier reads files via plain
file:read_file/1into a fresh BEAM heap binary; mmap is deliberately not used. The process already mmaps multi-GB GGUF weights, and a region binary that outlived its closing NIF call would have exposed the BEAM to SIGBUS from any external truncation. - Crash-safe save publish protocol: reserve, write_tmp, check,
link(2), mark_published; two-stage TTL cleanup with orphan adoption. - Five save reasons (
cold,continued,finish,evict,shutdown) with async/sync semantics matching their use. saves_droppedcounter: bumps whenever a back-pressured writer pool refuses a save the model wanted to fire.- Multi-turn warmth via explicit
parent_keyresume and stateless longest-prefix walk for OpenAI/Anthropic-shaped clients. erllama_schedulermemory-pressure poller with pluggable sources (memsup,nvidia-smi, custom callback). Off by default. Sweep timer is cancelled onterminate/2so a supervisor restart never leaves a zombie firing into a fresh server.erllama_cache_writerdirty-IO writer pool with a leak-proof reservation semaphore.pin_and_load/2wraps load + unpack intry/afterso the holder is always checked in.- Persisted hit counters (u32 in disk header) so popular prefixes survive an LRU walk after restart.
- End-to-end metrics: hits/misses/saves/evictions plus per-path
latency totals (
pack_total_ns,load_total_ns,longest_prefix_ns,longest_prefix_probes).
NIF safety
- Per-resource
pthread_mutexand two-resource lifetime pattern for safe concurrentfree_*/1plus dirty NIF ops. extern "C" noexceptshim catching every llama.cpp C++ exception at the boundary;decode_onedefensive guard againstGGML_ASSERTaborts.llama_backend_initis deferred to the firstnif_load_modelviapthread_once, so cache-only and unit-test workloads do not pay theggml_backend_load_allcost at NIF load.nif_tokenizeandnif_detokenizehonourrelease_pending, so a model returned byfree_model/1as{ok, deferred}cannot be reused via tokenize.nif_detokenizefails closed onn_vocab <= 0(matchesnif_prefill).make_errno_atommaps FreeBSD'sEINTEGRITYtoeintegrity.
Tooling
FindErlang.cmake(adopted from erlang-rocksdb) detectsERTS_INCLUDE_DIRvia the standard CMake find-module contract.- Bench harness (
bench/run.sh) with cold-vs-warm matrix and a 4-agent shared-prefix scenario. TinyLlama and LLaMA-3 8B presets.
CI
actions/checkoutandactions/cachebumped to@v5(Node.js 24).xref,dialyzer,erlfmt,elvispromoted to gate jobs;build,eunit,proper,ct,freebsddepend on them.- macOS matrix is
macos-14, macos-15. - FreeBSD matrix added:
release: ['14.2', '14.4']. Inside the VM: refreshpcre2so git can run, installgit, setgit config --global --add safe.directory '*'so llama.cpp's build-infogit rev-parsesucceeds. erllama_nif_tests:load_model_rejects_non_existent_path_testis now a generator with a 60 s timeout to absorb the lazy Metal init on macOS.
Tests
- 211 EUnit + PropEr property tests + 7 stub Common Test cases.
Real-model Common Test suite gated on
LLAMA_TEST_MODEL(14 cases including seed determinism, grammar+sampler, apply_chat_template, embeddings, KV pack/unpack round-trip). - New stub-backed coverage: sampler params (
erllama_sampler_tests), LoRA adapters + cache identity (erllama_lora_tests), FIFO queueing of concurrent infers (erllama_streaming_tests). - Multi-platform CI: Ubuntu 24.04 amd64, Ubuntu 24.04 arm64, macOS 14 + 15 (Apple Silicon), FreeBSD 14.2 + 14.4. OTP 28 across the matrix.
Documentation
- README rewritten as a friendly entry point with snippets.
- User guides: loading, caching, configuration, building, examples.
- Internal design notes: cache design, publish protocol, NIF safety.
- ex_doc-friendly module documentation throughout.
ROADMAP.md: what 0.1 doesn't do yet (multi-seq concurrent decoding, speculative decoding, vision, audio, ONNX/safetensors, stop-sequences, telemetry hooks, multi-GPU pressure, KV compression, cluster).- README closes with a teaser for the upcoming
erllama_clusterapplication: a separate OTP project that coordinates a fleet of erllama nodes (request distribution, cross-node speculative decoding, pipeline parallelism over QUIC).
Acknowledgements
Same idea as antirez/ds4.