Changelog
View SourceAll notable changes to erllama are documented here. The format follows Keep a Changelog and this project adheres to Semantic Versioning.
Unreleased
[0.3.0] - 2026-05-16
Anthropic-Messages compatibility additions on top of 0.2.0: caller-supplied stop sequences with trimmed output, an opt-in extended-thinking message surface with per-block integrity signatures, and a round of NIF safety hardening on the C/C++ side.
Added
stop_sequences :: [binary()]oninfer/4Paramsandcomplete/3Opts. Generation halts on the first occurrence of any element in the accumulated detokenised output. The match is trimmed from the streamed{erllama_token, _, _}chunks and the synchronousreply, and the matched binary is reported asstop_sequenceon the result map (complete/3) and stats map (infer/4done message). The key is absent when generation hitlength, was cancelled, or reached EOG without a match. The previously reservedstopplaceholder is renamed tostop_sequences; it was never wired up so this is not a breaking change (#32).thinking => enabled | disabledoninfer/4Params(defaultdisabled). Whenenabledagainst a thinking-capable backend, streaming requests receive{erllama_token, Ref, {thinking_delta, Bin}}fragments and a single{erllama_thinking_end, Ref, Sig}close marker before any subsequent token.Sigis an opaque integrity signature the downstream forwards verbatim into the Anthropicsignature_deltaSSE event, or<<>>when no signature is available (#33).erllama_model_backendgains{thinking_token, token_id()}andthinking_endvariants onstep_result()plus an optionalthinking_signature/2callback. Backends without extended thinking emit neither variant and require no changes (#33).
Changed
llama_batch_init,llama_batch_free, andllama_batch_get_oneare now routed througherllama_safe_batch_*noexceptshims so a C++ exception cannot unwind through the C NIF frame (#30).- Per-thread
thread_localstorage replaces the process-global log buffer used by the malformed-GGUF classifier; concurrent model loads no longer scramble each other'sGGML_ASSERTtext on a NULL return (#31). - NIF unload no longer calls
llama_backend_free(avoids apthread_oncewedge on.soreload paths) and clears thellama_log_setcallback so a post-unload log emission cannot dispatch into freed memory (#31).
0.2.0 - 2026-05-15
Multi-sequence batched scheduling, map-shaped completion results, chunked prefill, per-model observability, and direct passthrough of the llama.cpp multi-GPU / flash-attention / KV-quant params.
Changed (breaking)
erllama:complete/2,3anderllama_model:complete/2,3now return{ok, completion_result()}instead of the legacy{ok, ReplyBinary, GeneratedTokens}tuple. The map carries:reply :: binary()generated :: [token_id()]context_tokens :: [token_id()](prompt ++ generated)committed_tokens :: non_neg_integer()(length(context_tokens))finish_key :: cache_key() | undefined— token-exact key for the full context, suitable asparent_keyon the next turn;undefinedif the finish save was suppressedcache_hit_kind :: exact | partial | coldfinish_reason :: stop | length | cancelledstats :: stats()
Mechanical migration:
%% before {ok, Reply, _Tokens} = erllama:complete(Model, Prompt). %% after {ok, #{reply := Reply, finish_key := FK}} = erllama:complete(Model, Prompt).Streaming
{erllama_done, Ref, Stats}(infer/4)Statsmap gains two additive keys:finish_keyandcommitted_tokens. No shape break for existing consumers (#21).
Added
Multi-sequence batched scheduler (#24, #25, #26, #27)
erllama_nif:step/2is the new multi-sequence batched decode primitive. Onellama_decodecall mixes prefill and decode rows freely (SARATHI-style co-batching), bounded by the live context'sn_batch. Returns{error, batch_overflow}cleanly so a budget-aware scheduler can shrink and retry, and{error, no_logits}when a decode row has no prefill yet on its seq.- Per-context
per_seq[]tracking (last_logits_idx,next_pos) withERLLAMA_N_SEQ_MAX_CAP = 256.kv_unpack/kv_seq_rmrefresh the per-seq position so subsequentstepcalls see correct state. erllama_model_backendbehaviour gains optional callbacksstep/2,sampler_new/2,sampler_free/1,seq_rm/2,seq_rm_last/3, plus seq-awarekv_pack/3andkv_unpack/3. All optional; existing backends keep compiling.- The model gen_statem now runs a multi-tenant scheduler. With
context_opts.n_seq_max => 1(the default), behaviour is bit-identical to 0.1: exactly one request runs at a time. Settingn_seq_max > 1lets up to N requests prefill and decode concurrently through onellama_decodeper tick. State collapses fromidle/prefilling/generatingtoidle/running; admissions past the seq-id capacity queue FIFO inpending. - Each in-flight request owns its own sampler chain (built at
admission via
backend:sampler_new/2, freed at finish), so concurrent requests with differenttemperature/seed/grammarsettings never share sampler state. - Cache save reasons (
cold,continued,finish,evict,shutdown) all thread through the request'sseq_idand remain token-exact per-sequence.
Chunked prefill (#28)
prefill_chunk_sizepolicy knob caps how many tokens a single prefill row contributes to one tick. Defaultmax(64, n_batch div 4); passinfinityto disable. A long prompt is sliced across multiple ticks so it never monopolises the batch and concurrent decoders keep making progress between chunks. Layered on top of then_batchper-tick budget.
prefill_only/2 (#21)
erllama:prefill_only/2anderllama_model:prefill_only/2decode a prompt into KV state and fire a finish save without sampling any output tokens. Returns aprefill_result()map carryingcontext_tokens,committed_tokens,finish_key, andcache_hit_kind. Useful for priming the cache before a burst of short follow-ups, or for holding a warm session across long pauses without consuming generation budget.
Per-model observability (#22)
- New public ETS table
erllama_model_obs, owned byerllama_inflight, written by each model gen_statem on every state transition and read lock-free from any process (including remote nodes viaerpc). - Four new accessors on
erllama:phase/1:idle | prefilling | generatingfor one model id.pending_len/1: gen_statem pending FIFO depth (calls queued behind whatever is currently running).last_cache_hit/1:#{kind, prefix_len}of the most recent admission, orundefinedif the model has never admitted.queue_depth/1: per-model variant of the existing globalqueue_depth/0; counts admitted streaminginfer/4rows.
model_info/1map gainsphase,pending_len, andlast_cache_hitkeys. Additive: existing keys preserved.
llama.cpp option passthrough (#23)
erllama_nif:load_model/2now reads three additional keys frommodel_opts:split_mode :: none | layer | row: multi-GPU split policy.main_gpu :: non_neg_integer(): GPU index whensplit_mode = none.tensor_split :: [float()]: per-device proportions (up to 16 entries; shorter lists zero-fill).
erllama_nif:new_context/2reads three more fromcontext_opts:flash_attn :: boolean() | auto: enable, disable, or defer to llama.cpp.type_k,type_v :: f16 | f32 | bf16 | q4_0 | q5_0 | q5_1 | q8_0: KV cache element type for keys and values.
- Bad atoms raise
badargbefore the load runs.
Fixed
warm_restore_primerpassed1instead of the current cell count to the prefill primer, so warm restores that ran the primer at a non-zero offset wrote KV at the wrong position. The primer now takes the live cell count from the per-seq tracker.- The cold prefill path could fire the cold save inside the remainder prefill rather than between the trim prefix and the remainder, leading to a save row that did not match the trim-aligned boundary. The cold save now fires at the cursor-emptied transition between the trim and the remainder.
Internal
- New types exported from
erllama_model:completion_result/0,prefill_result/0. erllama_model_tnow owns thetensor_splitbuffer; the vendored llama.cpp aliases the pointer rather than copying it, so its storage must outlive the model.erllama_model_stubderives per-seq tokens fromphash2({decode_step_stub, SeqId, Sampler})so a scheduler bug that swaps samplers between seqs becomes observable in tests.
0.1.2 - 2026-05-12
Cluster-routing primitives, speculative-decoding verifier, a cold-path correctness fix, and C-safety CI tooling. All additions are backwards-compatible; existing API call sites unchanged.
Added
Cluster routing and load-balancing (#13)
erllama:queue_depth/0returns O(1) inflight count via an atomics counter parked in persistent_term, readable cross-node viaerpc. Used by the upcomingerllama_clusterload balancer (least_loaded, power_of_two strategies).erllama:list_cached_prefixes/2returns the longest cached prefix length of a token list for a given model on this node, across all cache tiers. Used by the cluster cache-affinity router.erllama_nif:vram_info/0walks every loaded ggml backend and sums free + total memory across non-CPU devices; returns{error, no_gpu}on a CPU-only build. Used by the cluster scheduler for bin-packing model placement.erllama:list_models/0map gainsmodel_id,quant_tag,loaded_at_monotonic, andvram_estimate_bkeys. Existing keys (id,pid,status, etc.) are unchanged.
Speculative decoding (#13)
erllama:draft_tokens/3synchronously generates up toMaxnext-token ids for a prefix. Times out at 30 s with a clean cancel + drain so the caller's mailbox stays clean. Empty prefix is rejected as{error, empty_prefix}.erllama:verify/4runsPrefixTokens ++ Candidatesthrough the model in one forward pass and returns the longest accepted prefix length plus the verifier's own next token. Acceptance walksArgmax[P + i - 1] == c_iand stops at the first mismatch. End-of-generation tokens map to the atomeos. Snapshot + restore protocol leaves the caller's pre-call context view unchanged. Allowed only from the model gen_statem's idle state; non-idle callers receive{error, busy}.- Token-id streaming:
erllama_model:stream_emitnow also sends{erllama_token_id, Ref, Id}on every produced token, in addition to the existing{erllama_token, Ref, Bin}. Empty-text tokens (special tokens, BPE merges with no visible bytes) still produce an id message. Existing consumers ignore the new tag.
Backend behaviour
- Optional callbacks
extra_metadata/1(vram-related model metadata) andverify/4(speculative verifier) onerllama_model_backend. Backends that omit either get a graceful{error, not_supported}fallback (#13). - Optional callback
seq_clear/1onerllama_model_backend. Llama backend implements it asllama_state_seq_rm(0, 0, -1). Called by the model layer at the top ofenter_prefilling; see the Fixed section below (#16).
NIFs (#13)
nif_model_size/1,nif_model_n_layer/1,nif_forward_with_argmax/2,nif_vram_info/0. Previously unreachable from Erlang.
Fixed
- Cold-path prefill KV-state leak:
erllama_model:enter_prefillingdid not reset the llama_context's KV cache before the new prefill.llama_batch_get_oneauto-positions atn_past, so a second cold request on the same model wrote its prompt KV at[previous_n_past..]instead of[0..], producing different output for the same prompt + seed across calls. The newseq_clear/1callback wipes seq 0 before the cold prefill. Warm restores viakv_unpackwere already correct (#16).
Changed
- Vendored
c_src/llama.cpp/bumped fromb9093tob9119(16 files, mostly Metal/CUDA tweaks plus a newggml-cuda/allreducekernel pair). Publicllama.hAPI used by the NIF is unchanged (#15). erllama_inflight:register/2andunregister/1switched toets:insert_new/ets:takeso the new atomics counter sees only true admissions and true removals; double-register or double-unregister become observable no-ops (#13).
Internal
- New CI jobs:
sanitizers(ASan+UBSan againsttinyllamas/stories260K.ggufunderLD_PRELOAD'd libasan),clang-tidy(NIF sources only), andscan-build(Clang Static Analyzer with--status-bugs) (#14). - New
c_src/CMakeLists.txtoptions:ENABLE_ASAN,ENABLE_TSAN,ENABLE_UBSAN,ENABLE_CLANG_TIDY, scoped to theerllama_niftarget; default OFF. .clang-tidyconfig at repo root. Vendored llama.cpp is not linted.
ROADMAP
- Pipeline parallelism deferred; blocked on upstream llama.cpp
adding layer-range execution to
llama_decode. The cluster degrades gracefully viafunction_exported. - Verify context isolation: the current snapshot/restore protocol
does not preserve the caller's pre-call
decode_readyflag; callers are assumed to issuedecode_oneimminently. A v2 extension to coverdecode_readyis documented for the next contributor.
0.1.1 - 2026-05-12
NIF safety and SIGSEGV hardening. No public API additions; new
error tuples surface paths that previously crashed the BEAM or
raised badarg across a dirty scheduler.
Fixed
- Adapter use-after-free / double-free when
free_model/1ran while adapter wrappers still referenced the model. The model resource now tracksactive_adaptersalongsideactive_contextsand defersllama_model_freeuntil both reach zero (#10). - Race in
set_adapters/2where a concurrentadapter_freecould null the underlying pointer between the per-adapter mutex release and thellama_set_adapters_loracall. Locks are now held across the llama call in pointer-sorted order to defeat AB-BA between concurrent callers (#10). - Per-message memory leak in
apply_chat_templatewhen the message list was malformed or allocation failed mid-build. The helper now releases its own role/content allocations on every error path (#10). prefill/2andembed/2walked past the KV slab when the prompt size reachedn_ctx, and produced undefined behaviour when it exceededn_batch. Both now bounds-check against the live context before touching state, returning{error, context_overflow}or{error, batch_overflow}(#11).apply_chat_template/2raisedbadargacross the dirty scheduler whencontentwas a list-of-maps (Anthropic-style content blocks) instead of a binary. Returns{error, invalid_content}(#11).load_model/2surfaced a generic{error, load_failed}for malformed GGUF files. Now returns{error, malformed_gguf}on a best-effort basis when the captured llama log line containsGGML_ASSERT. Best-effort only:llama_log_setis process global and concurrent loads can mis-attribute classification. AGGML_ASSERTthat hitsabort()still terminates the BEAM process; subprocess isolation would be required for a complete fix and is intentionally out of scope (#11).
Added
- New error atoms returned by the NIF:
context_overflow,batch_overflow,invalid_content,malformed_gguf. Callers matching{error, _}are unaffected; callers that care about the specific reason should match the new atoms.
0.1.0 - 2026-05-11
Initial public release.
Added
Public API
- Native Erlang/OTP wrapper around llama.cpp via a single
dirty-scheduler NIF (
erllama_nif) covering model load, context construction, tokenisation, prefill, single-token decode, and KV pack/unpack. - Models are identified by
binary()on the public API.erllama:load_model/2,complete/2,3,unload/1,status/1,evict/1,shutdown/1takebinary() | pid(). Internal registration uses{via, erllama_registry, BinaryId}so user- supplied ids cannot exhaust the atom table. erllama:list_models/0returning[model_info()]anderllama:model_info/1keyed on a model id.- Public
erllama:tokenize/2anderllama:detokenize/2keyed on a model id. The low-levelerllama_nif:tokenize/3anderllama_nif:detokenize/2remain available. erllama:unload_model/1as an alias forerllama:unload/1matching the OpenAI/Ollama-style naming downstream HTTP servers use.erllama:infer/4streaming inference. Returns{ok, Ref}; tokens are delivered to the caller as{erllama_token, Ref, _},{erllama_done, Ref, Stats},{erllama_error, Ref, Reason}.erllama:cancel/1. Idempotent and fire-and-forget; observed between tokens.erllama:apply_chat_template/2. Renders a normalised chat request (messages,system,tools) through the model's GGUF chat template and tokenises. Backed byllama_chat_apply_template.erllama:embed/2. Per-sequence pooled embedding viallama_get_embeddings_seqwith last-token fallback.
Sampling
- Sampler parameters:
complete/3andinfer/4honourtemperature,top_k,top_p,min_p,repetition_penalty,seed, andgrammarvia one combined chain builder (erllama_nif:configure_sampler/2). Chain order:grammar -> repetition_penalty -> top_k -> top_p -> min_p -> (temperature > 0 ? temp -> dist(seed) : greedy).set_grammar/2retained as a backwards-compatible alias. - Grammar-constrained sampling: pass
grammar => GBNFin thecomplete/3Opts orinfer/4Params; the per-model sampler chain is rebuilt as grammar then greedy for the duration of the request and reset on completion or cancellation.
LoRA adapters
erllama:load_adapter/2,unload_adapter/2,set_adapter_scale/3,list_adapters/1. Per-adapter sha256 + scale fold into the cache viaerllama_cache_key:effective_fingerprint/2, so rows produced with adapter A never collide with rows from adapter B. Snapshot-at-admission semantics keep in-flight requests on their original fingerprint even if an adapter mutation arrives mid-generation.
Concurrency model
- Concurrent request queue: a second
complete/3orinfer/4arriving while one is in flight is queued FIFO instead of getting{error, busy}. The reply{ok, Ref}is sent as soon as the call is admitted; streaming events follow once the queue head advances to the request. - Decode loop schedules each step via
gen_statem:cast(self(), decode_step)instead ofnext_eventso cancel, evict, status, and queued requests interleave fairly between tokens. - Seq-aware NIFs (infrastructure for 0.2 multi-seq batching):
nif_kv_pack/4accepts an explicitseq_id; newerllama_sampler_tresource owning a standalonellama_sampler*;nif_sampler_new/2,nif_sampler_free/1. Cache rows stay seq-id-free:seq_idis a save/load call argument, never row metadata.
Internals
erllama_registrymodule: ETS-backedviacallback for binary model ids.erllama_inflightmodule:Ref -> ModelPidtable socancel/1routes to the right gen_statem.erllama_model_backendoptional callbacks:apply_chat_template/2,embed/2,set_grammar/2,configure_sampler/2,clear_sampler/1,load_adapter/2,unload_adapter/2,apply_adapters/2. Backends that omit them surface{error, not_supported}from the public API.
Cache subsystem
- Token-exact KV cache with three independently-supervised tiers:
RAM (ETS slabs),
ram_file(/dev/shm), and disk (plain read I/O). - Sole-writer arbitration through
erllama_cache_meta_srv; reads on the hot path go to ETS directly.lookup_exact/1is a single atomicets:lookup(no two-call race) and the meta server cancels waiter timers when an early reply lands so the mailbox doesn't bloat under load. - The disk tier reads files via plain
file:read_file/1into a fresh BEAM heap binary; mmap is deliberately not used. The process already mmaps multi-GB GGUF weights, and a region binary that outlived its closing NIF call would have exposed the BEAM to SIGBUS from any external truncation. - Crash-safe save publish protocol: reserve, write_tmp, check,
link(2), mark_published; two-stage TTL cleanup with orphan adoption. - Five save reasons (
cold,continued,finish,evict,shutdown) with async/sync semantics matching their use. saves_droppedcounter: bumps whenever a back-pressured writer pool refuses a save the model wanted to fire.- Multi-turn warmth via explicit
parent_keyresume and stateless longest-prefix walk for OpenAI/Anthropic-shaped clients. erllama_schedulermemory-pressure poller with pluggable sources (memsup,nvidia-smi, custom callback). Off by default. Sweep timer is cancelled onterminate/2so a supervisor restart never leaves a zombie firing into a fresh server.erllama_cache_writerdirty-IO writer pool with a leak-proof reservation semaphore.pin_and_load/2wraps load + unpack intry/afterso the holder is always checked in.- Persisted hit counters (u32 in disk header) so popular prefixes survive an LRU walk after restart.
- End-to-end metrics: hits/misses/saves/evictions plus per-path
latency totals (
pack_total_ns,load_total_ns,longest_prefix_ns,longest_prefix_probes).
NIF safety
- Per-resource
pthread_mutexand two-resource lifetime pattern for safe concurrentfree_*/1plus dirty NIF ops. extern "C" noexceptshim catching every llama.cpp C++ exception at the boundary;decode_onedefensive guard againstGGML_ASSERTaborts.llama_backend_initis deferred to the firstnif_load_modelviapthread_once, so cache-only and unit-test workloads do not pay theggml_backend_load_allcost at NIF load.nif_tokenizeandnif_detokenizehonourrelease_pending, so a model returned byfree_model/1as{ok, deferred}cannot be reused via tokenize.nif_detokenizefails closed onn_vocab <= 0(matchesnif_prefill).make_errno_atommaps FreeBSD'sEINTEGRITYtoeintegrity.
Tooling
FindErlang.cmake(adopted from erlang-rocksdb) detectsERTS_INCLUDE_DIRvia the standard CMake find-module contract.- Bench harness (
bench/run.sh) with cold-vs-warm matrix and a 4-agent shared-prefix scenario. TinyLlama and LLaMA-3 8B presets.
CI
actions/checkoutandactions/cachebumped to@v5(Node.js 24).xref,dialyzer,erlfmt,elvispromoted to gate jobs;build,eunit,proper,ct,freebsddepend on them.- macOS matrix is
macos-14, macos-15. - FreeBSD matrix added:
release: ['14.2', '14.4']. Inside the VM: refreshpcre2so git can run, installgit, setgit config --global --add safe.directory '*'so llama.cpp's build-infogit rev-parsesucceeds. erllama_nif_tests:load_model_rejects_non_existent_path_testis now a generator with a 60 s timeout to absorb the lazy Metal init on macOS.
Tests
- 211 EUnit + PropEr property tests + 7 stub Common Test cases.
Real-model Common Test suite gated on
LLAMA_TEST_MODEL(14 cases including seed determinism, grammar+sampler, apply_chat_template, embeddings, KV pack/unpack round-trip). - New stub-backed coverage: sampler params (
erllama_sampler_tests), LoRA adapters + cache identity (erllama_lora_tests), FIFO queueing of concurrent infers (erllama_streaming_tests). - Multi-platform CI: Ubuntu 24.04 amd64, Ubuntu 24.04 arm64, macOS 14 + 15 (Apple Silicon), FreeBSD 14.2 + 14.4. OTP 28 across the matrix.
Documentation
- README rewritten as a friendly entry point with snippets.
- User guides: loading, caching, configuration, building, examples.
- Internal design notes: cache design, publish protocol, NIF safety.
- ex_doc-friendly module documentation throughout.
ROADMAP.md: what 0.1 doesn't do yet (multi-seq concurrent decoding, speculative decoding, vision, audio, ONNX/safetensors, stop-sequences, telemetry hooks, multi-GPU pressure, KV compression, cluster).- README closes with a teaser for the upcoming
erllama_clusterapplication: a separate OTP project that coordinates a fleet of erllama nodes (request distribution, cross-node speculative decoding, pipeline parallelism over QUIC).
Acknowledgements
Same idea as antirez/ds4.