Changelog
View SourceAll notable changes to erllama are documented here. The format follows Keep a Changelog and this project adheres to Semantic Versioning.
Unreleased
0.1.1 - 2026-05-12
NIF safety and SIGSEGV hardening. No public API additions; new
error tuples surface paths that previously crashed the BEAM or
raised badarg across a dirty scheduler.
Fixed
- Adapter use-after-free / double-free when
free_model/1ran while adapter wrappers still referenced the model. The model resource now tracksactive_adaptersalongsideactive_contextsand defersllama_model_freeuntil both reach zero (#10). - Race in
set_adapters/2where a concurrentadapter_freecould null the underlying pointer between the per-adapter mutex release and thellama_set_adapters_loracall. Locks are now held across the llama call in pointer-sorted order to defeat AB-BA between concurrent callers (#10). - Per-message memory leak in
apply_chat_templatewhen the message list was malformed or allocation failed mid-build. The helper now releases its own role/content allocations on every error path (#10). prefill/2andembed/2walked past the KV slab when the prompt size reachedn_ctx, and produced undefined behaviour when it exceededn_batch. Both now bounds-check against the live context before touching state, returning{error, context_overflow}or{error, batch_overflow}(#11).apply_chat_template/2raisedbadargacross the dirty scheduler whencontentwas a list-of-maps (Anthropic-style content blocks) instead of a binary. Returns{error, invalid_content}(#11).load_model/2surfaced a generic{error, load_failed}for malformed GGUF files. Now returns{error, malformed_gguf}on a best-effort basis when the captured llama log line containsGGML_ASSERT. Best-effort only:llama_log_setis process global and concurrent loads can mis-attribute classification. AGGML_ASSERTthat hitsabort()still terminates the BEAM process; subprocess isolation would be required for a complete fix and is intentionally out of scope (#11).
Added
- New error atoms returned by the NIF:
context_overflow,batch_overflow,invalid_content,malformed_gguf. Callers matching{error, _}are unaffected; callers that care about the specific reason should match the new atoms.
0.1.0 - 2026-05-11
Initial public release.
Added
Public API
- Native Erlang/OTP wrapper around llama.cpp via a single
dirty-scheduler NIF (
erllama_nif) covering model load, context construction, tokenisation, prefill, single-token decode, and KV pack/unpack. - Models are identified by
binary()on the public API.erllama:load_model/2,complete/2,3,unload/1,status/1,evict/1,shutdown/1takebinary() | pid(). Internal registration uses{via, erllama_registry, BinaryId}so user- supplied ids cannot exhaust the atom table. erllama:list_models/0returning[model_info()]anderllama:model_info/1keyed on a model id.- Public
erllama:tokenize/2anderllama:detokenize/2keyed on a model id. The low-levelerllama_nif:tokenize/3anderllama_nif:detokenize/2remain available. erllama:unload_model/1as an alias forerllama:unload/1matching the OpenAI/Ollama-style naming downstream HTTP servers use.erllama:infer/4streaming inference. Returns{ok, Ref}; tokens are delivered to the caller as{erllama_token, Ref, _},{erllama_done, Ref, Stats},{erllama_error, Ref, Reason}.erllama:cancel/1. Idempotent and fire-and-forget; observed between tokens.erllama:apply_chat_template/2. Renders a normalised chat request (messages,system,tools) through the model's GGUF chat template and tokenises. Backed byllama_chat_apply_template.erllama:embed/2. Per-sequence pooled embedding viallama_get_embeddings_seqwith last-token fallback.
Sampling
- Sampler parameters:
complete/3andinfer/4honourtemperature,top_k,top_p,min_p,repetition_penalty,seed, andgrammarvia one combined chain builder (erllama_nif:configure_sampler/2). Chain order:grammar -> repetition_penalty -> top_k -> top_p -> min_p -> (temperature > 0 ? temp -> dist(seed) : greedy).set_grammar/2retained as a backwards-compatible alias. - Grammar-constrained sampling: pass
grammar => GBNFin thecomplete/3Opts orinfer/4Params; the per-model sampler chain is rebuilt as grammar then greedy for the duration of the request and reset on completion or cancellation.
LoRA adapters
erllama:load_adapter/2,unload_adapter/2,set_adapter_scale/3,list_adapters/1. Per-adapter sha256 + scale fold into the cache viaerllama_cache_key:effective_fingerprint/2, so rows produced with adapter A never collide with rows from adapter B. Snapshot-at-admission semantics keep in-flight requests on their original fingerprint even if an adapter mutation arrives mid-generation.
Concurrency model
- Concurrent request queue: a second
complete/3orinfer/4arriving while one is in flight is queued FIFO instead of getting{error, busy}. The reply{ok, Ref}is sent as soon as the call is admitted; streaming events follow once the queue head advances to the request. - Decode loop schedules each step via
gen_statem:cast(self(), decode_step)instead ofnext_eventso cancel, evict, status, and queued requests interleave fairly between tokens. - Seq-aware NIFs (infrastructure for 0.2 multi-seq batching):
nif_kv_pack/4accepts an explicitseq_id; newerllama_sampler_tresource owning a standalonellama_sampler*;nif_sampler_new/2,nif_sampler_free/1. Cache rows stay seq-id-free:seq_idis a save/load call argument, never row metadata.
Internals
erllama_registrymodule: ETS-backedviacallback for binary model ids.erllama_inflightmodule:Ref -> ModelPidtable socancel/1routes to the right gen_statem.erllama_model_backendoptional callbacks:apply_chat_template/2,embed/2,set_grammar/2,configure_sampler/2,clear_sampler/1,load_adapter/2,unload_adapter/2,apply_adapters/2. Backends that omit them surface{error, not_supported}from the public API.
Cache subsystem
- Token-exact KV cache with three independently-supervised tiers:
RAM (ETS slabs),
ram_file(/dev/shm), and disk (plain read I/O). - Sole-writer arbitration through
erllama_cache_meta_srv; reads on the hot path go to ETS directly.lookup_exact/1is a single atomicets:lookup(no two-call race) and the meta server cancels waiter timers when an early reply lands so the mailbox doesn't bloat under load. - The disk tier reads files via plain
file:read_file/1into a fresh BEAM heap binary; mmap is deliberately not used. The process already mmaps multi-GB GGUF weights, and a region binary that outlived its closing NIF call would have exposed the BEAM to SIGBUS from any external truncation. - Crash-safe save publish protocol: reserve, write_tmp, check,
link(2), mark_published; two-stage TTL cleanup with orphan adoption. - Five save reasons (
cold,continued,finish,evict,shutdown) with async/sync semantics matching their use. saves_droppedcounter: bumps whenever a back-pressured writer pool refuses a save the model wanted to fire.- Multi-turn warmth via explicit
parent_keyresume and stateless longest-prefix walk for OpenAI/Anthropic-shaped clients. erllama_schedulermemory-pressure poller with pluggable sources (memsup,nvidia-smi, custom callback). Off by default. Sweep timer is cancelled onterminate/2so a supervisor restart never leaves a zombie firing into a fresh server.erllama_cache_writerdirty-IO writer pool with a leak-proof reservation semaphore.pin_and_load/2wraps load + unpack intry/afterso the holder is always checked in.- Persisted hit counters (u32 in disk header) so popular prefixes survive an LRU walk after restart.
- End-to-end metrics: hits/misses/saves/evictions plus per-path
latency totals (
pack_total_ns,load_total_ns,longest_prefix_ns,longest_prefix_probes).
NIF safety
- Per-resource
pthread_mutexand two-resource lifetime pattern for safe concurrentfree_*/1plus dirty NIF ops. extern "C" noexceptshim catching every llama.cpp C++ exception at the boundary;decode_onedefensive guard againstGGML_ASSERTaborts.llama_backend_initis deferred to the firstnif_load_modelviapthread_once, so cache-only and unit-test workloads do not pay theggml_backend_load_allcost at NIF load.nif_tokenizeandnif_detokenizehonourrelease_pending, so a model returned byfree_model/1as{ok, deferred}cannot be reused via tokenize.nif_detokenizefails closed onn_vocab <= 0(matchesnif_prefill).make_errno_atommaps FreeBSD'sEINTEGRITYtoeintegrity.
Tooling
FindErlang.cmake(adopted from erlang-rocksdb) detectsERTS_INCLUDE_DIRvia the standard CMake find-module contract.- Bench harness (
bench/run.sh) with cold-vs-warm matrix and a 4-agent shared-prefix scenario. TinyLlama and LLaMA-3 8B presets.
CI
actions/checkoutandactions/cachebumped to@v5(Node.js 24).xref,dialyzer,erlfmt,elvispromoted to gate jobs;build,eunit,proper,ct,freebsddepend on them.- macOS matrix is
macos-14, macos-15. - FreeBSD matrix added:
release: ['14.2', '14.4']. Inside the VM: refreshpcre2so git can run, installgit, setgit config --global --add safe.directory '*'so llama.cpp's build-infogit rev-parsesucceeds. erllama_nif_tests:load_model_rejects_non_existent_path_testis now a generator with a 60 s timeout to absorb the lazy Metal init on macOS.
Tests
- 211 EUnit + PropEr property tests + 7 stub Common Test cases.
Real-model Common Test suite gated on
LLAMA_TEST_MODEL(14 cases including seed determinism, grammar+sampler, apply_chat_template, embeddings, KV pack/unpack round-trip). - New stub-backed coverage: sampler params (
erllama_sampler_tests), LoRA adapters + cache identity (erllama_lora_tests), FIFO queueing of concurrent infers (erllama_streaming_tests). - Multi-platform CI: Ubuntu 24.04 amd64, Ubuntu 24.04 arm64, macOS 14 + 15 (Apple Silicon), FreeBSD 14.2 + 14.4. OTP 28 across the matrix.
Documentation
- README rewritten as a friendly entry point with snippets.
- User guides: loading, caching, configuration, building, examples.
- Internal design notes: cache design, publish protocol, NIF safety.
- ex_doc-friendly module documentation throughout.
ROADMAP.md: what 0.1 doesn't do yet (multi-seq concurrent decoding, speculative decoding, vision, audio, ONNX/safetensors, stop-sequences, telemetry hooks, multi-GPU pressure, KV compression, cluster).- README closes with a teaser for the upcoming
erllama_clusterapplication: a separate OTP project that coordinates a fleet of erllama nodes (request distribution, cross-node speculative decoding, pipeline parallelism over QUIC).
Acknowledgements
Same idea as antirez/ds4.