erllama
View SourceRun llama.cpp from Erlang. Keep prompts warm. Stay inside OTP.
erllama is a native Erlang/OTP runtime for llama.cpp with supervised
model processes, OpenAI-shaped completion APIs, and a token-exact KV
cache that turns repeated prompt prefill from seconds into milliseconds.
If your app sends the same system prompt, agent scaffold, or conversation prefix again and again, erllama saves the model state once and restores it on the next request. No fuzzy matching. No hidden session server. Just exact tokens, exact cache keys, and OTP supervision around the whole path.
Why erllama?
- Fast repeat prompts. Cache hits restore KV state instead of recomputing prompt prefill.
- Native OTP shape. Each loaded model is a supervised process with a clear lifecycle: load, complete, stream, observe, unload.
- Bigger-than-RAM warmth. Hot prefixes can live in RAM, warm prefixes in tmpfs, and large working sets on disk.
- Stateless-server friendly. Resend the full conversation every turn and still get longest-prefix cache hits.
- Multi-model safe. Cache rows include the model fingerprint and context shape, so different models never collide on identical prompts.
- Observable by default. Hit/miss counters and per-model state probes are cheap enough to call from routers.
- Built on
llama.cpp. Local GGUF inference with the platform support you expect: Metal, BLAS, CUDA toggles, and plain CPU fallback.
Quick taste
1> {ok, _} = application:ensure_all_started(erllama).
2> Path = "/srv/models/tinyllama-1.1b-chat.Q4_K_M.gguf".
3> {ok, Bin} = file:read_file(Path).
4> {ok, Model} = erllama:load_model(#{
backend => erllama_model_llama,
model_path => Path,
fingerprint => crypto:hash(sha256, Bin)
}).
{ok, <<"erllama_model_2375">>}
5> {ok, #{reply := Reply, finish_key := Key}} =
erllama:complete(Model, <<"Once upon a time">>).
%% First call: cold prefill, async save.
6> {ok, #{reply := Reply2}} =
erllama:complete(Model, <<"Once upon a time">>).
%% Same prompt: KV cache restore.
7> {ok, #{reply := Reply3}} =
erllama:complete(Model,
<<"Once upon a time, in a quiet village">>).
%% Longer prompt: longest cached prefix wins.
8> {ok, #{reply := Reply4}} =
erllama:complete(Model, <<"and they lived happily ever after">>,
#{parent_key => Key}).
%% Stateful resume from the previous finish save.load_model/1 returns a binary model id. Pass it to complete/2,3,
infer/4, tokenize/2, unload/1, and the rest of the public API.
Install
erllama targets Erlang/OTP 28 and rebar3 3.25+.
Add it to rebar.config:
{deps, [
{erllama, "~> 0.5"}
]}.Then start the application before loading models:
{ok, _} = application:ensure_all_started(erllama).The first compile builds the vendored llama.cpp. See
Building for platform notes and CUDA/Metal options.
Common patterns
Stateless HTTP completion
OpenAI/Anthropic-shaped servers usually resend the whole conversation on each turn. That is fine. erllama walks the prompt backward and restores the longest exact prefix it has already saved.
handle_completion(ModelId, Prompt) ->
{ok, #{reply := Reply}} =
erllama:complete(ModelId, Prompt, #{response_tokens => 256}),
Reply.Stateful Erlang session
If your session process already tracks turns, keep the returned
finish_key and pass it as parent_key on the next request. That skips
the longest-prefix walk and resumes directly from the saved row.
{ok, #{reply := R1, finish_key := K1}} =
erllama:complete(ModelId, Prompt1),
{ok, #{reply := R2, finish_key := K2}} =
erllama:complete(ModelId, Prompt2, #{parent_key => K1}).Many models in one BEAM
Each loaded model is its own supervised process. The cache is shared, but rows are fingerprint-segregated.
{ok, _} = erllama:load_model(<<"tiny">>, TinyConfig),
{ok, _} = erllama:load_model(<<"big">>, BigConfig),
{ok, #{reply := R1}} = erllama:complete(<<"tiny">>, <<"summarise: ...">>),
{ok, #{reply := R2}} = erllama:complete(<<"big">>, <<"deep analysis: ...">>),
ok = erllama:unload(<<"tiny">>).Inspect live state
1> erllama_cache:get_counters().
#{hits_exact => 142, hits_resume => 17, hits_longest_prefix => 89,
misses => 12, saves_cold => 12, saves_finish => 31, ...}
2> erllama:phase(<<"big">>).
generating
3> erllama:pending_len(<<"big">>).
3
4> erllama:last_cache_hit(<<"big">>).
#{kind => partial, prefix_len => 1024}Documentation
| Need | Read |
|---|---|
| Load a model | Loading a model |
| Configure cache tiers and save policy | Caching |
Configure sys.config and per-model options | Configuration |
| Build from source | Building |
| Copy working snippets | Examples |
| Stream tool calls while preserving cache hits | Tool calls |
| Understand cache design tradeoffs | Cache design |
| Understand crash-safe save publication | Publish protocol |
| Understand request admission and decode flow | Request lifecycle |
| Understand NIF lifetime safety | NIF safety |
API reference for erllama, erllama_cache, erllama_scheduler, and
erllama_nif is published on
HexDocs. You can also build it locally:
rebar3 ex_doc
Architecture
erllama_sup
├── erllama_cache_sup
│ ├── erllama_cache_meta_srv
│ ├── erllama_cache_ram
│ └── erllama_cache_writer
├── erllama_registry
├── erllama_inflight
├── erllama_model_sup
│ └── erllama_model one supervised gen_statem per loaded model
└── erllama_scheduler memory-pressure poller, off by defaultDisk and ram_file tier servers are started by the operator, one per
root directory, then referenced by loaded models through tier_srv and
tier.
The important invariant is simple: cache hits are token-exact. A key is derived from the model fingerprint, quantization, context shape, and full token list. erllama may find a shorter saved prefix for a longer prompt, but it never returns an approximate match.
Requirements
- Erlang/OTP 28
- rebar3 3.25+
- C++17 toolchain
cmake>= 3.20- Apple Silicon: Metal + Accelerate are auto-detected
- Linux: BLAS is auto-detected; CUDA is enabled with
ERLLAMA_OPTS=-DGGML_CUDA=ON - FreeBSD:
erlang-runtime28pluscmake bash gmake
Status
erllama is pre-release. The cache, scheduler, and NIF safety wrappers have
unit, property, and Common Test coverage. The real-model Common Test suite
is gated by LLAMA_TEST_MODEL so normal CI can run without a GGUF file.
See CHANGELOG.md for release notes.
Contributing
The contributor guide is AGENTS.md. The short version:
rebar3 fmt
rebar3 compile
rebar3 eunit
rebar3 proper
rebar3 ct
rebar3 lint
rebar3 dialyzer
rebar3 xref
Run the real-model suite when you have a GGUF available:
LLAMA_TEST_MODEL=/path/to/tinyllama-1.1b-chat.Q4_K_M.gguf \
rebar3 ct --suite=test/erllama_real_model_SUITE
Bumping the vendored llama.cpp is covered in
UPDATE_LLAMA.md.
Related projects
erllama_cluster is planned as a separate OTP application for routing,
cache-aware placement, speculative decoding, and distributed inference
across erllama nodes.
Repository: https://github.com/erllama/erllama_cluster
Acknowledgements
Same idea as antirez/ds4.
License
MIT. Copyright (c) 2026 Benoit Chesneau. See LICENSE.
The vendored c_src/llama.cpp/ retains its upstream MIT license; see
c_src/llama.cpp/LICENSE.