CI Hex.pm

Run llama.cpp from Erlang. Keep prompts warm. Stay inside OTP.

erllama is a native Erlang/OTP runtime for llama.cpp with supervised model processes, OpenAI-shaped completion APIs, and a token-exact KV cache that turns repeated prompt prefill from seconds into milliseconds.

If your app sends the same system prompt, agent scaffold, or conversation prefix again and again, erllama saves the model state once and restores it on the next request. No fuzzy matching. No hidden session server. Just exact tokens, exact cache keys, and OTP supervision around the whole path.

Why erllama?

  • Fast repeat prompts. Cache hits restore KV state instead of recomputing prompt prefill.
  • Native OTP shape. Each loaded model is a supervised process with a clear lifecycle: load, complete, stream, observe, unload.
  • Bigger-than-RAM warmth. Hot prefixes can live in RAM, warm prefixes in tmpfs, and large working sets on disk.
  • Stateless-server friendly. Resend the full conversation every turn and still get longest-prefix cache hits.
  • Multi-model safe. Cache rows include the model fingerprint and context shape, so different models never collide on identical prompts.
  • Observable by default. Hit/miss counters and per-model state probes are cheap enough to call from routers.
  • Built on llama.cpp. Local GGUF inference with the platform support you expect: Metal, BLAS, CUDA toggles, and plain CPU fallback.

Quick taste

1> {ok, _} = application:ensure_all_started(erllama).
2> Path = "/srv/models/tinyllama-1.1b-chat.Q4_K_M.gguf".
3> {ok, Bin} = file:read_file(Path).
4> {ok, Model} = erllama:load_model(#{
       backend => erllama_model_llama,
       model_path => Path,
       fingerprint => crypto:hash(sha256, Bin)
   }).
{ok, <<"erllama_model_2375">>}

5> {ok, #{reply := Reply, finish_key := Key}} =
       erllama:complete(Model, <<"Once upon a time">>).
%% First call: cold prefill, async save.

6> {ok, #{reply := Reply2}} =
       erllama:complete(Model, <<"Once upon a time">>).
%% Same prompt: KV cache restore.

7> {ok, #{reply := Reply3}} =
       erllama:complete(Model,
                        <<"Once upon a time, in a quiet village">>).
%% Longer prompt: longest cached prefix wins.

8> {ok, #{reply := Reply4}} =
       erllama:complete(Model, <<"and they lived happily ever after">>,
                        #{parent_key => Key}).
%% Stateful resume from the previous finish save.

load_model/1 returns a binary model id. Pass it to complete/2,3, infer/4, tokenize/2, unload/1, and the rest of the public API.

Install

erllama targets Erlang/OTP 28 and rebar3 3.25+.

Add it to rebar.config:

{deps, [
    {erllama, "~> 0.5"}
]}.

Then start the application before loading models:

{ok, _} = application:ensure_all_started(erllama).

The first compile builds the vendored llama.cpp. See Building for platform notes and CUDA/Metal options.

Common patterns

Stateless HTTP completion

OpenAI/Anthropic-shaped servers usually resend the whole conversation on each turn. That is fine. erllama walks the prompt backward and restores the longest exact prefix it has already saved.

handle_completion(ModelId, Prompt) ->
    {ok, #{reply := Reply}} =
        erllama:complete(ModelId, Prompt, #{response_tokens => 256}),
    Reply.

Stateful Erlang session

If your session process already tracks turns, keep the returned finish_key and pass it as parent_key on the next request. That skips the longest-prefix walk and resumes directly from the saved row.

{ok, #{reply := R1, finish_key := K1}} =
    erllama:complete(ModelId, Prompt1),

{ok, #{reply := R2, finish_key := K2}} =
    erllama:complete(ModelId, Prompt2, #{parent_key => K1}).

Many models in one BEAM

Each loaded model is its own supervised process. The cache is shared, but rows are fingerprint-segregated.

{ok, _} = erllama:load_model(<<"tiny">>, TinyConfig),
{ok, _} = erllama:load_model(<<"big">>, BigConfig),

{ok, #{reply := R1}} = erllama:complete(<<"tiny">>, <<"summarise: ...">>),
{ok, #{reply := R2}} = erllama:complete(<<"big">>, <<"deep analysis: ...">>),

ok = erllama:unload(<<"tiny">>).

Inspect live state

1> erllama_cache:get_counters().
#{hits_exact => 142, hits_resume => 17, hits_longest_prefix => 89,
  misses => 12, saves_cold => 12, saves_finish => 31, ...}

2> erllama:phase(<<"big">>).
generating
3> erllama:pending_len(<<"big">>).
3
4> erllama:last_cache_hit(<<"big">>).
#{kind => partial, prefix_len => 1024}

Documentation

NeedRead
Load a modelLoading a model
Configure cache tiers and save policyCaching
Configure sys.config and per-model optionsConfiguration
Build from sourceBuilding
Copy working snippetsExamples
Stream tool calls while preserving cache hitsTool calls
Understand cache design tradeoffsCache design
Understand crash-safe save publicationPublish protocol
Understand request admission and decode flowRequest lifecycle
Understand NIF lifetime safetyNIF safety

API reference for erllama, erllama_cache, erllama_scheduler, and erllama_nif is published on HexDocs. You can also build it locally:

rebar3 ex_doc

Architecture

erllama_sup
├── erllama_cache_sup
│   ├── erllama_cache_meta_srv
│   ├── erllama_cache_ram
│   └── erllama_cache_writer
├── erllama_registry
├── erllama_inflight
├── erllama_model_sup
│   └── erllama_model      one supervised gen_statem per loaded model
└── erllama_scheduler      memory-pressure poller, off by default

Disk and ram_file tier servers are started by the operator, one per root directory, then referenced by loaded models through tier_srv and tier.

The important invariant is simple: cache hits are token-exact. A key is derived from the model fingerprint, quantization, context shape, and full token list. erllama may find a shorter saved prefix for a longer prompt, but it never returns an approximate match.

Requirements

  • Erlang/OTP 28
  • rebar3 3.25+
  • C++17 toolchain
  • cmake >= 3.20
  • Apple Silicon: Metal + Accelerate are auto-detected
  • Linux: BLAS is auto-detected; CUDA is enabled with ERLLAMA_OPTS=-DGGML_CUDA=ON
  • FreeBSD: erlang-runtime28 plus cmake bash gmake

Status

erllama is pre-release. The cache, scheduler, and NIF safety wrappers have unit, property, and Common Test coverage. The real-model Common Test suite is gated by LLAMA_TEST_MODEL so normal CI can run without a GGUF file.

See CHANGELOG.md for release notes.

Contributing

The contributor guide is AGENTS.md. The short version:

rebar3 fmt
rebar3 compile
rebar3 eunit
rebar3 proper
rebar3 ct
rebar3 lint
rebar3 dialyzer
rebar3 xref

Run the real-model suite when you have a GGUF available:

LLAMA_TEST_MODEL=/path/to/tinyllama-1.1b-chat.Q4_K_M.gguf \
    rebar3 ct --suite=test/erllama_real_model_SUITE

Bumping the vendored llama.cpp is covered in UPDATE_LLAMA.md.

erllama_cluster is planned as a separate OTP application for routing, cache-aware placement, speculative decoding, and distributed inference across erllama nodes.

Repository: https://github.com/erllama/erllama_cluster

Acknowledgements

Same idea as antirez/ds4.

License

MIT. Copyright (c) 2026 Benoit Chesneau. See LICENSE.

The vendored c_src/llama.cpp/ retains its upstream MIT license; see c_src/llama.cpp/LICENSE.