Tool calls

View Source

Some chat templates emit structured tool calls inline with normal model text. qwen3-style <tool_call>...</tool_call>, DSML, and XML-shaped templates all have the same practical problem: the HTTP layer needs to parse a tool call for the client, then put the exact same bytes back into the next prompt so the KV cache can still hit.

erllama handles the model-facing half of that problem. It detects configured tool-call boundaries, streams the exact bytes the model sampled, and can make the syntactic wrapper deterministic while leaving payload text free to sample normally.

It does not mint tool ids, parse SDK JSON, or canonicalize tool arguments. Those belong in the server layer above erllama.

Declaring markers

Markers are per model because every chat template is different:

erllama:load_model(<<"qwen3-tool">>, #{
    backend => erllama_model_llama,
    model_path => "/models/qwen3-7b.gguf",
    tool_call_markers => #{
        start => <<"<tool_call>">>,
        end => <<"</tool_call>">>
    }
}).

The backend tokenizes the marker strings through the model's own vocabulary at load time. Multi-token markers work because detection is by token sequence, not by a byte scan over detokenized text.

Omitting tool_call_markers keeps the model on the normal text path: no tool-call messages and no sampler swap.

Streaming shape

A streaming infer/4 against a tool-aware model receives:

{erllama_token, Ref, {tool_call_delta, Bin}}
{erllama_tool_call_end, Ref, FullBin}

tool_call_delta contains the bytes inside the tool-call span. At the end of the span, FullBin is the concatenation of every delta. Store that binary under whatever tool id your HTTP layer minted.

On the next turn, splice the stored binary back into the rendered prompt. Do not pretty-print it. Do not canonicalize it if you still have the exact bytes. Exact replay is what lets the next prompt match the cached token prefix.

Deterministic syntax

When tool_call_markers are configured, erllama builds a second sampler chain for the request with temperature = 0. When the start marker is sampled, the scheduler swaps the request onto that greedy chain so the tool-call syntax stays stable from a fixed prefix.

If the template has inner markers for payload bodies, declare them too:

tool_call_markers => #{
    start => <<"<tool_call>">>,
    end => <<"</tool_call>">>,
    payload_start => <<"<arg>">>,
    payload_end => <<"</arg>">>
}

With payload markers set, erllama uses the greedy sampler for the wrapper syntax and the request's normal sampler for payload bytes. That keeps large string arguments, file contents, or generated edits from becoming unnecessarily deterministic.

Without payload markers, the whole tool-call span uses the greedy sampler. That is simpler and works well for short literal arguments.

Server responsibilities

The downstream server is responsible for:

  • Minting a tool id.
  • Storing FullBin under that id.
  • Parsing FullBin into the SDK's tool_use or tool-call JSON shape.
  • Falling back to deterministic rendering when the id is unknown.
  • Replaying stored bytes verbatim into later prompts when the id is known.

erllama provides adjacent primitives that help compose the full flow:

  • erllama:prefill_only/3 with parent_key can warm a prompt prefix before generation.
  • session_id on infer/4 and complete/3 can pin a live sequence across turns.
  • erllama:continue/3 extends a pinned sequence with a caller-asserted token tail, for chat templates that render differently across turns.
  • erllama:end_session/2 releases a pinned sequence explicitly.

See Examples for copyable streaming snippets.