This is the through-line that ties the agent, persistence, compaction, and rendering together. The core rule: the agent holds the bounded working set; the log holds everything; the renderer reads the log for scrollback. Conflating these is what would quietly turn every agent into a memory hog.
Three stores, two of them not in the agent
- The working set (in memory, bounded). What the agent keeps: the small
fsm_state, the rendered context for the active turn (already bounded — it's what compaction produces: latest summary + verbatim tail + retained/stubbed tool results, under budget), a short tail of recent finalized messages so it can assemble the next turn without a DB round-trip, and the in-progress assistant text it accumulates mid-stream (it finalizes the message from this buffer and serves it to mid-stream mounts — see06). The agent never holds message 1 through 4,000 in memory. - The log (on disk, unbounded). The
eventstable is the only unbounded store and the source of truth: every message, the real tool results (not the stubs), every suspension and resolution. The agent reads from it lazily — on revival it reconstructs the working set from "latest summary + events since that summary's span," not by loading everything. - Scrollback (renderer, reads the log). When a user scrolls up to a message
long since compacted out of the working set, the headless layer paginates
eventsbackward. The UI shows the whole conversation; the agent only ever holds the working set.
There is exactly one unbounded store and one bounded store. No third copy, no "agent's idea of history" drifting from "the log's idea of history." The rendered context the agent keeps and the thing it sends the model are the same bounded object, a projection of the log.
Consequence: agent footprint is flat in conversation length and linear in the active agent count (managed by eviction), not in total history.
User sees the log; the model sees the projection
A deliberate, correct divergence: scrolling up shows the true historical message
from the log; the model gets the compacted projection. They answer different
questions ("what was said" vs "what fits the budget"). The practical upshot is a
debugging rule: when chasing "why did it forget X," pull from model_calls (what
was actually rendered to the model that turn), not from what's on screen. That's
what the optional audit table is for.
Sizing: bound by tokens, not message count
"Keep N messages" is the wrong knob — one 50 KB tool result is worth ~200 chat messages. The working set is bounded by a token budget, and message count floats under it.
Conversion at ~4 bytes/token and a mixed average of ~200 tokens/message:
| Working budget | ≈ messages |
|---|---|
| ~16K tokens | ~80 |
| ~32K tokens | ~160 |
| ~64K tokens | ~320 |
Default planning number: ~30K-token working window ≈ 100–150 chat messages. Tool-heavy conversations hold fewer distinct events but more tokens each — except the retention reducer stubs old tool results down to ~10 tokens, so many stubbed pairs cost almost nothing and the count can run higher.
When compaction triggers
Two triggers; for a tool-centric agent the cheap one dominates:
- The deterministic reducers (tool-result retention, sliding window) run on every assembly — they're free, so they're never gated.
- Expensive summarization fires only when the verbatim tail exceeds the working budget — a cost knob set well below the model ceiling, since every context token is re-billed every turn. Default working budget ~25–50K tokens; hard ceiling ~70–75% of the model window as the never-exceed guard that approximate token-counting's safety margin protects.
Memory per gen_statem
Text content lives in refc binaries: anything over 64 bytes (essentially every
message) goes to the shared binary heap as a ref-counted binary, so the process
heap holds a ~56-byte pointer, not the bytes. The dominant cost is binary, and if
the working set is kept as a ReqLLM.Context of message structs (not a
pre-concatenated prompt string), the same binaries are shared between the message
list and the assembled context rather than duplicated.
For a ~32K-token chat working set (~160 messages):
| Component | ≈ size |
|---|---|
| Text binaries (refc, shared heap) | ~128 KB |
| Struct skeletons (~160 × ~300 B) | ~50 KB |
fsm_state + gen_statem baseline | ~5–10 KB |
| Live data | ~180 KB |
Typical 150–250 KB for an active chat agent. Between GC sweeps RSS runs ~1.5–2×, so budget ~300–400 KB. Tool-heavy with a few large recent results still in full (say 3 × 6K tokens before they age out) adds ~70 KB plus cheap stubs → ~300–600 KB.
Scaling sanity check (text isn't shared across conversations, so it's linear in the active set):
| Active agents | ≈ memory |
|---|---|
| 1,000 | ~200 MB |
| 10,000 | ~2 GB |
Which is exactly why eviction to the log earns its keep, and why footprint must stay flat in length.
Constraints this imposes (honor from day one)
- Keep the working set as
ReqLLM.Context, never a concatenated prompt string. Concatenating doubles the footprint and breaks binary sharing. - Keep the in-memory message struct lean. Per-turn usage, latency, and audit go
to
model_callson disk, not into the in-memory struct — otherwise every process carries data only the debugger wants. - Hibernate before evicting. See
01— hibernate compacts the heap for agents parked inawaiting_input; full idle terminates and persists to the log.