Principle

Compaction operates on the rendered context, never on the canonical log. The log is immutable history; compaction is a projection over it — a function from the full event log to "the subset/summary that fits the budget." This rules out the tempting but wrong implementation of editing the message history in place, which would corrupt replay and conflate "what happened" with "what we sent."

Compaction and injection are independent

Reduction (compaction) and injection (pre-hooks, see 02) are separate subsystems with separate interfaces, triggers, and data flow. They are not two halves of one loop. The only thing they share is the token budget, computed once over the final rendered context after both have run. Nothing else connects them. The library must not couple them.

Two rules at the seam (stated from the hook side in 02):

  • Overflow — reduction targets budget − injection_reserve; an injection that blows its reserve is a loud per-hook error. Compaction is never re-entered after hooks run.
  • Layout of the rendered context — cache-prefix discipline applies to the whole assembly, not just compaction: stable prefix (system prompt + latest summary) → verbatim history tail → per-turn injected content adjacent to the user message at the very end. Injected content placed before the history would invalidate the provider cache prefix every turn.

Reduction is a pipeline of reducers

Compaction is not one strategy; it is a pipeline of independent reducers run cheap-to-expensive, each with its own config and trigger. The trigger logic is: run the free deterministic reducers always; if still over budget, summarize the overflow. This is cost-minimizing by construction — you never pay for a summarization call you didn't need, and in most tool-heavy conversations the tool-result reducer alone keeps you under budget so summarization never runs.

1. Tool-result retention (first-class)

The fattest target and the first thing to cut — a single large API response dwarfs the whole dialogue. This is a first-class reducer, not a sub-case of summarization, with two retention modes that map to different tool shapes:

  • Age-based — keep the full result for K turns, then it expires. Good default.
  • Count-based — keep the last N results from this tool, older ones expire. The right mode for repeatedly-called tools (search, read_file) where only the recent calls matter.

Declared per-tool with a global default (a weather lookup and a 50KB document have different lifespans), plus a never_evict flag for results that are durable facts. Deterministic, no model call, runs every assembly. (Unit: turns rather than messages — a turn is the natural span a result stays relevant.)

Constraint — pairing. A tool call and its result are a pair in the canonical format, and providers reject a history with an orphaned call or result. So eviction must not drop a result and leave the call dangling. It stubs the result content ([result expired]) while keeping the pairing intact. Stubbing also signals to the model that the call happened, so it doesn't re-issue it. (All of this is on the rendered context; the log keeps the real result forever.)

2. Sliding window on dialogue

A turn or token cap on plain dialogue. Deterministic, no model call.

3. Summarization

The only expensive, lossy, model-calling reducer. It fires only if the free reducers above didn't reach budget. Two rules:

  • Off the critical path. Don't block the next user turn on a summarization call. Summarize asynchronously between turns and write the result as a derived artifact keyed by the span of events it covers (the summaries table — schema in 04). Context assembly then always just reads "latest applicable summary + verbatim tail," and a revived agent reconstructs context from the log plus those artifacts — no special compacting state, and compaction stays entirely out of the suspend/resume path. (If you ever do inline blocking summarization as a fallback, then you need a state and persisted progress — which is itself the argument against doing it inline.)
  • Prefix-ward, for cache safety. Providers cache on a stable prefix. Rewriting the middle of the context invalidates the cache for everything after the edit and makes you pay full price on the recent tokens you were trying to keep cheap. So only ever collapse the oldest span into a summary at the front, leave the tail byte-stable, and recompact in discrete chunks rather than trimming continuously. Continuous trimming of recent context is the worst case for cost; chunked prefix collapse is the best.

Reducer interface (sketch)

reduce(context, budget, state) -> {context, state}

The pipeline threads a shrinking budget through reducers in order. Each returns the reduced context and any state it needs to carry (e.g. the summarization reducer's "latest summary covers events up to seq N").

Token counting

Pre-send counting is approximate — exact counts only come back after the call, and tokenization is model-specific. Trigger on an estimate (a real tokenizer like tiktoken for OpenAI, a char/4 heuristic elsewhere) with a safety margin. Budget conservatively: an over-budget request is a hard failure, an over-eager compaction is just mild waste.

Observability

The optional model_calls audit record (see 04) carries which summary version and which evictions applied to each turn. Since compaction is lossy and the classic cause of "the agent forgot," this record is how you prove whether something was compacted out or the model just ignored it. Cheap to record now, miserable to add later.

Resolved: budget shape (hybrid, phased)

The model is a hybrid, not a single-vs-per-tier either/or:

  • A single total budget is the hard ceiling — it guarantees you never overflow, full stop. This is the v0 surface.
  • Optional per-tier caps (e.g. "tool results ≤ 30% of the window") are guards that trigger earlier eviction so one fat tool result can't starve dialogue. For a tool-centric library this guard earns its config surface — but it is deferred to v0.1, added when a real workload misbehaves. Single-total-only is a defensible v0.

The reducer interface must stay forward-compatible with per-tier caps: the budget threaded through reducers is a value that can later carry per-tier sub-limits without changing the reduce/3 signature. That is the one thing to get right now so adding caps in v0.1 is not a breaking change.

If a real tool-heavy workload starves dialogue before v0.1, promote the per-tier cap early — it is additive, not a redesign.