Sagents.Middleware.Summarization (Sagents v0.8.0-rc.3)

Copy Markdown

Middleware that automatically manages conversation length through intelligent summarization.

This middleware monitors token usage and automatically summarizes older messages when a threshold is exceeded, preserving recent messages for context continuity.

Purpose

Long conversations present several problems:

  • Increased API costs
  • Slower response times
  • Risk of exceeding model context limits
  • Potential API errors

This middleware solves these problems by:

  • Monitoring total token count
  • Summarizing older messages when threshold is exceeded
  • Preserving recent messages for continuity
  • Protecting AI/Tool message pairs from separation

Configuration

# Default configuration
{Summarization, []}

# Custom configuration
{Summarization, [
  model: custom_model,                    # Model for summarization (defaults to agent model)
  max_tokens_before_summary: 170_000,    # Token threshold (default: 170k)
  messages_to_keep: 6,                   # Recent messages to preserve (default: 6)
  summary_prompt: custom_prompt,         # Custom summarization prompt
  token_counter: &custom_counter/1       # Custom token counting function
]}

Configuration Options

  • :model - LLM to use for summarization. Defaults to the agent's model.
  • :max_tokens_before_summary - Token threshold that triggers summarization. Default: 170,000
  • :messages_to_keep - Number of recent messages to preserve intact. Default: 6
  • :summary_prompt - Custom prompt for summarization. Uses intelligent default.
  • :token_counter - Function to count tokens. Defaults to approximate counting.

Position in Middleware Stack

Should run relatively early in before_model phase, after message generation but before any processing that expects specific message structures:

  1. TodoListMiddleware
  2. FilesystemMiddleware
  3. SubAgentMiddleware
  4. SummarizationMiddleware ← Position
  5. AnthropicPromptCachingMiddleware
  6. PatchToolCallsMiddleware
  7. HumanInTheLoopMiddleware

How It Works

1. Token Monitoring

Before each model call, counts total tokens in message history.

2. Threshold Check

If tokens exceed threshold, triggers summarization.

3. Safe Cutoff Detection

Finds safe points to cut the conversation that don't separate:

  • Assistant messages with tool_calls from their corresponding tool results
  • Related message pairs

4. Message Partitioning

  • To summarize: Older messages before cutoff point
  • To preserve: Recent messages after cutoff point

5. Summary Generation

Uses LLM to generate concise summary of older messages.

6. State Update

Replaces older messages with summary messages, preserving recent messages.

Example

# Create agent with summarization
{:ok, agent} = Agent.new(
  model: model,
  middleware: [
    {Summarization, [
      max_tokens_before_summary: 150_000,
      messages_to_keep: 8
    ]}
  ]
)

# Summarization happens automatically during execution
{:ok, state} = Agent.execute(agent, state)

Safe Cutoff Algorithm

The middleware protects AI/Tool message pairs from separation:

  1. Calculate target cutoff: message_count - messages_to_keep
  2. Search backwards from target to find safe cutoff point
  3. A point is safe if:
    • It's not an assistant message with tool_calls
    • The next message isn't a tool result for this assistant
  4. If no safe point found, summarize nothing (keeps all messages)

Error Handling

  • Falls back to keeping all messages if summarization fails
  • Logs errors but doesn't halt agent execution
  • Graceful degradation ensures agent continues working

Performance Considerations

  • Token counting is approximate (fast estimation)
  • Summarization only runs when threshold exceeded
  • Summary generation is async-compatible
  • Minimal overhead when under threshold