5. Agent ergonomics: dry-run, self-describing help, structured errors, I/O discipline

Copy Markdown View Source

Date: 2026-05-29 Status: Accepted

Context

Pixir is built by agents as much as it is for them. The model drives tools, and agents (including a fresh one with no prior context) drive the CLI and the dev scripts. When a command's behaviour, arguments, or failure mode is opaque, the agent guesses — it runs destructive actions to "see what happens", invents flags, or pastes raw, unbounded output back into the model where it burns tokens and derails the Turn.

Two recurring failure modes motivate this decision:

  1. No safe preview. A tool/command with side effects can only be understood by running it. There is no way to ask "what would you do?" first.
  2. Ambiguous I/O. It is unclear what text is meant for the model, what is meant for the human at the terminal, and what is pure diagnostics. Everything collapses onto stdout, so the model ingests prompts/spinners/log noise and the human sees raw JSON.

This is a cross-cutting contract over the Tool behaviour (ADR 0001's Executor), the Event/Log envelope (ADR 0004), and the CLI (build step 8). It costs us nothing to adopt now and is expensive to retrofit.

Decision

Every Pixir tool, CLI command, and dev script obeys four rules.

1. Dry-runnable

Anything with side effects (writes, shell, network, file mutation) supports a dry-run that reports the planned action and the resources it would touch, performs no mutation, and returns success.

  • CLI / scripts: accept --dry-run.
  • LLM-facing tools: the Executor honours a dry_run flag (per-call arg and/or a Turn-level mode). In dry-run a tool returns a structured plan (e.g. %{would: :write, path: ..., bytes: ...}) instead of acting.
  • Read-only tools are trivially dry-run (they have no effect to suppress).

2. Self-describing for a fresh agent

A caller with zero context can discover capabilities without reading source.

  • CLI: pixir help and pixir <cmd> --help print usage, every flag, and at least one example. Subcommands are individually discoverable.
  • Tools: __tool__/0 is the help — a complete name, description, and JSON schema (required/optional args, types). Descriptions are written for a fresh agent.
  • Help text is plain and example-led; no hidden flags.

3. Structured errors

Failures are data, not just prose. Internally functions return {:error, reason}; at every boundary that an agent or the model reads, the error serialises to a stable envelope:

%{ok: false, error: %{kind: atom_or_string, message: human_string, details: map}}

kind is a small, stable, documented vocabulary (e.g. :outside_workspace, :not_found, :invalid_args, :timeout) so callers can branch on it. The canonical enumeration is Pixir.Tool.kind/0 — the single source of truth; adding a kind is a deliberate change there. CLI commands exit non-zero on error and emit the same envelope on the diagnostic channel; successful machine output stays clean.

Deliberate divergence — a nonzero shell exit is not an error. A bash command that runs and exits nonzero (e.g. grep finding nothing → exit 1) returns a successful tool result %{"output", "exit_code", "ok" => false}, not an error envelope — the tool did its job, and the model needs the output and code to reason (treating a no-match as a failure would derail Turns). So there is no :nonzero_exit kind; branch on exit_code/ok within the result instead.

4. I/O channel discipline — be deliberate about what reaches the LLM

Three distinct channels, never conflated:

  • Model channel — the tool_result payload folded into History and sent to the model (ADR 0004). It is curated and token-bounded: large output is truncated with an explicit marker (e.g. …[truncated N of M lines]), never raw-dumped. No ANSI, spinners, or prompts.
  • Human channel (stdout) — rendered, human-facing output produced by the Renderer (build step 9) from bus events. Formatting lives here.
  • Diagnostic channel (stderr / Logger) — debug and operational noise. Never seen by the model and not part of any tool result.

Input is equally explicit: a one-shot prompt comes from argv or, when piped, from stdin; a tool never silently reads stdin. Each tool documents which streams it consumes and produces.

Consequences

  • Safer autonomy. Agents can preview destructive actions; dry-run is the natural first probe and pairs with the deferred :ask permission gate when it lands.
  • Faster cold starts. A fresh agent orients via --help / __tool__/0 instead of reading the codebase or trial-and-error.
  • Branchable failures + cheaper Turns. Stable kinds let the loop react programmatically; bounded model-channel output keeps Turns from blowing the context window on log spew.
  • Cost / obligations.
    • Every new tool and command must implement dry-run, help, the error envelope, and declare its streams — enforced in review (and ideally a shared test helper).
    • The kind vocabulary is a curated, documented set; adding one is a deliberate change.
    • Truncation policy (limits, marker format) is a tunable knob, decided in code alongside the Executor.
  • The Pixir.Tool behaviour and Executor (build step 6) and the CLI (step 8) are the enforcement points; this ADR is their acceptance criteria.