mix pixir.bench.real_subagents (pixir v0.1.0)

Copy Markdown View Source

Runs small, real-network Subagents gates through the local T3 Code harnesses.

This is the first executable adapter for docs/benchmarks/real-network-subagents.md. It intentionally measures only provider/model capability:

  • does the provider path accept the requested model?
  • does T3 observe Subagent lifecycle for that provider/model?
  • how long did the smoke run take?

In common_model_gate, it first runs tiny no-subagent provider probes for each candidate model on both provider paths. It then runs smoke_real_n2 only for the first commonly accepted model. A model or lifecycle divergence is recorded as an explicit non-comparable abort, not a failed head-to-head.

In representative_review_n3, it generates the seeded fixture, writes benchctl, injects a fixed three-child scenario prompt, and scores strict final JSON. In scaling_lifecycle, it generates one shard assignment per requested child and scores lifecycle plus mechanical assignment completion for N=10+ fan-out runs. Usage reconciliation is still not implemented.

The task shells into the paired local T3 Code checkout and runs local-only harnesses:

  • scripts/pixir-subagents-benchmark.ts
  • scripts/codex-subagents-observability-probe.ts

Default run is deliberately small:

mix pixir.bench.real_subagents

Useful variants:

mix pixir.bench.real_subagents --dry-run
mix pixir.bench.real_subagents --scenario common_model_gate --dry-run --json
mix pixir.bench.real_subagents --scenario common_model_gate --models gpt-5.5 --reasoning-effort low
mix pixir.bench.real_subagents --scenario probe --models gpt-5.5
mix pixir.bench.real_subagents --scenario smoke_real_n2 --models gpt-5.5
mix pixir.bench.real_subagents --scenario representative_review_n3 --dry-run
mix pixir.bench.real_subagents --scenario scaling_lifecycle --models gpt-5.5 --reasoning-effort low --n 10
mix pixir.bench.real_subagents --providers pixir --pixir-models gpt-5.3-codex-spark
mix pixir.bench.real_subagents --providers codex --codex-models default,gpt-5.5
mix pixir.bench.real_subagents --models gpt-5.5 --reasoning-effort low
mix pixir.bench.real_subagents --models gpt-5.5 --reasoning-effort low --n-values 1,3,5 --repetitions 3 --include-baseline
mix pixir.bench.real_subagents --n 2 --output .pixir/benchmarks/real-subagents/manual

Options:

  • --scenario - capability_matrix, probe, smoke_real_n2, common_model_gate, representative_review_n3, or scaling_lifecycle. Default: capability_matrix.
  • --providers - comma-separated pixir,codex subset. Default: pixir,codex.
  • --pixir-models - comma-separated models for Pixir. Default: gpt-5.3-codex-spark.
  • --codex-models - comma-separated models for Codex. Default: default,gpt-5.3-codex-spark.
  • --models / --model-candidates - comma-separated models for both providers. default means "do not pass --model".
  • --reasoning-effort - effort knob passed to both provider paths. Default: low.
  • --n - child count for the smoke probe. Default: 1.
  • --n-values - comma-separated child counts for a scaling suite.
  • --repetitions - repetitions per provider/model/N. Default: 1.
  • --include-baseline - add no-network T3 harness baselines per provider/model/repetition.
  • --json - emit machine-readable result or dry-run output on stdout.
  • --output - output directory. Default: .pixir/benchmarks/real-subagents/<run-id>.
  • --t3-code-path - paired T3 Code checkout. Default: T3_CODE_PATH, or ../t3code relative to the Pixir repo.
  • --dry-run - print planned commands without running or writing artifacts.

Artifacts:

runs.jsonl
summary.json
report.md
fixtures/<provider>/<model>/
provider-artifacts/<provider>/<model>/

Each provider artifact also includes memory-samples.txt, a sampled process-tree RSS trace for the T3 harness and its descendants. This task hits real providers through T3 Code. Keep --n and model lists small.