Runs small, real-network Subagents gates through the local T3 Code harnesses.
This is the first executable adapter for
docs/benchmarks/real-network-subagents.md. It intentionally measures only
provider/model capability:
- does the provider path accept the requested model?
- does T3 observe Subagent lifecycle for that provider/model?
- how long did the smoke run take?
In common_model_gate, it first runs tiny no-subagent provider probes for each
candidate model on both provider paths. It then runs smoke_real_n2 only for the
first commonly accepted model. A model or lifecycle divergence is recorded as an
explicit non-comparable abort, not a failed head-to-head.
In representative_review_n3, it generates the seeded fixture, writes benchctl,
injects a fixed three-child scenario prompt, and scores strict final JSON. In
scaling_lifecycle, it generates one shard assignment per requested child and
scores lifecycle plus mechanical assignment completion for N=10+ fan-out runs. Usage
reconciliation is still not implemented.
The task shells into the paired local T3 Code checkout and runs local-only harnesses:
scripts/pixir-subagents-benchmark.tsscripts/codex-subagents-observability-probe.ts
Default run is deliberately small:
mix pixir.bench.real_subagentsUseful variants:
mix pixir.bench.real_subagents --dry-run
mix pixir.bench.real_subagents --scenario common_model_gate --dry-run --json
mix pixir.bench.real_subagents --scenario common_model_gate --models gpt-5.5 --reasoning-effort low
mix pixir.bench.real_subagents --scenario probe --models gpt-5.5
mix pixir.bench.real_subagents --scenario smoke_real_n2 --models gpt-5.5
mix pixir.bench.real_subagents --scenario representative_review_n3 --dry-run
mix pixir.bench.real_subagents --scenario scaling_lifecycle --models gpt-5.5 --reasoning-effort low --n 10
mix pixir.bench.real_subagents --providers pixir --pixir-models gpt-5.3-codex-spark
mix pixir.bench.real_subagents --providers codex --codex-models default,gpt-5.5
mix pixir.bench.real_subagents --models gpt-5.5 --reasoning-effort low
mix pixir.bench.real_subagents --models gpt-5.5 --reasoning-effort low --n-values 1,3,5 --repetitions 3 --include-baseline
mix pixir.bench.real_subagents --n 2 --output .pixir/benchmarks/real-subagents/manualOptions:
--scenario-capability_matrix,probe,smoke_real_n2,common_model_gate,representative_review_n3, orscaling_lifecycle. Default:capability_matrix.--providers- comma-separatedpixir,codexsubset. Default:pixir,codex.--pixir-models- comma-separated models for Pixir. Default:gpt-5.3-codex-spark.--codex-models- comma-separated models for Codex. Default:default,gpt-5.3-codex-spark.--models/--model-candidates- comma-separated models for both providers.defaultmeans "do not pass --model".--reasoning-effort- effort knob passed to both provider paths. Default:low.--n- child count for the smoke probe. Default:1.--n-values- comma-separated child counts for a scaling suite.--repetitions- repetitions per provider/model/N. Default:1.--include-baseline- add no-network T3 harness baselines per provider/model/repetition.--json- emit machine-readable result or dry-run output on stdout.--output- output directory. Default:.pixir/benchmarks/real-subagents/<run-id>.--t3-code-path- paired T3 Code checkout. Default:T3_CODE_PATH, or../t3coderelative to the Pixir repo.--dry-run- print planned commands without running or writing artifacts.
Artifacts:
runs.jsonl
summary.json
report.md
fixtures/<provider>/<model>/
provider-artifacts/<provider>/<model>/Each provider artifact also includes memory-samples.txt, a sampled process-tree
RSS trace for the T3 harness and its descendants. This task hits real providers
through T3 Code. Keep --n and model lists small.