Merges a first test run with an automatic retry run to distinguish flaky failures from confirmed ones.
mix test.json re-runs the previously-failed tests (ExUnit's native
--failed) after a run with failures, then calls merge/2 to overlay the
two result documents. The goal is to never block an AI agent on a failure
that heals on re-run, while never hiding a failure that doesn't.
Classification
Tests are matched across runs by their {module, name} identity (ExUnit's
canonical manifest key — file:line can shift between runs).
- confirmed — failed run 1 and did not pass run 2. Stays in
testsand keeps run 1's failure detail. Counts towardsummary.failed. - flaky — failed run 1 but passed run 2. Moved out of
testsinto a top-levelflakyarray (run 1's failure detail preserved so the agent sees what flaked). Counted insummary.flaky, never insummary.failed.
A test that failed run 1 but was not re-verified as passing (e.g. --failed
could not re-run it) stays confirmed — conservative by design: nothing is
marked flaky we could not observe passing.
setup_all failures (module_failures) are classified the same way by module
name: recurs in run 2 → stays confirmed; cleared → moves to flaky tagged
"scope" => "module".
Invalid tests (setup_all casualties)
When a setup_all fails, ExUnit marks the module's tests "invalid". ExUnit's
failure manifest records invalid tests like failures, so the retry re-runs them.
The merge resolves each run-1 invalid test against its run-2 state:
- passed run 2 → healed: replaced by its run-2 (passing) entry, moved
from
summary.invalidintosummary.passed. - failed run 2 → confirmed failure: replaced by its run-2 entry (run-2
failure detail). Counts toward
summary.failed. - still invalid / not re-run → stays as-is; its recurring module failure keeps the run red.
In failures-only output (the default), run-1 invalid tests are not present in
tests. Run-2 failures of those tests are surfaced into tests so a real
failure is never hidden, and summary.invalid goes to zero only when no
module failure recurred (invalid tests cannot outlive their setup_all
failure).
Output shape (additive to schema v1)
The merged document is run 1's document with:
tests— flaky entries removed; healed/re-failed invalid entries replaced by their run-2 selves (passing/confirmed entries preserved, so--allruns keep their passing tests)module_failures— only the recurring (confirmed) ones, omitted if noneflaky— flaky tests and modules, omitted when emptysummary.failed— confirmed test count;summary.flaky— flaky count;summary.passed/summary.invalid— adjusted for healed invalid tests;summary.result—"passed"iff nothing is confirmed and nothing is still invalidretry—%{"ran" => true, "passes" => 1, "retried" => N, "confirmed" => X, "flaky" => Y}whereretriedcounts the failed tests, invalid tests, and failed modules that were re-run
All functions are pure — the orchestration (subprocess spawn, file IO, exit
code) lives in Mix.Tasks.Test.Json.
Summary
Types
A decoded (string-keyed) JSON result document.
Functions
Merges run 1 and run 2 result documents into a single document distinguishing confirmed failures from flaky ones.
Types
Functions
Merges run 1 and run 2 result documents into a single document distinguishing confirmed failures from flaky ones.
Both arguments are string-keyed maps (as produced by :json.decode/1 on a
buffered run). Run 2 is expected to have been produced with --all so every
re-run test carries its "state".