Using Lockstep from AI agents

Copy Markdown View Source

Lockstep is designed to be invokable from AI coding agents (Claude Code, Cursor, GitHub Copilot, etc.) to verify concurrency properties of BEAM code. This file documents the AI-agent contract.

What an agent can do with Lockstep

TaskHow
Verify a hypothesized race existsGenerate a Lockstep test, run it, parse the result
Reproduce a flaky test deterministicallyWrap the test body in ctest, capture seed
Replay a saved counterexamplemix lockstep.replay --trace <path>
Minimize a counterexamplemix lockstep.shrink --trace <path>
Verify a fix doesn't regressRe-run the same test with same seed

CLI commands (Bash-callable)

These are the primary entry points for AI agents. All run via the mix command in the project root:

mix test <file> — run a Lockstep test

mix test test/path/to/race_test.exs --seed 1

Exit code 0 = no bug found. Non-zero = test failed.

If a Lockstep test detected a bug, output includes:

  • iteration: N — the iteration that triggered
  • seed: M — top-level seed
  • strategy: :pct/:pos/... — strategy used
  • trace path: traces/<name>.lockstep — saved trace file

The trace file path is the key handoff: subsequent commands operate on that file.

mix lockstep.replay --trace <path> — print a saved trace

mix lockstep.replay --trace traces/foo-iter4-seed1.lockstep

Prints metadata + human-readable schedule with pid aliases (P0, P1, ...) and step numbers. Includes <-- FAILED HERE marker.

mix lockstep.replay --trace <path> --run "Mod.fn" --file <test_file>

Re-execute a specific test function under the recorded schedule. Deterministic — same trace + same code = same outcome.

mix lockstep.shrink --trace <path> — minimize a counterexample

mix lockstep.shrink --trace traces/foo-iter4-seed1.lockstep

Outputs a smaller trace file that reproduces the same bug. Useful before filing an issue.

mix lockstep.dump_trace --trace <path> — print trace as terms

For programmatic parsing.

Generating a Lockstep test from a hypothesis

The simplest pattern an agent should produce when it wants to verify a race hypothesis:

defmodule Lockstep.Mimic.MyRaceTest do
  use Lockstep.Test

  ctest "verify hypothesis: <plain English description>",
        iterations: 1000, strategy: :pos, seed: 1 do
    # Build minimal mimic of the production code path.
    # Use Lockstep.* wrappers everywhere instead of OTP defaults.

    # 1. Setup state
    {:ok, pid} = Lockstep.GenServer.start_link(MyServer, init_state)

    # 2. Spawn concurrent actors
    parent = self()
    for i <- 1..3 do
      Lockstep.spawn(fn ->
        # ... code that exercises the hypothesized race ...
        Lockstep.send(parent, {:done, i})
      end)
    end

    # 3. Collect results
    for _ <- 1..3 do
      Lockstep.recv_first(fn {:done, _} -> true; _ -> false end)
    end

    # 4. Assert invariant
    if not invariant_holds?(...) do
      raise "invariant violated: ..."
    end
  end
end

Key rules for agent-generated tests:

  1. Use Lockstep.* wrappers for every send, spawn, GenServer.call, ETS op, atomics op. Bare OTP calls won't be intercepted.
  2. Use Lockstep.send/2 in the test body (not send/2).
  3. Use Lockstep.recv_first/1 (not receive).
  4. The test function's body raises on invariant violation — that's how Lockstep detects the bug.

Strategy selection heuristic

Bug shapeStrategy
Tight read-modify-write on atomics/ETS:pos
Multi-process message-ordering races:pct (default)
Spin-loop / liveness scenarios:fair_pct
Unknown shape, exploring:random

For deep races (>5 sync points between cause and effect), bump iterations to 10_000+ and try multiple strategies.

Parsing Lockstep output

A Lockstep failure exits with code 1 and prints an exception block that's parseable. Key fields:

** (Lockstep.BugFound)
Lockstep found a concurrency bug on iteration <ITER>.
  seed:        <TOP_SEED>
  iter seed:   <ITER_SEED>
  strategy:    :<STRATEGY>
  trace path:  traces/<NAME>.lockstep

Reason:
  <REASON>

Schedule:
  step 1  <EVENT>
  ...
  step N  <EVENT>  <-- FAILED HERE

Regex for parsing (Python-flavoured):

import re
m = re.search(r"iteration (\d+).*seed:\s+(\d+).*strategy:\s+:(\w+).*trace path:\s+(\S+)",
              output, re.DOTALL)
if m:
    iteration, seed, strategy, trace_path = m.groups()

Common pitfalls when agents generate tests

  1. Forgetting to use Lockstep wrappers. Agent writes GenServer.start_link(...) and the GenServer is unmanaged. Lockstep can't schedule it.
  2. Pre-constructed bugs. Agent writes a test that always raises (no scheduling needed). The "bug found at iteration 1" is not informative — restructure to be a real race.
  3. Drain too short. Many test bodies need drain(100)+ ticks to let timers fire and processes settle.
  4. Bad invariant. Agent's invariant should fail ONLY in the bug case, not during normal flow.

Ground truth: what to trust

  • Lockstep's deterministic finding (same seed → same iteration → same trace) means the bug is REAL given the test setup.
  • Lockstep's negative result (10000 iterations, no bug found) means the test setup didn't surface a bug under the strategies used. It does NOT mean the code is bug-free.
  • <-- FAILED HERE marker points to the exact event where the assertion fired.
  • Causal slice (default on) shows ~5-20% of trace events causally related to the failure.

Skill bundle

The Anthropic Skill format file is at skill/lockstep/SKILL.md. Agents using the Anthropic Agent SDK can load this via Skill.load("lockstep").

CI integration

Recommended pattern for CI:

# .github/workflows/concurrency-tests.yml
- run: mix test --include lockstep --seed 1 --max-failures 1

The --seed 1 ensures the same iteration counts hit the same schedule across runs. --max-failures 1 stops on the first failure to keep the trace artifact small.

Save traces/ as an artifact so failures are debuggable.

Filing a bug found by an agent

When an AI agent finds a bug in third-party code via Lockstep, the filable artifact is:

  1. Lockstep test file
  2. Trace file (traces/<bug>.lockstep)
  3. Output of mix lockstep.replay --trace <path> (the human-readable schedule)
  4. Suggested fix + the mix lockstep.replay --run command that verifies the fix

Reproducibility is the value-add — the bug isn't "we saw it once", it's "seed N + strategy X + iteration M reproduces every time."