Coyote-style controlled concurrency testing for BEAM.
See Lockstep.Test for the ExUnit integration. The functions in this
module are the runtime API used inside controlled tests:
Lockstep.spawn(fn -> ... end)
Lockstep.send(target, message)
Lockstep.recv()All three are sync points: each call hands control back to the controller, which picks the next process to run per the configured strategy.
Patterns and gotchas
Real-world usage of Lockstep — especially against distributed systems like Phoenix.PubSub, Phoenix.Tracker, or hand-rolled Raft-style protocols — surfaces a few patterns worth calling out up front.
Choosing a strategy
See Lockstep.Strategy for the full discussion. Short version: the
default :pct strategy is best at finding partial-order races where
any of several priority swaps work. For races that require the
scheduler to consistently pick a specific proc over several
consecutive sync points, :pos works better — PCT's priority
shuffle under-explores those sequences. Real example from Lockstep's
own test suite: test/leader_follower_register_test.exs async-
replication staleness wasn't found in 100 PCT iterations but was
found at iteration 1 under POS. Rule of thumb: if PCT can't find
your race in ~50 iterations, try POS before increasing iterations.
Driving timed protocols: explicit triggers, not send_after
Distributed protocols often have time-driven elements: election
timeouts, heartbeats, retry timers. There's a temptation to model
them in tests with Lockstep.send_after, hoping the strategy will
explore "what if the timeout fires before the response arrives"
schedules. This usually doesn't work as intended.
Lockstep's controller fires timers ONLY when every alive proc is
blocked on recv_match. So timer fires get serialized with proc
execution: timer 1 fires → recipient becomes ready → strategy picks
it → it processes → returns to recv → all blocked again → timer 2
fires. Multiple timers at the same fire_at don't actually fire
concurrently — they're explored one at a time.
For tests where you want to surface concurrent-trigger races,
drive the triggering events explicitly via Lockstep.send from
the test body. test/raft_election_test.exs uses this pattern
successfully:
# Trigger an election on every node concurrently. They'll all
# become candidates for term 1 and race for the majority.
for {_id, pid} <- nodes do
Lockstep.spawn(fn ->
Lockstep.send(pid, :trigger_election)
end)
endThis way the strategy interleaves the per-node election handlers freely — the very interleaving the bug needs.
Avoiding timer pile-up
When a long-running test does need timers (heartbeats, retries), every handler that schedules a fresh timer should cancel or invalidate the previous one. Naive code:
def handle_info(:tick, state) do
Lockstep.send_after(self(), :tick, 100) # leak!
{:noreply, work(state)}
endEach tick adds another pending timer to the queue. After a few
hundred ticks, virtual-time advancement is dominated by trivial
timer fires that consume max_steps budget. Two clean fixes:
Cancel and re-schedule:
def handle_info(:tick, state) do if state.timer, do: Lockstep.cancel_timer(state.timer) ref = Lockstep.send_after(self(), :tick, 100) {:noreply, %{state | timer: ref} |> work()} endEpoch-tagged messages (no need to cancel; stale fires are ignored on receipt):
def handle_info({:tick, epoch}, %{epoch: epoch} = state) do new_epoch = epoch + 1 Lockstep.send_after(self(), {:tick, new_epoch}, 100) {:noreply, %{state | epoch: new_epoch} |> work()} end def handle_info({:tick, _stale}, state), do: {:noreply, state}
The Raft demo uses pattern #2.
Multi-step sync chains
Operations like Lockstep.GenServer.call/2 go through several sync
points: a monitor, a send, and a selective receive. A test that
performs N gen_server calls puts ~3N controller calls on each iteration's
step budget. For long workloads or chatty libraries (Phoenix.Tracker
is a notorious example — every heartbeat triggers ~10 sync points
through Registry / persistent_term / ETS), max_steps may need to
be set substantially higher than for simple race-hunt scenarios.
Start with 5_000 for tight micro-races and scale to 50_000+ for
full Phoenix integration tests.
v0.1 → v0.5 progression
v0.1 supported bare send/receive/spawn only. v0.5 added
GenServer, Task, Registry, Supervisor, GenStatem wrappers, virtual
clock, monitors, links, trap_exit. v1.0 (current) adds distributed-
cluster simulation (Lockstep.Cluster), per-node state isolation
(Phase D), and Jepsen-level checker infrastructure
(Lockstep.History, Lockstep.Checker.{Linearizable, SequentialConsistency, Causal}, Lockstep.Generator,
Lockstep.Model.Register).
Summary
Functions
Same shape as Process.alive?/1 but consults the controller's view.
Cancel a previously scheduled timer. Returns the number of ms that
remained on the timer if it was cancelled before firing, or false
if the timer had already fired or never existed.
Stop monitoring. Same shape as Process.demonitor/2. The :flush
option removes any already-delivered :DOWN for this ref from the
caller's mailbox.
Same shape as Process.flag/2. Currently only :trap_exit is
modeled at the controller level; other flags are accepted and
return a placeholder previous value but don't change semantics.
Same shape as Process.link/1. Establishes a bidirectional link.
Linking a dead managed process delivers {:EXIT, target, :noproc}
immediately (if trap_exit is on) or kills the caller (if not).
Monitor target_pid. Returns a reference; when target_pid exits,
the calling process receives {:DOWN, ref, :process, target_pid, reason}
in its controller-side mailbox. Same shape as Process.monitor/1,
except delivery happens through Lockstep's mailbox (so it's
observable via Lockstep.recv/recv_first) instead of BEAM's.
Read the controller's virtual clock. Time only advances when the controller would otherwise deadlock (everyone blocked on receive); at that point virtual time jumps to the next pending timer's fire_at. Returns milliseconds since iteration start (0 at the first call).
Receive the next message in the calling process's controller-side
mailbox. Blocks (in the controller) until the strategy picks this
process to receive. No pattern matching: you get the next message in
delivery order — for selective receive use recv_first/1.
Selective receive: scan the controller-side mailbox in delivery order
and return the first message for which predicate returns true.
Other messages stay in the mailbox in their original order.
Run a controlled test body N times. Used by Lockstep.Test.ctest/3;
most users do not call this directly.
Send a message to another managed process. The send is recorded by the
controller and the message is queued in the controller-side mailbox of
the target. Returns :ok.
Schedule message to be delivered to target after delay_ms
milliseconds of virtual time. Returns a timer reference that can be
passed to cancel_timer/1.
Virtual-time sleep. Same shape as Process.sleep/1. Implemented as
send_after(self(), sentinel, ms) followed by recv_first waiting
for the sentinel — the controller advances virtual time forward to
fire the timer, which yields control to other managed processes
while we "sleep."
Spawn a new managed process. The function runs under the controller.
Spawn a managed child process and link to it. Same as Lockstep.spawn/1
followed by Lockstep.link/1, but atomic — there's no window where
the child has been spawned but not yet linked.
Same shape as Process.unlink/1.
Functions
Same shape as Process.alive?/1 but consults the controller's view.
Returns true if target_pid is a managed process that has not yet
exited under Lockstep's controller. For pids the controller doesn't
know about (e.g., processes from outside the iteration), falls back
to vanilla Process.alive?/1.
Calling alive?/1 is a sync point — the strategy may interleave
another process between this check and any subsequent action. That's
the point: TOCTOU bugs (if Process.alive?(pid), do: GenServer.call(pid, ...))
surface here.
@spec cancel_timer(reference()) :: non_neg_integer() | false
Cancel a previously scheduled timer. Returns the number of ms that
remained on the timer if it was cancelled before firing, or false
if the timer had already fired or never existed.
Stop monitoring. Same shape as Process.demonitor/2. The :flush
option removes any already-delivered :DOWN for this ref from the
caller's mailbox.
Same shape as Process.flag/2. Currently only :trap_exit is
modeled at the controller level; other flags are accepted and
return a placeholder previous value but don't change semantics.
@spec link(pid()) :: true
Same shape as Process.link/1. Establishes a bidirectional link.
Linking a dead managed process delivers {:EXIT, target, :noproc}
immediately (if trap_exit is on) or kills the caller (if not).
@spec monitor( pid() | atom() | {atom(), node()} | {:via, module(), term()} | {:global, term()} ) :: reference()
Monitor target_pid. Returns a reference; when target_pid exits,
the calling process receives {:DOWN, ref, :process, target_pid, reason}
in its controller-side mailbox. Same shape as Process.monitor/1,
except delivery happens through Lockstep's mailbox (so it's
observable via Lockstep.recv/recv_first) instead of BEAM's.
@spec now() :: non_neg_integer()
Read the controller's virtual clock. Time only advances when the controller would otherwise deadlock (everyone blocked on receive); at that point virtual time jumps to the next pending timer's fire_at. Returns milliseconds since iteration start (0 at the first call).
@spec recv() :: any()
Receive the next message in the calling process's controller-side
mailbox. Blocks (in the controller) until the strategy picks this
process to receive. No pattern matching: you get the next message in
delivery order — for selective receive use recv_first/1.
Selective receive: scan the controller-side mailbox in delivery order
and return the first message for which predicate returns true.
Other messages stay in the mailbox in their original order.
Equivalent to BEAM's receive with a pattern, except the patterns are
expressed as a predicate function:
msg = Lockstep.recv_first(fn
{^ref, _reply} -> true
_ -> false
end)Blocks (in the controller) until a message matching predicate is
available. Predicate failures (raising/throwing inside it) count as
"no match" so a buggy predicate cannot trip the controller.
Run a controlled test body N times. Used by Lockstep.Test.ctest/3;
most users do not call this directly.
Send a message to another managed process. The send is recorded by the
controller and the message is queued in the controller-side mailbox of
the target. Returns :ok.
@spec send_after(pid(), any(), non_neg_integer()) :: reference()
Schedule message to be delivered to target after delay_ms
milliseconds of virtual time. Returns a timer reference that can be
passed to cancel_timer/1.
Same shape as Process.send_after/3. The timer fires when the
controller advances virtual time, which happens automatically as soon
as no managed process is ready and the next timer is the only way to
make progress.
@spec sleep(non_neg_integer() | :infinity) :: :ok
Virtual-time sleep. Same shape as Process.sleep/1. Implemented as
send_after(self(), sentinel, ms) followed by recv_first waiting
for the sentinel — the controller advances virtual time forward to
fire the timer, which yields control to other managed processes
while we "sleep."
Spawn a new managed process. The function runs under the controller.
Spawn a managed child process and link to it. Same as Lockstep.spawn/1
followed by Lockstep.link/1, but atomic — there's no window where
the child has been spawned but not yet linked.
When the linked process exits abnormally, the caller dies too unless
it has set flag(:trap_exit, true). Trapping converts the death into
a {:EXIT, child, reason} message in the caller's mailbox.
@spec unlink(pid()) :: true
Same shape as Process.unlink/1.