Lockstep (Lockstep v0.1.0)

Copy Markdown View Source

Coyote-style controlled concurrency testing for BEAM.

See Lockstep.Test for the ExUnit integration. The functions in this module are the runtime API used inside controlled tests:

Lockstep.spawn(fn -> ... end)
Lockstep.send(target, message)
Lockstep.recv()

All three are sync points: each call hands control back to the controller, which picks the next process to run per the configured strategy.

Patterns and gotchas

Real-world usage of Lockstep — especially against distributed systems like Phoenix.PubSub, Phoenix.Tracker, or hand-rolled Raft-style protocols — surfaces a few patterns worth calling out up front.

Choosing a strategy

See Lockstep.Strategy for the full discussion. Short version: the default :pct strategy is best at finding partial-order races where any of several priority swaps work. For races that require the scheduler to consistently pick a specific proc over several consecutive sync points, :pos works better — PCT's priority shuffle under-explores those sequences. Real example from Lockstep's own test suite: test/leader_follower_register_test.exs async- replication staleness wasn't found in 100 PCT iterations but was found at iteration 1 under POS. Rule of thumb: if PCT can't find your race in ~50 iterations, try POS before increasing iterations.

Driving timed protocols: explicit triggers, not send_after

Distributed protocols often have time-driven elements: election timeouts, heartbeats, retry timers. There's a temptation to model them in tests with Lockstep.send_after, hoping the strategy will explore "what if the timeout fires before the response arrives" schedules. This usually doesn't work as intended.

Lockstep's controller fires timers ONLY when every alive proc is blocked on recv_match. So timer fires get serialized with proc execution: timer 1 fires → recipient becomes ready → strategy picks it → it processes → returns to recv → all blocked again → timer 2 fires. Multiple timers at the same fire_at don't actually fire concurrently — they're explored one at a time.

For tests where you want to surface concurrent-trigger races, drive the triggering events explicitly via Lockstep.send from the test body. test/raft_election_test.exs uses this pattern successfully:

# Trigger an election on every node concurrently. They'll all
# become candidates for term 1 and race for the majority.
for {_id, pid} <- nodes do
  Lockstep.spawn(fn ->
    Lockstep.send(pid, :trigger_election)
  end)
end

This way the strategy interleaves the per-node election handlers freely — the very interleaving the bug needs.

Avoiding timer pile-up

When a long-running test does need timers (heartbeats, retries), every handler that schedules a fresh timer should cancel or invalidate the previous one. Naive code:

def handle_info(:tick, state) do
  Lockstep.send_after(self(), :tick, 100)  # leak!
  {:noreply, work(state)}
end

Each tick adds another pending timer to the queue. After a few hundred ticks, virtual-time advancement is dominated by trivial timer fires that consume max_steps budget. Two clean fixes:

  1. Cancel and re-schedule:

    def handle_info(:tick, state) do
      if state.timer, do: Lockstep.cancel_timer(state.timer)
      ref = Lockstep.send_after(self(), :tick, 100)
      {:noreply, %{state | timer: ref} |> work()}
    end
  2. Epoch-tagged messages (no need to cancel; stale fires are ignored on receipt):

    def handle_info({:tick, epoch}, %{epoch: epoch} = state) do
      new_epoch = epoch + 1
      Lockstep.send_after(self(), {:tick, new_epoch}, 100)
      {:noreply, %{state | epoch: new_epoch} |> work()}
    end
    
    def handle_info({:tick, _stale}, state), do: {:noreply, state}

The Raft demo uses pattern #2.

Multi-step sync chains

Operations like Lockstep.GenServer.call/2 go through several sync points: a monitor, a send, and a selective receive. A test that performs N gen_server calls puts ~3N controller calls on each iteration's step budget. For long workloads or chatty libraries (Phoenix.Tracker is a notorious example — every heartbeat triggers ~10 sync points through Registry / persistent_term / ETS), max_steps may need to be set substantially higher than for simple race-hunt scenarios. Start with 5_000 for tight micro-races and scale to 50_000+ for full Phoenix integration tests.

v0.1 → v0.5 progression

v0.1 supported bare send/receive/spawn only. v0.5 added GenServer, Task, Registry, Supervisor, GenStatem wrappers, virtual clock, monitors, links, trap_exit. v1.0 (current) adds distributed- cluster simulation (Lockstep.Cluster), per-node state isolation (Phase D), and Jepsen-level checker infrastructure (Lockstep.History, Lockstep.Checker.{Linearizable, SequentialConsistency, Causal}, Lockstep.Generator, Lockstep.Model.Register).

Summary

Functions

Same shape as Process.alive?/1 but consults the controller's view.

Cancel a previously scheduled timer. Returns the number of ms that remained on the timer if it was cancelled before firing, or false if the timer had already fired or never existed.

Stop monitoring. Same shape as Process.demonitor/2. The :flush option removes any already-delivered :DOWN for this ref from the caller's mailbox.

Same shape as Process.flag/2. Currently only :trap_exit is modeled at the controller level; other flags are accepted and return a placeholder previous value but don't change semantics.

Same shape as Process.link/1. Establishes a bidirectional link. Linking a dead managed process delivers {:EXIT, target, :noproc} immediately (if trap_exit is on) or kills the caller (if not).

Monitor target_pid. Returns a reference; when target_pid exits, the calling process receives {:DOWN, ref, :process, target_pid, reason} in its controller-side mailbox. Same shape as Process.monitor/1, except delivery happens through Lockstep's mailbox (so it's observable via Lockstep.recv/recv_first) instead of BEAM's.

Read the controller's virtual clock. Time only advances when the controller would otherwise deadlock (everyone blocked on receive); at that point virtual time jumps to the next pending timer's fire_at. Returns milliseconds since iteration start (0 at the first call).

Receive the next message in the calling process's controller-side mailbox. Blocks (in the controller) until the strategy picks this process to receive. No pattern matching: you get the next message in delivery order — for selective receive use recv_first/1.

Selective receive: scan the controller-side mailbox in delivery order and return the first message for which predicate returns true. Other messages stay in the mailbox in their original order.

Run a controlled test body N times. Used by Lockstep.Test.ctest/3; most users do not call this directly.

Send a message to another managed process. The send is recorded by the controller and the message is queued in the controller-side mailbox of the target. Returns :ok.

Schedule message to be delivered to target after delay_ms milliseconds of virtual time. Returns a timer reference that can be passed to cancel_timer/1.

Virtual-time sleep. Same shape as Process.sleep/1. Implemented as send_after(self(), sentinel, ms) followed by recv_first waiting for the sentinel — the controller advances virtual time forward to fire the timer, which yields control to other managed processes while we "sleep."

Spawn a new managed process. The function runs under the controller.

Spawn a managed child process and link to it. Same as Lockstep.spawn/1 followed by Lockstep.link/1, but atomic — there's no window where the child has been spawned but not yet linked.

Functions

alive?(target_pid)

@spec alive?(pid()) :: boolean()

Same shape as Process.alive?/1 but consults the controller's view.

Returns true if target_pid is a managed process that has not yet exited under Lockstep's controller. For pids the controller doesn't know about (e.g., processes from outside the iteration), falls back to vanilla Process.alive?/1.

Calling alive?/1 is a sync point — the strategy may interleave another process between this check and any subsequent action. That's the point: TOCTOU bugs (if Process.alive?(pid), do: GenServer.call(pid, ...)) surface here.

cancel_timer(ref)

@spec cancel_timer(reference()) :: non_neg_integer() | false

Cancel a previously scheduled timer. Returns the number of ms that remained on the timer if it was cancelled before firing, or false if the timer had already fired or never existed.

demonitor(ref, opts \\ [])

@spec demonitor(reference(), [atom()]) :: true

Stop monitoring. Same shape as Process.demonitor/2. The :flush option removes any already-delivered :DOWN for this ref from the caller's mailbox.

flag(flag, value)

@spec flag(atom(), any()) :: any()

Same shape as Process.flag/2. Currently only :trap_exit is modeled at the controller level; other flags are accepted and return a placeholder previous value but don't change semantics.

link(target)

@spec link(pid()) :: true

Same shape as Process.link/1. Establishes a bidirectional link. Linking a dead managed process delivers {:EXIT, target, :noproc} immediately (if trap_exit is on) or kills the caller (if not).

monitor(target_pid)

@spec monitor(
  pid()
  | atom()
  | {atom(), node()}
  | {:via, module(), term()}
  | {:global, term()}
) ::
  reference()

Monitor target_pid. Returns a reference; when target_pid exits, the calling process receives {:DOWN, ref, :process, target_pid, reason} in its controller-side mailbox. Same shape as Process.monitor/1, except delivery happens through Lockstep's mailbox (so it's observable via Lockstep.recv/recv_first) instead of BEAM's.

now()

@spec now() :: non_neg_integer()

Read the controller's virtual clock. Time only advances when the controller would otherwise deadlock (everyone blocked on receive); at that point virtual time jumps to the next pending timer's fire_at. Returns milliseconds since iteration start (0 at the first call).

recv()

@spec recv() :: any()

Receive the next message in the calling process's controller-side mailbox. Blocks (in the controller) until the strategy picks this process to receive. No pattern matching: you get the next message in delivery order — for selective receive use recv_first/1.

recv_first(predicate)

@spec recv_first((any() -> boolean())) :: any()

Selective receive: scan the controller-side mailbox in delivery order and return the first message for which predicate returns true. Other messages stay in the mailbox in their original order.

Equivalent to BEAM's receive with a pattern, except the patterns are expressed as a predicate function:

msg = Lockstep.recv_first(fn
  {^ref, _reply} -> true
  _              -> false
end)

Blocks (in the controller) until a message matching predicate is available. Predicate failures (raising/throwing inside it) count as "no match" so a buggy predicate cannot trip the controller.

run(test_fun, opts)

Run a controlled test body N times. Used by Lockstep.Test.ctest/3; most users do not call this directly.

send(target, message)

@spec send(pid(), any()) :: :ok

Send a message to another managed process. The send is recorded by the controller and the message is queued in the controller-side mailbox of the target. Returns :ok.

send_after(target, message, delay_ms)

@spec send_after(pid(), any(), non_neg_integer()) :: reference()

Schedule message to be delivered to target after delay_ms milliseconds of virtual time. Returns a timer reference that can be passed to cancel_timer/1.

Same shape as Process.send_after/3. The timer fires when the controller advances virtual time, which happens automatically as soon as no managed process is ready and the next timer is the only way to make progress.

sleep(ms)

@spec sleep(non_neg_integer() | :infinity) :: :ok

Virtual-time sleep. Same shape as Process.sleep/1. Implemented as send_after(self(), sentinel, ms) followed by recv_first waiting for the sentinel — the controller advances virtual time forward to fire the timer, which yields control to other managed processes while we "sleep."

spawn(fun)

@spec spawn((-> any())) :: pid()

Spawn a new managed process. The function runs under the controller.

spawn_link(fun)

@spec spawn_link((-> any())) :: pid()

Spawn a managed child process and link to it. Same as Lockstep.spawn/1 followed by Lockstep.link/1, but atomic — there's no window where the child has been spawned but not yet linked.

When the linked process exits abnormally, the caller dies too unless it has set flag(:trap_exit, true). Trapping converts the death into a {:EXIT, child, reason} message in the caller's mailbox.

unlink(target)

@spec unlink(pid()) :: true

Same shape as Process.unlink/1.