Lockstep testing methodology — the playbook

Copy Markdown View Source

How to apply Lockstep to a real BEAM library to find (or rule out) concurrency bugs. This is the field guide we wrote while testing Cachex, ConCache, lud/mutex, nimble_pool, and Hammer.

The bug-finding rate so far: 1 in 5 libraries. Hammer's Hammer.Atomic.FixWindow.inc/4 had a real lost-update race against hit/5. The other four are concurrency-correct under their public APIs — which is itself useful information.

This doc is the recipe for getting to that signal.

Contents

  1. When to reach for Lockstep
  2. Two paths: compile-rewrite vs test-only dep
  3. The model-based linearizability template
  4. Writing the model
  5. Picking a strategy
  6. Iteration counts and tuning
  7. Beyond linearizability: fault injection, time-skew
  8. What Lockstep can and cannot reach
  9. Anti-patterns
  10. Case studies

When to reach for Lockstep

Lockstep makes sense when you're testing code where:

  • Multiple processes interact through shared mutable state (ETS, atomics, persistent_term, registered names).
  • You need a reproducible failing schedule — vanilla concurrent ExUnit tests find races but throw away the trace.
  • You suspect a race exists at non-trivial depth (3+ scheduling decisions away from a clear "first send wins").

It's overkill when:

  • The code is sequential or has only one writer.
  • You just want stress testing — Task.async_stream in a loop is cheaper.
  • The race is already easy to catch (tight :erlang.spawn loops catch lost-updates 99% of the time, see the README's "Why bother").

Two paths: compile-rewrite vs test-only dep

There are two ways to point Lockstep at a library, with very different tradeoffs.

Path A: test-only Hex dep (workers controlled, library raw)

Add the library to mix.exs as only: :test. Workers run as Lockstep.Task.async/1 (controlled), call into the library, and record into Lockstep.History. The library itself runs un-rewritten on the real BEAM scheduler.

You get:

  • Real-world concurrent behavior of the library (real scheduler).
  • Lockstep manages worker scheduling and history recording.
  • Linearizability checking against your model.

You don't get:

  • PCT scheduling into the library's GenServer mailbox dispatch. Internal races aren't actively explored — you're hoping the BEAM scheduler hits them naturally.

Use this when the library is too heavy to compile-rewrite (heavy macros, many transitive deps, kernel-level features). Cachex and Phoenix.PubSub fit here.

Path B: compile-rewrite via Lockstep.MixCompiler (full pipeline)

Clone the source, run it through Lockstep.MixCompiler.compile/1, and Code.compile_file/1 the rewritten output. The library's internal :ets, :atomics, Process.send_after, GenServer.call, Process.monitor, and friends all become Lockstep sync points.

You get:

  • PCT (or any other strategy) scheduling into the library's internals — handle_call dispatch ordering, monitor cleanup ordering, message arrival sequencing.
  • A genuinely-stronger correctness signal: bugs at depth 3-6 inside the SUT are findable.

You pay for it with:

  • Source must be cloneable and compilable. Heavy macro libraries (Cachex's record-style state, Hammer's __before_compile__) may need partial workarounds.
  • Some OTP idioms aren't reachable: {:via, Registry, ...} was blocked until we added rewriting for the via tuple itself; :"$ancestors" propagation is also needed for libs that read Process.get(:"$ancestors").

Use this when the library is small-medium (≤ 2k LOC, few transitive deps) and you actually want to find concurrency bugs in its logic.

The model-based linearizability template

The reusable shape for any read/write API:

defmodule Lockstep.MyLibTest do
  use ExUnit.Case, async: false

  setup_all do
    # Compile through Lockstep.MixCompiler if you want path B.
    # Otherwise add the lib as a test dep and skip this.
    ...
  end

  defmodule MyModel do
    @behaviour Lockstep.Model
    alias Lockstep.History.Event

    @impl true
    def init, do: %{}  # whatever shape your reference state is

    @impl true
    def step(state, %Event{f: :get, value: k}, %Event{type: :ok, value: v}) do
      if Map.get(state, k) == v do
        {:ok, state}
      else
        {:error, "stale read"}
      end
    end

    def step(state, %Event{f: :put, value: {k, v}}, %Event{type: :ok}) do
      {:ok, Map.put(state, k, v)}
    end

    # ... one clause per (op kind, completion type)

    # Generic completion catch-alls (always include these):
    def step(state, _, %Event{type: :fail}), do: {:ok, state}
    def step(state, _, %Event{type: :info}), do: {:ok, state}
  end

  test "linearizability" do
    Lockstep.Runner.run(
      &body/0,
      iterations: 200,
      strategy: :pct,
      max_steps: 5_000,
      seed: 1,
      iter_timeout: 30_000,
      progress: 25,
      suite: "mylib_lin"
    )
  end

  defp body do
    # 1. Set up a fresh SUT instance (unique name per iteration).
    {:ok, sut} = MyLib.start_link(...)

    # 2. Start a History recorder.
    history = Lockstep.History.start_link!()

    # 3. Spawn workers, each pulling from a Generator.
    n_workers = 4
    ops_per_worker = 6

    workers =
      for w <- 1..n_workers do
        Lockstep.Task.async(fn ->
          gen =
            Lockstep.Generator.weighted(
              [
                {{:get, "k"}, 4},
                {{:put, "k", w * 100}, 2},
                # ... your op pool ...
              ],
              seed: w * 1_000_003
            )
            |> Lockstep.Generator.take(ops_per_worker)

          Lockstep.Generator.each(gen, fn op -> apply_op(history, sut, op) end)
        end)
      end

    Lockstep.Task.await_many(workers, 30_000)

    # 4. Check the history against the model.
    events = Lockstep.History.events(history)

    case Lockstep.Checker.Linearizable.check(events, MyModel) do
      {:ok, _} -> :ok
      {:error, info} ->
        msg = Lockstep.Checker.Linearizable.format_violation(info)
        raise "non-linearizable history found:\n\n#{msg}"
    end
  end

  defp apply_op(history, sut, {:get, k}) do
    Lockstep.History.op(history, :get, k, fn -> MyLib.get(sut, k) end)
  end

  defp apply_op(history, sut, {:put, k, v}) do
    Lockstep.History.op(history, :put, {k, v}, fn -> MyLib.put(sut, k, v) end)
  end
end

Lockstep.History.op/4 records an :invoke before the function, an :ok (with the function's return value) on normal return, or an :info (with the exception message) on raise. The Lockstep.Checker.Linearizable then tries to find a serial order that's consistent with both real-time precedence and the model.

Writing the model

A model is a sequential reference implementation. Its job is to answer: "given the recorded operations applied in some serial order, is this completion consistent?"

One clause per (op kind, completion shape) pair

def step(state, %Event{f: :put, value: {k, v}}, %Event{type: :ok, value: :ok}) do
  {:ok, Map.put(state, k, v)}
end

def step(state, %Event{f: :insert_new, value: {k, v}}, %Event{type: :ok, value: :ok}) do
  if Map.has_key?(state, k) do
    {:error, "insert_new returned :ok but key already exists"}
  else
    {:ok, Map.put(state, k, v)}
  end
end

def step(state, %Event{f: :insert_new, value: {k, _v}}, %Event{type: :ok, value: {:error, :already_exists}}) do
  if Map.has_key?(state, k) do
    {:ok, state}
  else
    {:error, "insert_new returned :already_exists but key didn't exist"}
  end
end

Each clause encodes both branches: "what state updates if this completion is consistent" and "what's the proof it's not consistent." The Linearizability checker uses this to backtrack through serial orderings.

Always include the catch-alls

def step(state, _invoke, %Event{type: :fail}), do: {:ok, state}
def step(state, _invoke, %Event{type: :info}), do: {:ok, state}

:fail is an explicit "didn't apply" — the op had no effect on state. :info is an unknown outcome (the worker crashed mid-call) — for now, treat as no-apply, which is the conservative default. (A more sophisticated checker would branch and try both.)

Be careful with return values

Probe the library's actual return shapes before writing the model. We had several "compile fails" or "function clause" errors traced back to incorrect assumptions:

  • Cachex.get/2 returns {:ok, value}, not value | nil.

  • Cachex.get_and_update/3 returns {:commit, new_v} directly, not {:ok, {:commit, new_v}}.
  • Mutex.lock/2 returns {:ok, %Mutex.Lock{}} or {:error, :busy}.
  • Mutex.await/3 returns %Mutex.Lock{} directly (or raises).

Run a quick mix run --eval 'IO.inspect(MyLib.foo(...))' first.

State scope: single-key first, multi-key second

Start with all ops on the same key. The contention is highest, the model is simplest, and most race patterns surface there. Once that passes, expand to multi-key — the model is just %{key => value} instead of value.

Picking a strategy

Lockstep ships five strategies. Rule of thumb:

StrategyBest atWhen to skip
:pct (default)Classic data races (lost-update, TOCTOU) at depth ≤ 3.When you need long runs of one process picked consecutively (POS is better).
:posBugs that need the strategy to consistently pick the same proc over many sync points (e.g., async replication patterns).When PCT depth-bounded probability is what you need.
:fair_pctLong-running tests where infrastructure (heartbeats, timers) would otherwise dominate.When you need full PCT for the whole run.
:idpctUnknown bug depth + many iterations (≥ 1000). Cycles depth in [d_min, d_max] per iteration.When iteration count is small and you want predictable depth.
:randomSanity check or cheap stress.When you want adversarial scheduling.

Workflow: start with :pct (default depth 3). If 200-500 iterations don't surface the bug, try :pos. If neither, switch to :idpct and run thousands of iterations. If still nothing, the suspected bug probably doesn't exist.

The Hammer bug was found by :pct at depth 3 with 5000 iterations on a 4-worker 1 inc + 3 hit workload. ConCache's lock manager and nimble_pool's checkout queue have been tested to ≥100 iterations of mixed workloads under :pct with no findings.

Iteration counts and tuning

PCT's bug-finding probability per iteration is 1 / (n · k^(d-1)) where n is step count, k is process count, d is bug depth.

A rough sizing table for a typical 5-worker test:

Bug depthPer-iter probabilityIterations for 99% confidence
2~1%~500
3~0.05%~10,000
4~0.0025%~200,000

In practice:

  • Start at 100 iterations during development. If your test is flaky, fix the test before scaling iterations.
  • Push to 1,000 iterations for a real correctness check.
  • For confidence at depth 3+, 5,000-10,000 iterations. Each takes ~1ms wall-time for typical workloads, so 10k is ~10s.
  • If you've gone past 50,000 and still nothing, the bug almost certainly isn't in your concurrency surface. Move on.

Beyond linearizability: fault injection, time-skew

The Linearizability template covers ~80% of useful tests. The remaining 20% need targeted directed tests.

Fault injection (kill a process holding a resource)

defp fault_body do
  {:ok, mutex} = Mutex.start_link([])
  parent = self()
  ref = make_ref()

  # Worker A: acquires the lock, signals, then exits without releasing.
  Lockstep.spawn(fn ->
    {:ok, _lock} = Mutex.lock(mutex, "k")
    Lockstep.send(parent, {ref, :a_acquired})
    # No release. fn returns → A exits :normal → :DOWN fires.
  end)

  Lockstep.recv_first(fn {^ref, :a_acquired} -> true; _ -> false end)

  # B (us) awaits — should block until A's :DOWN, then acquire.
  lock = Mutex.await(mutex, "k", :infinity)
  :ok = Mutex.release(mutex, lock)
end

Lockstep's controller delivers :DOWN automatically when a managed process exits, AND treats monitor-observed deaths as expected (no :bug flag). PCT explores orderings of: A's exit, the :DOWN delivery to B, B's pending await recv, B's retry.

Time-skew (force a timer to fire mid-operation)

Lockstep's virtual time advances when no process is runnable. Lockstep.sleep/1 parks the calling process and lets virtual time tick. Combined with aggressive timer config in the SUT (e.g., expiry: 0, cleanup_interval: 1), you can force timer events to interleave with concurrent ops:

defp time_skew_body do
  # Example: a circuit-breaker style state machine with periodic cleanup.
  {:ok, _} = MyBreaker.start_link(cleanup_interval: 1)
  {:ok, _} = MyBreaker.register("api", max_attempts: 3, expiry: 0)

  # Auto-open with 3 failures.
  MyBreaker.failure("api")
  MyBreaker.failure("api")
  MyBreaker.failure("api")

  # Concurrent: cleanup tick fires + worker does another failure.
  Lockstep.spawn(fn -> MyBreaker.failure("api") end)
  Lockstep.sleep(10)  # advances virtual time, fires cleanup tick

  # Final state must be valid (state machine consistent under interleaving).
  {:ok, info} = MyBreaker.status("api")
  if info.state not in [:open, :half_open], do: raise(...)
end

Note: System.system_time/1 is not rewritten by Lockstep. Real wall-clock time still ticks. If your library's expiry comparison uses System.system_time, you need very low expiry values for the comparison to fire under fast iterations.

What Lockstep can and cannot reach

Reaches

  • All Lockstep.* wrapped primitives: :ets.*, :atomics.*, :persistent_term.*, GenServer.{call, cast}, Process.{monitor, demonitor, send_after, link, unlink, flag, alive?}, spawn, spawn_link, Lockstep.send, Lockstep.recv*.
  • OTP {:via, Registry, ...} registration (rewriter transforms the via tuple to Lockstep.Registry).
  • Process.get(:"$ancestors") / :"$ancestors" is propagated on every spawn.

Doesn't reach

  • BEAM kernel messages: :ETS-TRANSFER (heir mechanism), :nodedown, port messages. These bypass the controller's mailbox. Libraries that depend on these (e.g., Eternal's ETS HEIR) won't work cleanly under Lockstep without additional wrapping.
  • Macro-generated code in user modules. The rewriter operates on source files passed to Lockstep.MixCompiler. If a library uses __before_compile__ to inject functions into the user's module (Hammer does this), the user's module isn't rewritten by default. Workaround: bypass the macro and call the underlying algorithm modules directly (which are rewritten).
  • System.system_time/1, :erlang.system_time/1, :os.timestamp/0. Real wall-clock is real wall-clock. If a library's correctness depends on time progression, it interacts oddly with Lockstep's fast iteration loop.
  • NIFs other than the wrapped set. If a lib calls into a custom NIF, Lockstep treats it as opaque.

Library-side compile quirks we've hit

  • Compile-time File.read!("README.md"): nimble_pool's @moduledoc reads README at compile time. Rewritten file lives in a different cwd, so File.read! fails. Pre-process the rewritten source to replace the line.
  • Macro-heavy state records: Cachex uses Erlang record syntax via macros. Compiling all of Cachex through Lockstep.MixCompiler is a rabbit hole; we use Path A (test-only Hex dep) instead.
  • Compile-time Code.ensure_loaded!: some libraries assert another module is loaded at compile time. The rewritten file is loaded fresh, so this can fail unless dependencies load first. Order rewritten files by leaf-to-root in your setup_all.

Anti-patterns

Strict equality assertions where the API returns floats

We've seen rate-limiter smoke tests assert {:allow, 3.0} = ... — which fails ~25% of the time because (now - last) / interval * refill produces 3.0000001 if even a microsecond of real time elapsed. Use assert_in_delta or is_number/1 for float-shaped returns.

Single-iteration tests for known-deep bugs

Running with iterations: 1 means PCT picks one schedule and stops. That's fine for a directed test of a specific known interleaving (e.g., shrinking output) but not for hunting unknown races. Default to iterations: 200 minimum.

Ignoring :info completions in the model

The catch-all def step(state, _, %Event{type: :info}), do: {:ok, state} treats unknown outcomes as no-applies. This is fine for correctness (conservative), but if you want to detect bugs that manifest only under crashed operations (e.g., the operation applied but the caller never saw the response), you need to branch.

Sharing state across iterations

If you Cachex.start_link(:my_cache, ...) once in setup_all, all iterations write to the same cache. State leaks. Use a per-iteration unique name: name = String.to_atom("c_#{:erlang.unique_integer([:positive])}").

Case studies

Cachex (1.5M downloads): linearizability confirmed

Test: 200 iterations × 6 workers × 8 mixed ops + 500 iterations of get_and_update R-M-W + 200 iterations multi-key. All linearizable. Compiled un-rewritten (Path A) due to 5 transitive deps. Cachex's basic ops are ETS-atomic; get_and_update serializes through a per-cache locksmith Queue GenServer.

lud/mutex: linearizability + fault-injection both pass

Test: 200 iterations linearizability on 5 workers × 4 lock/release cycles + 100 iterations fault injection (kill holder, verify monitor cleanup). Full pipeline. The fault test exercises the Process.monitor cleanup path — Lockstep delivers :DOWN automatically when a managed process exits.

ConCache (~600k downloads): linearizability confirmed

Test: 100 iterations × 4 workers × 6 mixed ops on single key, with PCT scheduling INTO ConCache.Lock's GenServer (the lock manager + waiter queue + monitor-based cleanup). All linearizable. Required two Lockstep core fixes:

  1. AST rewriter handles {:via, Registry, term}{:via, Lockstep.Registry, term} (OTP's :via dispatch otherwise routes to the real Registry which isn't started under Lockstep).
  2. Lockstep.spawn propagates :"$ancestors" (ConCache.Owner reads it via hd(Process.get(:"$ancestors")) to find its parent).

nimble_pool (used by Finch, Postgrex): clean

Test: 100 iterations × 5 client tasks × 3 checkout cycles on pool_size 2. Full pipeline. All cycles complete. Tests the checkout! queueing, Process.monitor on clients, and worker selection logic.

Hammer (~rate limiter): bug found

Test: 4-worker 1 inc + 3 hit workload, 5000 iterations under PCT. Found a real lost-update race in Hammer.Atomic.FixWindow.inc/4:

case :ets.lookup(table, full_key) do
  [{_, atomic}] -> ...
  [] ->
    :ets.insert(table, {full_key, :atomics.new(2, signed: false)})  # OVERWRITES
    inc(table, key, scale, increment)
end

:ets.insert overwrites whereas hit/5 correctly uses :ets.insert_new. Under interleaved hit+inc on a fresh key, increments to the first atomic can be clobbered when the second process's inc overwrites with a fresh atomic.

This was the seventh library tested with the template — the first to surface a real concurrency bug. The other six are correctly designed; Hammer's mismatch between hit/5 and inc/4 is an implementation gap, not a documented design choice.


Appendix: minimal test scaffolding for compile-rewrite (Path B)

defmodule Lockstep.MyLibTest do
  use ExUnit.Case, async: false

  @src "/tmp/mylib"
  @rewritten_dir "/tmp/mylib_lockstep"

  setup_all do
    cond do
      not File.exists?(@src) ->
        IO.puts(:stderr, "\n[mylib] skipped: clone to #{@src}")
        {:ok, available: false}

      true ->
        File.rm_rf!(@rewritten_dir)

        source_paths = [
          Path.join(@src, "lib/mylib.ex")
          # ... add all .ex files in dependency order
        ]

        case Lockstep.MixCompiler.compile(%{paths: source_paths, output: @rewritten_dir}) do
          {:ok, rewritten} ->
            # Sort leaf-to-root to satisfy module-level deps.
            ordered = Enum.sort_by(rewritten, &order_key/1)

            for path <- ordered do
              try do
                Code.compile_file(path)
                IO.puts(:stderr, "[mylib] compiled #{Path.basename(path)}")
              rescue
                e -> IO.puts(:stderr, "compile FAILED: #{Exception.message(e)}")
              end
            end

            {:ok, available: Code.ensure_loaded?(MyLib)}

          _ ->
            {:ok, available: false}
        end
    end
  end

  defp order_key(path) do
    case Path.basename(path) do
      "leaf_module.ex" -> 0
      "mid_module.ex" -> 1
      "main_module.ex" -> 2
      _ -> 99
    end
  end

  test "lin", ctx do
    if not ctx.available, do: assert(true), else: lin_test()
  end

  defp lin_test do
    Lockstep.Runner.run(&body/0,
      iterations: 200,
      strategy: :pct,
      max_steps: 5_000,
      seed: 1,
      iter_timeout: 30_000,
      progress: 25,
      suite: "mylib_lin"
    )
  end

  defp body, do: ...
end