LetItCrash (let_it_crash v0.6.0)

View Source

A testing library for crash recovery and OTP supervision behavior.

LetItCrash helps you test that your GenServers and supervised processes recover correctly after crashes, embracing Elixir's "let it crash" philosophy.

Usage

use LetItCrash

test "genserver recovers after crash" do
  {:ok, pid} = MyGenServer.start_link([])

  LetItCrash.crash(pid)

  assert LetItCrash.recovered?(MyGenServer)
end

Crash Functions

The library provides a crash/2 function that allows you to specify the type of exit signal. It follows the same convention as Process.exit/2 for consistency and piping support:

  • crash(pid) - Sends a :shutdown exit signal (default, can be trapped)
  • crash(pid, :shutdown) - Explicitly sends :shutdown signal
  • crash(pid, :kill) - Sends a :kill exit signal (cannot be trapped, guarantees termination)

Use :kill when testing processes that trap exits, such as GenServers that need to perform cleanup operations.

Async work

Supervisor recovery is only part of the story. Production failures often hide in fire-and-forget Tasks, Oban jobs, and LiveView handle_async/3 callbacks — work that runs outside the caller's stack frame and can fail silently. For that, see LetItCrash.Async, which provides:

  • LetItCrash.Async.observe_async/1,2 — wrap a block of test code and collect telemetry about every async exception inside it
  • LetItCrash.Async.assert_no_silent_swallow/1,2 — fail when a Task raised but nobody noticed
  • LetItCrash.Async.assert_all_completed/2 — fail when async work didn't finish within a wall-clock budget
  • LetItCrash.Async.assert_idempotent/2 — fail when running the same operation twice produces different observable state

Both surfaces are imported by use LetItCrash.

Summary

Functions

Imports LetItCrash testing functions into the current module.

Asserts that a process properly cleans up its Registry entries on crash and recovery.

Asserts the impact of crashing a child process on its siblings within a supervision tree.

Crashes a process by sending it an exit signal.

Checks if a registered process has recovered (restarted) after a crash.

Tests that a process can recover from a crash by executing a test function before and after the crash.

Verifies that ETS table entries are properly cleaned up when a process crashes.

Waits for a process to exit (terminate).

Waits for a registered process to exist and be alive.

Functions

__using__(opts)

(macro)

Imports LetItCrash testing functions into the current module.

This imports both the top-level helpers (crash/2, recovered?/2,3, wait_for_process/2, wait_for_exit/2, test_restart/3, assert_clean_registry/3, verify_ets_cleanup/3, assert_supervision_impact/3) and the async helpers from LetItCrash.Async (observe_async/1,2, assert_no_silent_swallow/1,2, assert_all_completed/2, assert_idempotent/2).

assert_clean_registry(registry, key, opts \\ [])

@spec assert_clean_registry(module(), term(), keyword()) :: :ok | {:error, term()}

Asserts that a process properly cleans up its Registry entries on crash and recovery.

This function verifies that:

  1. The old Registry entry is removed when the process crashes
  2. A new Registry entry is created when the process recovers
  3. The new entry points to the new PID

Parameters

  • registry - The Registry module to monitor
  • process_name - The registered name/key of the process
  • opts - Options for the verification
    • :timeout - Maximum time to wait for cleanup and re-registration (default: 2000ms)

Examples

test "process cleans up registry on restart" do
  {:ok, _pid} = MyServer.start_link(name: :my_server)
  Registry.register(MyApp.Registry, :my_server, %{status: :active})

  LetItCrash.crash(:my_server)
  LetItCrash.assert_clean_registry(MyApp.Registry, :my_server)
end

assert_supervision_impact(supervisor, target, opts)

@spec assert_supervision_impact(pid() | atom(), atom(), keyword()) ::
  :ok | {:error, term()}

Asserts the impact of crashing a child process on its siblings within a supervision tree.

Crashes the target process and verifies that each process listed in :expect reaches the expected status (:restarted, :alive, or :stopped).

This function validates that your chosen supervision strategy (:one_for_one, :one_for_all, :rest_for_one) behaves as expected for your specific tree.

Parameters

  • supervisor - PID or registered name of the supervisor
  • target - Registered name of the child to crash
  • opts - Options:
    • :expect (required) - Keyword list of {child_name, expected_status}
    • :signal - Exit signal: :shutdown (default) or :kill
    • :timeout - Max wait time in ms (default: 2000)
    • :interval - Polling interval in ms (default: 50)

Expected Statuses

  • :restarted - Process is alive with a different PID than before the crash
  • :alive - Process is alive with the same PID (unaffected)
  • :stopped - Process is no longer registered or alive

Each status can also be paired with an assertion function that verifies application-level behavior after the status is confirmed:

  • {:restarted, fn -> ... end} - Verify behavior after restart
  • {:alive, fn -> ... end} - Verify the unaffected process is still functional
  • {:stopped, fn -> ... end} - Run assertions after confirming the process stopped

The assertion function runs only after the status is confirmed. If the status does not match, the function is never called. If the status matches but the function raises, the error propagates as a test failure.

Examples

# Verify one_for_all restarts all children
LetItCrash.assert_supervision_impact(:my_sup, :worker_a,
  expect: [
    worker_a: :restarted,
    worker_b: :restarted,
    worker_c: :restarted
  ]
)

# Verify one_for_one only restarts the crashed child
LetItCrash.assert_supervision_impact(:my_sup, :worker_a,
  expect: [
    worker_a: :restarted,
    worker_b: :alive,
    worker_c: :alive
  ]
)

# Verify application behavior, not just OTP mechanics
LetItCrash.assert_supervision_impact(:my_sup, :coordinator,
  expect: [
    coordinator: {:restarted, fn ->
      assert MyCoordinator.get_state() == :idle
    end},
    worker_a: {:alive, fn ->
      assert MyWorker.get_status(:worker_a) == :waiting
    end},
    worker_b: :restarted
  ]
)

crash(process, type \\ :shutdown)

@spec crash(process :: pid() | atom(), type :: :shutdown | :kill) ::
  :ok | {:error, term()}

Crashes a process by sending it an exit signal.

Follows the same convention as Process.exit/2, with the process as the first argument to enable easy piping.

Parameters

  • process - A PID or registered process name to crash
  • type - The type of exit signal: :shutdown (default) or :kill

The :shutdown signal can be trapped by processes with Process.flag(:trap_exit, true), while :kill cannot be trapped and guarantees termination.

Examples

# Default :shutdown signal:
{:ok, pid} = MyGenServer.start_link([])
LetItCrash.crash(pid)

# Explicitly specifying :shutdown:
LetItCrash.crash(pid, :shutdown)

# Piping support:
Process.whereis(:my_process)
|> LetItCrash.crash(:kill)

# For processes with trap_exit, use :kill:
defmodule ScoreCoordinator do
  use GenServer

  def init(_) do
    Process.flag(:trap_exit, true)
    {:ok, %{}}
  end
end

{:ok, pid} = ScoreCoordinator.start_link([])
LetItCrash.crash(pid, :kill)  # Guarantees termination

recovered?(process_name, original_pid_or_opts \\ [])

@spec recovered?(atom(), pid() | keyword()) :: boolean()

Checks if a registered process has recovered (restarted) after a crash.

This function works by comparing the current PID of a registered process with a previously stored PID. If they differ, it means the process was restarted.

Parameters

  • process_name - The registered name of the process to check
  • original_pid - The PID before the crash (optional, will be retrieved if not provided)
  • opts - Options for recovery checking
    • :timeout - Maximum time to wait for recovery (default: 1000ms)
    • :interval - Polling interval (default: 50ms)

Examples

test "process recovers after crash" do
  original_pid = Process.whereis(MyGenServer)
  LetItCrash.crash(MyGenServer)
  assert LetItCrash.recovered?(MyGenServer, original_pid)
end

recovered?(process_name, original_pid, opts)

@spec recovered?(atom(), pid(), keyword()) :: boolean()

start_tracking()

test_restart(process, test_fn, opts \\ [])

@spec test_restart(pid() | atom(), function(), keyword()) :: :ok | {:error, term()}

Tests that a process can recover from a crash by executing a test function before and after the crash.

Parameters

  • process - PID or registered name of the process to test
  • test_fn - Function to execute before and after crash
  • opts - Options for the test
    • :timeout - Maximum time to wait for recovery (default: 1000ms)

Examples

test "maintains state after restart" do
  LetItCrash.test_restart(MyStatefulServer, fn ->
    assert MyStatefulServer.get_count() == 0
    MyStatefulServer.increment()
    assert MyStatefulServer.get_count() == 1
  end)
end

verify_ets_cleanup(table, key, opts \\ [])

@spec verify_ets_cleanup(atom() | :ets.tid(), term(), keyword()) ::
  :ok | {:error, term()}

Verifies that ETS table entries are properly cleaned up when a process crashes.

This function monitors specific ETS table entries and ensures they are cleaned up appropriately during process restart.

Parameters

  • table - The ETS table name or reference to monitor
  • key - The key to monitor in the ETS table
  • opts - Options for the verification
    • :timeout - Maximum time to wait for cleanup (default: 1000ms)
    • :expect_cleanup - Whether to expect the entry to be cleaned up (default: true)
    • :expect_recreate - Whether to expect the entry to be recreated (default: false)

Examples

test "cleans up ETS entries on crash" do
  :ets.insert(:my_cache, {:server_data, "important"})

  LetItCrash.crash(:my_server)
  LetItCrash.verify_ets_cleanup(:my_cache, :server_data)
end

test "recreates ETS entries after recovery" do
  LetItCrash.crash(:my_server)
  LetItCrash.verify_ets_cleanup(:my_cache, :server_data,
    expect_cleanup: true, expect_recreate: true)
end

wait_for_exit(pid, opts \\ [])

@spec wait_for_exit(
  pid(),
  keyword()
) :: :ok | {:error, :timeout}

Waits for a process to exit (terminate).

Uses Process.monitor/1 to block until the given PID is no longer alive, or until the timeout elapses. This is the deterministic replacement for Process.sleep/1 calls that wait "long enough" for a process to die.

If the process is already dead when the call is made, the monitor fires immediately with :noproc and the function returns :ok.

Parameters

  • pid - The PID of the process to wait on
  • opts - Options for waiting
    • :timeout - Maximum time to wait (default: 1000ms)

Returns

  • :ok - The process is no longer alive
  • {:error, :timeout} - Process is still alive after the timeout

Examples

test "process dies after crash" do
  {:ok, pid} = Agent.start_link(fn -> 0 end)
  Process.unlink(pid)

  LetItCrash.crash(pid)

  :ok = LetItCrash.wait_for_exit(pid)
  refute Process.alive?(pid)
end

# With custom timeout for slow-terminating processes
:ok = LetItCrash.wait_for_exit(pid, timeout: 5000)

wait_for_process(process_name, opts \\ [])

@spec wait_for_process(
  atom(),
  keyword()
) :: :ok | {:error, :timeout}

Waits for a registered process to exist and be alive.

This function is useful in test setup when you need to ensure a process is available before interacting with it, particularly after starting supervisors or during async initialization.

Parameters

  • process_name - The registered name of the process to wait for
  • opts - Options for waiting
    • :timeout - Maximum time to wait (default: 1000ms)
    • :interval - Polling interval (default: 50ms)

Returns

  • :ok - Process exists and is alive
  • {:error, :timeout} - Process did not appear within timeout

Examples

test "worker is available after supervisor starts" do
  {:ok, _sup} = MySupervisor.start_link()

  # Wait for the worker to be ready
  :ok = LetItCrash.wait_for_process(:my_worker)

  # Now safe to interact with it
  assert MyWorker.get_status() == :ready
end

# With custom timeout for slow-starting processes
:ok = LetItCrash.wait_for_process(:heavy_worker, timeout: 5000)