gen_durable — performance

Copy Markdown View Source

Performance is a first-class constraint for this engine, not an afterthought. This document is the cost model: what one step costs, the EXPLAIN plans of every hot-path statement, the indexes that back them, throughput estimates, and the known limits.

All EXPLAIN (ANALYZE, BUFFERS) output below is real, captured on Postgres 17 (the devcontainer image) against a seeded dataset of 1,000,000 gen_durable rows (295k runnable, 5k executing, 700k terminal) and 50k signals, over a local Unix socket. Reproduction steps are at the end. Numbers are warm-cache; absolute ms will differ on your hardware, but the plan shapes and row counts are what matter.


1. The cost model of one step

A step runs between two short database transactions, with the user's code in the middle, outside any transaction (see the architecture notes in gen_durable_plan.md):

 pick (claim+lease)         user step/2 (no DB, no txn)        outcome (commit)
   ····························  
 TX1: short, batched        side effects live here            TX2: short
 amortized over the batch    (idempotency is yours)           consume + update

The database work per step, and where it goes:

PhaseStatementsRound-tripsNotes
pick1 (UPDATE … RETURNING over a window-dedup claim)~1/B per stepone pick claims a batch of B; the feeder amortizes it
load signals1 SELECT1always-on today; gated by an FSM flag is the remaining F4 step
load childs1 SELECT1same
user step/200runs outside any transaction
outcome1 CTE (consumed DELETE + outcome UPDATE [+ parent-join])1folded from a 4–5-stmt transaction into one statement (F4)

So a plain :next step is now ~3 round-trips (2 loads + 1 outcome), with the pick amortized to near-zero by batching. The outcome used to be a 4-statement BEGIN/consume/UPDATE/COMMIT transaction (+1 for :done/:stop's parent-join); folding it into a single data-modifying CTE made it 1 — measured at exactly one statement (test/perf_test.exs), and consistently faster than the transaction form on wall-clock (the round-trips are the cost). Gating the two loads behind an FSM capability flag is the last F4 step → ~1 round-trip/step. See §7.

Round-trips, not query execution time, dominate: every statement below executes in well under 1 ms at the database, but each client↔Postgres hop is a network round-trip (~0.3–1 ms across hosts). The cost is the count of hops, which is why collapsing the outcome from 4 hops to 1 matters more than any single statement's plan.


2. The picker (the one query that must scale)

The picker runs on every poll and every completion-driven refill, so it is the query that most has to stay cheap as the table grows. It does three things in one statement: claim a batch atomically, serialize partitions, and dedup partition keys (spec §6).

2.1 The decisive detail: queue = $1, never ANY

Each scheduler owns exactly one queue, so the picker filters queue = $1 (equality). This is not cosmetic — it is what lets the gen_durable_pick (queue, priority, eligible_at) WHERE status='runnable' index supply rows already ordered, so the LIMIT stops after a handful. With queue = ANY($1) the planner cannot trust the index order and falls back to scanning the entire runnable set and top-N sorting it:

-- queue = ANY('{default}')  
->  Index Scan using gen_durable_pick on gen_durable g_2 (rows=280000)    whole runnable set
      Filter: ((partition_key IS NULL) OR (NOT ...))
->  Sort  Sort Method: top-N heapsort
Execution Time: 613.578 ms
-- queue = 'default'  
->  Limit (rows=50)
      ->  Index Scan using gen_durable_pick on gen_durable g_2 (rows=50)   stops at the batch
            Index Cond: ((queue = 'default') AND (eligible_at <= now()))
Execution Time: 0.725 ms

~850× faster, and O(batch) instead of O(runnable backlog). Same data, one operator.

2.2 Full plan of the picker (batch = 50)

The picker is the canonical Postgres claim — one SELECT … FOR UPDATE SKIP LOCKED LIMIT, then one UPDATE — with the partition dedup folded in as a window function over the locked set, so there is exactly one nested loop (the UPDATE join):

Update on gen_durable g (actual time=0.400..0.983 rows=50 loops=1)
  Buffers: shared hit=868
  CTE locked
    ->  WindowAgg (rows=50)                                      dedup: row_number() per key
          ->  Sort  Sort Key: COALESCE(partition_key, id::text), priority, eligible_at  (27kB)
                ->  Limit (rows=50)
                      ->  LockRows (rows=50)                     FOR UPDATE SKIP LOCKED, in-scan
                            ->  Index Scan using gen_durable_pick (rows=50)
                                  Index Cond: ((queue = 'default') AND (eligible_at <= now()))
                                  Filter: ((partition_key IS NULL) OR (NOT ( hashed SubPlan 2)))
                                  SubPlan 2
                                    ->  Index Scan using gen_durable_lease (never executed)
  ->  Nested Loop (rows=50)                                      the one, optimal join
        ->  CTE Scan on locked l (rows=50)  Filter: (rn = 1)
        ->  Index Scan using gen_durable_pkey on gen_durable g (rows=1 loops=50)
Planning Time: 0.582 ms
Execution Time: 1.122 ms

What to read here:

  • gen_durable_pick index scan stops at the LIMIT — the scan touches ~batch rows, not the 295k runnable rows. This is the whole game.
  • The lock happens in that one scan (LockRows), and row_number() dedups the already-locked set — so there is no separate re-lock pass. The earlier design (DISTINCT ON → re-lock → update) had a second nested loop that scaled per-row with the batch; folding the dedup into a window function removed it (measured −19% at batch 5000, §2.4).
  • SubPlan 2 (the partition NOT EXISTS) was never executed here because the top 50 rows were non-partitioned (partition_key IS NULL short-circuits the guard). A non-partitioned queue pays nothing for the partition machinery. When partitioned rows are in the window, the guard probes gen_durable_partition_active — a partial index over only the executing rows with a non-null partition_key, so a non-partitioned claim never even writes to it.
  • The single Nested Loop is the UPDATE join, and it is optimal — outer = batch rows, inner = one primary-key point lookup each (loops=50, rows=1). That is O(batch) point updates, the textbook-best way to update N rows by id; see §2.5 for the proof that forcing it off is ~10× slower. The per-key losers (rn > 1) were locked but not updated, so they stay runnable and their lock releases at commit — no advisory bounce.

2.3 The bounded-window trade-off

The scan is LIMIT $2 (the batch size). A cluster of same-key rows filling that window dedups down to one, so a single pick can return fewer than batch — completion-driven refill closes the gap on the next pick. This keeps the hot path index-cheap at the cost of a little fill latency under heavy single-key load. See §6 for the degenerate case.

2.4 Large batches and the per-row floor

Crank prefetch and the pick claims a big batch in one statement. The cost is linear in the batch — every component (index scan, dedup sort, lock, the UPDATE) is per-row:

batchpick (median of 5, warm)per row
50~1.1 ms~22 µs
100~1.9 ms~19 µs
1000~16 ms~16 µs
5000~57 ms~11 µs

The fixed cost (planning + the NOT EXISTS hash build over the executing set) amortizes away with bigger batches; the ~11–16 µs/row marginal cost does not. That floor is the actual work of claiming: flipping each row runnable → executing is a heap write plus moving the row out of the runnable partial index and into the two executing partial indexes (lease, partition_active) — plus WAL for all of it. You cannot claim a row without writing it, so ~11–16 µs/row is close to the floor for this schema.

Two consequences worth designing around:

  • A pick of 5000 is a 70 ms synchronous statement in the scheduler GenServer and holds FOR UPDATE over 5000 rows. Past ~a few hundred, batch size buys no amortization (the fixed cost is already gone) and only adds blocking. If you push prefetch very high, prefer many medium picks over one giant one.
  • Per-row cost is aggregate DB CPU: at 10k steps/s that floor is ~0.15 s/s of CPU on picking alone. The way to cut it is fewer, larger logical steps — not a faster pick.

2.5 The one nested loop is optimal (proof)

The UPDATE join is a Nested Loop and that is exactly right. "Update these N rows by id" wants N primary-key point lookups — a nested loop with the PK index on the inner side, O(batch). It only looks scary when the inner side is unindexed (then it is O(N·M)); here it is loops=batch, rows=1 against gen_durable_pkey. Forcing the planner off it (SET enable_nestloop = off) makes it fall back to a hash join that Seq Scans the whole million-row table to find the batch:

->  Hash Join  (Hash Cond: g.id = l.id)
      ->  Seq Scan on gen_durable g  (rows=1000000)      scans everything to find `batch` rows
batchnested loop (default)forced off → hash join + seq scan
100017 ms160 ms
500056 ms195 ms

~10× slower. The nested loop is the canonical Postgres-queue claim, and it is the floor: UPDATE cannot LIMIT, so the claim must be SELECT … FOR UPDATE … LIMIT then UPDATE joined by id; and updating N rows means writing N rows regardless of how they are reached.

2.6 What did not help (measured, rejected)

Restructurings prototyped against the seeded dataset and rejected by measurement — recorded so they are not re-attempted blind:

  • Single-scan dedup (correlated NOT EXISTS "I am the most-urgent runnable of my key"
    • FOR UPDATE in the scan, backed by a new (partition_key, priority, eligible_at) partial index). A wash on wall-clock and worse at batch 5000 — the new partial index adds write amplification to every claim and every return-to-runnable. The window function over the locked set (the shipped picker) gets the same dedup with no new index.
  • ctid-join instead of id-join for the UPDATE re-touch (TID scan instead of a PK descent). Cut logical buffer hits ~28% but warm-cache wall-clock did not move (17.0 vs 17.0 ms at 1000) and was slower at 5000 (79 vs 70). Those buffers are nanosecond cache hits; time is dominated by heap writes + WAL, which ctid does not touch — consistent with §2.5 (the cost is the write, not the lookup).

Lesson: on a warm queue the picker is at its floor; fewer logical page touches do not translate to time when they are all cache hits. The real lever is round-trips per step (§7), not the pick query. Beware benchmarking on a bloated table — rolled-back EXPLAIN ANALYZE runs accumulate dead tuples and inflate timings by ~20%; VACUUM and take the median of several runs before believing a delta.


3. The outcome and point queries (all O(1) by primary key)

Every outcome and every signal/insert touches rows by primary key or a covering index — constant work regardless of table size.

The collapsed outcome (F4) — the plan, not the statement count

Each complete_* is one statement (§1): the signal consume rides as a leading consumed CTE, and :done/:stop carry the parent-join as the main UPDATE after a terminal CTE. Counting "one statement" only proves the round-trip — the plan has to prove the single statement is cheap. Here is the heaviest outcome, complete_done on a child (all three parts: consume + done + parent decrement), measured with 50,000 children present so the parent path is real, not a one-row fixture:

Update on gen_durable p                                           the parent decrement
  CTE consumed
    ->  Delete on signals s   (Index Scan signals_target  gen_durable_pkey)
  CTE terminal
    ->  Update on gen_durable  ->  Index Scan using gen_durable_pkey  (id = $1)
  ->  Nested Loop
        ->  Index Scan using gen_durable_pkey on c   (id = $1)             child by PK
        ->  Index Scan using gen_durable_pkey on p   (id = c.parent_id)    parent by PK
Execution Time: 0.268 ms

Every node is a primary-key index scan — no Seq Scan on gen_durable, even with 50k rows in the parent index. The c.id = $1 equality is the selective path so the planner stays on the PK (a one-row fixture can fool it onto the partial parent index with an id filter — harmless there, and it does not happen once stats are real). For a non-child :done the parent-join is a clean PK no-op (p matches 0 rows).

And it is not only fewer round-trips: the single statement is less DB execution time than the old three separate statements — median ~0.25 ms vs ~0.42 ms (EXPLAIN ANALYZE, 300k-row table), before even counting the BEGIN/COMMIT round-trips the transaction form also paid. Consume stays name-scoped (spec §5): on an instance parked on go with {go, other} in its inbox, only go is deleted — other survives.

(The statement-count assertions in test/perf_test.exs guard the round-trip count; this EXPLAIN is the execution-cost evidence. Different claims, both checked.)

Point queries

deliver_signal — wake by id+status+awaits:

Update on gen_durable
  ->  Index Scan using gen_durable_pkey (rows=0)
        Filter: ((status = 'awaiting_signal') AND (awaits = 'go'))
Execution Time: 0.029 ms

insert — single row, dedup via the partial unique index:

Insert on gen_durable
  Conflict Arbiter Indexes: gen_durable_unique
  Tuples Inserted: 1
Execution Time: 0.468 ms

load_signals — the inbox by target:

Sort  Sort Key: id
  ->  Index Scan using signals_target on signals (rows=1)   Index Cond: (target_id = 700001)
Execution Time: 0.072 ms

All sub-millisecond, all index-driven. None of them grow with the table.


4. The reaper (proportional to expired rows only)

The reaper sweeps expired leases via the gen_durable_lease (lease_expires_at) WHERE status='executing' partial index — it never scans live or terminal rows:

Update on gen_durable (rows=5000)
  ->  Index Scan using gen_durable_lease (rows=5000)
        Index Cond: (lease_expires_at < now())
        Filter: (status = 'executing')
Execution Time: 42.002 ms    5000 rows reaped at once (mass-crash scenario)

The index scan itself is ~2 ms; the 42 ms is the heap update of 5000 rows. In steady state almost nothing is expired, so a sweep finds 0 rows and costs an index probe. Cost scales with the size of the expired set (a mass worker death), not the table — exactly what you want.


5. Index coverage

Every hot statement is served by a partial or primary-key index; the big terminal-row bulk (700k done/failed) sits in none of the partial indexes.

QueryIndexKind
pick scangen_durable_pick (queue, priority, eligible_at) WHERE status='runnable'partial, ordered
pick partition guardgen_durable_partition_active (partition_key) WHERE status='executing' AND partition_key IS NOT NULLpartial, index-only probe
pick lock / outcomes / signalgen_durable_pkey (id)primary key
reapgen_durable_lease (lease_expires_at) WHERE status='executing'partial
insert dedupgen_durable_unique (unique_guard) WHERE unique_guard IS NOT NULLpartial unique
child joingen_durable_parent (parent_id) WHERE parent_id IS NOT NULLpartial
load_signalssignals_target (target_id, name)covering

The partial predicates matter: gen_durable_pick indexes only the ~295k runnable rows, not the 700k terminal ones, so the working set stays small as completed work accumulates.


6. Throughput model and known limits

Round-trip-bound throughput

Query execution is sub-millisecond, so steady-state throughput is set by round-trips per step × round-trip latency, divided across concurrency.

Let R = round-trips/step, L = client↔Postgres round-trip latency, U = user step/2 time. One serial chain does 1 / (R·L + U) steps/s; with C concurrent workers (pool and DB permitting), throughput ≈ C / (R·L + U).

round-trips Rper-step DB latency at L=0.2 ms (local)at L=1 ms (cloud)
today (2 loads + 4-stmt outcome txn)~6~1.2 ms~6 ms
F4 (gate loads, single-CTE outcome)~1–2~0.2–0.4 ms~1–2 ms

So a trivial-step workload at C = 50, local, today: 50 / 1.2 ms ≈ 40k steps/s, DB permitting; F4 would roughly triple-to-quintuple that. These are model figures from the measured per-query costs, not an end-to-end benchmark — treat them as order-of- magnitude. The pick is amortized out by the feeder batch (0.7 ms / 50 rows ≈ 14 µs/row).

Tuning levers (feeder knobs — see GenDurable.Scheduler)

  • prefetch — claims a batch ahead into memory; turns many small picks into few fat ones (amortizes the pick round-trip). Buffered rows are heartbeated, so depth is safe w.r.t. lease_ttl. Cost of depth: cross-node fairness, crash blip (bounded by TTL).
  • min_demand — batch gate: fetch fat, not one row per freed slot.
  • poll_interval / max_poll_interval — idle backoff cuts the polling load on an empty queue to near-zero without hurting busy-queue latency.

Known limits / pathologies (honest list)

  1. A single hot partition key with a large runnable backlog. While that key is executing, the picker's NOT EXISTS excludes each of its runnable siblings — one cheap index probe each — but if those siblings dominate the top of the queue by priority, the scan walks past many of them per pick. The window LIMIT bounds the output, not how far it skips. Mitigation (future): picker sharding by hashtext(key) so one key maps to one scheduler, or a per-key "next eligible" side structure. A single key monopolizing a queue is itself a modeling smell.
  2. Partitioned steps pin a connection. partition_key serialization holds a session advisory lock on a checked-out connection for the whole step (user code included), so in-flight partitioned steps consume pool connections 1:1. Size the pool for peak partitioned concurrency. Non-partitioned steps grab/release per statement.
  3. load_signals + load_childs run on every step, even for FSMs that never await or spawn. Two wasted round-trips for the common machine; gate them behind use GenDurable.FSM, awaits: true, childs: true. The last remaining F4 step.

7. Optimization backlog — ordered by payoff

  1. Collapse the outcome transaction into one CTE statement — done (F4): a 4–5-stmt BEGIN … COMMIT became one data-modifying CTE, ~4 round-trips → 1, the biggest single win on the hot path. Asserted single-statement in test/perf_test.exs.
  2. Gate load_signals / load_childs on FSM capability flags — ~2 round-trips → 0 for plain machines. Takes a plain :next step from ~3 round-trips to ~1.
  3. Picker sharding by key hash — removes the hot-key skip cost (limit #1).

Together (1)+(2) take a plain :next step from ~6 round-trips to ~1–2.


8. Reproducing these numbers

The dataset and every EXPLAIN above were produced in the devcontainer Postgres. To regenerate: bring up the stack (make up), then in docker compose … exec db psql -U postgres, create a scratch database, apply the v1 DDL from lib/gen_durable/migration.ex verbatim, seed with generate_series (700k terminal / 200k runnable non-partitioned / 80k distinct-key / 15k hot-key / 5k executing / 50k signals), ANALYZE, and run EXPLAIN (ANALYZE, BUFFERS) on each statement from lib/gen_durable/queries.ex. Wrap mutating statements in BEGIN; … ; ROLLBACK; so the plan executes without changing the dataset. Run each twice and read the second (warm).