Performance is a first-class constraint for this engine, not an afterthought. This document is the cost model: what one step costs, the EXPLAIN plans of every hot-path statement, the indexes that back them, throughput estimates, and the known limits.
All EXPLAIN (ANALYZE, BUFFERS) output below is real, captured on Postgres 17
(the devcontainer image) against a seeded dataset of 1,000,000 gen_durable rows
(295k runnable, 5k executing, 700k terminal) and 50k signals, over a local
Unix socket. Reproduction steps are at the end. Numbers are warm-cache; absolute ms
will differ on your hardware, but the plan shapes and row counts are what matter.
1. The cost model of one step
A step runs between two short database transactions, with the user's code in the
middle, outside any transaction (see the architecture notes in gen_durable_plan.md):
pick (claim+lease) user step/2 (no DB, no txn) outcome (commit)
───────────────────► ···························· ───────────────────►
TX1: short, batched side effects live here TX2: short
amortized over the batch (idempotency is yours) consume + updateThe database work per step, and where it goes:
| Phase | Statements | Round-trips | Notes |
|---|---|---|---|
| pick | 1 (UPDATE … RETURNING over a window-dedup claim) | ~1/B per step | one pick claims a batch of B; the feeder amortizes it |
| load signals | 1 SELECT | 1 | always-on today; gated by an FSM flag is the remaining F4 step |
| load childs | 1 SELECT | 1 | same |
user step/2 | 0 | 0 | runs outside any transaction |
| outcome | 1 CTE (consumed DELETE + outcome UPDATE [+ parent-join]) | 1 | folded from a 4–5-stmt transaction into one statement (F4) |
So a plain :next step is now ~3 round-trips (2 loads + 1 outcome), with the pick
amortized to near-zero by batching. The outcome used to be a 4-statement
BEGIN/consume/UPDATE/COMMIT transaction (+1 for :done/:stop's parent-join);
folding it into a single data-modifying CTE made it 1 — measured at exactly one
statement (test/perf_test.exs), and consistently faster than the transaction form on
wall-clock (the round-trips are the cost). Gating the two loads behind an FSM capability
flag is the last F4 step → ~1 round-trip/step. See §7.
Round-trips, not query execution time, dominate: every statement below executes in well under 1 ms at the database, but each client↔Postgres hop is a network round-trip (~0.3–1 ms across hosts). The cost is the count of hops, which is why collapsing the outcome from 4 hops to 1 matters more than any single statement's plan.
2. The picker (the one query that must scale)
The picker runs on every poll and every completion-driven refill, so it is the query that most has to stay cheap as the table grows. It does three things in one statement: claim a batch atomically, serialize partitions, and dedup partition keys (spec §6).
2.1 The decisive detail: queue = $1, never ANY
Each scheduler owns exactly one queue, so the picker filters queue = $1 (equality).
This is not cosmetic — it is what lets the gen_durable_pick (queue, priority, eligible_at) WHERE status='runnable' index supply rows already ordered, so the LIMIT
stops after a handful. With queue = ANY($1) the planner cannot trust the index order
and falls back to scanning the entire runnable set and top-N sorting it:
-- queue = ANY('{default}') ❌
-> Index Scan using gen_durable_pick on gen_durable g_2 (rows=280000) ← whole runnable set
Filter: ((partition_key IS NULL) OR (NOT ...))
-> Sort Sort Method: top-N heapsort
Execution Time: 613.578 ms-- queue = 'default' ✅
-> Limit (rows=50)
-> Index Scan using gen_durable_pick on gen_durable g_2 (rows=50) ← stops at the batch
Index Cond: ((queue = 'default') AND (eligible_at <= now()))
Execution Time: 0.725 ms~850× faster, and O(batch) instead of O(runnable backlog). Same data, one operator.
2.2 Full plan of the picker (batch = 50)
The picker is the canonical Postgres claim — one SELECT … FOR UPDATE SKIP LOCKED LIMIT,
then one UPDATE — with the partition dedup folded in as a window function over the
locked set, so there is exactly one nested loop (the UPDATE join):
Update on gen_durable g (actual time=0.400..0.983 rows=50 loops=1)
Buffers: shared hit=868
CTE locked
-> WindowAgg (rows=50) ← dedup: row_number() per key
-> Sort Sort Key: COALESCE(partition_key, id::text), priority, eligible_at (27kB)
-> Limit (rows=50)
-> LockRows (rows=50) ← FOR UPDATE SKIP LOCKED, in-scan
-> Index Scan using gen_durable_pick (rows=50)
Index Cond: ((queue = 'default') AND (eligible_at <= now()))
Filter: ((partition_key IS NULL) OR (NOT (… hashed SubPlan 2)))
SubPlan 2
-> Index Scan using gen_durable_lease (never executed)
-> Nested Loop (rows=50) ← the one, optimal join
-> CTE Scan on locked l (rows=50) Filter: (rn = 1)
-> Index Scan using gen_durable_pkey on gen_durable g (rows=1 loops=50)
Planning Time: 0.582 ms
Execution Time: 1.122 msWhat to read here:
gen_durable_pickindex scan stops at theLIMIT— the scan touches ~batchrows, not the 295k runnable rows. This is the whole game.- The lock happens in that one scan (
LockRows), androw_number()dedups the already-locked set — so there is no separate re-lock pass. The earlier design (DISTINCT ON→ re-lock → update) had a second nested loop that scaled per-row with the batch; folding the dedup into a window function removed it (measured −19% at batch 5000, §2.4). SubPlan 2(the partitionNOT EXISTS) wasnever executedhere because the top 50 rows were non-partitioned (partition_key IS NULLshort-circuits the guard). A non-partitioned queue pays nothing for the partition machinery. When partitioned rows are in the window, the guard probesgen_durable_partition_active— a partial index over only the executing rows with a non-nullpartition_key, so a non-partitioned claim never even writes to it.- The single
Nested Loopis theUPDATEjoin, and it is optimal — outer =batchrows, inner = one primary-key point lookup each (loops=50, rows=1). That is O(batch) point updates, the textbook-best way to update N rows by id; see §2.5 for the proof that forcing it off is ~10× slower. The per-key losers (rn > 1) were locked but not updated, so they stayrunnableand their lock releases at commit — no advisory bounce.
2.3 The bounded-window trade-off
The scan is LIMIT $2 (the batch size). A cluster of same-key rows filling that window
dedups down to one, so a single pick can return fewer than batch — completion-driven
refill closes the gap on the next pick. This keeps the hot path index-cheap at the cost
of a little fill latency under heavy single-key load. See §6 for the degenerate case.
2.4 Large batches and the per-row floor
Crank prefetch and the pick claims a big batch in one statement. The cost is linear
in the batch — every component (index scan, dedup sort, lock, the UPDATE) is per-row:
| batch | pick (median of 5, warm) | per row |
|---|---|---|
| 50 | ~1.1 ms | ~22 µs |
| 100 | ~1.9 ms | ~19 µs |
| 1000 | ~16 ms | ~16 µs |
| 5000 | ~57 ms | ~11 µs |
The fixed cost (planning + the NOT EXISTS hash build over the executing set) amortizes
away with bigger batches; the ~11–16 µs/row marginal cost does not. That floor is the
actual work of claiming: flipping each row runnable → executing is a heap write plus
moving the row out of the runnable partial index and into the two executing partial
indexes (lease, partition_active) — plus WAL for all of it. You cannot claim a row
without writing it, so ~11–16 µs/row is close to the floor for this schema.
Two consequences worth designing around:
- A pick of 5000 is a 70 ms synchronous statement in the scheduler GenServer and holds
FOR UPDATEover 5000 rows. Past ~a few hundred, batch size buys no amortization (the fixed cost is already gone) and only adds blocking. If you pushprefetchvery high, prefer many medium picks over one giant one. - Per-row cost is aggregate DB CPU: at 10k steps/s that floor is ~0.15 s/s of CPU on picking alone. The way to cut it is fewer, larger logical steps — not a faster pick.
2.5 The one nested loop is optimal (proof)
The UPDATE join is a Nested Loop and that is exactly right. "Update these N rows by id"
wants N primary-key point lookups — a nested loop with the PK index on the inner side,
O(batch). It only looks scary when the inner side is unindexed (then it is O(N·M)); here
it is loops=batch, rows=1 against gen_durable_pkey. Forcing the planner off it
(SET enable_nestloop = off) makes it fall back to a hash join that Seq Scans the whole
million-row table to find the batch:
-> Hash Join (Hash Cond: g.id = l.id)
-> Seq Scan on gen_durable g (rows=1000000) ← scans everything to find `batch` rows| batch | nested loop (default) | forced off → hash join + seq scan |
|---|---|---|
| 1000 | 17 ms | 160 ms |
| 5000 | 56 ms | 195 ms |
~10× slower. The nested loop is the canonical Postgres-queue claim, and it is the floor:
UPDATE cannot LIMIT, so the claim must be SELECT … FOR UPDATE … LIMIT then UPDATE
joined by id; and updating N rows means writing N rows regardless of how they are reached.
2.6 What did not help (measured, rejected)
Restructurings prototyped against the seeded dataset and rejected by measurement — recorded so they are not re-attempted blind:
- Single-scan dedup (correlated
NOT EXISTS"I am the most-urgent runnable of my key"FOR UPDATEin the scan, backed by a new(partition_key, priority, eligible_at)partial index). A wash on wall-clock and worse at batch 5000 — the new partial index adds write amplification to every claim and every return-to-runnable. The window function over the locked set (the shipped picker) gets the same dedup with no new index.
ctid-join instead ofid-join for theUPDATEre-touch (TID scan instead of a PK descent). Cut logical buffer hits ~28% but warm-cache wall-clock did not move (17.0 vs 17.0 ms at 1000) and was slower at 5000 (79 vs 70). Those buffers are nanosecond cache hits; time is dominated by heap writes + WAL, whichctiddoes not touch — consistent with §2.5 (the cost is the write, not the lookup).
Lesson: on a warm queue the picker is at its floor; fewer logical page touches do not
translate to time when they are all cache hits. The real lever is round-trips per step
(§7), not the pick query. Beware benchmarking on a bloated table — rolled-back
EXPLAIN ANALYZE runs accumulate dead tuples and inflate timings by ~20%; VACUUM and
take the median of several runs before believing a delta.
3. The outcome and point queries (all O(1) by primary key)
Every outcome and every signal/insert touches rows by primary key or a covering index — constant work regardless of table size.
The collapsed outcome (F4) — the plan, not the statement count
Each complete_* is one statement (§1): the signal consume rides as a leading consumed
CTE, and :done/:stop carry the parent-join as the main UPDATE after a terminal CTE.
Counting "one statement" only proves the round-trip — the plan has to prove the single
statement is cheap. Here is the heaviest outcome, complete_done on a child (all
three parts: consume + done + parent decrement), measured with 50,000 children present
so the parent path is real, not a one-row fixture:
Update on gen_durable p ← the parent decrement
CTE consumed
-> Delete on signals s (Index Scan signals_target → gen_durable_pkey)
CTE terminal
-> Update on gen_durable -> Index Scan using gen_durable_pkey (id = $1)
-> Nested Loop
-> Index Scan using gen_durable_pkey on c (id = $1) ← child by PK
-> Index Scan using gen_durable_pkey on p (id = c.parent_id) ← parent by PK
Execution Time: 0.268 msEvery node is a primary-key index scan — no Seq Scan on gen_durable, even with 50k
rows in the parent index. The c.id = $1 equality is the selective path so the planner
stays on the PK (a one-row fixture can fool it onto the partial parent index with an
id filter — harmless there, and it does not happen once stats are real). For a non-child
:done the parent-join is a clean PK no-op (p matches 0 rows).
And it is not only fewer round-trips: the single statement is less DB execution time
than the old three separate statements — median ~0.25 ms vs ~0.42 ms (EXPLAIN ANALYZE,
300k-row table), before even counting the BEGIN/COMMIT round-trips the transaction form
also paid. Consume stays name-scoped (spec §5): on an instance parked on go with {go, other} in its inbox, only go is deleted — other survives.
(The statement-count assertions in test/perf_test.exs guard the round-trip count; this
EXPLAIN is the execution-cost evidence. Different claims, both checked.)
Point queries
deliver_signal — wake by id+status+awaits:
Update on gen_durable
-> Index Scan using gen_durable_pkey (rows=0)
Filter: ((status = 'awaiting_signal') AND (awaits = 'go'))
Execution Time: 0.029 msinsert — single row, dedup via the partial unique index:
Insert on gen_durable
Conflict Arbiter Indexes: gen_durable_unique
Tuples Inserted: 1
Execution Time: 0.468 msload_signals — the inbox by target:
Sort Sort Key: id
-> Index Scan using signals_target on signals (rows=1) Index Cond: (target_id = 700001)
Execution Time: 0.072 msAll sub-millisecond, all index-driven. None of them grow with the table.
4. The reaper (proportional to expired rows only)
The reaper sweeps expired leases via the gen_durable_lease (lease_expires_at) WHERE status='executing' partial index — it never scans live or terminal rows:
Update on gen_durable (rows=5000)
-> Index Scan using gen_durable_lease (rows=5000)
Index Cond: (lease_expires_at < now())
Filter: (status = 'executing')
Execution Time: 42.002 ms ← 5000 rows reaped at once (mass-crash scenario)The index scan itself is ~2 ms; the 42 ms is the heap update of 5000 rows. In steady state almost nothing is expired, so a sweep finds 0 rows and costs an index probe. Cost scales with the size of the expired set (a mass worker death), not the table — exactly what you want.
5. Index coverage
Every hot statement is served by a partial or primary-key index; the big terminal-row
bulk (700k done/failed) sits in none of the partial indexes.
| Query | Index | Kind |
|---|---|---|
pick scan | gen_durable_pick (queue, priority, eligible_at) WHERE status='runnable' | partial, ordered |
pick partition guard | gen_durable_partition_active (partition_key) WHERE status='executing' AND partition_key IS NOT NULL | partial, index-only probe |
pick lock / outcomes / signal | gen_durable_pkey (id) | primary key |
reap | gen_durable_lease (lease_expires_at) WHERE status='executing' | partial |
insert dedup | gen_durable_unique (unique_guard) WHERE unique_guard IS NOT NULL | partial unique |
| child join | gen_durable_parent (parent_id) WHERE parent_id IS NOT NULL | partial |
load_signals | signals_target (target_id, name) | covering |
The partial predicates matter: gen_durable_pick indexes only the ~295k runnable rows,
not the 700k terminal ones, so the working set stays small as completed work accumulates.
6. Throughput model and known limits
Round-trip-bound throughput
Query execution is sub-millisecond, so steady-state throughput is set by round-trips per step × round-trip latency, divided across concurrency.
Let R = round-trips/step, L = client↔Postgres round-trip latency, U = user
step/2 time. One serial chain does 1 / (R·L + U) steps/s; with C concurrent workers
(pool and DB permitting), throughput ≈ C / (R·L + U).
round-trips R | per-step DB latency at L=0.2 ms (local) | at L=1 ms (cloud) | |
|---|---|---|---|
| today (2 loads + 4-stmt outcome txn) | ~6 | ~1.2 ms | ~6 ms |
| F4 (gate loads, single-CTE outcome) | ~1–2 | ~0.2–0.4 ms | ~1–2 ms |
So a trivial-step workload at C = 50, local, today: 50 / 1.2 ms ≈ 40k steps/s,
DB permitting; F4 would roughly triple-to-quintuple that. These are model figures from
the measured per-query costs, not an end-to-end benchmark — treat them as order-of-
magnitude. The pick is amortized out by the feeder batch (0.7 ms / 50 rows ≈ 14 µs/row).
Tuning levers (feeder knobs — see GenDurable.Scheduler)
prefetch— claims a batch ahead into memory; turns many small picks into few fat ones (amortizes the pick round-trip). Buffered rows are heartbeated, so depth is safe w.r.t.lease_ttl. Cost of depth: cross-node fairness, crash blip (bounded by TTL).min_demand— batch gate: fetch fat, not one row per freed slot.poll_interval/max_poll_interval— idle backoff cuts the polling load on an empty queue to near-zero without hurting busy-queue latency.
Known limits / pathologies (honest list)
- A single hot partition key with a large runnable backlog. While that key is
executing, the picker'sNOT EXISTSexcludes each of its runnable siblings — one cheap index probe each — but if those siblings dominate the top of the queue by priority, the scan walks past many of them per pick. The windowLIMITbounds the output, not how far it skips. Mitigation (future): picker sharding byhashtext(key)so one key maps to one scheduler, or a per-key "next eligible" side structure. A single key monopolizing a queue is itself a modeling smell. - Partitioned steps pin a connection.
partition_keyserialization holds a session advisory lock on a checked-out connection for the whole step (user code included), so in-flight partitioned steps consume pool connections 1:1. Size the pool for peak partitioned concurrency. Non-partitioned steps grab/release per statement. load_signals+load_childsrun on every step, even for FSMs that never await or spawn. Two wasted round-trips for the common machine; gate them behinduse GenDurable.FSM, awaits: true, childs: true. The last remaining F4 step.
7. Optimization backlog — ordered by payoff
- ✅ Collapse the outcome transaction into one CTE statement — done (F4): a 4–5-stmt
BEGIN … COMMITbecame one data-modifying CTE, ~4 round-trips → 1, the biggest single win on the hot path. Asserted single-statement intest/perf_test.exs. - Gate
load_signals/load_childson FSM capability flags — ~2 round-trips → 0 for plain machines. Takes a plain:nextstep from ~3 round-trips to ~1. - Picker sharding by key hash — removes the hot-key skip cost (limit #1).
Together (1)+(2) take a plain :next step from ~6 round-trips to ~1–2.
8. Reproducing these numbers
The dataset and every EXPLAIN above were produced in the devcontainer Postgres. To
regenerate: bring up the stack (make up), then in docker compose … exec db psql -U postgres, create a scratch database, apply the v1 DDL from
lib/gen_durable/migration.ex verbatim, seed with generate_series (700k terminal /
200k runnable non-partitioned / 80k distinct-key / 15k hot-key / 5k executing / 50k
signals), ANALYZE, and run EXPLAIN (ANALYZE, BUFFERS) on each statement from
lib/gen_durable/queries.ex. Wrap mutating statements in BEGIN; … ; ROLLBACK; so the
plan executes without changing the dataset. Run each twice and read the second (warm).