:rest_for_one supervisor wrapping the five Temporal children whose
lifetimes are tied to the bridge-handle registry:
Hourglass.Runtime— owns theCoreRuntimeResourceNIF resource ref. First child so a Runtime crash (notably via the:nif_reloadedexit signal thatBridgeHolderraises when it detects anArgumentErrorfromBridge.worker_new/2— the symptom ofPhoenix.CodeReloaderreloading theBridgemodule and invalidating every pre-reload resource ref) cascades the entire subtree, dropping every stale handle in a single restart.Hourglass.BridgeHolder— owns every bridge handle in the VM.Hourglass.WorkerRegistry—:viaregistry mapping task-queue names toHourglass.WorkerGenServer pids.Hourglass.Worker.Supervisor—DynamicSupervisorof per-task-queueHourglass.WorkerGenServers.Hourglass.WorkerLauncher— boot child that registers the default"default"-task-queue Worker againstWorker.Supervisor. The worker resolves workflow and activity modules structurally from the Temporal type name (no module-list config required). Inside the Subsystem so aBridgeHoldercascade re-runs it and re-establishes the default Worker against the fresh subtree.
Why a dedicated subsystem?
Previously these five were direct children of Hourglass.Application's
top-level :one_for_one supervisor. If BridgeHolder crashed
(rare — Logger.error :task_sup_crashed exit, or other runtime
bug), only BridgeHolder was restarted. Existing
Hourglass.Worker GenServers (under Worker.Supervisor) had
no idea their bridge-handle registrations were lost — their poll
loops entered a 50 ms-sleep retry loop on
{:error, :worker_not_registered} forever, requiring manual
intervention to recover.
Wrapping the five in a :rest_for_one supervisor makes a
BridgeHolder crash cascade through WorkerRegistry and
Worker.Supervisor. After the cascade, the new BridgeHolder
starts with empty handles, the new WorkerRegistry is empty,
and Worker.Supervisor is a fresh DynamicSupervisor with no
children. Zombie state is gone.
In-flight Task survival
In-flight evaluator and activity-executor Tasks live under
sibling DynSups at Hourglass.Application scope:
They are NOT children of this subsystem, so a BridgeHolder crash
does not kill them. They keep running. Their next
BridgeHolder.complete_* call returns
{:error, :worker_not_registered} (the fresh BridgeHolder has
no registration for their old task queue) — per
Hourglass.Worker.WorkflowEvaluator.run/1's error handling, that
error is logged and the Task exits :normal. Core then
redelivers the activation on the next poll on a fresh consumer.
Recovery semantics — caller responsibility
Worker.Supervisor is a DynamicSupervisor. Its children — the
per-task-queue Worker GenServers — are dynamically registered
via Worker.Supervisor.start_worker/1. Cascade-restarting
Worker.Supervisor therefore terminates every Worker it was
parenting; on restart, Worker.Supervisor is empty.
After a BridgeHolder crash + cascade, the default "default"-task-
queue Worker is re-registered automatically via WorkerLauncher (the
5th child). Test-spawned per-test Workers use unique task queues and
must be re-registered by the test harness — BridgeHolder crashes
inside a test are rare enough that this is acceptable.
Child ordering rationale
:rest_for_one cascades restarts to every child after the one
that died. Order matters:
Runtimefirst — if it dies (notably viaBridgeHolderraising:nif_reloadedafter detecting anArgumentErrorfromBridge.worker_new/2), cascade everything else. Every Bridge resource ref (CoreRuntime, Worker, Client) issued by the pre-reload NIF is now opaque; resetting the whole subtree re-acquires fresh refs against the freshly loaded NIF.BridgeHoldersecond — if it dies, cascadeWorkerRegistry(its via-name entries are stale) andWorker.Supervisor(its Workers' bridge handles are gone).WorkerRegistrythird — if it somehow dies independently ofBridgeHolder, cascade onlyWorker.Supervisor(its Workers' via-names are gone).BridgeHolderkeeps its handles, but every Worker is a fresh registration anyway.Worker.Supervisorlast — its children are dynamic; if it dies on its own, the others are unaffected and the application restarts Workers viastart_worker/1.
Test-helper parity
test/test_helper.exs boots the app-scope singletons; the
:temporal integration suite brings up this subsystem (under
Hourglass.Application with :start_runtime true, or via
start_supervised/1) so the production-shape tests exercise the same
supervision-tree structure. The cascade-restart regression test
for BridgeHolder lives in
test/hourglass/subsystem_test.exs (async: false because it
deliberately crashes the globally-shared holder) and relies on this
module to assert the correct cascade.
Summary
Functions
Returns a specification to start this module under a supervisor.
Functions
Returns a specification to start this module under a supervisor.
See Supervisor.
@spec start_link(keyword()) :: Supervisor.on_start()