Migrating from v0.5.1 to v0.6

Copy Markdown View Source

v0.6 is a hardening release: a full-library logic audit fixed ~25 findings across activity liveness, replay-path agreement, the determinism scanner, identity across continue_as_new chains and cluster nodes, and signal/cancel/await consistency. Most fixes are invisible to application code. This guide covers the migration and the observable behavior changes.

Database Migration

v0.6 adds one column:

alter table(:continuum_runs) do
  add :cancel_requested_at, :utc_datetime_usec
end

mix continuum.gen.migration includes it for new installs. No backfill is needed; the column records pending cancel requests for runs whose owning engine was unreachable, and the owner honors it on its next lease heartbeat.

Cancellation Has a Real cancelled State

Cancelled runs previously ended as failed with the error term :cancelled. In v0.6 the run row's state is cancelled, and cancel_run! is the single broadcaster of one canonical {:run_finished, run_id, :cancelled, :cancelled} message — including for cascade-cancelled descendant runs, whose awaiters previously blocked for their full timeout.

What to update:

  • Code matching {:error, %{state: :failed, error: :cancelled}} from Continuum.await/3 now receives {:error, %{state: :cancelled, error: :cancelled}}.
  • Code inspecting run rows directly (state == "failed" plus decoding the error) should match state == "cancelled".
  • Rows written by earlier versions are still recognized: they display, await, and query as cancelled (Continuum.query(state: :cancelled) matches both encodings).
  • A child run that legitimately failed with the user error term :cancelled is now classified as a failure by its parent's await_child/1, not as a cancellation.

Continuum.signal/3,4 Validates Its Target

Signaling a run that does not exist returns {:error, :not_found}, and signaling a terminal run returns {:error, :run_terminal}. Previously both returned :ok while the signal sat in a mailbox nothing could ever consume. If you signal speculatively (for example, fire-and-forget notifications to runs that may have finished), handle or ignore the new error tuples.

Journal Errors Are Structured

Journal write rejections raise Continuum.Runtime.JournalError (with op and a structured reason) instead of RuntimeError with a formatted message. Code rescuing RuntimeError around journal operations — or matching on message substrings such as "lease_mismatch" — must rescue Continuum.Runtime.JournalError and match on error.reason instead.

Relatedly, a transient database failure while journaling a completion or suspension no longer marks the run failed with the DB exception as its error: the engine crashes and crash-and-resume replays and finishes the run.

Cancel Results Are More Specific

Continuum.cancel/2 on a run it cannot cancel locally now distinguishes:

  • {:error, :not_found} — no such run;
  • {:error, :owned_elsewhere} — a live engine on another node owns it. The cancel was forwarded if the node was reachable; otherwise the request was recorded durably and the owner honors it on its next heartbeat — the error tells you cancellation is pending, not failed;
  • {:error, {:run_not_active, state}} — the run is already terminal (previously reported as :not_found).

continue_as_new Chains Are Transparent

Operations addressed to a chain-root run id now act on the live incarnation: signals are delivered to the tip's mailbox, cancel cancels the tip, and Continuum.await/3 follows the chain to the final terminal result (the internal {:continued, run_id} marker is never returned). When a run continues, its undelivered signals, live unawaited children, namespace, and attributes move to the successor — previously children were orphaned from the cancel cascade and tenant scoping silently reset to defaults.

Successors are also stamped with the workflow's currently loaded version instead of the predecessor's pin, so long-running chains pick up deploys.

stuck_unknown_version Is No Longer Produced

A node that claims a run whose (workflow, version_hash) it does not have loaded now releases the lease and leaves the run suspended for a capable node, emitting [:continuum, :run, :unknown_version] per attempt. Runs marked stuck_unknown_version by earlier versions are flipped back to suspended at boot when a matching version registers. If you alerted on the stuck state, alert on the telemetry event (or mix continuum.audit) instead.

Activity Execution Liveness

No action required, but worth knowing operationally:

  • Task leases are heartbeated while the activity executes (TTL 30 seconds, renewed every 10; tune with :task_lease_ttl_seconds and :task_lease_renew_ms). Activities longer than 30 seconds no longer depend on a one-shot lease extension, and a crashed worker's task is rescuable within roughly one TTL.
  • Crash requeues consume an attempt. An activity with the default max_attempts: 1 whose worker or node dies mid-execution now fails with :attempts_exhausted instead of silently re-running its side effects on every recovery. Raise max_attempts (and supply an idempotency_key/1) for crash-resilient activities.
  • mix continuum.audit reports expired_leased_activity_tasks; a persistently non-zero count means workers are dying between claim and completion faster than the sweep rescues them.

side_effect/1 Identity in Helper Modules

Producer fingerprints no longer include per-compilation anonymous-function artifacts, so recompiling a helper module (adding an unrelated function) no longer drifts every in-flight run replaying through a side_effect site in it. One-time caveat: histories journaled through the bare-producer Effect.run/2 form (not the Continuum.side_effect/1 macro, which is what workflow code uses) replay-break once across this upgrade.

Note the documented caveat on Continuum.side_effect/1: command identity includes the call site's line, and helper modules have no version-hash protection — prefer keeping side_effect calls in the workflow module.

Determinism Scanner Coverage

Recompiling against v0.6 may surface new compile errors or warnings in workflow code that previously slipped through — each is a real determinism hazard:

  • piped banned calls (x |> send(:msg)) are checked at their effective arity and rejected;
  • chained dynamic receivers (input.mod.fun(x)) and captures of dynamic modules (&m.f/1) warn as unanalyzable;
  • catch arms in Continuum.Pure helpers warn (same suspend-swallow foot-gun as in workflow clauses);
  • the compensate_all coverage check sees the whole module, so uncompensated activities in other clauses or private helpers now warn — and call sites with non-literal opts no longer warn falsely.

Internal Runtime API Changes

Only relevant if you call Continuum.Runtime.* directly (not a supported surface): Journal.Postgres.retry_activity_task!/5 takes backoff_ms instead of a timestamp, Journal.Postgres.deliver_signal!/4 returns {:ok, delivered_run_id} (it may have chain-hopped) or an error tuple, and Lease.renew/4 can return {:ok, :cancel_requested}, which callers must not treat as an error.