This guide covers the operational boundaries Squid Mesh expects host applications to own.

Runtime Guarantees

Squid Mesh currently guarantees:

  • durable run, step, and attempt state in Postgres
  • durable queued and scheduled work through Oban
  • workflow-level retry, replay, inspection, and cancellation on top of that durable state

Squid Mesh does not currently claim:

  • exactly-once external side effects
  • custom worker leases or heartbeats beyond Oban
  • dynamic cron registration after boot

Idempotent Step Design

Any step that talks to an external system should be idempotent at that boundary.

Recommended patterns:

  • include an application-owned idempotency key in the external request
  • persist enough domain state to detect duplicate delivery
  • treat remote 409 or duplicate acknowledgements as success when appropriate

Avoid:

  • steps that produce irreversible side effects without a duplicate strategy
  • relying on "this step should only run once" as the safety model

Queue Sizing

Squid Mesh runs on the host app's Oban instance, so queue sizing stays a host-app decision.

Recommended starting point:

  • dedicate a :squid_mesh queue
  • isolate higher-cost workflow traffic from unrelated app jobs
  • size concurrency conservatively, then increase based on observed queue depth

If workflows perform mostly I/O:

  • a moderate queue limit is usually fine

If workflows call slow external systems:

  • keep limits lower
  • prefer backoff and queue isolation over large worker counts

Retries And Backoff

Workflow-step retries are owned by Squid Mesh, not by Oban worker max_attempts.

Jido action retries are also disabled at the Squid Mesh runtime boundary so one workflow attempt maps to one persisted step attempt.

Recommended practice:

  • declare retries only on steps that own recoverable work
  • prefer bounded exponential backoff
  • surface structured errors from steps so retry behavior is understandable in inspection

Example:

step(:check_gateway_status, MyApp.Steps.CheckGatewayStatus,
  retry: [max_attempts: 5, backoff: [type: :exponential, min: 1_000, max: 30_000]]
)

Long Waits

Built-in :wait steps are non-blocking because they reschedule continuation through Oban instead of sleeping inside a worker.

Still, long waits have real operational cost:

  • more scheduled jobs
  • longer-lived run records
  • more delayed work to reason about during incidents

Recommended practice:

  • keep :wait for workflow-scale delays, not arbitrary timers everywhere
  • prefer application scheduling or cron triggers when the delay is really about when the workflow should start
  • avoid extremely large waits unless the workflow truly needs to remain in-flight

Cron Activation

Cron triggers are declared in the workflow but activated by the host app through SquidMesh.Plugins.Cron.

Current boundary:

  • activation is static at boot
  • Oban owns recurring scheduling
  • Squid Mesh turns the cron tick into a normal start_run/3 call

Recommended practice:

  • treat cron workflows as deploy-time configuration
  • review cron registrations alongside the host app's Oban setup
  • keep payload defaults complete so cron runs do not rely on manual input

Observability

At minimum, production deployments should capture:

  • run lifecycle telemetry
  • step lifecycle telemetry
  • queue depth and throughput from Oban
  • structured logs with run and step metadata

Recommended reading: