Squid Mesh is observable through durable runtime state first. Host applications inspect the journal-backed read models, graph output, explanation diagnostics, and their own worker logs or metrics.

Squid Mesh does not currently expose a public :telemetry event contract under the [:squid_mesh, ...] prefix. Treat telemetry event names and metric labels as host-app concerns until a dedicated runtime telemetry API exists.

Runtime State Surfaces

Use these public APIs as the stable observability boundary:

list_runs/2 intentionally stays narrow. It exposes lookup and status fields without attempt inputs, outputs, errors, claim metadata, or idempotency keys. Use inspect_run/2 only after selecting a specific run and applying the host app's authorization rules.

Redaction And Field Selection

Treat Squid Mesh observability data as three tiers:

TierExamplesSuggested use
Index-saferun_id, workflow, queue, status, terminal status, indexed timeRun lists, dashboards, queue counters.
Operator detailreason, visible/scheduled attempt counts, next visibility time, manual step, anomaly countSupport views and incident pages after authorization.
Sensitive detailrun input, durable context, attempt input/output/error, idempotency keys, claim IDs, owner IDs, manual metadataPrivileged audit views only, with host redaction.

inspect_run/2 and inspect_run_graph/2 can expose host-domain data because step inputs, outputs, errors, manual metadata, and durable context come from the embedding application. Squid Mesh cannot know which fields are customer data, provider responses, tokens, or internal notes. Apply an allow-list at the HTTP, LiveView, CLI, or API boundary instead of serializing the full snapshot by default.

For example, an operator summary can keep runtime state while dropping step payloads:

def operator_summary(snapshot) do
  manual_state = snapshot.manual_state || %{}

  %{
    run_id: snapshot.run_id,
    workflow: snapshot.workflow,
    queue: snapshot.queue,
    status: snapshot.status,
    reason: snapshot.reason,
    visible_attempt_count: length(snapshot.visible_attempts),
    scheduled_attempt_count: length(snapshot.scheduled_attempts),
    next_visible_at: snapshot.next_visible_at,
    manual_step: Map.get(manual_state, :step) || Map.get(manual_state, "step"),
    anomaly_count: length(snapshot.anomalies)
  }
end

For graph views, prefer inspect_run_graph/2 without include_history: true unless the viewer needs input, output, error, manual-state, or attempt detail. When history is enabled, redact each node's input, output, error, manual_state, and attempts fields before exposing the payload outside a trusted operator surface.

Use the same rule for metrics and logs: record counts, statuses, queues, workflow names, and reason categories. Avoid user-provided payload fields, provider responses, idempotency keys, claim identifiers, and raw errors as labels or log fields.

What To Measure

The read model gives host apps enough durable state to derive useful operational signals:

SignalSourceWhy it matters
Run counts by workflow, queue, and statuslist_runs/2Tracks volume, completion rate, and backlog shape.
Visible attempt depthinspect_run/2.visible_attemptsShows work that workers can claim now.
Scheduled attempt depth and next wakeupscheduled_attempts, next_visible_atShows delayed retries, waits, and future-visible work.
Claimed or expired attemptsattempts, expired_claimsIdentifies workers that are busy, stalled, or recoverable.
Pending dispatch/resultspending_dispatches, pending_resultsDetects journal facts that need runtime reconciliation.
Manual intervention countmanual_state and status :pausedDrives approval queues and operator SLAs.
Terminal outcomesterminal?, terminal_statusTracks completed, failed, cancelled, and replayed work.
Runtime anomaliesanomaliesSurfaces inconsistent or malformed durable facts.

For dashboards, start with list_runs/2, then inspect selected runs with history only when the caller needs detailed attempts or audit evidence.

Operator Explanations

explain_run/2 is the highest-signal surface for support tooling. It condenses the inspection snapshot into:

  • reason - the runtime state category, such as :attempt_visible, :attempt_scheduled_for_later, :manual_intervention_required, :expired_claim, or :terminal.
  • summary and details - a short explanation plus structured state.
  • next_actions - safe host/operator actions, such as waiting for a worker, resolving a manual step, recovering an expired claim, or inspecting a terminal run.
  • evidence - thread revisions, attempt counts, planned/applied runnable keys, manual state, next visibility time, and anomalies.

Use this for incident pages, CLI output, and support views where raw journal facts would be too noisy.

Graph Output

inspect_run_graph/2 presents the same durable state as workflow nodes and edges. It is useful when a host UI needs to show:

  • current nodes
  • completed, pending, retrying, failed, skipped, and paused nodes
  • selected transition edges
  • dependency edges and pending joins
  • manual-state detail when history is included

For JSON or LiveView boundaries, call SquidMesh.Runs.GraphInspection.to_map/1 after applying the host app's authorization and redaction policy. See Graph inspection contract for the stable map shape.

Logs

Squid Mesh emits application logs only for explicit built-in :log workflow steps. It does not currently attach automatic logger metadata such as run_id, workflow, step, or attempt to every runtime log.

If a host app needs correlated logs, wrap worker execution and host boundaries with its own logger metadata:

Logger.metadata(queue: queue, worker: worker_id)
SquidMesh.execute_next(queue: queue, owner_id: worker_id)

For step-specific external calls, prefer logging at the host boundary or inside native SquidMesh.Step modules, and avoid logging secrets, claim tokens, payloads, or raw provider responses.

Host Telemetry

Host applications can still emit their own telemetry around Squid Mesh calls:

:telemetry.span(
  [:my_app, :squid_mesh, :execute_next],
  %{queue: queue, worker: worker_id},
  fn ->
    result = SquidMesh.execute_next(queue: queue, owner_id: worker_id)
    {result, %{result: elem(result, 0)}}
  end
)

Keep host telemetry labels low-cardinality. Good labels include queue, workflow, status, and result category. Avoid run_id, claim tokens, idempotency keys, raw errors, or user-provided payload fields as metric labels.

  • Getting started shows the inspection and explanation APIs in a small runnable workflow.
  • Graph inspection contract documents the node and edge payload for host UIs.
  • Host app integration shows where host apps wrap worker loops, inspection, and manual-control APIs.
  • Operations covers production concerns such as retries, waits, cancellation, and cron activation.