Incident Response Playbook

Copy Markdown View Source

This guide is the operator-facing reference for responding to SAML incidents when a Relyra deployment is live. Every login resolves to a verified trust path or a typed rejection; this playbook is what to do when the rejection recurs, when a metadata refresh suspends, when a signing cert rotates, or when a replay storm hits the ACS endpoint. Relyra owns the typed-rejection contract and the five evidence surfaces (telemetry, audit ledger, LiveView admin, Mix tasks, troubleshooting decoder); the host owns the operational response. The reference table below is the centerpiece — every scenario runbook in this guide points back into it rather than restating where the evidence lives.

Relyra owns / Host owns

Relyra owns

  • Emitting structured [:relyra, :saml, ...] telemetry events at every stage of the trust pipeline (lib/relyra/telemetry.ex).
  • Writing append-only audit rows to the relyra_audit_events table for every trust-state mutation: connection, metadata, certificate, and mapping.
  • Publishing the /relyra/admin LiveView routes for connection / metadata / certificate / mapping inspection and mutation.
  • Shipping the 7 mix relyra.* operator hand-tools.
  • Documenting the canonical Relyra.Error atom taxonomy in ../troubleshooting.md.

Host owns

  • Telemetry storage and alerting (Relyra emits; the host's observability stack retains, indexes, and pages on-call).
  • Audit-row access control and retention (the host's policy decides who reads relyra_audit_events and for how long).
  • Authentication and authorization for the /relyra/admin LiveView surface (Relyra exposes the routes; the host's pipeline gates them).
  • Scheduling Mix-task invocation — for example, a cron entry that runs mix relyra.refresh_due on a cadence appropriate to the deployment.
  • The human operator playbook. This guide is a starting point; institutional procedures (paging, escalation, communication) wrap it.

Evidence surfaces

When an incident hits, walk this table top to bottom. Each row names the exact code anchor — Relyra's strict-defaults posture means the operator should never need to grep the source to find the right evidence.

SurfaceWhat it tells youExact code anchor
Telemetry eventsStructured :start / :stop / :exception spans for every trust-pipeline stage, plus auto-refresh state-transition events and certificate-expiry warningslib/relyra/telemetry.ex
Audit ledgerAppend-only relyra_audit_events rows recording every trust-state mutation (connection / metadata / certificate / mapping)lib/relyra/ecto/audit_event.ex
LiveView admin routesOperator UI for connection, metadata, certificate, and mapping inspection and mutationlib/relyra/live_admin/router.ex
Mix tasks7 operator hand-tools for drift checks, diagnostic bundles, scaffolding, metadata pinning, scheduled refresh, and security-review evidencelib/mix/tasks/relyra.*.ex
Troubleshooting decoderPer-atom decoder with the four-field micro-block (Means / Likely root cause / Operator action / Source) for every Relyra.Error atom../troubleshooting.md

Telemetry event catalog

Cite these event names verbatim when attaching :telemetry.attach/4 handlers. Each span event below is a :start / :stop / :exception triplet — the :stop event carries duration_ms and an :outcome / :error_code pair when the stage failed.

Span events:

  • [:relyra, :saml, :login]
  • [:relyra, :saml, :authn_request]
  • [:relyra, :saml, :response, :decode]
  • [:relyra, :saml, :response, :validate]
  • [:relyra, :saml, :signature, :verify]
  • [:relyra, :saml, :replay, :check]
  • [:relyra, :saml, :user, :map]
  • [:relyra, :saml, :session, :establish]
  • [:relyra, :saml, :metadata, :refresh]
  • [:relyra, :saml, :metadata, :import]
  • [:relyra, :saml, :metadata, :auto_refresh]

Auto-refresh state-transition events (one-shot, not span-bracketed):

  • [:relyra, :saml, :metadata, :auto_refresh, :degraded]
  • [:relyra, :saml, :metadata, :auto_refresh, :suspended]
  • [:relyra, :saml, :metadata, :auto_refresh, :recovered]
  • [:relyra, :saml, :metadata, :auto_refresh, :validity_warning]
  • [:relyra, :saml, :metadata, :auto_refresh, :skipped]

Certificate expiry warning event (one-shot):

  • [:relyra, :saml, :certificate, :expiring]

Audit vocabulary

Filter relyra_audit_events by these atom literals — they are the exact values stored in the :domain and :action columns (lib/relyra/ecto/audit_event.ex:13-26).

@domain_values = [:connection, :metadata, :certificate, :mapping]
@action_values = [:created, :updated, :enabled, :disabled, :applied,
                  :refreshed, :staged, :activated, :retired, :replaced,
                  :deleted]

Trust mutations co-commit an audit row inside the same Ecto transaction — if a connection edit or a metadata apply succeeded, the row is present. Replays do not mutate trust state and therefore write no audit row; see Scenario 3 below.

LiveView admin routes

The route parameter is :connection_id everywhere a connection is referenced — match against this exact name in your host's authorization plug. The path prefix /relyra/admin is configurable when you mount the routes, but the suffix shapes are fixed.

PathModule / action
/relyra/admin/Relyra.LiveAdmin.ConnectionsLive :index
/relyra/admin/connections/newRelyra.LiveAdmin.ConnectionsLive :new
/relyra/admin/connections/:connection_idRelyra.LiveAdmin.ConnectionsLive :show
/relyra/admin/connections/:connection_id/editRelyra.LiveAdmin.ConnectionsLive :edit
/relyra/admin/connections/:connection_id/metadataRelyra.LiveAdmin.ConnectionMetadataLive :metadata
/relyra/admin/diagnostic/bundleRelyra.Phoenix.Controllers.DiagnosticController :download

Mix tasks

These are the 7 Relyra operator hand-tools. hex.audit is a third-party Hex task and is NOT a Relyra hand-tool — do not include it in operator runbooks alongside these.

TaskPurpose
mix relyra.batteries_includedGenerate or drift-check BATTERIES_INCLUDED.md.
mix relyra.conformanceGenerate or drift-check CONFORMANCE.md.
mix relyra.diagnosticGenerate a Relyra diagnostic bundle.
mix relyra.installScaffold the minimal Relyra integration surface.
mix relyra.metadata.pinPin a SHA-256 metadata trust fingerprint on a connection.
mix relyra.refresh_dueRefresh any metadata sources whose schedule is due.
mix relyra.security_reviewGenerate or drift-check SECURITY_REVIEW_EVIDENCE.md.

Scenario 1: Certificate expiry imminent

Symptom: no Relyra.Error atom — Relyra warns proactively before any signature failure occurs. The first signal is telemetry, not a rejection.

  1. Triage — the [:relyra, :saml, :certificate, :expiring] event (see the telemetry catalog in the surface table) fires with :days_until_expiry and :fingerprint_sha256 measurements. Confirm the :connection_id identifies the connection at risk. If your observability stack pages on this event, this is the moment to acknowledge.
  2. Diagnose — open /relyra/admin/connections/:connection_id and review the certificate inventory section for the affected connection. Query the audit ledger for domain = :certificate rows with action in [:staged, :activated] to see prior rollover history on this connection. If you need a redacted snapshot to share with the IdP vendor, run mix relyra.diagnostic.
  3. Recover — stage the new IdP signing certificate via the admin UI (audit row domain = :certificate, action = :staged will record the stage). When the IdP cuts over, activate the staged certificate (action = :activated) and retire the old one (action = :retired). If the IdP publishes the new cert via metadata, run mix relyra.refresh_due first to pick it up.

Scenario 2: Metadata drift after IdP change

Symptom: :metadata_drift_requires_review returns from metadata-apply attempts; auto-refresh telemetry may have transitioned the source to :degraded or :suspended.

  1. Triage — check the auto-refresh state-transition events from the telemetry catalog: [:relyra, :saml, :metadata, :auto_refresh, :degraded] or :suspended indicate the auto-refresh worker has flagged the source. The audit ledger will show domain = :metadata rows with action = :applied carrying a drift flag.
  2. Diagnose — open /relyra/admin/connections/:connection_id/metadata for a side-by-side of the staged-vs-current trust state. Query the audit ledger for domain = :metadata rows with action in [:applied, :refreshed] over the last 24 hours to see what the auto-refresh worker has been doing on this source. Cross-reference the atom decoder at ../troubleshooting.md#metadata_drift_requires_review.
  3. Recover — if the drift is legitimate (the IdP rotated its signing cert or moved its SSO endpoint), run mix relyra.refresh_due to force the worker to retry, then approve the drift via the admin UI (audit row domain = :metadata, action = :applied will record the approval). If the deployment requires byte-exact metadata identity, run mix relyra.metadata.pin with the new SHA-256 fingerprint to harden the connection against future drift.

Scenario 3: Replay storm

Symptom: :replayed_assertion in logs, repeating across many login attempts in a short window.

  1. Triage — the [:relyra, :saml, :replay, :check] telemetry event (see the catalog) fires with repeated :stop events whose :outcome is :error and whose :error_code is :replayed_assertion. Replays do not mutate trust state, so no audit row is writtenlib/relyra/replay_store/ecto.ex and lib/relyra/replay_store/ets.ex contain zero AuditWriter.append_event calls. Operators rely on [:relyra, :saml, :replay, :check] telemetry alone for replay-storm detection; the audit ledger will not corroborate.
  2. Diagnose — there is no admin LiveView surface for replay activity in v1.4. Volume, source-IP analysis, and per-connection grouping must come from your host application's log infrastructure (the telemetry handler that writes structured replay events into your logging stack). Use :connection_id in the telemetry metadata to localize the storm to a single connection if possible.
  3. Recover — apply host-app-level rate limiting on the ACS endpoint; replays are protocol-correct messages that the SAML library cannot refuse to receive (only refuse to accept). If the storm correlates with a single connection, disable that connection via /relyra/admin/connections/:connection_id/edit while you investigate (audit row domain = :connection, action = :disabled will record the disable). Re-enable (domain = :connection, action = :enabled) when the source is confirmed legitimate or rate-limited.

Scenario 4: Signature regression after IdP key rotation

Symptom: :digest_mismatch, :invalid_signature, or :trust_anchor_mismatch returning from every login attempt for a specific connection.

  1. Triage — the [:relyra, :saml, :signature, :verify] telemetry event (see the catalog) fires :stop with :outcome :error and the exact atom in :error_code. Group by :connection_id to confirm the regression is isolated to one connection — a fleet-wide signature failure indicates an algorithm-policy change, not a key rotation.
  2. Diagnose — open /relyra/admin/connections/:connection_id and review the signing certificate inventory; the IdP may have rotated to a key Relyra does not yet trust. Run mix relyra.diagnostic to bundle the connection state plus a redacted copy of the failing response payload for the IdP vendor's support team. Cross-reference the atom decoder at ../troubleshooting.md#digest_mismatch and ../troubleshooting.md#trust_anchor_mismatch for the exact distinction between the three failure modes.
  3. Recover — stage the new IdP signing certificate via the admin UI cert inventory (audit row domain = :certificate, action = :staged); confirm the next login succeeds against the staged cert; activate it (action = :activated) and retire the old one (action = :retired). The signing source is configured IdP certs only — Relyra will never trust a document KeyInfo no matter how convenient it would be at this moment.

Scenario 5: ACS misconfiguration at provisioning

Symptom: :destination_mismatch, :recipient_mismatch, or :in_response_to_mismatch on every login for a new or freshly reconfigured connection.

  1. Triage — the [:relyra, :saml, :response, :validate] telemetry event (see the catalog) fires :stop with :outcome :error and the atom in :error_code. The atom names exactly which field mismatched — :destination_mismatch is the Destination attribute on the response, :recipient_mismatch is the SubjectConfirmationData Recipient, and :in_response_to_mismatch is the response's InResponseTo attribute against the request store.
  2. Diagnose — open /relyra/admin/connections/:connection_id/edit and verify acs_url, sp_entity_id, idp_entity_id, and idp_sso_url match what the IdP's published metadata advertises. The IdP-side error message will reveal the mismatched value. Cross-reference ../troubleshooting.md#destination_mismatch for the exact field semantics.
  3. Recover — update the affected connection field via the admin UI; the audit row domain = :connection, action = :updated records the change. If your deployment uses metadata-driven configuration, fix the canonical value at the source (your metadata generator or IdP admin console) and run mix relyra.refresh_due to pick it up.

Scenario 6: Attribute mapping breakage

Symptom: :invalid_audience or mapping-stage errors surfacing at the host-app boundary (the user mapper rejects the principal, or the host session adapter rejects the attribute set).

  1. Triage — the [:relyra, :saml, :user, :map] telemetry event (see the catalog) fires :stop with :outcome :error. Query the audit ledger for domain = :mapping rows with action in [:created, :updated] to see recent mapping changes on the connection — a recent mapping edit is the most common cause.
  2. Diagnose — open /relyra/admin/connections/:connection_id and review the mapping section plus the audit timeline. Cross-reference ../troubleshooting.md#invalid_audience for the audience-mismatch case. If the issue is in the host's UserMapper callback rather than in the connection's mapping configuration, the fix is in host application code — see ../identity_mapping_and_provisioning.md for the UserMapper contract.
  3. Recover — update the mapping via the admin UI (audit row domain = :mapping, action = :updated); confirm the next login for a representative user succeeds. If the host application owns the failing logic in its UserMapper, ship the fix there and redeploy — Relyra emits the attribute set unchanged, so it never needs a library change for mapping-policy edits.

When in doubt

Every login resolves to a verified trust path or a typed rejection — and when in doubt, the diagnostic bundle is the trace. Run mix relyra.diagnostic to capture a redacted, attributable snapshot of the connection state, recent audit rows, and the failing response payload. Route the bundle to your IdP support contact when the symptom is a signing-cert regression, to your security review queue when an active attack is suspected, or attach it to a Relyra issue when you suspect a library bug. The bundle is designed to be safe to share — Phase 23's redactor strips secret material before the archive is written.