This guide is the operator-facing reference for responding to SAML incidents when a Relyra deployment is live. Every login resolves to a verified trust path or a typed rejection; this playbook is what to do when the rejection recurs, when a metadata refresh suspends, when a signing cert rotates, or when a replay storm hits the ACS endpoint. Relyra owns the typed-rejection contract and the five evidence surfaces (telemetry, audit ledger, LiveView admin, Mix tasks, troubleshooting decoder); the host owns the operational response. The reference table below is the centerpiece — every scenario runbook in this guide points back into it rather than restating where the evidence lives.
Relyra owns / Host owns
Relyra owns
- Emitting structured
[:relyra, :saml, ...]telemetry events at every stage of the trust pipeline (lib/relyra/telemetry.ex). - Writing append-only audit rows to the
relyra_audit_eventstable for every trust-state mutation: connection, metadata, certificate, and mapping. - Publishing the
/relyra/adminLiveView routes for connection / metadata / certificate / mapping inspection and mutation. - Shipping the 7
mix relyra.*operator hand-tools. - Documenting the canonical
Relyra.Erroratom taxonomy in../troubleshooting.md.
Host owns
- Telemetry storage and alerting (Relyra emits; the host's observability stack retains, indexes, and pages on-call).
- Audit-row access control and retention (the host's policy decides who reads
relyra_audit_eventsand for how long). - Authentication and authorization for the
/relyra/adminLiveView surface (Relyra exposes the routes; the host's pipeline gates them). - Scheduling Mix-task invocation — for example, a cron entry that runs
mix relyra.refresh_dueon a cadence appropriate to the deployment. - The human operator playbook. This guide is a starting point; institutional procedures (paging, escalation, communication) wrap it.
Evidence surfaces
When an incident hits, walk this table top to bottom. Each row names the exact code anchor — Relyra's strict-defaults posture means the operator should never need to grep the source to find the right evidence.
| Surface | What it tells you | Exact code anchor |
|---|---|---|
| Telemetry events | Structured :start / :stop / :exception spans for every trust-pipeline stage, plus auto-refresh state-transition events and certificate-expiry warnings | lib/relyra/telemetry.ex |
| Audit ledger | Append-only relyra_audit_events rows recording every trust-state mutation (connection / metadata / certificate / mapping) | lib/relyra/ecto/audit_event.ex |
| LiveView admin routes | Operator UI for connection, metadata, certificate, and mapping inspection and mutation | lib/relyra/live_admin/router.ex |
| Mix tasks | 7 operator hand-tools for drift checks, diagnostic bundles, scaffolding, metadata pinning, scheduled refresh, and security-review evidence | lib/mix/tasks/relyra.*.ex |
| Troubleshooting decoder | Per-atom decoder with the four-field micro-block (Means / Likely root cause / Operator action / Source) for every Relyra.Error atom | ../troubleshooting.md |
Telemetry event catalog
Cite these event names verbatim when attaching :telemetry.attach/4
handlers. Each span event below is a :start / :stop / :exception
triplet — the :stop event carries duration_ms and an :outcome /
:error_code pair when the stage failed.
Span events:
[:relyra, :saml, :login][:relyra, :saml, :authn_request][:relyra, :saml, :response, :decode][:relyra, :saml, :response, :validate][:relyra, :saml, :signature, :verify][:relyra, :saml, :replay, :check][:relyra, :saml, :user, :map][:relyra, :saml, :session, :establish][:relyra, :saml, :metadata, :refresh][:relyra, :saml, :metadata, :import][:relyra, :saml, :metadata, :auto_refresh]
Auto-refresh state-transition events (one-shot, not span-bracketed):
[:relyra, :saml, :metadata, :auto_refresh, :degraded][:relyra, :saml, :metadata, :auto_refresh, :suspended][:relyra, :saml, :metadata, :auto_refresh, :recovered][:relyra, :saml, :metadata, :auto_refresh, :validity_warning][:relyra, :saml, :metadata, :auto_refresh, :skipped]
Certificate expiry warning event (one-shot):
[:relyra, :saml, :certificate, :expiring]
Audit vocabulary
Filter relyra_audit_events by these atom literals — they are the exact
values stored in the :domain and :action columns
(lib/relyra/ecto/audit_event.ex:13-26).
@domain_values = [:connection, :metadata, :certificate, :mapping]
@action_values = [:created, :updated, :enabled, :disabled, :applied,
:refreshed, :staged, :activated, :retired, :replaced,
:deleted]Trust mutations co-commit an audit row inside the same Ecto transaction — if a connection edit or a metadata apply succeeded, the row is present. Replays do not mutate trust state and therefore write no audit row; see Scenario 3 below.
LiveView admin routes
The route parameter is :connection_id everywhere a connection is
referenced — match against this exact name in your host's authorization
plug. The path prefix /relyra/admin is configurable when you mount the
routes, but the suffix shapes are fixed.
| Path | Module / action |
|---|---|
/relyra/admin/ | Relyra.LiveAdmin.ConnectionsLive :index |
/relyra/admin/connections/new | Relyra.LiveAdmin.ConnectionsLive :new |
/relyra/admin/connections/:connection_id | Relyra.LiveAdmin.ConnectionsLive :show |
/relyra/admin/connections/:connection_id/edit | Relyra.LiveAdmin.ConnectionsLive :edit |
/relyra/admin/connections/:connection_id/metadata | Relyra.LiveAdmin.ConnectionMetadataLive :metadata |
/relyra/admin/diagnostic/bundle | Relyra.Phoenix.Controllers.DiagnosticController :download |
Mix tasks
These are the 7 Relyra operator hand-tools. hex.audit is a third-party
Hex task and is NOT a Relyra hand-tool — do not include it in operator
runbooks alongside these.
| Task | Purpose |
|---|---|
mix relyra.batteries_included | Generate or drift-check BATTERIES_INCLUDED.md. |
mix relyra.conformance | Generate or drift-check CONFORMANCE.md. |
mix relyra.diagnostic | Generate a Relyra diagnostic bundle. |
mix relyra.install | Scaffold the minimal Relyra integration surface. |
mix relyra.metadata.pin | Pin a SHA-256 metadata trust fingerprint on a connection. |
mix relyra.refresh_due | Refresh any metadata sources whose schedule is due. |
mix relyra.security_review | Generate or drift-check SECURITY_REVIEW_EVIDENCE.md. |
Scenario 1: Certificate expiry imminent
Symptom: no Relyra.Error atom — Relyra warns proactively before any
signature failure occurs. The first signal is telemetry, not a rejection.
- Triage — the
[:relyra, :saml, :certificate, :expiring]event (see the telemetry catalog in the surface table) fires with:days_until_expiryand:fingerprint_sha256measurements. Confirm the:connection_ididentifies the connection at risk. If your observability stack pages on this event, this is the moment to acknowledge. - Diagnose — open
/relyra/admin/connections/:connection_idand review the certificate inventory section for the affected connection. Query the audit ledger fordomain = :certificaterows withactionin[:staged, :activated]to see prior rollover history on this connection. If you need a redacted snapshot to share with the IdP vendor, runmix relyra.diagnostic. - Recover — stage the new IdP signing certificate via the admin UI
(audit row
domain = :certificate,action = :stagedwill record the stage). When the IdP cuts over, activate the staged certificate (action = :activated) and retire the old one (action = :retired). If the IdP publishes the new cert via metadata, runmix relyra.refresh_duefirst to pick it up.
Scenario 2: Metadata drift after IdP change
Symptom: :metadata_drift_requires_review
returns from metadata-apply attempts; auto-refresh telemetry may have
transitioned the source to :degraded or :suspended.
- Triage — check the auto-refresh state-transition events from the
telemetry catalog:
[:relyra, :saml, :metadata, :auto_refresh, :degraded]or:suspendedindicate the auto-refresh worker has flagged the source. The audit ledger will showdomain = :metadatarows withaction = :appliedcarrying a drift flag. - Diagnose — open
/relyra/admin/connections/:connection_id/metadatafor a side-by-side of the staged-vs-current trust state. Query the audit ledger fordomain = :metadatarows withactionin[:applied, :refreshed]over the last 24 hours to see what the auto-refresh worker has been doing on this source. Cross-reference the atom decoder at../troubleshooting.md#metadata_drift_requires_review. - Recover — if the drift is legitimate (the IdP rotated its signing
cert or moved its SSO endpoint), run
mix relyra.refresh_dueto force the worker to retry, then approve the drift via the admin UI (audit rowdomain = :metadata,action = :appliedwill record the approval). If the deployment requires byte-exact metadata identity, runmix relyra.metadata.pinwith the new SHA-256 fingerprint to harden the connection against future drift.
Scenario 3: Replay storm
Symptom: :replayed_assertion
in logs, repeating across many login attempts in a short window.
- Triage — the
[:relyra, :saml, :replay, :check]telemetry event (see the catalog) fires with repeated:stopevents whose:outcomeis:errorand whose:error_codeis:replayed_assertion. Replays do not mutate trust state, so no audit row is written —lib/relyra/replay_store/ecto.exandlib/relyra/replay_store/ets.excontain zeroAuditWriter.append_eventcalls. Operators rely on[:relyra, :saml, :replay, :check]telemetry alone for replay-storm detection; the audit ledger will not corroborate. - Diagnose — there is no admin LiveView surface for replay activity
in v1.4. Volume, source-IP analysis, and per-connection grouping
must come from your host application's log infrastructure (the
telemetry handler that writes structured replay events into your
logging stack). Use
:connection_idin the telemetry metadata to localize the storm to a single connection if possible. - Recover — apply host-app-level rate limiting on the ACS endpoint;
replays are protocol-correct messages that the SAML library cannot
refuse to receive (only refuse to accept). If the storm correlates
with a single connection, disable that connection via
/relyra/admin/connections/:connection_id/editwhile you investigate (audit rowdomain = :connection,action = :disabledwill record the disable). Re-enable (domain = :connection,action = :enabled) when the source is confirmed legitimate or rate-limited.
Scenario 4: Signature regression after IdP key rotation
Symptom: :digest_mismatch,
:invalid_signature, or
:trust_anchor_mismatch
returning from every login attempt for a specific connection.
- Triage — the
[:relyra, :saml, :signature, :verify]telemetry event (see the catalog) fires:stopwith:outcome:errorand the exact atom in:error_code. Group by:connection_idto confirm the regression is isolated to one connection — a fleet-wide signature failure indicates an algorithm-policy change, not a key rotation. - Diagnose — open
/relyra/admin/connections/:connection_idand review the signing certificate inventory; the IdP may have rotated to a key Relyra does not yet trust. Runmix relyra.diagnosticto bundle the connection state plus a redacted copy of the failing response payload for the IdP vendor's support team. Cross-reference the atom decoder at../troubleshooting.md#digest_mismatchand../troubleshooting.md#trust_anchor_mismatchfor the exact distinction between the three failure modes. - Recover — stage the new IdP signing certificate via the admin UI
cert inventory (audit row
domain = :certificate,action = :staged); confirm the next login succeeds against the staged cert; activate it (action = :activated) and retire the old one (action = :retired). The signing source is configured IdP certs only — Relyra will never trust a documentKeyInfono matter how convenient it would be at this moment.
Scenario 5: ACS misconfiguration at provisioning
Symptom: :destination_mismatch,
:recipient_mismatch, or
:in_response_to_mismatch
on every login for a new or freshly reconfigured connection.
- Triage — the
[:relyra, :saml, :response, :validate]telemetry event (see the catalog) fires:stopwith:outcome:errorand the atom in:error_code. The atom names exactly which field mismatched —:destination_mismatchis theDestinationattribute on the response,:recipient_mismatchis theSubjectConfirmationData Recipient, and:in_response_to_mismatchis the response'sInResponseToattribute against the request store. - Diagnose — open
/relyra/admin/connections/:connection_id/editand verifyacs_url,sp_entity_id,idp_entity_id, andidp_sso_urlmatch what the IdP's published metadata advertises. The IdP-side error message will reveal the mismatched value. Cross-reference../troubleshooting.md#destination_mismatchfor the exact field semantics. - Recover — update the affected connection field via the admin UI; the
audit row
domain = :connection,action = :updatedrecords the change. If your deployment uses metadata-driven configuration, fix the canonical value at the source (your metadata generator or IdP admin console) and runmix relyra.refresh_dueto pick it up.
Scenario 6: Attribute mapping breakage
Symptom: :invalid_audience
or mapping-stage errors surfacing at the host-app boundary (the user
mapper rejects the principal, or the host session adapter rejects the
attribute set).
- Triage — the
[:relyra, :saml, :user, :map]telemetry event (see the catalog) fires:stopwith:outcome:error. Query the audit ledger fordomain = :mappingrows withactionin[:created, :updated]to see recent mapping changes on the connection — a recent mapping edit is the most common cause. - Diagnose — open
/relyra/admin/connections/:connection_idand review the mapping section plus the audit timeline. Cross-reference../troubleshooting.md#invalid_audiencefor the audience-mismatch case. If the issue is in the host'sUserMappercallback rather than in the connection's mapping configuration, the fix is in host application code — see../identity_mapping_and_provisioning.mdfor the UserMapper contract. - Recover — update the mapping via the admin UI (audit row
domain = :mapping,action = :updated); confirm the next login for a representative user succeeds. If the host application owns the failing logic in itsUserMapper, ship the fix there and redeploy — Relyra emits the attribute set unchanged, so it never needs a library change for mapping-policy edits.
When in doubt
Every login resolves to a verified trust path or a typed rejection — and
when in doubt, the diagnostic bundle is the trace. Run
mix relyra.diagnostic to capture a redacted, attributable snapshot of
the connection state, recent audit rows, and the failing response
payload. Route the bundle to your IdP support contact when the symptom is
a signing-cert regression, to your security review queue when an active
attack is suspected, or attach it to a Relyra issue when you suspect a
library bug. The bundle is designed to be safe to share — Phase 23's
redactor strips secret material before the archive is written.