Production checklist (Threadline)

Copy Markdown View Source

Use this after the README quickstart and before treating audit capture as production-ready. It complements brownfield-continuity.md for existing data.

For host staging / pooler parity (STG-01STG-03), use guides/adoption-pilot-backlog.md as the in-repo matrix and rubric: fixed-field topology (STG-HOST-TOPOLOGY-TEMPLATE) plus audited HTTP/job paths with honest status columns under STG-AUDITED-PATH-RUBRIC. Copy rows into issues when something fails; keep evidence pointers redacted and link out to integrator-controlled detail.

1. Capture and triggers

Coverage drift visibility

Threadline's strongest production posture comes from making coverage drift impossible to miss. After mounting the operator surface and configuring triggers, verify:

  • [ ] Surface header pill renders on every LV — visit any operator-surface page and confirm the badge shows either "All covered" (green-muted) or "{N} uncovered" (amber). The badge link goes to /audit/coverage.
  • [ ] Coverage dashboard responds at /audit/coverage — the page renders three buckets (covered / uncovered / expected) with a 30-second polling default.
  • [ ] Mix-task parity for capture-only pathsmix threadline.health.coverage prints the same data; mix threadline.health.coverage --json for machine consumption.
  • [ ] Adopter-declared expected-uncovered set — if you use Oban, vendor add-ons, or non-Threadline bookkeeping tables, declare them in config :threadline, :health, expected_uncovered_tables: [...]. Run Threadline.Health.Policy.validate!/1 at boot to fail loudly on typos.
  • [ ] Telemetry alert on failure — subscribe to [:threadline, :health, :checked, :error] so sustained polling failures (e.g. DB connection issues) page someone instead of silently freezing the dashboard at the last-good count.

See also guides/operator-surface.md §"Coverage dashboard".

2. Actor bridge and semantics

  • [ ] Request paths set threadline.actor_ref inside the same Ecto.Multi / Repo.transaction as audited writes (transaction-local GUC; safe under PgBouncer transaction pooling — see README PgBouncer section).
  • [ ] Background jobs use Threadline.Job (or equivalent) so jobs and HTTP requests both attribute actors consistently.
  • [ ] Where you need intent beyond row diffs, Threadline.record_action/2 is called with :repo and a valid ActorRef.

3. Redaction and sensitive columns

  • [ ] config :threadline, :trigger_capture, tables: %{"users" => [exclude: ..., mask: ...]} reviewed with security; no column in both exclude and mask.
  • [ ] mix threadline.gen.triggers --dry-run used after config changes; migrations applied before relying on new trigger SQL.
  • [ ] Visit /audit/policy/redaction after deploys or config changes; confirm the affected tables land in Config matches deployed, not Drift detected or Could not introspect.
  • [ ] Capture-only path checked too: mix threadline.policy.show for human output, mix threadline.policy.show --json for machine checks or incident tooling.
  • [ ] If any table shows Drift detected or Could not introspect, rerun mix threadline.gen.triggers, apply the generated migration, and re-check before declaring the rollout aligned.
  • [ ] Confirm the redaction viewer stays safe for operator screenshots and incident notes: it should show only column names and placeholder metadata, never sample values.
  • [ ] JSON/JSONB columns: remember masking replaces the whole value (no field-level redaction in current releases).

4. Retention and purge

  • [ ] config :threadline, :retention validated (keep_days or max_age_seconds, not both; positive window).
  • [ ] Destructive purge only with enabled: true after ops sign-off; always mix threadline.retention.purge --dry-run first.
  • [ ] Production: MIX_ENV=prod mix threadline.retention.purge --execute (requires explicit --execute).
  • [ ] Batch size and max_batches tuned so each run finishes under lock/latency budgets; schedule often enough that volume per run stays bounded.
  • [ ] Backups / point-in-time recovery: purges are permanent deletes of audit_changes (and optionally empty audit_transactions); align retention with compliance needs.
  • [ ] Index strategy for audit tables (baseline vs optional btree/GIN) reviewed with your DBA path; see audit-indexing.md for shipped index names, timeline/export join semantics, and evidence-first additive patterns.

Volume, growth, and purge cadence

  • Treat audit_changes (and related storage) as a monotonically growing dataset until retention runs; chart table size and free space alongside application traffic so growth surprises surface before purge latency spikes.
  • Schedule purges often enough that each run finishes well inside the configured max_batches outer loop — if you routinely hit the cap, eligible rows remain until the next run; lowering per-pass volume (smaller --batch-size / batch_size) or running more frequently is safer than silently leaving a long tail of old rows.
  • Start batch_size near 500 (the Threadline.Retention.purge/1 default), then adjust with lock wait, statement duration, and capture concurrency in mind; the Mix task maps --batch-size / --max-batches to the same options.
  • Threadline.Retention.Policy is the validated view of config :threadline, :retention; call Threadline.Retention.purge/1 with a required repo: keyword (and optional dry_run:, batch_size, max_batches, cutoff:) from automation, or use mix threadline.retention.purge: always --dry-run first, then production MIX_ENV=prod mix threadline.retention.purge --execute only after ops sign-off — until enabled: true, programmatic calls return {:error, :disabled} and the Mix task raises.
  • Monitor each run: Mix and library logs include batch indices and cumulative deleted_changes (and empty-transaction counts when applicable); track wall-clock duration per run and whether the final summary shows unused max_batches headroom.
  • Cutoff clock, orphan audit_transactions, and empty-parent semantics stay in domain-reference.md — Retention (Phase 13) — do not fork a second spec in this checklist.

5. Export and investigation

6. Observability

7. Brownfield and continuity

Support incident queries

Incident queries assume audit rows still within the retained window — aggressive purges can make historical answers empty; reconcile timelines with retention and purge before escalating missing data.

Pre-launch: confirm operators can answer the five canonical support questions (see domain-reference.md for full SQL and API notes). For a skimmable “which public API first?” map before diving into playbooks, see domain-reference.md — Exploration API routing.

Question (1-line)API / MixSQL
1. Row history — PK in a time windowThreadline.history/3, Threadline.Query.timeline/2Golden query in domain reference
2. Actor window — one actor across tablesThreadline.actor_history/2, timeline/2 + :actor_refGolden query
3. Correlation bundle — shared correlation_idtimeline/2, mix threadline.export + :correlation_idInner-join SQL + strict semantics
4. Export parity — same filters as timelineThreadline.Export, mix threadline.exportFilter vocabulary
5. Action ↔ capture — link semantics to rowsThreadline.record_action/2, action_idJoin pattern
6. Single transaction incident drill-downStart with domain-reference.md — Exploration API routingThen use the bundled incident story in incident-playbook.md

See also