Production checklist (Threadline)

Copy Markdown View Source

Use this after the README quickstart and before treating audit capture as production-ready. It complements brownfield-continuity.md for existing data.

For host staging / pooler parity (STG-01STG-03), use guides/adoption-pilot-backlog.md as the in-repo matrix and rubric: fixed-field topology (STG-HOST-TOPOLOGY-TEMPLATE) plus audited HTTP/job paths with honest status columns under STG-AUDITED-PATH-RUBRIC. Copy rows into issues when something fails; keep evidence pointers redacted and link out to integrator-controlled detail.

Host repo wiring (prerequisite)

  • [ ] config :threadline, ecto_repos: [MyApp.Repo] is set in config/config.exs (getting-started §2 — Configure Threadline). Required for all mix threadline.* tasks that call resolve_repo!/0 and for operator-surface Mix fallbacks when LiveView is denied or not mounted.
  • [ ] Multi-database hosts: list only the repo that holds audit tables first in :ecto_repos; do not mirror your full host :ecto_repos list unless the first entry is the audit database. Pass repo: on mount and on programmatic APIs when using a non-default repo.

1. Capture and triggers

Coverage drift visibility

Threadline's strongest production posture comes from making coverage drift impossible to miss. After mounting the operator surface and configuring triggers, verify:

  • [ ] Surface header pill renders on every LV — visit any operator-surface page and confirm the badge shows either "All covered" (green-muted) or "{N} uncovered" (amber). The badge link goes to /audit/coverage.
  • [ ] Coverage dashboard responds at /audit/coverage — the page renders three buckets (covered / uncovered / expected) with a 30-second polling default.
  • [ ] Mix-task parity for capture-only pathsmix threadline.health.coverage prints the same data; mix threadline.health.coverage --json for machine consumption.
  • [ ] Adopter-declared expected-uncovered set — if you use Oban, vendor add-ons, or non-Threadline bookkeeping tables, declare them in config :threadline, :health, expected_uncovered_tables: [...]. Run Threadline.Health.Policy.validate!/1 at boot to fail loudly on typos.
  • [ ] Telemetry alert on failure — subscribe to [:threadline, :health, :checked, :error] so sustained polling failures (e.g. DB connection issues) page someone instead of silently freezing the dashboard at the last-good count.

See also guides/operator-surface.md §"Coverage dashboard".

2. Actor bridge and semantics

  • [ ] Request paths set threadline.actor_ref inside the same Ecto.Multi / Repo.transaction as audited writes (transaction-local GUC; safe under PgBouncer transaction pooling — see README PgBouncer section).
  • [ ] Background jobs use Threadline.Job (or equivalent) so jobs and HTTP requests both attribute actors consistently.
  • [ ] Where you need intent beyond row diffs, Threadline.record_action/2 is called with :repo and a valid ActorRef.

3. Redaction and sensitive columns

  • [ ] config :threadline, :trigger_capture, tables: %{"users" => [exclude: ..., mask: ...]} reviewed with security; no column in both exclude and mask.
  • [ ] mix threadline.gen.triggers --dry-run used after config changes; migrations applied before relying on new trigger SQL.
  • [ ] Visit /audit/policy/redaction after deploys or config changes; confirm the affected tables land in Config matches deployed, not Drift detected or Could not introspect.
  • [ ] Capture-only path checked too: mix threadline.policy.show for human output, mix threadline.policy.show --json for machine checks or incident tooling.
  • [ ] If any table shows Drift detected or Could not introspect, rerun mix threadline.gen.triggers, apply the generated migration, and re-check before declaring the rollout aligned.
  • [ ] Confirm the redaction viewer stays safe for operator screenshots and incident notes: it should show only column names and placeholder metadata, never sample values.
  • [ ] JSON/JSONB columns: remember masking replaces the whole value (no field-level redaction in current releases).

4. Retention and purge

  • [ ] config :threadline, :retention validated (keep_days or max_age_seconds, not both; positive window).
  • [ ] Destructive purge only with enabled: true after ops sign-off; always mix threadline.retention.purge --dry-run first.
  • [ ] Production: MIX_ENV=prod mix threadline.retention.purge --execute (requires explicit --execute).
  • [ ] Batch size and max_batches tuned so each run finishes under lock/latency budgets; schedule often enough that volume per run stays bounded.
  • [ ] Backups / point-in-time recovery: purges are permanent deletes of audit_changes (and optionally empty audit_transactions); align retention with compliance needs.
  • [ ] Index strategy for audit tables (baseline vs optional btree/GIN) reviewed with your DBA path; see audit-indexing.md for shipped index names, timeline/export join semantics, and evidence-first additive patterns.

Volume, growth, and purge cadence

  • Treat audit_changes (and related storage) as a monotonically growing dataset until retention runs; chart table size and free space alongside application traffic so growth surprises surface before purge latency spikes.
  • Schedule purges often enough that each run finishes well inside the configured max_batches outer loop — if you routinely hit the cap, eligible rows remain until the next run; lowering per-pass volume (smaller --batch-size / batch_size) or running more frequently is safer than silently leaving a long tail of old rows.
  • Start batch_size near 500 (the Threadline.Retention.purge/1 default), then adjust with lock wait, statement duration, and capture concurrency in mind; the Mix task maps --batch-size / --max-batches to the same options.
  • Threadline.Retention.Policy is the validated view of config :threadline, :retention; call Threadline.Retention.purge/1 with a required repo: keyword (and optional dry_run:, batch_size, max_batches, cutoff:) from automation, or use mix threadline.retention.purge: always --dry-run first, then production MIX_ENV=prod mix threadline.retention.purge --execute only after ops sign-off — until enabled: true, programmatic calls return {:error, :disabled} and the Mix task raises.
  • Monitor each run: Mix and library logs include batch indices and cumulative deleted_changes (and empty-transaction counts when applicable); track wall-clock duration per run and whether the final summary shows unused max_batches headroom.
  • Cutoff clock, orphan audit_transactions, and empty-parent semantics stay in domain-reference.md — Retention — do not fork a second spec in this checklist.

5. Export and investigation

Confirm Host repo wiring (prerequisite) before running export or evidence Mix tasks in CI.

6. Observability

7. Brownfield and continuity

Support incident queries

Incident queries assume audit rows still within the retained window — aggressive purges can make historical answers empty; reconcile timelines with retention and purge before escalating missing data.

Pre-launch: confirm operators can answer the five canonical support questions (see domain-reference.md for full SQL and API notes). For a skimmable “which public API first?” map before diving into playbooks, see domain-reference.md — Exploration API routing.

Question (1-line)API / MixSQL
1. Row history — PK in a time windowThreadline.history/3, Threadline.Query.timeline/2Golden query in domain reference
2. Actor window — one actor across tablesThreadline.actor_history/2, timeline/2 + :actor_refGolden query
3. Correlation bundle — shared correlation_idtimeline/2, mix threadline.export + :correlation_idInner-join SQL + strict semantics
4. Export parity — same filters as timelineThreadline.Export, mix threadline.exportFilter vocabulary
5. Action ↔ capture — link semantics to rowsThreadline.record_action/2, action_idJoin pattern
6. Single transaction incident drill-downStart with domain-reference.md — Exploration API routingThen use the bundled incident story in incident-playbook.md

See also