Production checklist (Threadline)

Copy Markdown View Source

Use this after the README quickstart and before treating audit capture as production-ready. It complements brownfield-continuity.md for existing data.

For host staging / pooler parity (STG-01STG-03), use guides/adoption-pilot-backlog.md as the in-repo matrix and rubric: fixed-field topology (STG-HOST-TOPOLOGY-TEMPLATE) plus audited HTTP/job paths with honest status columns under STG-AUDITED-PATH-RUBRIC. Copy rows into issues when something fails; keep evidence pointers redacted and link out to integrator-controlled detail.

1. Capture and triggers

2. Actor bridge and semantics

  • [ ] Request paths set threadline.actor_ref inside the same Ecto.Multi / Repo.transaction as audited writes (transaction-local GUC; safe under PgBouncer transaction pooling — see README PgBouncer section).
  • [ ] Background jobs use Threadline.Job (or equivalent) so jobs and HTTP requests both attribute actors consistently.
  • [ ] Where you need intent beyond row diffs, Threadline.record_action/2 is called with :repo and a valid ActorRef.

3. Redaction and sensitive columns

  • [ ] config :threadline, :trigger_capture, tables: %{"users" => [exclude: ..., mask: ...]} reviewed with security; no column in both exclude and mask.
  • [ ] mix threadline.gen.triggers --dry-run used after config changes; migrations applied before relying on new trigger SQL.
  • [ ] JSON/JSONB columns: remember masking replaces the whole value (no field-level redaction in current releases).

4. Retention and purge

  • [ ] config :threadline, :retention validated (keep_days or max_age_seconds, not both; positive window).
  • [ ] Destructive purge only with enabled: true after ops sign-off; always mix threadline.retention.purge --dry-run first.
  • [ ] Production: MIX_ENV=prod mix threadline.retention.purge --execute (requires explicit --execute).
  • [ ] Batch size and max_batches tuned so each run finishes under lock/latency budgets; schedule often enough that volume per run stays bounded.
  • [ ] Backups / point-in-time recovery: purges are permanent deletes of audit_changes (and optionally empty audit_transactions); align retention with compliance needs.
  • [ ] Index strategy for audit tables (baseline vs optional btree/GIN) reviewed with your DBA path; see audit-indexing.md for shipped index names, timeline/export join semantics, and evidence-first additive patterns.

Volume, growth, and purge cadence

  • Treat audit_changes (and related storage) as a monotonically growing dataset until retention runs; chart table size and free space alongside application traffic so growth surprises surface before purge latency spikes.
  • Schedule purges often enough that each run finishes well inside the configured max_batches outer loop — if you routinely hit the cap, eligible rows remain until the next run; lowering per-pass volume (smaller --batch-size / batch_size) or running more frequently is safer than silently leaving a long tail of old rows.
  • Start batch_size near 500 (the Threadline.Retention.purge/1 default), then adjust with lock wait, statement duration, and capture concurrency in mind; the Mix task maps --batch-size / --max-batches to the same options.
  • Threadline.Retention.Policy is the validated view of config :threadline, :retention; call Threadline.Retention.purge/1 with a required repo: keyword (and optional dry_run:, batch_size, max_batches, cutoff:) from automation, or use mix threadline.retention.purge: always --dry-run first, then production MIX_ENV=prod mix threadline.retention.purge --execute only after ops sign-off — until enabled: true, programmatic calls return {:error, :disabled} and the Mix task raises.
  • Monitor each run: Mix and library logs include batch indices and cumulative deleted_changes (and empty-transaction counts when applicable); track wall-clock duration per run and whether the final summary shows unused max_batches headroom.
  • Cutoff clock, orphan audit_transactions, and empty-parent semantics stay in domain-reference.md — Retention (Phase 13) — do not fork a second spec in this checklist.

5. Export and investigation

6. Observability

7. Brownfield and continuity

Support incident queries

Incident queries assume audit rows still within the retained window — aggressive purges can make historical answers empty; reconcile timelines with retention and purge before escalating missing data.

Pre-launch: confirm operators can answer the five canonical support questions (see domain-reference.md for full SQL and API notes). For a skimmable “which public API first?” map before diving into playbooks, see domain-reference.md — Exploration API routing.

Question (1-line)API / MixSQL
1. Row history — PK in a time windowThreadline.history/3, Threadline.Query.timeline/2Golden query in domain reference
2. Actor window — one actor across tablesThreadline.actor_history/2, timeline/2 + :actor_refGolden query
3. Correlation bundle — shared correlation_idtimeline/2, mix threadline.export + :correlation_idInner-join SQL + strict semantics
4. Export parity — same filters as timelineThreadline.Export, mix threadline.exportFilter vocabulary
5. Action ↔ capture — link semantics to rowsThreadline.record_action/2, action_idJoin pattern
6. Single transaction incident drill-downStart with domain-reference.md — Exploration API routingThen use the bundled incident story in incident-playbook.md

See also