Parapet Milestone History

Copy Markdown View Source

This document records the development milestones for Parapet v0.1–v0.9. These are planning tranches, not Hex release versions — the package was not published to hex.pm during this period.

For the changelog of published Hex releases (v0.10+), see CHANGELOG.md.


v0.9 — Performance, Scale & DX (2026-05-23)

  • Shipped proactive TSDB cardinality protection: a mix parapet.doctor cardinality static analyzer plus a compile-time Parapet.Metrics.Validator enforcing a 10-label ceiling per metric, applied across all built-in metrics and adapter SLIs.
  • Delivered database scale and pruning: composite indexes for Incident, TimelineEntry, and ToolAudit at over 100k rows, a Parapet.Evidence.Archiver with resolved-only retention, and a mix parapet.archive task plus Oban cron worker that never prunes active investigating work.
  • Made the Operator UI responsive under load with bounded queue paging, index-aware Operator queries, and a 50k+ incident benchmark — and repaired the generated resolve flow so the active-to-resolved lifecycle is true again.
  • Unified the Day-1 experience under mix parapet.install, a deterministic Igniter orchestrator that chains spine/prometheus/ui with explicit opt-in extras, backed by severity-aware multi-node mix parapet.doctor checks (for example, Oban uniqueness).
  • Proved multi-node safety with Ecto-backed action claims and circuit breakers under concurrency simulation, plus an environment-conditional peer-node canary that skips cleanly without distributed Erlang.
  • Hardened milestone closure: phases 6–14 backfilled milestone-grade verification surfaces, reconciled planning-artifact drift, tightened archive retention, and added a regression-catching closure-proof chain for the generated operator UI.

Stats: ~20,274 LOC (Elixir/EEx, lib+priv+test) · Phases 1–14 (5 core, 9 closure) · 36 plans · 88 commits · 2026-05-19 → 2026-05-23


v0.8 — Deterministic Escalation & Bounded Mitigation (2026-05-19)

  • Built a durable Oban-backed escalation engine (Parapet.Escalation.Worker) that routes incidents to next tiers unless acknowledged or resolved.
  • Implemented system-identity execution for Bounded Runbooks to safely perform auto-mitigations using the Parapet.Operator API, logging all actions under the :system URN identity.
  • Created an Ecto-backed circuit breaker leveraging ToolAudit histories to prevent mitigation flap-loops: once a mitigation has run N times, the breaker trips and escalates instead.
  • Updated the LiveView Operator UI to visualize escalation chains and distinctively style system-executed mitigations with manual trigger overrides for operator control.

Stats: ~13,900 LOC (Elixir/EEx) · Phases 1–4 · 8 plans · 2026-05-19


v0.7 — Async & Delivery Reliability (2026-05-18)

  • Established safe telemetry contracts for Mailglass, Chimeway, and Rindle integrations to emit bounded async and delivery events using normalized event semantics for diverse external providers.
  • Implemented out-of-the-box provider-first SLOs for async pipeline health and provider delivery states, including multi-burn-rate PromQL alerts.
  • Created explicit fault-domain triage enrichment for async and delivery incidents, leveraging durable evidence — triage snapshot chronology — over UI-derived heuristics.
  • Added safe, host-wired recovery runbook templates for stalled async work, covering dead-letter handling, provider outage recovery, stalled job cleanup, and callback delay flows.

Stats: ~13,401 LOC (Elixir/EEx) · Phases 4–7 · 12 plans · 2026-05-18


v0.6 — Change Correlation & Audit Trailing (2026-05-17)

  • Implemented OpenTelemetry trace exemplar extraction from events and process dictionaries, appending trace identifiers to generated Prometheus metrics.
  • Added trace identifier storage to Ecto Incident schemas and dynamically formatted trace links within the Operator UI for one-click navigation to external trace backends.
  • Consumed Rulestead feature flag toggles via telemetry, creating durable timeline entries and suspect change markers to instantly correlate feature flag changes with SLO burn rates.
  • Highlighted recent proximate system changes (like flag toggles) on active incidents in the Operator UI, distinguishing them visually from human actions.
  • Implemented Parapet.Integrations.Threadline for compliance sync, mirroring Operator audit actions to Threadline event logs.
  • Added dual audit modes (:threadline_deferred and :dual_write) to satisfy strict compliance constraints, including bypassing internal Parapet storage entirely when deferred.

Stats: 8,968 LOC (Elixir/EEx) · Phases 1–3 · 9 plans · 2026-05-17


v0.5 — Proactive Resilience & Copilot Triage (2026-05-16)

  • Implemented Parapet.Probe for defining and scheduling active synthetic canaries via NativeScheduler and ObanScheduler, enabling proactive health detection before alerts fire.
  • Expanded Sigra and Accrue integrations to emit explicit login, signup, and checkout SLIs for business-critical journey monitoring.
  • Built a Parapet MCP server to allow AI agents to safely read incident data and act as triage copilots, providing structured access without write permissions.
  • Resolved compilation and type warnings across the project, achieving a clean zero-warning compilation state.

Stats: ~8,500 LOC (Elixir/EEx) · Phases 1–3 · 9 plans · 2026-05-16


v0.4 — Scoria AI Integration (2026-05-15)

  • Implemented telemetry translation consuming Scoria.SRE.Telemetry events and producing Parapet Prometheus metrics and durable Ecto Incidents for AI infrastructure observability.
  • Built Parapet.SLO.ScoriaEval to define and alert on Eval-Driven SLOs based on Scoria deterministic evaluation scores, with Grafana visualization for SLO error budget correlation.
  • Added native tracking of AI Config Changes (scorer_version, baseline_version, model) to correlate configuration drift with SLO degradation.
  • Monitored Scoria MCP tool failure modes (timeout, execution_failed, breaker_open, access_denied) as explicit SLIs using bounded atoms to protect Ecto from high-volume telemetry.
  • Monitored Scoria workflow approval pauses as durable human-in-the-loop states, triggering alerts on stale requests, and extended the Operator UI with deep-links to Scoria's durable evidence.

Stats: 7,847 LOC (Elixir/EEx) · Phases 1–4 · 9 plans · 2026-05-15


v0.3 — Runbooks & Alert Routing (2026-05-12)

  • Implemented a webhook receiver endpoint for Prometheus Alertmanager, automatically routing "firing" and "resolved" alerts to the durable Ecto Incident lifecycle with intelligent deduplication and correlation by alert name and labels.
  • Created a structured Parapet.Runbook DSL for defining operator-triggered mitigation steps and attaching them based on SLOs or alert names.
  • Extended the Operator UI to interactively display attached runbooks and execute one-click mitigations with complete ToolAudit logging.
  • Built a modular Parapet.Notifier system with out-of-the-box Slack (Block Kit) and Microsoft Teams (Adaptive Cards) adapters to broadcast incident state changes and record timeline entries.
  • Added UI capabilities for operators to explicitly acknowledge incidents and generate comprehensive Markdown retrospectives automatically.

Stats: 6,667 LOC (Elixir/EEx) · Phases 1–4 · 12 plans · 2026-05-12


v0.2 — Durable Spine and Operator UI (2026-05-11)

  • Implemented the Parapet.Evidence context with Incident, TimelineEntry, and ToolAudit Ecto schemas for durable SRE tracking, separating ephemeral telemetry from low-volume Ecto data.
  • Created mix parapet.gen.spine generator to scaffold evidence migrations into host applications safely separated from high-volume telemetry.
  • Defined the Operator API with transactional audited commands and a WorkbenchContract for safe UI derivations.
  • Created mix parapet.gen.ui to generate an isolated, secure, and visually responsive Phoenix LiveView Operator Workbench inside the host app.
  • Automated structural UI tests to guarantee responsive mobile and desktop layout fidelity without relying on human QA or full browser end-to-end tests.
  • Implemented optional integration adapters for Mailglass, Chimeway, Accrue, Rindle, Threadline, and Rulestead leveraging a new capability registry, with all adapters compiling out cleanly when their sibling libraries are absent.

Stats: 3,164 LOC (Elixir/EEx) · Phases 1–3 · 11 plans · 2026-05-11


v0.1 — Trustworthy Spine (2026-05-10)

  • Established the foundational Parapet telemetry contract, supervisor, and install generator, defining a documented telemetry surface treated as a public API with semver guarantees.
  • Built core metrics instrumentation for HTTP, Ecto, and Oban safely via a robust API enforcing low-cardinality by default with explicit label contracts.
  • Created an SLO DSL converting standard Elixir definitions to fully functional Prometheus recording and alerting rules.
  • Delivered a seamless Day-1 experience with mix parapet.doctor health checks and Grafana dashboard generation.

Stats: 1,992 LOC (Elixir) · Phases 1–4 · 15 plans · 2026-05-10