# Search backend operations (SRE view)

This guide is for **platform and SRE** maintainers who run the search process alongside a Phoenix app using Scrypath. It complements [Sync Modes and Visibility](../guides/sync-modes-and-visibility.md) (application semantics) and [Operator Support](operator-support.md) (library maintainer first response).

## Scope

- **Scrypath v1** publicly targets **Meilisearch** first. Runbooks here assume a Meilisearch cluster or single node your app reaches over HTTP. Other engines need different metrics and failure modes; Scrypath does not abstract them on the public path.
- **Goal:** a **small** set of metrics and alerts so teams notice user-visible failure and capacity risk without paging on normal variance.

## Two layers: app vs search engine

| Layer | What you own | What breaks first when users complain |
|-------|----------------|--------------------------------------|
| **Application** | Scrypath sync/search/hydration paths, Oban queues, DB | Wrong or stale results, timeouts, 5xx from your app |
| **Meilisearch** | Process health, disk, RAM, version, task pipeline | Search down, writes stuck, index corruption risk under disk pressure |

Instrument **both**. Do not page only on Meilisearch CPU; pair with **Scrypath search error rate** and **end-to-end latency** from the app.

## Scrypath telemetry (application signals)

Scrypath emits **`:telemetry.span/3`**-style events (see [Telemetry](https://hexdocs.pm/telemetry/readme.html) for `start` / `stop` / `exception` and duration measurements on `stop`). Keep dashboards **low cardinality**: use `schema`, `backend`, `index`, `sync_mode` — avoid high-cardinality tags such as raw query text or primary keys on alert rules.

Stable event prefixes (each has `:start`, `:stop`, and on failure `:exception` where applicable):

| Event prefix | When | Useful aggregates |
|--------------|------|-------------------|
| `[:scrypath, :search]` | Common-path search | p95/p99 duration, error rate, `hit_count` from stop metadata |
| `[:scrypath, :hydration]` | Repo batch load after search | Duration vs `hit_count` / `record_count` / `missing_count` (drift indicator when `missing_count` grows) |
| `[:scrypath, :sync, :upsert]` / `[:scrypath, :sync, :delete]` | Document sync | Error rate, `document_count`, `:noop` ratio (noisy if alerted per call) |
| `[:scrypath, :meilisearch, :request]` | HTTP to Meilisearch | `status_code`, `method`, `path` pattern — alert on sustained 5xx / connection errors |
| `[:scrypath, :meilisearch, :task_wait]` | Waiting for Meilisearch task completion | `poll_count`, `final_status` — large `poll_count` or non-`:succeeded` trends |
| `[:scrypath, :reindex, :settings_verified]` | Post-apply settings read-back | Stop metadata `result` tag (`:parity`, `:drift`, etc.) |
| `[:scrypath, :reindex, :verify_skipped]` | Execute only — settings verify skipped by opt | Rare spikes may be intentional deploys; correlate with logs |
| `[:scrypath, :operator, :failed_work, :observed]` | Each failed-work row materialized from backend tasks or Oban jobs | Useful for dashboards and structured logs; **high volume** on noisy data — do **not** page on every event; treat as diagnostic signal and aggregate |

**Dashboard-first:** sync upsert volume, search QPS, hydration `missing_count` distribution, Meilisearch request latency.

## Meilisearch infrastructure (minimal signals)

Prioritize signals that predict **outage**, **data loss risk**, or **unbounded backlog**. Exact metric names depend on your exporter (Prometheus sidecar, cloud vendor agent, or logs). Map these **concepts** to your stack:

1. **Process up / ready** — HTTP `GET /health` (or vendor equivalent) from the same network path as the app. Page when **unreachable** for longer than a short window (e.g. two failed checks), not on single blips.
2. **Disk free** — Meilisearch persists indexes; **running out of disk** is a top cause of corruption and wedged tasks. Alert on **free space percentage or absolute GB** with headroom for snapshots, dumps, task DB growth, and reindexes.
3. **Memory pressure** — Meilisearch uses LMDB and memory-mapped I/O, so high RSS/page-cache-looking memory can be normal. Page on **OOM kills** or **sustained** memory limit pressure from your orchestrator, not one-off spikes during planned reindex.
4. **Task failures** — Meilisearch indexes work through a **task** queue. Sustained **failed** tasks (not every transient validation error) indicate a bad deploy, schema mismatch, or upstream bug. Prefer a **rate** or **count over a window**, not every single failure.
5. **Task backlog age** — A queue with recent work may be healthy; a queue whose oldest task age keeps growing means users may see stale search even while the app is otherwise healthy.
6. **Task DB size / retention** — High-churn apps can create large task histories. Track task count or task DB size where your Meilisearch version/exporter exposes it, and define a retention/compaction policy before disk pressure makes it urgent.
7. **Replication / multi-node** (if used) — **split brain or lag** between nodes is a separate product surface; follow Meilisearch’s own HA docs for your version.

**Avoid alert fatigue:** do **not** page on single slow searches, one failed document in a batch, or Meilisearch `202 Accepted` enqueue latency alone. Those belong on dashboards or SLO burn-rate rules with long windows.

## Footguns (Meilisearch + Scrypath-shaped)

- **`filter` and `facetFilters` AND together** — Users can think they cleared facets while a base `filter` still narrows results. Document in your UI and ops playbooks; see the faceted search guide appendix.
- **Reindex + disk** — Full reindex can **temporarily double** index footprint until old data is dropped. Plan disk headroom before `Scrypath.reindex/2` on large corpora.
- **Settings verify skipped** — `skip_settings_verification?: true` speeds emergencies but **hides drift** until the next verify. Treat as a **temporary** flag; do not leave it on silently.
- **Sync mode semantics** — `:oban` means **durable enqueue**, not “search is updated.” Paging on queue depth without checking **search visibility** misdiagnoses user impact; see sync modes guide.
- **Version skew** — Meilisearch minor versions change task and index behavior. Pin server **and** client expectations per environment; roll upgrades in a canary before production.
- **Live database on slow/network storage** — Keep the live Meilisearch data path on low-latency local or block storage. Object storage is fine for dump/snapshot artifacts, not for the active database.
- **Too many tiny writes** — Per-record or high-churn updates can create task debt. Batch, debounce, or omit fields that do not affect search.
- **Master key sprawl** — The master key is an administrative credential. Application workers should use scoped keys, and browser access should use narrow search keys or tenant tokens only when direct client search is intentional.

## Backup and upgrade posture

Scrypath can rebuild from the source of truth, but large indexes may make rebuild-only DR too slow. Keep both ideas in the runbook:

- **Snapshots** are for fast same-version restore.
- **Dumps** are for portability and version migration.
- **Rebuild from Postgres/source data** is the most trustworthy repair when projection or settings drift is suspected.

Before a Meilisearch version upgrade, rehearse dump/import in staging and run known Scrypath smoke checks: health, stats, settings diff, search, filters, facets, and sync status. Do not let production be the first place a new Meilisearch binary sees your data.

## What to run before you tune alerts

From the repo root (maintainer checks):

- **`mix verify.phase13`** (with integration when you have `SCRYPATH_MEILISEARCH_URL`) — focused operator-flow checks against a real Meilisearch, matching the CI-style job that runs with live integration enabled.
- **Application-level:** `Scrypath.sync_status/2`, `Scrypath.failed_sync_work/2`, `Scrypath.reconcile_sync/2` for human-readable posture before you change indexing.

## Related docs

- [ARCHITECTURE.md](../ARCHITECTURE.md) — drift, reindex order, and sync guarantees
- [guides/sync-modes-and-visibility.md](../guides/sync-modes-and-visibility.md) — `:inline` / `:oban` / `:manual`
- [guides/operator-mix-tasks.md](../guides/operator-mix-tasks.md) — thin Mix wrappers over `Scrypath.*`
- [guides/relevance-tuning.md](../guides/relevance-tuning.md) — settings and verify-applied semantics
- [guides/meilisearch-concepts.md](../guides/meilisearch-concepts.md) — adopter-level Meilisearch mental model
- [guides/meilisearch-operations.md](../guides/meilisearch-operations.md) — production checklist, backup, upgrade, and storage posture