Search backend operations (SRE view)

Copy Markdown View Source

This guide is for platform and SRE maintainers who run the search process alongside a Phoenix app using Scrypath. It complements Sync Modes and Visibility (application semantics) and Operator Support (library maintainer first response).

Scope

  • Scrypath v1 publicly targets Meilisearch first. Runbooks here assume a Meilisearch cluster or single node your app reaches over HTTP. Other engines need different metrics and failure modes; Scrypath does not abstract them on the public path.
  • Goal: a small set of metrics and alerts so teams notice user-visible failure and capacity risk without paging on normal variance.

Two layers: app vs search engine

LayerWhat you ownWhat breaks first when users complain
ApplicationScrypath sync/search/hydration paths, Oban queues, DBWrong or stale results, timeouts, 5xx from your app
MeilisearchProcess health, disk, RAM, version, task pipelineSearch down, writes stuck, index corruption risk under disk pressure

Instrument both. Do not page only on Meilisearch CPU; pair with Scrypath search error rate and end-to-end latency from the app.

Scrypath telemetry (application signals)

Scrypath emits :telemetry.span/3-style events (see Telemetry for start / stop / exception and duration measurements on stop). Keep dashboards low cardinality: use schema, backend, index, sync_mode — avoid high-cardinality tags such as raw query text or primary keys on alert rules.

Stable event prefixes (each has :start, :stop, and on failure :exception where applicable):

Event prefixWhenUseful aggregates
[:scrypath, :search]Common-path searchp95/p99 duration, error rate, hit_count from stop metadata
[:scrypath, :hydration]Repo batch load after searchDuration vs hit_count / record_count / missing_count (drift indicator when missing_count grows)
[:scrypath, :sync, :upsert] / [:scrypath, :sync, :delete]Document syncError rate, document_count, :noop ratio (noisy if alerted per call)
[:scrypath, :meilisearch, :request]HTTP to Meilisearchstatus_code, method, path pattern — alert on sustained 5xx / connection errors
[:scrypath, :meilisearch, :task_wait]Waiting for Meilisearch task completionpoll_count, final_status — large poll_count or non-:succeeded trends
[:scrypath, :reindex, :settings_verified]Post-apply settings read-backStop metadata result tag (:parity, :drift, etc.)
[:scrypath, :reindex, :verify_skipped]Execute only — settings verify skipped by optRare spikes may be intentional deploys; correlate with logs
[:scrypath, :operator, :failed_work, :observed]Each failed-work row materialized from backend tasks or Oban jobsUseful for dashboards and structured logs; high volume on noisy data — do not page on every event; treat as diagnostic signal and aggregate

Dashboard-first: sync upsert volume, search QPS, hydration missing_count distribution, Meilisearch request latency.

Meilisearch infrastructure (minimal signals)

Prioritize signals that predict outage, data loss risk, or unbounded backlog. Exact metric names depend on your exporter (Prometheus sidecar, cloud vendor agent, or logs). Map these concepts to your stack:

  1. Process up / ready — HTTP GET /health (or vendor equivalent) from the same network path as the app. Page when unreachable for longer than a short window (e.g. two failed checks), not on single blips.
  2. Disk free — Meilisearch persists indexes; running out of disk is a top cause of corruption and wedged tasks. Alert on free space percentage or absolute GB with headroom for snapshots, dumps, task DB growth, and reindexes.
  3. Memory pressure — Meilisearch uses LMDB and memory-mapped I/O, so high RSS/page-cache-looking memory can be normal. Page on OOM kills or sustained memory limit pressure from your orchestrator, not one-off spikes during planned reindex.
  4. Task failures — Meilisearch indexes work through a task queue. Sustained failed tasks (not every transient validation error) indicate a bad deploy, schema mismatch, or upstream bug. Prefer a rate or count over a window, not every single failure.
  5. Task backlog age — A queue with recent work may be healthy; a queue whose oldest task age keeps growing means users may see stale search even while the app is otherwise healthy.
  6. Task DB size / retention — High-churn apps can create large task histories. Track task count or task DB size where your Meilisearch version/exporter exposes it, and define a retention/compaction policy before disk pressure makes it urgent.
  7. Replication / multi-node (if used) — split brain or lag between nodes is a separate product surface; follow Meilisearch’s own HA docs for your version.

Avoid alert fatigue: do not page on single slow searches, one failed document in a batch, or Meilisearch 202 Accepted enqueue latency alone. Those belong on dashboards or SLO burn-rate rules with long windows.

Footguns (Meilisearch + Scrypath-shaped)

  • filter and facetFilters AND together — Users can think they cleared facets while a base filter still narrows results. Document in your UI and ops playbooks; see the faceted search guide appendix.
  • Reindex + disk — Full reindex can temporarily double index footprint until old data is dropped. Plan disk headroom before Scrypath.reindex/2 on large corpora.
  • Settings verify skippedskip_settings_verification?: true speeds emergencies but hides drift until the next verify. Treat as a temporary flag; do not leave it on silently.
  • Sync mode semantics:oban means durable enqueue, not “search is updated.” Paging on queue depth without checking search visibility misdiagnoses user impact; see sync modes guide.
  • Version skew — Meilisearch minor versions change task and index behavior. Pin server and client expectations per environment; roll upgrades in a canary before production.
  • Live database on slow/network storage — Keep the live Meilisearch data path on low-latency local or block storage. Object storage is fine for dump/snapshot artifacts, not for the active database.
  • Too many tiny writes — Per-record or high-churn updates can create task debt. Batch, debounce, or omit fields that do not affect search.
  • Master key sprawl — The master key is an administrative credential. Application workers should use scoped keys, and browser access should use narrow search keys or tenant tokens only when direct client search is intentional.

Backup and upgrade posture

Scrypath can rebuild from the source of truth, but large indexes may make rebuild-only DR too slow. Keep both ideas in the runbook:

  • Snapshots are for fast same-version restore.
  • Dumps are for portability and version migration.
  • Rebuild from Postgres/source data is the most trustworthy repair when projection or settings drift is suspected.

Before a Meilisearch version upgrade, rehearse dump/import in staging and run known Scrypath smoke checks: health, stats, settings diff, search, filters, facets, and sync status. Do not let production be the first place a new Meilisearch binary sees your data.

What to run before you tune alerts

From the repo root (maintainer checks):