This guide covers the FSM error states and the common error tuples Rindle returns. The pattern is the same across the lifecycle: every "stuck" or "degraded" outcome has a queryable state, an explicit recovery path, and typically a Mix task that automates the recovery.
For the full state diagrams, see Core Concepts. For the day-2 task reference, see Operations.
Diagnostics Split
Start with the read-only surfaces first:
mix rindle.doctorvalidates setup and drift.mix rindle.runtime_statusreports degraded or stuck work.- The repair verbs perform change only after diagnostics identify the right lane.
In short: doctor validates setup and drift, runtime status reports degraded or stuck work, and repair verbs perform change.
The contract intentionally has no dashboard and no auto-remediation layer in this release.
For upgrade troubleshooting, keep the same order: explicit migrations,
mix rindle.doctor, optional mix rindle.runtime_status, then the repair verb
that matches the actual state.
If the failing profile uses Rindle.Storage.GCS, keep the same order and then
use storage_gcs.md for the bucket, CORS, session_uri, and
resumable-upload operator runbook instead of rebuilding that flow from logs.
Supported Recovery Verbs
Phase 30 makes the recovery lanes explicit:
reprobe—Rindle.reprobe/1for probe-derived field drift on one assetrequeue—Rindle.requeue_variants/2for failed or cancelled variants on one assetregenerate—mix rindle.regenerate_variantsfor broadstaleormissingderivative driftcleanup—mix rindle.cleanup_orphansaftermix rindle.abort_incomplete_uploadsfor expired upload residuesweep—mix rindle.sweep_orphaned_temp_filesfor AV temp-run-dir residue
Use the same verb vocabulary here that the operations guide teaches. If one of these verbs applies, prefer it over direct DB row mutation.
AV Error Contract
Rindle v1.4 freezes eight AV-facing error reasons as a public operator vocabulary:
:processor_capability_missing:ffmpeg_not_found:capability_drift:variant_source_not_found:unsupported_codec:streaming_not_configured:variant_processing_cancelled:range_unparseable
The exact user-facing text for those reasons is owned by
Rindle.Error.message/1 and locked in test/rindle/error_test.exs. Treat this
guide as the recovery map, not a second wording authority.
| Reason | What it usually means | First operator move |
|---|---|---|
:processor_capability_missing | The configured processor cannot satisfy a declared AV variant. | Run mix rindle.doctor and compare the profile's variants to the processor capability list. |
:ffmpeg_not_found | FFmpeg is missing from PATH or not configured at :ffmpeg_path. | Install FFmpeg, then re-run mix rindle.doctor. |
:capability_drift | A storage or processor capability disappeared after the profile was already in use. | Re-check runtime configuration, then use the supported cleanup, requeue, or regenerate lane that matches the affected work before retrying. |
:variant_source_not_found | Variant processing could not download the original media from storage. | Confirm the original object still exists and that the adapter can read it. |
:unsupported_codec | The declared codec is not available in the current FFmpeg build or processor allowlist. | Inspect ffmpeg -codecs and compare it to the variant recipe. |
:streaming_not_configured | A caller asked for streaming playback without a configured streaming provider. | Fall back to progressive delivery with Rindle.Delivery.url/3. |
:variant_processing_cancelled | An in-flight transcode was intentionally cancelled. | Verify whether Rindle.cancel_processing/1 was invoked, then use Rindle.requeue_variants/2 if the asset should resume work. |
:range_unparseable | A malformed HTTP Range header reached the local streaming surface. | Fix the caller/header generator or enable strict parsing if your app wants hard failures. |
The common FFmpeg- and capability-related recovery path is still
mix rindle.doctor, because it exercises the same runtime boundary that emits
these reasons.
Quarantined Assets
State: MediaAsset.state == "quarantined"
What it means: The asset failed an upload-time policy check. Common triggers:
- MIME mismatch — the magic-byte sniff disagrees with the profile's
allow_mimeallowlist (e.g., user uploadedevil.png.exeand the bytes are an executable, not a PNG) - Size or pixel limits exceeded
- Scanner verdict (if you've wired a
Rindle.Scanneradapter) - Manual quarantine by an operator
Telemetry / logs: Logger.warning("rindle.asset.quarantined", asset_id: ..., detected_mime: ..., reason: ...) — search your log aggregator for
this string.
Recovery options:
Confirm + delete (most common): The user uploaded something we should not host. Run a deletion through
Rindle.delete/3(which transitions todeletedand enqueuesPurgeStorage).False positive — un-quarantine manually: There is no public API for this because it should be rare and audited. Manual DB update remains the exception path here:
asset |> Ecto.Changeset.change(state: "available") |> MyApp.Repo.update!()Document the audit trail (who reversed the quarantine, why, when).
Inspect the storage object: The original is still in storage at
asset.storage_key. Download it through your storage adapter to verify the verdict before any reversal.
The quarantined → deleted transition is allowed; quarantined →
available is not allowed by the FSM. This is one of the narrow cases
where the supported Phase 30 verbs do not apply and a documented manual update
is still required.
Failed Variants
State: MediaVariant.state == "failed"
What it means: The variant exhausted its retry budget (default 5
attempts on the internal variant-processing worker). The Oban job is in the
discarded state; the variant row is in failed.
Diagnosing root cause:
- Inspect the Oban job:
MyApp.Repo.get!(Oban.Job, job_id)— theerrorscolumn has the stack trace from each attempt. - Check the variant's recipe: did the spec change recently? A recipe
bug (e.g.,
quality: 200, which is out of range) will fail every attempt the same way. - Check the source asset: is the original still in storage? A missing original prevents any variant from regenerating.
- Check libvips / Image: very large images can OOM the BEAM; check memory metrics around the failure time.
Recovery options:
| Cause | Action |
|---|---|
| Transient (network, storage) | Rindle.requeue_variants/2 for the affected asset (requeue) |
| Intentional cancellation that should be resumed | Rindle.requeue_variants/2 for the affected asset (requeue) |
| Recipe bug fixed for many assets | Update the profile, then mix rindle.regenerate_variants (regenerate) |
| Corrupt source repaired for one asset | Fix the source bytes, then Rindle.requeue_variants/2 (requeue) |
| OOM / resource exhaustion | Reduce rindle_process concurrency or move to a larger node before retrying with requeue or regenerate |
The supported split is intentional: requeue is asset-scoped repair for
failed/cancelled work, while regenerate remains the broad maintenance lane
for stale/missing drift. If the underlying issue persists, the variant will
flip back to failed after another 5 attempts; investigate further rather than
re-enqueuing in a loop.
That same split applies to upgraded adopters: use Rindle.requeue_variants/2
for one failed or cancelled upgraded asset, and keep mix rindle.regenerate_variants for broader drift only.
Stale Variants
State: MediaVariant.state == "stale"
What it means: The variant's stored recipe_digest no longer
matches the profile's current digest. You changed the variant spec
(quality, dimensions, format, mode) and existing variants predate
the change.
Detection: Stale detection happens on read — when Rindle resolves
a variant URL, it compares the stored digest to the current profile
digest, and if they differ, transitions the variant to stale. So
"stale" is observable as soon as you change a profile.
Recovery: mix rindle.regenerate_variants (regenerate) walks all stale
variants and re-enqueues them. Filter by profile or variant name if
you only want to regenerate a subset:
mix rindle.regenerate_variants --profile Elixir.MyApp.PostImageProfile
mix rindle.regenerate_variants --variant thumb
While stale variants are being regenerated, Rindle.Delivery.variant_url/4
serves the original asset as a fallback (the default :stale_mode
is :fallback_original). To serve stale variants during regeneration
instead, pass stale_mode: :serve_stale — this is appropriate when
the visual diff between recipes is small and you'd rather show the
old variant than the unsized original.
Missing Variants
State: MediaVariant.state == "missing"
What it means: mix rindle.verify_storage HEAD-checked the
variant's storage object and got not_found. The DB still has the
row; the storage object is gone.
Common causes:
- Out-of-band deletion (operator using S3 console, CDN purge gone wrong)
- Storage lifecycle policy expired the object
- Multi-region replication gap during failover
- Storage backup restore that pre-dated the variant
Detection: Run mix rindle.verify_storage periodically. The task
emits a deterministic summary:
Rindle: verifying storage for variants...
checked: 120
present: 117
missing: 2
fsm_blocked: 1
errors: 0
Done.Recovery: mix rindle.regenerate_variants (regenerate) re-enqueues missing
variants the same way it does stale ones. The processor downloads
the original (which must still be present) and re-derives the
variant.
If the original is also missing, the variant cannot be regenerated — you have data loss. The asset should be quarantined or deleted depending on your data-recovery posture.
Expired Upload Sessions
State: MediaUploadSession.state == "expired"
What it means: The session's expires_at elapsed before
Broker.verify_completion/2 was called. The presigned URL is no
longer valid; the upload either never happened or never reached
verification.
Detection: Logger.info("rindle.upload_session.expired", session_id: ..., reason: %{event: :expired, elapsed_seconds: ...}).
The transition itself happens in Rindle.Workers.AbortIncompleteUploads
(scheduled via Oban cron) or via mix rindle.abort_incomplete_uploads.
Recovery: Expired sessions are terminal — the FSM does not allow
transitions out of expired. The flow is:
mix rindle.abort_incomplete_uploadsflips timed-outsigned/uploadingsessions toexpired.mix rindle.cleanup_orphansremovesexpiredsessions and any staged storage objects they reference.
This is the supported cleanup lane. Do not remove expired upload-session rows
manually unless the cleanup workflow itself is broken and you have preserved an
audit trail.
Probe Drift
Symptom: An asset's source object is still authoritative, but the stored probe-derived fields are stale or were persisted before improved detection landed.
Recovery: Use Rindle.reprobe/1 (reprobe) for that asset. This refreshes
probe-owned fields such as MIME, kind, dimensions, duration, and track booleans
without mutating unrelated lifecycle state, variants, or upload sessions.
If the problem is analyzer metadata rather than probe facts, stay on
mix rindle.backfill_metadata; reprobe is not a metadata backfill surrogate.
AV Temp Residue
Symptom: AV processing left abandoned directories under Rindle.tmp/.
Recovery: Use mix rindle.sweep_orphaned_temp_files (sweep). Start in
dry-run, confirm the counts, then opt into live deletion with --no-dry-run
or a cron job configured with "dry_run" => false if you want destructive
execution.
This is separate from upload-session cleanup: temp sweeping targets local
transcoding residue, not staged upload objects or upload-session rows.
If the user wants to retry their upload, they must call
Broker.initiate_session/2 to start a new session. There is no
"resume" path — by design, because the half-uploaded bytes (if any)
are not trustworthy.
Common Error Tuples
| Error | Where Returned | Meaning + Recovery |
|---|---|---|
{:error, :not_found} | Broker.sign_url/2, Broker.verify_completion/2 | Session ID does not exist. Surfaces as a 404 in your controller. |
{:error, :storage_object_missing} | Broker.verify_completion/2 | The presigned PUT was never made (or hit the wrong URL). User should retry. |
{:error, {:invalid_transition, from, to}} | Any FSM transition | A worker tried an FSM transition that is not allowlisted. Indicates a bug or a race; check rindle.*.transition_failed log entries. |
{:error, {:delivery_unsupported, :signed_url}} | Rindle.Delivery.url/3 | A private profile is pointed at a storage adapter that does not support signed URLs. Either flip the profile to public, or switch adapters. |
{:error, :forbidden} (or any authorizer error) | Rindle.Delivery.url/3 | The configured authorizer rejected the request. Check the actor and the subject. |
{:error, {:storage_adapter_exception, term}} | Storage adapter calls | The storage adapter raised. Inspect term for the underlying cause (network, auth, malformed config). |
Diagnostics Cheatsheet
Quick queries you can run to triage state distribution:
# Asset state distribution
import Ecto.Query
MyApp.Repo.all(
from a in Rindle.Domain.MediaAsset,
group_by: a.state,
select: {a.state, count(a.id)}
)
# Variants in non-ready state
MyApp.Repo.all(
from v in Rindle.Domain.MediaVariant,
where: v.state in ["failed", "stale", "missing"],
group_by: [v.state, v.name],
select: {v.state, v.name, count(v.id)}
)
# Upload sessions past their TTL but not yet expired
MyApp.Repo.all(
from s in Rindle.Domain.MediaUploadSession,
where: s.state in ["signed", "uploading"] and s.expires_at < ago(0, "second"),
select: count(s.id)
)These three queries cover most "what's wrong with my media pipeline?" triage situations. Wire them into a LiveDashboard page or an admin LiveView for at-a-glance health.
Getting Help
If you encounter a state or error tuple that is not covered here, the canonical references are:
- The lifecycle/state tables in the core concepts guide for transition rules
Rindle.Upload.Broker,Rindle.Delivery,Rindle(the public facade module) for the public API contractsRindle.Error.message/1andtest/rindle/error_test.exsfor the locked AV error text and remediation wording- The Mix task
@moduledocblocks for command-line behavior - The telemetry contract test (
test/rindle/contracts/telemetry_contract_test.exs) for the locked event surface
When in doubt, the FSM is the source of truth. If the FSM forbids a transition, that is by design — work with the FSM (queue → process → ready, or queue → purge), not around it.