Layer-B retry helper consumed by non-streaming adapters and by the image façade.
Wraps a non-streaming adapter call in a bounded retry loop with
exponential backoff and additive jitter. Adapters call run/3 with
their own per-attempt closure; the closure returns one of
{:ok, term}, {:retry, delay_ms, error}, or {:error, term}. The
helper handles the loop, the sleep, and the per-attempt
[:allm, :adapter, :retry] telemetry emission.
Streaming adapters do not call run/3: streaming calls are not
retried automatically because partial output has already been
delivered to the consumer. The ALLM.StreamAdapter behaviour-doc
surfaces this contract; enforcement is by code review, not by the
helper.
Image-side caller class
The image façade (ALLM.generate_image/3, edit_image/4,
image_variations/3) wraps the adapter dispatch in run/3. Backoff
timing reuses the chat-side default_policy/0 unchanged at its
source; the image façade augments retry_on at the call site to
add the four image-error atoms (:rate_limited,
:provider_unavailable, :timeout, :network_error). The
chat-side default retry_on is HTTP-status-coded
([429, 500, 502, 503, 504, :timeout]) and the image side surfaces
closed-enum atoms only, so without the augmentation only :timeout
would coincidentally retry. Image-side closures emit
{:retry, delay_ms, %ALLM.Error.ImageAdapterError{}} for the four
retry-engaging reasons; other reasons surface verbatim with no retry
attempt.
Where retry telemetry fires
Today the public Layer-C entry points (ALLM.generate/3,
ALLM.step/3, ALLM.chat/3) all route through ALLM.StreamRunner,
which calls the adapter's ALLM.StreamAdapter.stream/2 — the
streaming path. Streaming calls are not retried, so
[:allm, :adapter, :retry] events do NOT fire from any chat-side
public call. The retry round-trip is exercised by direct adapter
calls (e.g., ALLM.Providers.Fake.generate(req, adapter_opts: [retry_until_call: n])) and by the image façade.
:request_id on retry events
:request_id appears on [:allm, :adapter, :retry] metadata only
when the adapter call's opts carry a :request_id (typically
threaded from a wrapping Runner / StreamRunner span). Direct
adapter calls (e.g., the Fake retry round-trip) emit retry events
without :request_id because no wrapping span generated one.
Default policy
See default_policy/0 for the closed map. Materialised
via materialize/1 from the engine's :retry field
(:default | false | keyword).
Closure contract
The closure passed to run/3 is invoked up to policy.max_attempts
times. It must return one of:
{:ok, value}— success; loop returns{:ok, value}.{:retry, delay_ms, error}— retryable failure. Whendelay_ms > 0, that value (plus jitter) is the delay; otherwise the computed exponential backoff (with jitter) is used. Theerrorterm is checked againstpolicy.retry_onfor membership; a non-matching error returns{:error, error}immediately.{:error, error}— non-retryable failure; loop returns{:error, error}immediately, no telemetry, no sleep.
Closure-raised exceptions propagate to the caller of run/3
unchanged (no rescue, no telemetry — exceptions are not retryable).
Telemetry
[:allm, :adapter, :retry] is emitted per retry attempt, before
sleeping, with measurements %{system_time: System.system_time}
and metadata Map.merge(telemetry_metadata, %{attempt: attempt, delay_ms: actual_delay, reason: error}). The final attempt (when
attempt == max_attempts) emits no retry event — the surrounding
[:allm, :adapter, :stop] (or :exception) span fires instead.
Summary
Types
Closure return: success, retry-with-delay, or non-retryable error.
Engine-side retry shapes accepted by materialize/1.
Materialised retry policy after merging :default | false | keyword().
Functions
Return the default retry policy.
Return true if error is a member of retry_on.
Materialise an engine :retry field into a policy or :no_retry.
Run fun under the given retry policy.
Types
@type closure_result(ok) :: {:ok, ok} | {:retry, non_neg_integer(), term()} | {:error, term()}
Closure return: success, retry-with-delay, or non-retryable error.
@type engine_retry() :: :default | false | keyword()
Engine-side retry shapes accepted by materialize/1.
@type policy() :: %{ max_attempts: non_neg_integer(), base_delay_ms: pos_integer(), max_delay_ms: pos_integer(), retry_on: [pos_integer() | atom()], jitter_ms: non_neg_integer(), respect_retry_after: boolean() }
Materialised retry policy after merging :default | false | keyword().
Functions
@spec default_policy() :: policy()
Return the default retry policy.
Fields:
:max_attempts—3:base_delay_ms—500:max_delay_ms—30_000:retry_on—[429, 500, 502, 503, 504, :timeout]:jitter_ms—250:respect_retry_after—true
Examples
iex> p = ALLM.Retry.default_policy()
iex> p.max_attempts
3
iex> p.retry_on
[429, 500, 502, 503, 504, :timeout]
@spec error_matches?(term(), [pos_integer() | atom()]) :: boolean()
Return true if error is a member of retry_on.
Integers match HTTP status codes (429 ∈ [429, 500, ...]); atoms
match error atoms (:timeout). For opaque error structs carrying a
:reason field, the reason is extracted before membership check.
Examples
iex> retry_on = [429, 500, :timeout]
iex> ALLM.Retry.error_matches?(429, retry_on)
true
iex> ALLM.Retry.error_matches?(400, retry_on)
false
iex> ALLM.Retry.error_matches?(%{reason: :timeout}, retry_on)
true
@spec materialize(engine_retry()) :: policy() | :no_retry
Materialise an engine :retry field into a policy or :no_retry.
false→:no_retry.:default→default_policy.[]→default_policy.keyword→Map.merge(default_policy, Map.new(kw)).[max_attempts: 0]→:no_retry(zero attempts is indistinguishable from "no retry").- Unknown keys raise
ArgumentError(a typo likemax_atempts:fails loudly).
Examples
iex> ALLM.Retry.materialize(:default).max_attempts
3
iex> ALLM.Retry.materialize(false)
:no_retry
iex> ALLM.Retry.materialize(max_attempts: 5).max_attempts
5
iex> ALLM.Retry.materialize(max_attempts: 0)
:no_retry
@spec run(policy() | :no_retry | engine_retry(), map(), (-> closure_result(ok))) :: {:ok, ok} | {:error, term()} when ok: var
Run fun under the given retry policy.
Accepts a materialised policy, :no_retry, or any of the engine
shapes (:default | false | keyword); the engine shape is
materialised internally via materialize/1.
Behaviour:
:no_retry—funis invoked once.{:ok, _}is returned verbatim;{:error, error}and{:retry, _, error}collapse to{:error, error}so the caller doesn't have to handle the third shape.policy—funis invoked up topolicy.max_attemptstimes.telemetry_metadatais shallow-merged with%{attempt: attempt, delay_ms: actual_delay, reason: error}per attempt and emitted under[:allm, :adapter, :retry]before sleeping. The final attempt (whenattempt == max_attempts) emits no retry event because the caller's surrounding[:allm, :adapter, :stop]/:exceptionspan fires instead.actual_delay = max(closure_delay_ms, computed_backoff)wherecomputed_backoff = min(max_delay_ms, base_delay_ms * (2 ** (attempt - 1))) + jitter. Whenrespect_retry_after: trueANDclosure_delay_ms > 0,actual_delay = closure_delay_ms + jitter(the closure-suppliedRetry-Aftervalue wins).
Jitter bounds: jitter is additive in [0, jitter_ms]
inclusive, never subtractive. Implementation:
:rand.uniform(jitter_ms + 1) - 1. :rand.uniform/1 returns
1..N inclusive on OTP 27, and :rand.uniform(1) - 1 == 0 when
jitter_ms == 0 (no jitter, deterministic delay).
Closure-raised exceptions propagate to the caller unchanged — no rescue, no telemetry, no retry. Exceptions are not retryable.
Examples
iex> {:ok, v} = ALLM.Retry.run(:no_retry, %{}, fn -> {:ok, 42} end)
iex> v
42
iex> ALLM.Retry.run(:no_retry, %{}, fn -> {:retry, 0, :transient} end)
{:error, :transient}
iex> {:ok, _pid} = Agent.start(fn -> 0 end, name: :doctest_retry_counter)
iex> {:ok, v} =
...> ALLM.Retry.run(
...> [base_delay_ms: 1, jitter_ms: 0],
...> %{},
...> fn ->
...> n = Agent.get_and_update(:doctest_retry_counter, &{&1 + 1, &1 + 1})
...> if n < 2, do: {:retry, 0, 429}, else: {:ok, n}
...> end
...>)
iex> Agent.stop(:doctest_retry_counter)
iex> v
2