ALLM.Retry (allm v0.4.0)

Copy Markdown View Source

Layer-B retry helper consumed by non-streaming adapters and by the image façade.

Wraps a non-streaming adapter call in a bounded retry loop with exponential backoff and additive jitter. Adapters call run/3 with their own per-attempt closure; the closure returns one of {:ok, term}, {:retry, delay_ms, error}, or {:error, term}. The helper handles the loop, the sleep, and the per-attempt [:allm, :adapter, :retry] telemetry emission.

Streaming adapters do not call run/3: streaming calls are not retried automatically because partial output has already been delivered to the consumer. The ALLM.StreamAdapter behaviour-doc surfaces this contract; enforcement is by code review, not by the helper.

Image-side caller class

The image façade (ALLM.generate_image/3, edit_image/4, image_variations/3) wraps the adapter dispatch in run/3. Backoff timing reuses the chat-side default_policy/0 unchanged at its source; the image façade augments retry_on at the call site to add the four image-error atoms (:rate_limited, :provider_unavailable, :timeout, :network_error). The chat-side default retry_on is HTTP-status-coded ([429, 500, 502, 503, 504, :timeout]) and the image side surfaces closed-enum atoms only, so without the augmentation only :timeout would coincidentally retry. Image-side closures emit {:retry, delay_ms, %ALLM.Error.ImageAdapterError{}} for the four retry-engaging reasons; other reasons surface verbatim with no retry attempt.

Where retry telemetry fires

Today the public Layer-C entry points (ALLM.generate/3, ALLM.step/3, ALLM.chat/3) all route through ALLM.StreamRunner, which calls the adapter's ALLM.StreamAdapter.stream/2 — the streaming path. Streaming calls are not retried, so [:allm, :adapter, :retry] events do NOT fire from any chat-side public call. The retry round-trip is exercised by direct adapter calls (e.g., ALLM.Providers.Fake.generate(req, adapter_opts: [retry_until_call: n])) and by the image façade.

:request_id on retry events

:request_id appears on [:allm, :adapter, :retry] metadata only when the adapter call's opts carry a :request_id (typically threaded from a wrapping Runner / StreamRunner span). Direct adapter calls (e.g., the Fake retry round-trip) emit retry events without :request_id because no wrapping span generated one.

Default policy

See default_policy/0 for the closed map. Materialised via materialize/1 from the engine's :retry field (:default | false | keyword).

Closure contract

The closure passed to run/3 is invoked up to policy.max_attempts times. It must return one of:

  • {:ok, value} — success; loop returns {:ok, value}.
  • {:retry, delay_ms, error} — retryable failure. When delay_ms > 0, that value (plus jitter) is the delay; otherwise the computed exponential backoff (with jitter) is used. The error term is checked against policy.retry_on for membership; a non-matching error returns {:error, error} immediately.
  • {:error, error} — non-retryable failure; loop returns {:error, error} immediately, no telemetry, no sleep.

Closure-raised exceptions propagate to the caller of run/3 unchanged (no rescue, no telemetry — exceptions are not retryable).

Telemetry

[:allm, :adapter, :retry] is emitted per retry attempt, before sleeping, with measurements %{system_time: System.system_time} and metadata Map.merge(telemetry_metadata, %{attempt: attempt, delay_ms: actual_delay, reason: error}). The final attempt (when attempt == max_attempts) emits no retry event — the surrounding [:allm, :adapter, :stop] (or :exception) span fires instead.

Summary

Types

Closure return: success, retry-with-delay, or non-retryable error.

Engine-side retry shapes accepted by materialize/1.

Materialised retry policy after merging :default | false | keyword().

Functions

Return the default retry policy.

Return true if error is a member of retry_on.

Materialise an engine :retry field into a policy or :no_retry.

Run fun under the given retry policy.

Types

closure_result(ok)

@type closure_result(ok) ::
  {:ok, ok} | {:retry, non_neg_integer(), term()} | {:error, term()}

Closure return: success, retry-with-delay, or non-retryable error.

engine_retry()

@type engine_retry() :: :default | false | keyword()

Engine-side retry shapes accepted by materialize/1.

policy()

@type policy() :: %{
  max_attempts: non_neg_integer(),
  base_delay_ms: pos_integer(),
  max_delay_ms: pos_integer(),
  retry_on: [pos_integer() | atom()],
  jitter_ms: non_neg_integer(),
  respect_retry_after: boolean()
}

Materialised retry policy after merging :default | false | keyword().

Functions

default_policy()

@spec default_policy() :: policy()

Return the default retry policy.

Fields:

  • :max_attempts3
  • :base_delay_ms500
  • :max_delay_ms30_000
  • :retry_on[429, 500, 502, 503, 504, :timeout]
  • :jitter_ms250
  • :respect_retry_aftertrue

Examples

iex> p = ALLM.Retry.default_policy()
iex> p.max_attempts
3
iex> p.retry_on
[429, 500, 502, 503, 504, :timeout]

error_matches?(error, retry_on)

@spec error_matches?(term(), [pos_integer() | atom()]) :: boolean()

Return true if error is a member of retry_on.

Integers match HTTP status codes (429 ∈ [429, 500, ...]); atoms match error atoms (:timeout). For opaque error structs carrying a :reason field, the reason is extracted before membership check.

Examples

iex> retry_on = [429, 500, :timeout]
iex> ALLM.Retry.error_matches?(429, retry_on)
true
iex> ALLM.Retry.error_matches?(400, retry_on)
false
iex> ALLM.Retry.error_matches?(%{reason: :timeout}, retry_on)
true

materialize(opts)

@spec materialize(engine_retry()) :: policy() | :no_retry

Materialise an engine :retry field into a policy or :no_retry.

  • false:no_retry.
  • :defaultdefault_policy.
  • []default_policy.
  • keywordMap.merge(default_policy, Map.new(kw)).
  • [max_attempts: 0]:no_retry (zero attempts is indistinguishable from "no retry").
  • Unknown keys raise ArgumentError (a typo like max_atempts: fails loudly).

Examples

iex> ALLM.Retry.materialize(:default).max_attempts
3

iex> ALLM.Retry.materialize(false)
:no_retry

iex> ALLM.Retry.materialize(max_attempts: 5).max_attempts
5

iex> ALLM.Retry.materialize(max_attempts: 0)
:no_retry

run(policy_or_retry, telemetry_metadata, fun)

@spec run(policy() | :no_retry | engine_retry(), map(), (-> closure_result(ok))) ::
  {:ok, ok} | {:error, term()}
when ok: var

Run fun under the given retry policy.

Accepts a materialised policy, :no_retry, or any of the engine shapes (:default | false | keyword); the engine shape is materialised internally via materialize/1.

Behaviour:

  • :no_retryfun is invoked once. {:ok, _} is returned verbatim; {:error, error} and {:retry, _, error} collapse to {:error, error} so the caller doesn't have to handle the third shape.

  • policyfun is invoked up to policy.max_attempts times. telemetry_metadata is shallow-merged with %{attempt: attempt, delay_ms: actual_delay, reason: error} per attempt and emitted under [:allm, :adapter, :retry] before sleeping. The final attempt (when attempt == max_attempts) emits no retry event because the caller's surrounding [:allm, :adapter, :stop] / :exception span fires instead.

    actual_delay = max(closure_delay_ms, computed_backoff) where computed_backoff = min(max_delay_ms, base_delay_ms * (2 ** (attempt - 1))) + jitter. When respect_retry_after: true AND closure_delay_ms > 0, actual_delay = closure_delay_ms + jitter (the closure-supplied Retry-After value wins).

Jitter bounds: jitter is additive in [0, jitter_ms] inclusive, never subtractive. Implementation: :rand.uniform(jitter_ms + 1) - 1. :rand.uniform/1 returns 1..N inclusive on OTP 27, and :rand.uniform(1) - 1 == 0 when jitter_ms == 0 (no jitter, deterministic delay).

Closure-raised exceptions propagate to the caller unchanged — no rescue, no telemetry, no retry. Exceptions are not retryable.

Examples

iex> {:ok, v} = ALLM.Retry.run(:no_retry, %{}, fn -> {:ok, 42} end)
iex> v
42

iex> ALLM.Retry.run(:no_retry, %{}, fn -> {:retry, 0, :transient} end)
{:error, :transient}

iex> {:ok, _pid} = Agent.start(fn -> 0 end, name: :doctest_retry_counter)
iex> {:ok, v} =
...> ALLM.Retry.run(
...> [base_delay_ms: 1, jitter_ms: 0],
...> %{},
...> fn ->
...> n = Agent.get_and_update(:doctest_retry_counter, &{&1 + 1, &1 + 1})
...> if n < 2, do: {:retry, 0, 429}, else: {:ok, n}
...> end
...>)
iex> Agent.stop(:doctest_retry_counter)
iex> v
2