Aerospike.RetryPolicy (Aerospike Driver v0.3.1)

Copy Markdown View Source

Retry configuration and error classification for the command path.

The retry driver consumes a t/0 value per command and decides, based on the classification helpers below, whether to re-dispatch an attempt against the next replica (on a rebalance-class error), re-dispatch against a fresh pool worker (on a transport-class error), or return the error verbatim (on anything else).

Writer discipline

The retry policy is cluster-scoped, not per-node, and is established once at Aerospike.start_link/1 time. The Tender computes the effective policy and publishes it to the :meta ETS table under the key :retry_opts; the command path reads it lock-free via load/1. Runtime mutation still stays behind the single-writer boundary that governs every other published :meta entry.

Per-command overrides (:timeout, :max_retries, :sleep_between_retries_ms, :replica_policy) may be passed through Aerospike.get/3's opts and are merged on top of the cluster default by merge/2.

Classification

One canonical classifier drives the retry loop and the pool-side failure accounting. It returns:

  • :bucket — one of :ok, :rebalance, :transport, :routing_refusal, :server_fatal
  • :retry_classification — the retry telemetry label or nil
  • :close_connection? — whether the current worker should be discarded after the outcome
  • :node_failure? — whether the outcome should increment the node's :failed counter

The buckets stay disjoint:

  • rebalance — the server replied with a result code that says "this partition is not mine right now" (currently :partition_unavailable). The retry driver re-picks on a different replica and asynchronously asks the Tender for a fresh partition map.

  • transport — the command did not reach a server that answered cleanly: :network_error, :timeout, :connection_error (socket), :pool_timeout, :invalid_node (pool checkout), and :circuit_open (circuit-breaker refusal). These are not ownership signals; the retry driver re-dispatches without asking for a map refresh.

  • routing_refusal — the router refused to select a replica (:cluster_not_ready, :no_master). The driver returns the atom verbatim; no retry.

  • server_fatal — everything else: server logical errors (:key_not_found, :generation_error, …) and client-local fatal errors like :parse_error. The driver returns these verbatim.

Summary

Types

High-level outcome bucket used by retry and pool-failure logic.

Complete retry classification for one command outcome.

Retry option accepted by from_opts/1 and merge/2.

Keyword list of retry options.

Replica selection policy used when retrying read commands.

Telemetry retry label derived from a command outcome.

t()

Effective retry policy for one command.

Functions

Classifies one command outcome into retry buckets and the metadata the retry and pool layers consume.

Returns the default retry policy. Used by the Tender at init.

Builds an effective retry policy by overlaying the keyword opts on top of defaults/0.

Reads the cluster-default retry policy from the :meta ETS table.

Overlays per-command opts on top of base. Only the three retry fields are recognised; other keys are ignored.

Returns true when term should increment the node's :failed counter.

Writes policy to meta_tab under the ETS key used by load/1.

Returns true when term is an error the retry driver should treat as a cluster-rebalance signal. Accepts either a bare %Aerospike.Error{} or the {:error, _} tuple form the command path produces; delegates to the canonical classifier above.

Returns the retry telemetry label for term, or nil when the outcome is fatal / non-retryable.

Returns true when term is an error the retry driver should treat as a transport-class failure (re-dispatch without re-routing logic beyond the replica walk).

Types

bucket()

@type bucket() :: :ok | :rebalance | :transport | :routing_refusal | :server_fatal

High-level outcome bucket used by retry and pool-failure logic.

Buckets are intentionally disjoint so the retry driver can decide whether to retry, refresh cluster state, close a socket, or return the error as-is.

classification()

@type classification() :: %{
  bucket: bucket(),
  retry_classification: retry_classification(),
  close_connection?: boolean(),
  node_failure?: boolean()
}

Complete retry classification for one command outcome.

option()

@type option() ::
  {:max_retries, non_neg_integer()}
  | {:sleep_between_retries_ms, non_neg_integer()}
  | {:replica_policy, replica_policy()}
  | {atom(), term()}

Retry option accepted by from_opts/1 and merge/2.

  • :max_retries — retries after the initial attempt. 0 disables retry.
  • :sleep_between_retries_ms — fixed delay between attempts.
  • :replica_policy:master or :sequence.
  • any other atom key — accepted and ignored.

Unknown keys are ignored so retry options can be merged from broader command/startup option lists.

options()

@type options() :: [option()]

Keyword list of retry options.

replica_policy()

@type replica_policy() :: :master | :sequence

Replica selection policy used when retrying read commands.

retry_classification()

@type retry_classification() :: :rebalance | :transport | :circuit_open | nil

Telemetry retry label derived from a command outcome.

nil means the outcome is not retryable.

t()

@type t() :: %{
  max_retries: non_neg_integer(),
  sleep_between_retries_ms: non_neg_integer(),
  replica_policy: replica_policy()
}

Effective retry policy for one command.

  • :max_retries — number of retries after the initial attempt (so a :max_retries of 2 means up to 3 attempts total). Must be a non-negative integer. 0 disables retry entirely.
  • :sleep_between_retries_ms — fixed delay between attempts; no jitter or exponential backoff.
  • :replica_policy:master dispatches every attempt against the master replica (transport failures retry the same node); :sequence walks the replica list via rem(attempt, length(replicas)) on each retry.

Functions

classify(err)

@spec classify(term()) :: classification()

Classifies one command outcome into retry buckets and the metadata the retry and pool layers consume.

defaults()

@spec defaults() :: t()

Returns the default retry policy. Used by the Tender at init.

from_opts(opts)

@spec from_opts(options()) :: t()

Builds an effective retry policy by overlaying the keyword opts on top of defaults/0.

Intended for the Tender's init path: validate the caller's start opts once and store the resulting map in :meta. Unknown keys are ignored so the retry policy can live alongside future policy knobs without a config migration.

load(meta_tab)

@spec load(atom()) :: t()

Reads the cluster-default retry policy from the :meta ETS table.

Falls back to defaults/0 when the slot is absent so readers never crash against a Tender that was started without the retry plumbing (a cluster-state-only test harness, for example, that skips the retry-opts init).

merge(base, opts)

@spec merge(t(), options()) :: t()

Overlays per-command opts on top of base. Only the three retry fields are recognised; other keys are ignored.

node_failure?(term)

@spec node_failure?(term()) :: boolean()

Returns true when term should increment the node's :failed counter.

put(meta_tab, policy)

@spec put(atom(), t()) :: true

Writes policy to meta_tab under the ETS key used by load/1.

Runtime publication flows through the cluster-state writer; table creation also uses this helper once to seed the default row before the tend-cycle worker starts.

rebalance?(term)

@spec rebalance?(term()) :: boolean()

Returns true when term is an error the retry driver should treat as a cluster-rebalance signal. Accepts either a bare %Aerospike.Error{} or the {:error, _} tuple form the command path produces; delegates to the canonical classifier above.

retry_classification(term)

@spec retry_classification(term()) :: retry_classification()

Returns the retry telemetry label for term, or nil when the outcome is fatal / non-retryable.

transport?(term)

@spec transport?(term()) :: boolean()

Returns true when term is an error the retry driver should treat as a transport-class failure (re-dispatch without re-routing logic beyond the replica walk).

Examples of transport-class codes: :network_error, :timeout, :connection_error, :pool_timeout, :invalid_node, :circuit_open.