Idempotency And Safety

Copy Markdown View Source

Every Jidoka operation declares one idempotency policy. That single field drives whether the runtime can retry, whether resume can replay, whether the spec compiles without an operation control, and whether incomplete work surfaces to application reconciliation. This guide documents each policy in detail, the Jidoka.Effect.Journal semantics on resume, and the production guardrails that depend on getting this right.

When To Use This

  • Use this guide when an agent is about to perform an external action you cannot blindly retry (charges, sends, deletes, deploys).
  • Use this guide when you add a new operation source and need to pick the right idempotency: value.
  • Use this guide when an incomplete Jidoka.resume/2 should not blow up the agent loop but instead route to a reconciliation worker.

Prerequisites

mix deps.get
mix test

Quick Example

Declare a single risky operation. The spec refuses to compile until a matching operation control is attached.

defmodule MyApp.SupportAgent do
  use Jidoka.Agent

  agent :support_agent do
    instructions "Refund only when explicitly approved."
  end

  tools do
    action MyApp.RefundOrder, idempotency: :unsafe_once
  end

  controls do
    operation MyApp.RequireRefundApproval,
      when: [name: :refund_order, idempotency: :unsafe_once]
  end
end

Compiling the plan now succeeds:

{:ok, _plan} = Jidoka.plan(MyApp.SupportAgent)

Remove the control and the plan refuses to compile with {:error, {:unsafe_once_requires_control, "refund_order", :action}}.

Concepts

Idempotency policy is the contract between the agent definition, the runtime, and the journal. The runtime never assumes; it always asks.

╭───────────────────────╮     ╭──────────────────────╮
│ Operation.idempotency │────▶│ Spec validation      │
╰───────────────────────╯     │ (compile-time gate)  │
                              ╰──────┬───────────────╯
                                     │
                                     ▼
                          ╭──────────────────────╮
                          │ Effect.Intent        │
                          │ idempotency: ...     │
                          │ idempotency_key: ... │
                          ╰──────┬───────────────╯
                                 │
                  ╭──────────────┼──────────────╮
                  ▼              ▼              ▼
            Journal has     Journal has     Journal has
            no intent       intent only     intent + result
                  │              │              │
                  ▼              ▼              ▼
            run capability   per-policy      replay journal
                             decision        result

Per-policy resume rules:

  • :pure and :idempotent retry safely from inputs. Resume will replay the journaled result; missing results are recomputed.
  • :dedupe prefers a recorded journal result. Use it for operations that are expensive but safe to repeat.
  • :reconcile allows the application to inspect incomplete work after resume. The runtime returns the intent and lets a reconciliation worker decide.
  • :unsafe_once forbids automatic retry. Resume returns a typed error when an intent is recorded without a result.

How To

Step 1: Pick A Policy For Each Operation

Use the smallest policy that is still correct.

  • :pure - the operation is a deterministic function of its arguments with no observable side effects. Lookups, transformations, schema validations.
  • :idempotent (default) - calling twice with the same key has the same external outcome. Most external APIs that accept idempotency keys.
  • :dedupe - calling twice may be expensive or noisy, but is otherwise safe. Prefer the journaled result.
  • :reconcile - external work can leave the system in an in-between state (an enqueued job whose status is unknown). The application owns reconciliation.
  • :unsafe_once - calling twice is unsafe. Charges, sends, deletes, one-way deploys.
tools do
  action MyApp.LookupOrder, idempotency: :pure
  action MyApp.ChargeCard, idempotency: :unsafe_once
  action MyApp.EnqueueJob, idempotency: :reconcile
end

Step 2: Add Controls For :unsafe_once

Jidoka.Agent.Spec.Operation.requires_control?/1 returns true for :unsafe_once. The plan compiler refuses to produce a plan without a matching operation control.

controls do
  operation MyApp.RequireChargeApproval,
    when: [name: :charge_card, idempotency: :unsafe_once]
end

The control can allow, block, or interrupt for human review. See Human In The Loop for the durable approval flow.

Step 3: Understand The Journal On Resume

Every effect is recorded as an intent before the capability runs and as a result when the capability returns. On resume:

  • If the journal already has a result for the pending intent, the effect interpreter replays it and never calls the capability.
  • If only the intent is recorded, the per-policy validation runs.
  • If the intent is missing, the runtime asks for the result through the capability.
%Jidoka.Effect.Journal{
  intents: %{"operation:abc" => intent},
  results: %{"operation:abc" => result}
}

Jidoka.Effect.Journal.result_for/2 and Jidoka.Effect.Journal.intent_for/2 are the lookup helpers. Jidoka.Effect.Journal.incomplete_intent?/2 is true when an intent has no recorded result.

Step 4: Handle Reconciliation Paths

For :reconcile operations, the application is expected to observe incomplete intents and resolve them out of band. A common pattern is to enumerate snapshots whose journal has incomplete intents and route them to a reconciler.

def reconcile_pending(snapshot) do
  journal = snapshot.turn_state.journal

  for {_id, intent} <- journal.intents,
      is_nil(Jidoka.Effect.Journal.result_for(journal, intent)),
      intent.idempotency == :reconcile do
    MyApp.Reconciler.queue(intent)
  end
end

After reconciliation completes externally, persist the result into the journal (or rebuild the snapshot through your own session pipeline) and resume.

Step 5: Trust The :unsafe_once Guard On Resume

The runtime never quietly retries an :unsafe_once operation that has an incomplete intent. Resume returns a typed error instead:

case Jidoka.resume(snapshot, llm: llm, operations: operations) do
  {:error, %Jidoka.Error{} = error} ->
    case Jidoka.error_to_map(error) do
      %{reason: :unsafe_once_incomplete_effect, intent_id: id} ->
        MyApp.UnsafeOnceQueue.route(id, snapshot)

      _ ->
        Logger.error("resume failed: " <> inspect(error))
    end

  other ->
    other
end

Approved interrupts are the supported way to re-enter an :unsafe_once intent: the operation control approves the specific interrupt, and the runtime stamps metadata["approved_interrupt_id"] on the effect. The journal then accepts the call exactly once.

Step 6: Distinguish :dedupe From :idempotent

Both policies are safe to retry. The difference is intent:

  • :idempotent says "retry is correct; do not avoid it." Use it for HTTP calls with idempotency keys, database upserts, and most external APIs that already promise safe retries.
  • :dedupe says "retry is correct but wasteful; prefer the journaled result." Use it for cache-fronted lookups or any operation that encountered a cost (LLM call, paid API, expensive aggregation) you do not want to repeat.

In practice this guides resume behavior: :dedupe on resume always prefers the journaled result; :idempotent is happy to call the capability again when the intent has no result.

Common Patterns

  • Default to :idempotent. It is the spec default for a reason.
  • Promote to :unsafe_once for external state changes. Charges, sends, deletes, and deploys deserve the strongest guard.
  • Use :reconcile when truth lives elsewhere. Async work queued to an external system is the canonical case.
  • Pair :unsafe_once with an approval workflow. A blocking control is fine for "never allow"; an interrupting control is needed for "allow once a reviewer says so."
  • Treat the journal as the contract. Tests should make assertions against Effect.Journal.intent_recorded?/2 and Effect.Journal.result_for/2, not on capability call counts alone.

Testing

A simple test exercises both the compile-time gate and the resume-time guard.

test "unsafe_once requires an operation control before plan compiles" do
  spec =
    Jidoka.agent!(
      id: "risky",
      instructions: "Charge only when explicit.",
      operations: [
        %{name: "charge_card", idempotency: :unsafe_once, kind: :action}
      ]
    )

  assert {:error, {:unsafe_once_requires_control, "charge_card", :action}} =
           Jidoka.plan(spec)
end

test "incomplete unsafe_once intent fails on resume" do
  llm = fn _intent, _journal ->
    {:ok, %{type: :operation, name: "charge_card",
            arguments: %{"order_id" => "A1"}}}
  end

  operations = fn _intent, _journal -> raise "boom" end

  assert {:error, _error} =
           Jidoka.turn(MyApp.SupportAgent, "Charge A1",
             llm: llm,
             operations: operations
           )

  # The application persisted the snapshot for a reconciler before
  # surfacing the failure; resuming it never retries the operation.
end

For :dedupe and :reconcile operations, build a journal that already has the desired shape and assert that resume routes correctly.

Troubleshooting

SymptomLikely CauseFix
{:error, {:unsafe_once_requires_control, name, kind}}An :unsafe_once operation has no matching operation control.Add a controls do ... operation ... when: [name: name] end clause.
{:error, %Jidoka.Error{reason: :unsafe_once_incomplete_effect}}Resume saw a recorded intent without a result.Route the snapshot to reconciliation; do not auto-retry.
Capability called twice on resumeOperation is :idempotent and the journal lost the result.Persist the full snapshot including its journal; ensure your store preserves all fields.
Reconciliation never firesThe journal had no incomplete intents because the runtime did call the capability.Confirm the policy is :reconcile, not :idempotent, and inspect result.journal.
Approved interrupt still errors on :unsafe_onceApproval target was the wrong interrupt id.Build the response with Jidoka.Review.Response.approve(review.interrupt_id).
:dedupe operation still runs every timeThe journal across resumes is empty because each call started a new turn.Use a session so the journal persists, or pass the prior snapshot to resume/2.

Reference

Key modules touched in this guide: