Use deterministic tests for agent behaviour. Inject a fake LLM and local operations, assert the compiled spec shape, and run small eval cases without calling a provider.

When To Use This

  • Use this guide when adding tests for a new agent, tool, control, or memory contract.
  • Use this guide when setting up regression coverage for the DSL-to-Jidoka.Agent.Spec contract.
  • Use this guide when building a small eval suite for CI.
  • Do not use this guide for live model evaluations or benchmarking; those belong in opt-in suites that explicitly require provider credentials.

Prerequisites

mix deps.get
mix test

Quick Example

A minimal eval pins both capabilities, declares one assertion, and runs the case through the same harness as production.

defmodule MyApp.TimeAgent do
  use Jidoka.Agent

  agent :time_agent do
    model "openai:gpt-4o-mini"
    instructions "Call local_time when asked for the time."
  end

  tools do
    action MyApp.Tools.LocalTime
  end
end

operations =
  Jidoka.Runtime.LocalOperations.operations(%{
    "local_time" => fn _args -> {:ok, %{city: "Chicago", time: "09:30"}} end
  })

llm = fn _intent, journal ->
  case map_size(journal.results) do
    0 -> {:ok, %{type: :operation, name: "local_time", arguments: %{}}}
    _ -> {:ok, %{type: :final, content: "Chicago time is 09:30."}}
  end
end

{:ok, run} =
  Jidoka.Eval.run_case(
    %{
      id: "time_basic",
      agent: MyApp.TimeAgent.spec(),
      input: "What time is it?",
      assertions: %{
        contains: "09:30",
        operation_called: "local_time"
      }
    },
    llm: llm,
    operations: operations
  )

run.status
#=> :passed

The run is reproducible. The same inputs always produce the same Jidoka.Eval.Run, so this example doubles as a regression test.

Concepts

Deterministic testing in Jidoka uses four building blocks.

  1. Fake LLM function. Every LLM capability is a 2-arity function fn intent, journal -> {:ok, decision} | {:error, reason} end. The decision shape is %{type: :operation, name: ..., arguments: ...} or %{type: :final, content: ...}. The journal is the replay trace; counting map_size(journal.results) is the standard way to drive multi-step decisions.
  2. Local operation capability. Jidoka.Runtime.LocalOperations.operations/1 wraps a map of %{name => handler} into a capability. Handlers may be (args -> term) or (intent, journal -> term). The same helper is what Jidoka.Operation.Source.Local uses under the hood.
  3. Golden DSL-to-spec tests. Jidoka.project/1 produces compact, deterministic maps from any Jidoka data. Snapshotting those projections locks the DSL/import contract; changes show up as diffs in the golden file.
  4. Jidoka.Eval. Jidoka.Eval.Case packages an agent + request + assertion set into one value. Jidoka.Eval.run_case/2 runs the case through the normal turn runtime and returns a Jidoka.Eval.Run with status, evaluated assertions, and observations.
╭──────────────────╮     ╭───────────────────╮     ╭──────────────────╮
│  Eval.Case data  │────▶│ Jidoka.Eval       │────▶│ turn runtime     │
│ - agent (spec)   │     │   .run_case/2     │     │   .run_turn/3    │
│ - request/input  │     ╰─────────┬─────────╯     ╰────────┬─────────╯
│ - assertions     │               │                        │
╰──────────────────╯               │                        ▼
                                   │              {:ok, Turn.Result}
                                   │              | {:hibernate, Snap}
                                   │              | {:error, reason}
                                   ▼                        │
                          ╭───────────────────╮             │
                          │ evaluate/2        │◀────────────╯
                          │ - contains        │
                          │ - equals          │
                          │ - operation_called│
                          ╰─────────┬─────────╯
                                    ▼
                          ╭───────────────────╮
                          │ Jidoka.Eval.Run   │
                          │ status:           │
                          │   :passed         │
                          │   :failed         │
                          │   :error          │
                          ╰───────────────────╯

Three Kinds Of Outcome

Jidoka.Eval.Run.status is one of:

  • :passed - the harness returned {:ok, %Turn.Result{}} and every evaluated assertion passed.
  • :failed - the harness returned {:ok, _result} but at least one assertion failed. The :assertions list contains :passed/:failed entries with :expected and :actual.
  • :error - the harness did not produce a result. Two subcases live here:
    • Input validation errors ({:error, %Jidoka.Error.Invalid{}} from request normalization, context schema mismatch, or spec compilation). run.error is the projected error map.
    • Execution errors ({:error, reason} from the operation or LLM capability). run.error carries the same shape.
    • Hibernation outcomes ({:hibernate, snapshot} from an operation control returning {:interrupt, ...}). run.error is %{reason: :hibernated, snapshot: ...}. The eval does not resume automatically; treat hibernation as a non-pass outcome and feed the snapshot into a Jidoka.resume/2 test if you need to drive the rest.

How To

Step 1: Author A Fake LLM

The simplest fake returns one decision regardless of journal:

llm = fn _intent, _journal ->
  {:ok, %{type: :final, content: "pong"}}
end

For multi-step tests, branch on map_size(journal.results):

llm = fn _intent, journal ->
  case map_size(journal.results) do
    0 -> {:ok, %{type: :operation, name: "local_time", arguments: %{}}}
    1 -> {:ok, %{type: :final, content: "09:30"}}
  end
end

You can also branch on intent metadata or the journal contents when you need to assert specific tool arguments came back. The fake is just a function; complexity lives in your test, not in a mock framework.

Step 2: Provide Local Operations

Jidoka.Runtime.LocalOperations.operations/1 is the helper for local operation tests:

operations =
  Jidoka.Runtime.LocalOperations.operations(%{
    "local_time" => fn _args -> {:ok, %{time: "09:30"}} end,
    "echo" => fn %{"phrase" => phrase} -> {:ok, %{echoed: phrase}} end
  })

Handlers can be (args -> term) or (intent, journal -> term). A return value that is not {:ok, _} or {:error, _} is wrapped in {:ok, value}.

Pass it to turn/3 (or to Jidoka.Eval.run_case/2) as operations:. The runtime routes any intent with kind: :operation through this capability.

Step 3: Write A Golden DSL-To-Spec Test

The DSL is data-first; the most effective regression test compares the projected spec to a snapshot.

defmodule MyApp.Golden.TimeAgentTest do
  use ExUnit.Case, async: true

  test "compiled spec matches the golden projection" do
    projection =
      MyApp.TimeAgent.spec()
      |> Jidoka.project()
      |> drop_volatile_fields()

    expected = %{
      id: "time_agent",
      operations: [
        %{name: "local_time", idempotency: :idempotent}
      ]
    }

    assert match?(^expected, projection)
  end

  defp drop_volatile_fields(%{} = projection) do
    Map.update!(projection, :operations, fn operations ->
      Enum.map(operations, &Map.take(&1, [:name, :idempotency]))
    end)
    |> Map.take([:id, :operations])
  end
end

In the Jidoka repository, test/jidoka/golden/dsl_to_spec_test.exs asserts the full projection against a recorded snapshot.

Step 4: Use Jidoka.Eval.run_case For Behavior Tests

Jidoka.Eval.run_case/2 accepts a Jidoka.Eval.Case struct, a map, or a keyword list. Three assertion kinds are supported today:

  • contains: "substring" (or a list of substrings) - asserts result.content contains each.
  • equals: "exact content" - asserts result.content equals the value.
  • operation_called: "name" (or a list) - asserts each name appears in result.agent_state.operation_results.
{:ok, run} =
  Jidoka.Eval.run_case(
    %{
      id: "time_basic",
      agent: MyApp.TimeAgent.spec(),
      input: "What time is it?",
      assertions: %{
        contains: ["09:30", "Chicago"],
        operation_called: ["local_time"]
      }
    },
    llm: llm,
    operations: operations
  )

run.status
#=> :passed

run.observations
#=> %{content: "Chicago time is 09:30.", operation_calls: ["local_time"], ...}

The Run struct also carries result (the full Turn.Result), assertions (with :expected and :actual), and metadata so test output can stay close to the source data.

Step 5: Distinguish Outcome Kinds

When a test fails, look at run.status and run.error first:

case Jidoka.Eval.run_case(case_input, llm: llm, operations: operations) do
  {:ok, %Jidoka.Eval.Run{status: :passed} = run} -> {:ok, run}
  {:ok, %Jidoka.Eval.Run{status: :failed, assertions: as}} -> {:failed, as}
  {:ok, %Jidoka.Eval.Run{status: :error, error: %{reason: :hibernated} = e}} ->
    {:hibernated, e.snapshot}
  {:ok, %Jidoka.Eval.Run{status: :error, error: e}} -> {:execution_error, e}
  {:error, reason} -> {:case_validation_error, reason}
end

{:error, reason} from run_case/2 itself is the case validation path - the input could not be normalized into a Jidoka.Eval.Case. The three statuses inside the run cover the runtime outcomes.

Step 6: Build A Small Eval Suite

Eval cases are plain data, so they compose well into a regular ExUnit suite. Iterate the case list, attach the agent spec, and assert on run.status. Jidoka.Eval is not a replacement for ExUnit, just a packaging convenience for the agent/request/assertions trio.

Common Patterns

  • One fake per scenario. Resist building a single mega-fake. Each test is clearest when the LLM function shows exactly the decisions that matter for that case.
  • Use the journal as the state machine. map_size(journal.results) and Map.values(journal.results) are usually enough to branch decisions without inventing a separate test state.
  • Inspect before asserting. When an assertion fails, run Jidoka.inspect(run.result) to see the timeline, then refine the assertion or the fake.
  • Project, then snapshot. Golden tests should compare Jidoka.project/1 output, not raw structs.
  • Treat hibernation as data. When a test deliberately exercises a control interrupt, assert on run.error.reason == :hibernated and use Jidoka.resume/2 in a follow-up test to drive the resume path.

Testing

The dedicated tests under test/jidoka/eval exercise this guide's surface end to end. The recipe is short: build a spec, pin an LLM and operations capability, then assert on Jidoka.Eval.Run.status.

test "passes when content and operations match" do
  operations =
    Jidoka.Runtime.LocalOperations.operations(%{
      "echo" => fn %{"phrase" => phrase} -> {:ok, %{echoed: phrase}} end
    })

  llm = fn _intent, journal ->
    case map_size(journal.results) do
      0 -> {:ok, %{type: :operation, name: "echo", arguments: %{"phrase" => "hi"}}}
      _ -> {:ok, %{type: :final, content: "hi"}}
    end
  end

  spec =
    Jidoka.agent!(
      id: "echo_agent",
      instructions: "Echo the user's input.",
      operations: [Jidoka.Agent.Spec.Operation.new!(name: "echo")]
    )

  assert {:ok, %Jidoka.Eval.Run{status: :passed}} =
           Jidoka.Eval.run_case(
             %{id: "echo_basic", agent: spec, input: "hi",
               assertions: %{contains: "hi", operation_called: "echo"}},
             llm: llm,
             operations: operations
           )
end

For tests that need to inspect the full run shape, project it with Jidoka.project(run).

Troubleshooting

SymptomLikely CauseFix
{:error, %Jidoka.Error.Invalid{}} from run_case/2The case input was malformed (missing :agent, invalid :input).Verify the case keys; agent: is required and must be a spec or compatible map.
run.status == :error with error.reason == :hibernatedAn operation control returned {:interrupt, _}.Either remove the control for the test or assert on hibernation and resume in a follow-up.
run.status == :error with a Splode error mapThe LLM or operation capability returned {:error, _}.Inspect run.error.details; the capability is the fastest place to fix.
Assertions report :passed but content is wrongThe fake LLM returned the expected string by accident even when the operation was never called.Add operation_called: to lock down the path.
Golden test fails after an unrelated changeVolatile fields (ids, timestamps) leaked into the snapshot.Project the spec, drop the volatile keys, then assert.

Reference