Use deterministic tests for agent behaviour. Inject a fake LLM and local operations, assert the compiled spec shape, and run small eval cases without calling a provider.
When To Use This
- Use this guide when adding tests for a new agent, tool, control, or memory contract.
- Use this guide when setting up regression coverage for the
DSL-to-
Jidoka.Agent.Speccontract. - Use this guide when building a small eval suite for CI.
- Do not use this guide for live model evaluations or benchmarking; those belong in opt-in suites that explicitly require provider credentials.
Prerequisites
- A working Jidoka project (see Getting Started).
- Familiarity with the operation contract from Tools And Operations.
- No provider keys are required for any example below.
mix deps.get
mix test
Quick Example
A minimal eval pins both capabilities, declares one assertion, and runs the case through the same harness as production.
defmodule MyApp.TimeAgent do
use Jidoka.Agent
agent :time_agent do
model "openai:gpt-4o-mini"
instructions "Call local_time when asked for the time."
end
tools do
action MyApp.Tools.LocalTime
end
end
operations =
Jidoka.Runtime.LocalOperations.operations(%{
"local_time" => fn _args -> {:ok, %{city: "Chicago", time: "09:30"}} end
})
llm = fn _intent, journal ->
case map_size(journal.results) do
0 -> {:ok, %{type: :operation, name: "local_time", arguments: %{}}}
_ -> {:ok, %{type: :final, content: "Chicago time is 09:30."}}
end
end
{:ok, run} =
Jidoka.Eval.run_case(
%{
id: "time_basic",
agent: MyApp.TimeAgent.spec(),
input: "What time is it?",
assertions: %{
contains: "09:30",
operation_called: "local_time"
}
},
llm: llm,
operations: operations
)
run.status
#=> :passedThe run is reproducible. The same inputs always produce the same
Jidoka.Eval.Run, so this example doubles as a regression test.
Concepts
Deterministic testing in Jidoka uses four building blocks.
- Fake LLM function. Every LLM capability is a 2-arity function
fn intent, journal -> {:ok, decision} | {:error, reason} end. The decision shape is%{type: :operation, name: ..., arguments: ...}or%{type: :final, content: ...}. The journal is the replay trace; countingmap_size(journal.results)is the standard way to drive multi-step decisions. - Local operation capability.
Jidoka.Runtime.LocalOperations.operations/1wraps a map of%{name => handler}into a capability. Handlers may be(args -> term)or(intent, journal -> term). The same helper is whatJidoka.Operation.Source.Localuses under the hood. - Golden DSL-to-spec tests.
Jidoka.project/1produces compact, deterministic maps from any Jidoka data. Snapshotting those projections locks the DSL/import contract; changes show up as diffs in the golden file. Jidoka.Eval.Jidoka.Eval.Casepackages an agent + request + assertion set into one value.Jidoka.Eval.run_case/2runs the case through the normal turn runtime and returns aJidoka.Eval.Runwith status, evaluated assertions, and observations.
╭──────────────────╮ ╭───────────────────╮ ╭──────────────────╮
│ Eval.Case data │────▶│ Jidoka.Eval │────▶│ turn runtime │
│ - agent (spec) │ │ .run_case/2 │ │ .run_turn/3 │
│ - request/input │ ╰─────────┬─────────╯ ╰────────┬─────────╯
│ - assertions │ │ │
╰──────────────────╯ │ ▼
│ {:ok, Turn.Result}
│ | {:hibernate, Snap}
│ | {:error, reason}
▼ │
╭───────────────────╮ │
│ evaluate/2 │◀────────────╯
│ - contains │
│ - equals │
│ - operation_called│
╰─────────┬─────────╯
▼
╭───────────────────╮
│ Jidoka.Eval.Run │
│ status: │
│ :passed │
│ :failed │
│ :error │
╰───────────────────╯Three Kinds Of Outcome
Jidoka.Eval.Run.status is one of:
:passed- the harness returned{:ok, %Turn.Result{}}and every evaluated assertion passed.:failed- the harness returned{:ok, _result}but at least one assertion failed. The:assertionslist contains:passed/:failedentries with:expectedand:actual.:error- the harness did not produce a result. Two subcases live here:- Input validation errors (
{:error, %Jidoka.Error.Invalid{}}from request normalization, context schema mismatch, or spec compilation).run.erroris the projected error map. - Execution errors (
{:error, reason}from the operation or LLM capability).run.errorcarries the same shape. - Hibernation outcomes (
{:hibernate, snapshot}from an operation control returning{:interrupt, ...}).run.erroris%{reason: :hibernated, snapshot: ...}. The eval does not resume automatically; treat hibernation as a non-pass outcome and feed the snapshot into aJidoka.resume/2test if you need to drive the rest.
- Input validation errors (
How To
Step 1: Author A Fake LLM
The simplest fake returns one decision regardless of journal:
llm = fn _intent, _journal ->
{:ok, %{type: :final, content: "pong"}}
endFor multi-step tests, branch on map_size(journal.results):
llm = fn _intent, journal ->
case map_size(journal.results) do
0 -> {:ok, %{type: :operation, name: "local_time", arguments: %{}}}
1 -> {:ok, %{type: :final, content: "09:30"}}
end
endYou can also branch on intent metadata or the journal contents when you need to assert specific tool arguments came back. The fake is just a function; complexity lives in your test, not in a mock framework.
Step 2: Provide Local Operations
Jidoka.Runtime.LocalOperations.operations/1 is the helper for local
operation tests:
operations =
Jidoka.Runtime.LocalOperations.operations(%{
"local_time" => fn _args -> {:ok, %{time: "09:30"}} end,
"echo" => fn %{"phrase" => phrase} -> {:ok, %{echoed: phrase}} end
})Handlers can be (args -> term) or (intent, journal -> term). A return
value that is not {:ok, _} or {:error, _} is wrapped in {:ok, value}.
Pass it to turn/3 (or to Jidoka.Eval.run_case/2) as operations:. The
runtime routes any intent with kind: :operation through this capability.
Step 3: Write A Golden DSL-To-Spec Test
The DSL is data-first; the most effective regression test compares the projected spec to a snapshot.
defmodule MyApp.Golden.TimeAgentTest do
use ExUnit.Case, async: true
test "compiled spec matches the golden projection" do
projection =
MyApp.TimeAgent.spec()
|> Jidoka.project()
|> drop_volatile_fields()
expected = %{
id: "time_agent",
operations: [
%{name: "local_time", idempotency: :idempotent}
]
}
assert match?(^expected, projection)
end
defp drop_volatile_fields(%{} = projection) do
Map.update!(projection, :operations, fn operations ->
Enum.map(operations, &Map.take(&1, [:name, :idempotency]))
end)
|> Map.take([:id, :operations])
end
endIn the Jidoka repository, test/jidoka/golden/dsl_to_spec_test.exs asserts
the full projection against a recorded snapshot.
Step 4: Use Jidoka.Eval.run_case For Behavior Tests
Jidoka.Eval.run_case/2 accepts a Jidoka.Eval.Case struct, a map, or a
keyword list. Three assertion kinds are supported today:
contains: "substring"(or a list of substrings) - assertsresult.contentcontains each.equals: "exact content"- assertsresult.contentequals the value.operation_called: "name"(or a list) - asserts each name appears inresult.agent_state.operation_results.
{:ok, run} =
Jidoka.Eval.run_case(
%{
id: "time_basic",
agent: MyApp.TimeAgent.spec(),
input: "What time is it?",
assertions: %{
contains: ["09:30", "Chicago"],
operation_called: ["local_time"]
}
},
llm: llm,
operations: operations
)
run.status
#=> :passed
run.observations
#=> %{content: "Chicago time is 09:30.", operation_calls: ["local_time"], ...}The Run struct also carries result (the full Turn.Result),
assertions (with :expected and :actual), and metadata so test
output can stay close to the source data.
Step 5: Distinguish Outcome Kinds
When a test fails, look at run.status and run.error first:
case Jidoka.Eval.run_case(case_input, llm: llm, operations: operations) do
{:ok, %Jidoka.Eval.Run{status: :passed} = run} -> {:ok, run}
{:ok, %Jidoka.Eval.Run{status: :failed, assertions: as}} -> {:failed, as}
{:ok, %Jidoka.Eval.Run{status: :error, error: %{reason: :hibernated} = e}} ->
{:hibernated, e.snapshot}
{:ok, %Jidoka.Eval.Run{status: :error, error: e}} -> {:execution_error, e}
{:error, reason} -> {:case_validation_error, reason}
end{:error, reason} from run_case/2 itself is the case validation
path - the input could not be normalized into a Jidoka.Eval.Case. The
three statuses inside the run cover the runtime outcomes.
Step 6: Build A Small Eval Suite
Eval cases are plain data, so they compose well into a regular ExUnit
suite. Iterate the case list, attach the agent spec, and assert on
run.status. Jidoka.Eval is not a replacement for ExUnit, just a
packaging convenience for the agent/request/assertions trio.
Common Patterns
- One fake per scenario. Resist building a single mega-fake. Each test is clearest when the LLM function shows exactly the decisions that matter for that case.
- Use the journal as the state machine.
map_size(journal.results)andMap.values(journal.results)are usually enough to branch decisions without inventing a separate test state. - Inspect before asserting. When an assertion fails, run
Jidoka.inspect(run.result)to see the timeline, then refine the assertion or the fake. - Project, then snapshot. Golden tests should compare
Jidoka.project/1output, not raw structs. - Treat hibernation as data. When a test deliberately exercises a
control interrupt, assert on
run.error.reason == :hibernatedand useJidoka.resume/2in a follow-up test to drive the resume path.
Testing
The dedicated tests under test/jidoka/eval exercise this guide's surface
end to end. The recipe is short: build a spec, pin an LLM and operations
capability, then assert on Jidoka.Eval.Run.status.
test "passes when content and operations match" do
operations =
Jidoka.Runtime.LocalOperations.operations(%{
"echo" => fn %{"phrase" => phrase} -> {:ok, %{echoed: phrase}} end
})
llm = fn _intent, journal ->
case map_size(journal.results) do
0 -> {:ok, %{type: :operation, name: "echo", arguments: %{"phrase" => "hi"}}}
_ -> {:ok, %{type: :final, content: "hi"}}
end
end
spec =
Jidoka.agent!(
id: "echo_agent",
instructions: "Echo the user's input.",
operations: [Jidoka.Agent.Spec.Operation.new!(name: "echo")]
)
assert {:ok, %Jidoka.Eval.Run{status: :passed}} =
Jidoka.Eval.run_case(
%{id: "echo_basic", agent: spec, input: "hi",
assertions: %{contains: "hi", operation_called: "echo"}},
llm: llm,
operations: operations
)
endFor tests that need to inspect the full run shape, project it with
Jidoka.project(run).
Troubleshooting
| Symptom | Likely Cause | Fix |
|---|---|---|
{:error, %Jidoka.Error.Invalid{}} from run_case/2 | The case input was malformed (missing :agent, invalid :input). | Verify the case keys; agent: is required and must be a spec or compatible map. |
run.status == :error with error.reason == :hibernated | An operation control returned {:interrupt, _}. | Either remove the control for the test or assert on hibernation and resume in a follow-up. |
run.status == :error with a Splode error map | The LLM or operation capability returned {:error, _}. | Inspect run.error.details; the capability is the fastest place to fix. |
Assertions report :passed but content is wrong | The fake LLM returned the expected string by accident even when the operation was never called. | Add operation_called: to lock down the path. |
| Golden test fails after an unrelated change | Volatile fields (ids, timestamps) leaked into the snapshot. | Project the spec, drop the volatile keys, then assert. |
Reference
Jidoka.Eval-run_case/2andevaluate/2.Jidoka.Eval.Case- case schema,new/2,new!/2,from_input/2.Jidoka.Eval.Run- run schema,:passed | :failed | :errorstatus, assertions, observations.Jidoka.Runtime.LocalOperations-operations/1helper that wraps a handler map.Jidoka.Operation.Source.Local- source-shaped wrapper around the same handlers.Jidoka.Projection- data projector used by golden tests.Jidoka- public facade:turn/3,chat/3,resume/2,inspect/2,project/1.
Related Guides
- Tools And Operations - shape of the operation contract under test.
- Memory - test patterns for memory-backed turns.
- Handoffs - testing ownership transitions.
- Inspection And Preflight - debugging failures before adding assertions.
- Runtime And Harness - hibernation and resume flows referenced by error-status cases.