When you change a prompt or a circle and want evidence, you run an eval. This harness runs Familiar scenarios across seeds, scores each run against rubric criteria, persists transcripts, and writes a JSON report.
Multi-scenario, multi-seed evaluation harness for Cantrip.Familiar.
Scenarios are trusted Elixir data, usually loaded from an .exs file or a
directory of .exs / .json files. Each scenario creates a temporary
workspace, runs the Familiar against a prompt, persists that run's loom
transcript, applies rubric criteria, and contributes to a summary report.
Minimal scenario shape:
[
%{
name: "read-note",
prompt: "Read note.txt and return the first line.",
fixtures: %{"note.txt" => "hello\n"},
llm: {Cantrip.FakeLLM, Cantrip.FakeLLM.new([%{code: ~S[
{:ok, reader} = Cantrip.new(%{
identity: %{system_prompt: "Read note.txt and return its contents."},
circle: %{type: :code, gates: ["read_file", "done"], wards: [%{max_turns: 2}]}
})
{:ok, text, _reader, _loom, _meta} = Cantrip.cast(reader, "Read note.txt")
done.(String.trim(text))
]}])},
rubric: [
%{name: "terminated", terminated: true},
%{name: "answer", expected_result: "hello"}
]
}
]Rubric criteria can be data-driven (:expected_result, :contains,
:terminated, :gate_used, :child_medium_used, :forbid_code_contains),
function-driven via :score, or judge-driven via :judge. Function criteria
receive the run map and return a boolean or numeric score. Judge criteria use :judge_llm,
:judge_llm_factory, or the runner's :judge_llm option and expect a JSON
object like %{"score" => 4, "reason" => "..."} or a bare numeric response.
Summary
Functions
Returns a JSON-safe projection of a report.
Loads scenarios from a trusted .exs file or a JSON file.
Loads scenarios from a trusted .exs/.json file or a directory.
Runs scenarios and returns a report map.
Loads a scenario file and runs it.
Loads a scenario file or directory and runs it.
Types
Functions
Returns a JSON-safe projection of a report.
Loads scenarios from a trusted .exs file or a JSON file.
Loads scenarios from a trusted .exs/.json file or a directory.
.exs files may return either a list of scenario maps or
%{scenarios: scenarios}. JSON files support data-driven criteria only.
Directories load *.exs and *.json entries in lexical order.
Runs scenarios and returns a report map.
Options:
:seeds- integer count or explicit list of seeds. Default:1.:out_dir- directory for report and transcripts. Default:tmp/cantrip-evals/<timestamp>.:llm_factory- fallback function(scenario, seed) -> llm.:judge_llm- fallback LLM used by judge-driven rubric criteria.:judge_llm_factory- fallback function(scenario, seed) -> judge_llm.:familiar_opts- base options merged into every Familiar.
Loads a scenario file and runs it.
Loads a scenario file or directory and runs it.