Cantrip.Familiar.Eval (Cantrip v1.3.3)

Copy Markdown View Source

When you change a prompt or a circle and want evidence, you run an eval. This harness runs Familiar scenarios across seeds, scores each run against rubric criteria, persists transcripts, and writes a JSON report.

Multi-scenario, multi-seed evaluation harness for Cantrip.Familiar.

Scenarios are trusted Elixir data, usually loaded from an .exs file or a directory of .exs / .json files. Each scenario creates a temporary workspace, runs the Familiar against a prompt, persists that run's loom transcript, applies rubric criteria, and contributes to a summary report.

Minimal scenario shape:

[
  %{
    name: "read-note",
    prompt: "Read note.txt and return the first line.",
    fixtures: %{"note.txt" => "hello\n"},
    llm: {Cantrip.FakeLLM, Cantrip.FakeLLM.new([%{code: ~S[
      {:ok, reader} = Cantrip.new(%{
        identity: %{system_prompt: "Read note.txt and return its contents."},
        circle: %{type: :code, gates: ["read_file", "done"], wards: [%{max_turns: 2}]}
      })
      {:ok, text, _reader, _loom, _meta} = Cantrip.cast(reader, "Read note.txt")
      done.(String.trim(text))
    ]}])},
    rubric: [
      %{name: "terminated", terminated: true},
      %{name: "answer", expected_result: "hello"}
    ]
  }
]

Rubric criteria can be data-driven (:expected_result, :contains, :terminated, :gate_used, :child_medium_used, :forbid_code_contains), function-driven via :score, or judge-driven via :judge. Function criteria receive the run map and return a boolean or numeric score. Judge criteria use :judge_llm, :judge_llm_factory, or the runner's :judge_llm option and expect a JSON object like %{"score" => 4, "reason" => "..."} or a bare numeric response.

Summary

Functions

Returns a JSON-safe projection of a report.

Loads scenarios from a trusted .exs file or a JSON file.

Loads scenarios from a trusted .exs/.json file or a directory.

Runs scenarios and returns a report map.

Loads a scenario file and runs it.

Loads a scenario file or directory and runs it.

Types

report()

@type report() :: map()

run_result()

@type run_result() :: map()

scenario()

@type scenario() :: map()

Functions

jsonable_report(report)

@spec jsonable_report(report()) :: map()

Returns a JSON-safe projection of a report.

load_file(path)

@spec load_file(Path.t()) :: {:ok, [scenario()]} | {:error, String.t()}

Loads scenarios from a trusted .exs file or a JSON file.

load_path(path)

@spec load_path(Path.t()) :: {:ok, [scenario()]} | {:error, String.t()}

Loads scenarios from a trusted .exs/.json file or a directory.

.exs files may return either a list of scenario maps or %{scenarios: scenarios}. JSON files support data-driven criteria only. Directories load *.exs and *.json entries in lexical order.

run(scenarios, opts \\ [])

@spec run(
  [scenario()],
  keyword()
) :: {:ok, report()} | {:error, String.t()}

Runs scenarios and returns a report map.

Options:

  • :seeds - integer count or explicit list of seeds. Default: 1.
  • :out_dir - directory for report and transcripts. Default: tmp/cantrip-evals/<timestamp>.
  • :llm_factory - fallback function (scenario, seed) -> llm.
  • :judge_llm - fallback LLM used by judge-driven rubric criteria.
  • :judge_llm_factory - fallback function (scenario, seed) -> judge_llm.
  • :familiar_opts - base options merged into every Familiar.

run_file(path, opts \\ [])

@spec run_file(
  Path.t(),
  keyword()
) :: {:ok, report()} | {:error, String.t()}

Loads a scenario file and runs it.

run_path(path, opts \\ [])

@spec run_path(
  Path.t(),
  keyword()
) :: {:ok, report()} | {:error, String.t()}

Loads a scenario file or directory and runs it.