mix cantrip.eval (Cantrip v1.3.3)

Copy Markdown View Source

Run a directory or file of Familiar eval scenarios.

mix cantrip.eval evals/familiar --out tmp/evals/current --seeds 5

Options

  • --out PATH - output directory for report.json, workspaces, and transcripts
  • --seeds N - run each scenario with seeds 1..N
  • --seeds A,B,C - run each scenario with explicit seed values
  • --min-mean FLOAT - fail the task if aggregate mean score is below this threshold
  • --min-worst FLOAT - fail the task if aggregate worst score is below this threshold
  • --json - print the full JSON report to stdout
  • --help - show usage