Run a directory or file of Familiar eval scenarios.
mix cantrip.eval evals/familiar --out tmp/evals/current --seeds 5Options
--out PATH- output directory forreport.json, workspaces, and transcripts--seeds N- run each scenario with seeds1..N--seeds A,B,C- run each scenario with explicit seed values--min-mean FLOAT- fail the task if aggregate mean score is below this threshold--min-worst FLOAT- fail the task if aggregate worst score is below this threshold--json- print the full JSON report to stdout--help- show usage