Runs a single eval and scores it.
The runner spins up a throwaway agent loaded with the eval's skills and
tools, sends the eval's prompt, and collects the resulting transcript
(final response + tool calls). A run is scored by two kinds of check: a
deterministic completion check (did the agent respond at all, vs.
erroring or timing out) and the LLM judge scoring the transcript against
the eval's ## Expect rubric.
The agent and judge both run through SkillKit.LLM, so the configured
provider decides behavior: a real provider for mix test --include eval,
the mock for the harness's own unit tests.
Summary
Functions
Runs eval and returns a SkillKit.Eval.Result.
Functions
@spec run( SkillKit.Eval.t(), keyword() ) :: SkillKit.Eval.Result.t()
Runs eval and returns a SkillKit.Eval.Result.
Options:
:timeout— ms to wait for the agent to respond (default30000):judge— setfalseto skip the LLM-judge check (defaulttrue):model— overrides the eval's agent model:judge_model— model URI for the judge (defaults to the eval's model):cache— skip cases whose scope already passed.trueuses the default cache under_build, a string uses that path,false(default) disables caching. SeeSkillKit.Eval.Cache.