SkillKit.Eval.Runner (SkillKit v0.3.0)

Copy Markdown View Source

Runs a single eval and scores it.

The runner spins up a throwaway agent loaded with the eval's skills and tools, sends the eval's prompt, and collects the resulting transcript (final response + tool calls). A run is scored by two kinds of check: a deterministic completion check (did the agent respond at all, vs. erroring or timing out) and the LLM judge scoring the transcript against the eval's ## Expect rubric.

The agent and judge both run through SkillKit.LLM, so the configured provider decides behavior: a real provider for mix test --include eval, the mock for the harness's own unit tests.

Summary

Functions

run(eval, opts \\ [])

Runs eval and returns a SkillKit.Eval.Result.

Options:

  • :timeout — ms to wait for the agent to respond (default 30000)
  • :judge — set false to skip the LLM-judge check (default true)
  • :model — overrides the eval's agent model
  • :judge_model — model URI for the judge (defaults to the eval's model)
  • :cache — skip cases whose scope already passed. true uses the default cache under _build, a string uses that path, false (default) disables caching. See SkillKit.Eval.Cache.