LLM-as-judge scoring for an eval's ## Expect rubric.
After the agent under test runs, the harness asks a model to decide whether the transcript satisfies the rubric. The judge is given the original user prompt (for context), the tools the agent called, and its final response.
The verdict is severity-weighted and always coalesces to pass or fail:
FAILis reserved for critical shortfalls — a security or safety problem, a vulnerability, incorrect/harmful output, or a critical failure to do what the rubric asks.- Everything else is a
PASS. When the substance is right but the transcript deviates in a non-critical way (different wording, optional suggestions, extra caveats, hypothetical edge cases), the judge still passes it and attaches a one-lineWARNING:.
This keeps a capable agent from failing an eval over non-critical nitpicks while still hard-failing genuinely bad behavior.
Judge calls go through SkillKit.LLM, so the configured provider (real in
--only eval runs, the mock in unit tests) decides the verdict.
Summary
Functions
Scores transcript against rubric.
Types
Functions
@spec judge(String.t(), SkillKit.Eval.Transcript.t(), keyword()) :: verdict()
Scores transcript against rubric.
Options:
:prompt— the user prompt that was sent to the agent (judge context):model— model URI for the judge call (defaults to the default provider)
Returns {:pass, reasoning, warning} (warning is a string or nil),
{:fail, reasoning}, or {:error, reason} when the LLM call itself fails.