Turns EVAL.md files and module-colocated @eval attributes into ExUnit
tests.
use SkillKit.Eval.Case discovers eval cases at compile time and defines one
test per case. Each generated test runs the case through
SkillKit.Eval.Runner and asserts that all of its checks pass; a failure
renders the failing checks and transcript via
SkillKit.Eval.Result.failure_message/1.
defmodule MyApp.SkillEvalTest do
# files: EVAL.md / *.eval.md under a directory
use SkillKit.Eval.Case, dir: "test/evals"
end
defmodule MyApp.CodeEvalTest do
# @eval attributes colocated in application modules
use SkillKit.Eval.Case, modules: [MyApp.Greeter, MyApp.Billing]
endProvide :dir, :modules, or both. Test names are qualified — by the eval
file's directory ("greeter: greets the user by name") or by the module
("MyApp.Greeter: greets the user by name") — so cases don't collide.
Generated tests are tagged :eval. Because they drive a real agent (and an
LLM judge), exclude them from the default suite and opt in explicitly:
# test_helper.exs
ExUnit.start(exclude: [:eval])
# run only the skill evals against a real provider
ANTHROPIC_API_KEY=... mix test --only evalOptions
:dir— directory to scan forEVAL.md/*.eval.mdfiles, relative to the project root. AnEVAL.mdmay setmodule:in its frontmatter to anchor the cache to that module's compiled hash (the sidecar pattern).:modules— modules thatuse SkillKit.Evaland carry@evalattributes; their cases are collected via__skill_evals__/0.:run— keyword options forwarded toSkillKit.Eval.Runner.run/2(e.g.[timeout: 60_000, judge: false]).:storage— storage provider to use while these eval tests run, defaultSkillKit.Storage.File. Eval skills are loaded from real files on disk, but the test environment otherwise configures in-memory storage; a per-testsetupswaps the provider in (and restores it after) so colocatedSKILL.mdfiles resolve. Set tofalseto leave the configured provider untouched.
Because the agent and judge call a real provider, pin a provider URI via
:run so eval cases don't fall back to the test mock:
use SkillKit.Eval.Case,
dir: "skills",
run: [
model: "anthropic:claude-sonnet-4-6",
judge_model: "anthropic:claude-sonnet-4-6"
]