SkillKit.Eval.Case (SkillKit v0.3.0)

Copy Markdown View Source

Turns EVAL.md files and module-colocated @eval attributes into ExUnit tests.

use SkillKit.Eval.Case discovers eval cases at compile time and defines one test per case. Each generated test runs the case through SkillKit.Eval.Runner and asserts that all of its checks pass; a failure renders the failing checks and transcript via SkillKit.Eval.Result.failure_message/1.

defmodule MyApp.SkillEvalTest do
  # files: EVAL.md / *.eval.md under a directory
  use SkillKit.Eval.Case, dir: "test/evals"
end

defmodule MyApp.CodeEvalTest do
  # @eval attributes colocated in application modules
  use SkillKit.Eval.Case, modules: [MyApp.Greeter, MyApp.Billing]
end

Provide :dir, :modules, or both. Test names are qualified — by the eval file's directory ("greeter: greets the user by name") or by the module ("MyApp.Greeter: greets the user by name") — so cases don't collide.

Generated tests are tagged :eval. Because they drive a real agent (and an LLM judge), exclude them from the default suite and opt in explicitly:

# test_helper.exs
ExUnit.start(exclude: [:eval])

# run only the skill evals against a real provider
ANTHROPIC_API_KEY=... mix test --only eval

Options

  • :dir — directory to scan for EVAL.md / *.eval.md files, relative to the project root. An EVAL.md may set module: in its frontmatter to anchor the cache to that module's compiled hash (the sidecar pattern).
  • :modules — modules that use SkillKit.Eval and carry @eval attributes; their cases are collected via __skill_evals__/0.
  • :run — keyword options forwarded to SkillKit.Eval.Runner.run/2 (e.g. [timeout: 60_000, judge: false]).
  • :storage — storage provider to use while these eval tests run, default SkillKit.Storage.File. Eval skills are loaded from real files on disk, but the test environment otherwise configures in-memory storage; a per-test setup swaps the provider in (and restores it after) so colocated SKILL.md files resolve. Set to false to leave the configured provider untouched.

Because the agent and judge call a real provider, pin a provider URI via :run so eval cases don't fall back to the test mock:

use SkillKit.Eval.Case,
  dir: "skills",
  run: [
    model: "anthropic:claude-sonnet-4-6",
    judge_model: "anthropic:claude-sonnet-4-6"
  ]