An evaluation case for a skill, expressed in a markdown EVAL.md file.
Evals are the test counterpart to skills. Where a SKILL.md injects
instructions into an agent, an EVAL.md describes behaviors the skill should
produce and the criteria for success. The harness loads the skill under test
into a fresh agent, sends each case's prompt, and asks an LLM judge whether
the resulting transcript meets the criteria.
SkillKit.Eval.Case turns a directory of EVAL.md files into ExUnit tests,
so mix test runs your skill evals as part of the suite.
File Format
An EVAL.md is a suite of cases. Each ## heading is one case (its text is
the case name); under it, a ### Prompt section is the message sent to the
agent and a ### Expect section is the rubric the LLM judge scores against.
## greets the user by name
### Prompt
Hi, I'm Sam
### Expect
The assistant greets the user by their name in a warm, friendly tone.
## handles a missing name
### Prompt
Hello there
### Expect
The assistant greets politely without inventing a name.Skill under test
When the EVAL.md lives next to a SKILL.md, that skill is loaded
automatically — no frontmatter needed. To test a skill elsewhere (or add
tools / pin a model), use optional frontmatter:
---
skills:
- "skills/greeter"
tools:
- "SkillKit.Tools.Shell"
model: "anthropic:claude-sonnet-4-6"
system: "You are being evaluated."
---
## greets the user by name
...| Frontmatter (all optional) | Notes |
|---|---|
skills | Skill providers under test (paths or module names). Overrides location inference. |
tools | Tool providers, same forms as skills. |
model | Model URI for the eval agent; falls back to the default provider. |
system | System prompt for the eval agent. |
Headings matching Prompt/Expect (case-insensitive, any level) are section
markers; every other ## heading starts a new case. Other heading levels
inside a section stay part of its content.
Provider strings
Entries in skills/tools are resolved like SkillKit.start_agent/2
providers: a value starting with an uppercase letter is treated as an Elixir
module name ("SkillKit.Tools.Shell"), anything else as a filesystem path
("skills/greeter").
Colocating evals with what they test
Beyond a standalone EVAL.md, an eval can be anchored to its subject so the
result cache keys on that subject's source and re-runs when it changes:
- Application code —
use SkillKit.Evalenables a doctest-style@evalattribute (see__using__/1), or a<source>.EVAL.mdsidecar next to<source>.exinfers its subject module from that file. The module's compiled MD5 anchors the cache. - A whole agent — an
EVAL.mdnext to anAGENT.md(or an explicitagent:) runs that entire agent — identity, skills, sub-agents — viaSkillKit.start_agent/2and judges its transcript (seeagent_source/1). The agent directory's contents anchor the cache.
See the evals guide for worked examples.
Summary
Functions
Colocates evals with the module they exercise via an @eval attribute.
Resolves the agent an eval targets, if any — the directory of an AGENT.md
whose whole identity (system prompt, skills, sub-agents) is run as the
subject. Returns the explicit agent: frontmatter, else the eval's own
directory when an AGENT.md sits beside it (the sidecar pattern), else nil.
When set, the eval runs the agent rather than loading a bare skill.
Loads every eval case under dir.
Like load_dir/1 but raises on error. Used by SkillKit.Eval.Case at
compile time.
Loads and parses a single EVAL.md file from disk into its list of cases.
Parses EVAL.md content into a list of %Eval{} cases.
Resolves the skill providers for an eval.
Resolves the tool providers for an eval — its explicit tools, plus the
eval's subject module when that module is itself a SkillKit.Tool.
Types
@type t() :: %SkillKit.Eval{ agent: String.t() | nil, location: String.t() | nil, metadata: %{optional(String.t()) => term()}, model: String.t() | nil, module: module() | nil, name: String.t() | nil, prompt: String.t() | nil, rubric: String.t() | nil, skills: [provider()], system: String.t() | nil, tools: [provider()] }
Functions
Colocates evals with the module they exercise via an @eval attribute.
use SkillKit.Eval registers an accumulating @eval string attribute. Each
value is a chunk of EVAL.md markdown (one or more ## cases, optional
frontmatter); they are parsed at compile time and exposed as
__skill_evals__/0. Every case records module: __MODULE__, so the eval
cache keys on the module's compiled hash — change the module's code (or the
@eval text) and the eval re-runs; leave it untouched and a prior pass is
reused.
defmodule MyApp.Greeter do
use SkillKit.Eval
@eval """
## greets the user by name
### Prompt
Hi, I'm Sam
### Expect
Greets the user by name.
"""
def greet(name), do: ...
endPoint SkillKit.Eval.Case at the module(s) with modules: [MyApp.Greeter].
Resolves the agent an eval targets, if any — the directory of an AGENT.md
whose whole identity (system prompt, skills, sub-agents) is run as the
subject. Returns the explicit agent: frontmatter, else the eval's own
directory when an AGENT.md sits beside it (the sidecar pattern), else nil.
When set, the eval runs the agent rather than loading a bare skill.
Loads every eval case under dir.
Discovers files named EVAL.md, *.eval.md, or *.EVAL.md at any depth and
flattens their cases. A <source>.EVAL.md next to <source>.ex infers its
subject module from that file. Returns {:ok, evals} ordered by path, or
{:error, {path, reason}} on the first file that fails to parse.
Like load_dir/1 but raises on error. Used by SkillKit.Eval.Case at
compile time.
Loads and parses a single EVAL.md file from disk into its list of cases.
Parses EVAL.md content into a list of %Eval{} cases.
Returns {:ok, evals} or {:error, reason}. location is stored on each
case for diagnostics and skill inference.
Resolves the skill providers for an eval.
Returns the explicit skills when set, otherwise infers a sibling SKILL.md
next to the eval's file (loaded via SkillKit.Eval.SkillFile), or [] when
neither applies.
Resolves the tool providers for an eval — its explicit tools, plus the
eval's subject module when that module is itself a SkillKit.Tool.