SkillKit.Eval (SkillKit v0.4.0)

Copy Markdown View Source

An evaluation case for a skill, expressed in a markdown EVAL.md file.

Evals are the test counterpart to skills. Where a SKILL.md injects instructions into an agent, an EVAL.md describes behaviors the skill should produce and the criteria for success. The harness loads the skill under test into a fresh agent, sends each case's prompt, and asks an LLM judge whether the resulting transcript meets the criteria.

SkillKit.Eval.Case turns a directory of EVAL.md files into ExUnit tests, so mix test runs your skill evals as part of the suite.

File Format

An EVAL.md is a suite of cases. Each ## heading is one case (its text is the case name); under it, a ### Prompt section is the message sent to the agent and a ### Expect section is the rubric the LLM judge scores against.

## greets the user by name
### Prompt
Hi, I'm Sam
### Expect
The assistant greets the user by their name in a warm, friendly tone.

## handles a missing name
### Prompt
Hello there
### Expect
The assistant greets politely without inventing a name.

Skill under test

When the EVAL.md lives next to a SKILL.md, that skill is loaded automatically — no frontmatter needed. To test a skill elsewhere (or add tools / pin a model), use optional frontmatter:

---
skills:
  - "skills/greeter"
tools:
  - "SkillKit.Tools.Shell"
model: "anthropic:claude-sonnet-4-6"
system: "You are being evaluated."
---
## greets the user by name
...
Frontmatter (all optional)Notes
skillsSkill providers under test (paths or module names). Overrides location inference.
toolsTool providers, same forms as skills.
modelModel URI for the eval agent; falls back to the default provider.
systemSystem prompt for the eval agent.

Headings matching Prompt/Expect (case-insensitive, any level) are section markers; every other ## heading starts a new case. Other heading levels inside a section stay part of its content.

Provider strings

Entries in skills/tools are resolved like SkillKit.start_agent/2 providers: a value starting with an uppercase letter is treated as an Elixir module name ("SkillKit.Tools.Shell"), anything else as a filesystem path ("skills/greeter").

Colocating evals with what they test

Beyond a standalone EVAL.md, an eval can be anchored to its subject so the result cache keys on that subject's source and re-runs when it changes:

  • Application codeuse SkillKit.Eval enables a doctest-style @eval attribute (see __using__/1), or a <source>.EVAL.md sidecar next to <source>.ex infers its subject module from that file. The module's compiled MD5 anchors the cache.
  • A whole agent — an EVAL.md next to an AGENT.md (or an explicit agent:) runs that entire agent — identity, skills, sub-agents — via SkillKit.start_agent/2 and judges its transcript (see agent_source/1). The agent directory's contents anchor the cache.

See the evals guide for worked examples.

Summary

Functions

Colocates evals with the module they exercise via an @eval attribute.

Resolves the agent an eval targets, if any — the directory of an AGENT.md whose whole identity (system prompt, skills, sub-agents) is run as the subject. Returns the explicit agent: frontmatter, else the eval's own directory when an AGENT.md sits beside it (the sidecar pattern), else nil. When set, the eval runs the agent rather than loading a bare skill.

Loads every eval case under dir.

Like load_dir/1 but raises on error. Used by SkillKit.Eval.Case at compile time.

Loads and parses a single EVAL.md file from disk into its list of cases.

Parses EVAL.md content into a list of %Eval{} cases.

Resolves the skill providers for an eval.

Resolves the tool providers for an eval — its explicit tools, plus the eval's subject module when that module is itself a SkillKit.Tool.

Types

provider()

@type provider() :: module() | String.t() | {module(), keyword()}

t()

@type t() :: %SkillKit.Eval{
  agent: String.t() | nil,
  location: String.t() | nil,
  metadata: %{optional(String.t()) => term()},
  model: String.t() | nil,
  module: module() | nil,
  name: String.t() | nil,
  prompt: String.t() | nil,
  rubric: String.t() | nil,
  skills: [provider()],
  system: String.t() | nil,
  tools: [provider()]
}

Functions

__using__(opts)

(macro)

Colocates evals with the module they exercise via an @eval attribute.

use SkillKit.Eval registers an accumulating @eval string attribute. Each value is a chunk of EVAL.md markdown (one or more ## cases, optional frontmatter); they are parsed at compile time and exposed as __skill_evals__/0. Every case records module: __MODULE__, so the eval cache keys on the module's compiled hash — change the module's code (or the @eval text) and the eval re-runs; leave it untouched and a prior pass is reused.

defmodule MyApp.Greeter do
  use SkillKit.Eval

  @eval """
  ## greets the user by name
  ### Prompt
  Hi, I'm Sam
  ### Expect
  Greets the user by name.
  """
  def greet(name), do: ...
end

Point SkillKit.Eval.Case at the module(s) with modules: [MyApp.Greeter].

agent_source(eval)

@spec agent_source(t()) :: String.t() | nil

Resolves the agent an eval targets, if any — the directory of an AGENT.md whose whole identity (system prompt, skills, sub-agents) is run as the subject. Returns the explicit agent: frontmatter, else the eval's own directory when an AGENT.md sits beside it (the sidecar pattern), else nil. When set, the eval runs the agent rather than loading a bare skill.

load_dir(dir)

@spec load_dir(Path.t()) :: {:ok, [t()]} | {:error, {Path.t(), term()}}

Loads every eval case under dir.

Discovers files named EVAL.md, *.eval.md, or *.EVAL.md at any depth and flattens their cases. A <source>.EVAL.md next to <source>.ex infers its subject module from that file. Returns {:ok, evals} ordered by path, or {:error, {path, reason}} on the first file that fails to parse.

load_dir!(dir)

@spec load_dir!(Path.t()) :: [t()]

Like load_dir/1 but raises on error. Used by SkillKit.Eval.Case at compile time.

load_file(path)

@spec load_file(Path.t()) :: {:ok, [t()]} | {:error, term()}

Loads and parses a single EVAL.md file from disk into its list of cases.

parse(content, location \\ nil)

@spec parse(String.t(), String.t() | nil) :: {:ok, [t()]} | {:error, term()}

Parses EVAL.md content into a list of %Eval{} cases.

Returns {:ok, evals} or {:error, reason}. location is stored on each case for diagnostics and skill inference.

skill_providers(eval)

@spec skill_providers(t()) :: [provider()]

Resolves the skill providers for an eval.

Returns the explicit skills when set, otherwise infers a sibling SKILL.md next to the eval's file (loaded via SkillKit.Eval.SkillFile), or [] when neither applies.

tool_providers(eval)

@spec tool_providers(t()) :: [provider()]

Resolves the tool providers for an eval — its explicit tools, plus the eval's subject module when that module is itself a SkillKit.Tool.