CMDCEval.Suite behaviour (cmdc_eval v0.1.0)

Copy Markdown View Source

Eval Suite behaviour —— 每个 Suite(如 BFCL / tau2-bench / internal)实现 3 callback。

必须实现的 3 个 callback

name/0

返回 Suite 唯一标识(atom 或 string,进入 Report.suite_name)。

cases/0

返回 [CMDCEval.Case.t()] 列表,Runner 按 case 并发跑。

assert/2

传入 (case, agent_reply_text),返回 boolean() 表示该 case 是否通过。

可选回调

default_tools/0

返回该 Suite 全局默认工具模块列表(被 Case :tools 覆盖)。默认 []

cost_estimator/1

传入 %{model, tokens_in, tokens_out} 返回估算 cost_usd。默认返回 0.0

示例实现

defmodule MyApp.MyEvalSuite do
  @behaviour CMDCEval.Suite

  alias CMDCEval.Case

  @impl true
  def name, do: "my_eval"

  @impl true
  def cases do
    [
      Case.new(id: "basic", input: "1 + 1 = ?", expected: ~r/2/),
      Case.new(id: "tool_call", input: "read README.md",
               expected: %{tool_called: "read_file"})
    ]
  end

  @impl true
  def assert(%Case{expected: %Regex{} = re}, reply), do: Regex.match?(re, reply)
  def assert(%Case{expected: %{tool_called: name}}, _reply) do
    # 检查 Agent 是否调用了某工具(需要 events 数据,由 Runner 透传)
    # 这里简化为 string match
    String.contains?(reply, name)
  end
end

Summary

Callbacks

assert t, reply

@callback assert(CMDCEval.Case.t(), reply :: String.t()) :: boolean()

cases()

@callback cases() :: [CMDCEval.Case.t()]

cost_estimator(map)

(optional)
@callback cost_estimator(map()) :: float()

default_tools()

(optional)
@callback default_tools() :: [module()]

name()

@callback name() :: String.t() | atom()