CMDC Agent benchmark harness —— 接公开基准 + 自定义 suite,输出 JSONL 报告, 与 LangSmith / Langfuse / Datadog 同源消费。

cmdc_evalcmdc 的独立子库,提供:

  • 标准 Suite behaviourCMDCEval.Suite),实现 3 callback 即可注册一个评测集
  • 断言上下文CMDCEval.Context),Suite 可用 assert/3 读取 reply / tool outputs / plugin events / metadata
  • RAG 通用断言CMDCEval.Assertions.RAG),覆盖 recall / citation / grounding / ACL / faithfulness / correctness
  • Workflow Eval 接缝CMDCEval.Workflow + CMDCEval.Assertions.Workflow),从 orchestrator event ledger 计算完成率、分支覆盖、human_task SLA 等门禁指标
  • Reasoning Eval 接缝CMDCEval.Reasoning + CMDCEval.Assertions.Reasoning),从 reasoning_* 事件或 Runner payload 计算分支、分数、修订、token 和答案门禁
  • 内置 Internal Suite —— 验证 cmdc kernel 内部特性(DAG / Steering / Checkpoint)的回归基准
  • BFCL v3 Suite 接入框架 —— Berkeley Function Calling Leaderboard,公开数据
  • Mix.Tasks.Cmdc.Eval CLI —— 一行命令跑 evals + 输出 JSONL
  • 稳定 JSONL 报告 schema —— suite / case_id / model / pass / latency_ms / tokens_in / tokens_out / cost_usd / events_digest

安装

def deps do
  [
    {:cmdc, "~> 0.6"},
    {:cmdc_eval, "~> 0.3"}
  ]
end

Quick Start

1. 跑 Internal Suite(cmdc kernel 自验证)

$ mix cmdc.eval --suite=internal --model="anthropic:claude-sonnet-4-5" --report=internal.jsonl

输出:


Suite:   internal
Model:   anthropic:claude-sonnet-4-5
Cases:   5
Pass:    5  (rate=1.0)
Fail:    0
Latency: avg=1234.0ms total=6170ms
Tokens:  in=234 out=567
Cost:    $0.0123
Report:  internal.jsonl

2. 跑 BFCL v3(公开基准)

# 1. 先 fetch fixtures(v0.1 写占位,真实数据见 cmdc.eval.fetch_bfcl moduledoc)
$ mix cmdc.eval.fetch_bfcl

# 2. 跑 evals
$ mix cmdc.eval --suite=bfcl --model="openai:gpt-4o" --report=bfcl.jsonl

3. 程序化调用

{:ok, report} = CMDCEval.run(
  suite: CMDCEval.Suites.Internal,
  model: "anthropic:claude-sonnet-4-5",
  concurrency: 4,
  timeout_ms: 60_000
)

report.summary
# => %{total: 5, pass: 5, fail: 0, pass_rate: 1.0, ...}

report.runs
# => [%CMDCEval.Run{case_id: "basic_text", pass: true, ...}, ...]

4. 自定义 Suite

defmodule MyApp.MySuite do
  @behaviour CMDCEval.Suite

  alias CMDCEval.Case

  @impl true
  def name, do: "my_app_evals"

  @impl true
  def cases do
    [
      Case.new(id: "task_a", input: "Solve task A", expected: ~r/done/),
      Case.new(id: "task_b", input: "Solve task B", expected: ~r/completed/)
    ]
  end

  @impl true
  def assert(%Case{expected: %Regex{} = re}, reply), do: Regex.match?(re, reply)
end

# 跑
{:ok, report} = CMDCEval.run(
  suite: MyApp.MySuite,
  model: "anthropic:claude-sonnet-4-5"
)

报告 JSONL Schema

每行一个 Run 的 JSON。下游可被 LangSmith / Langfuse / Datadog 直接消费:

{
  "suite": "internal",
  "case_id": "basic_text",
  "model": "anthropic:claude-sonnet-4-5",
  "pass": true,
  "latency_ms": 1234,
  "tokens_in": 234,
  "tokens_out": 567,
  "cost_usd": 0.0123,
  "events_digest": null,
  "error": null,
  "timestamp": "2026-05-18T12:34:56Z",
  "metadata": {"category": "smoke"}
}

字段稳定 —— 不会在 minor 版本删/改字段,新字段会通过 metadata 透传。

RAG Suite 示例

defmodule MyApp.RAGEvalSuite do
  @behaviour CMDCEval.Suite

  alias CMDCEval.{Assertions.RAG, Case}

  def name, do: "rag_regression"

  def cases do
    [
      Case.new(
        id: "policy-approval",
        input: "高风险操作需要审批吗?",
        expected: %{rag: %{expected_chunk_ids: ["approval-policy-c1"]}},
        metadata: %{allowed_collections: ["policies"]}
      )
    ]
  end

  def assert(_case, _reply, context) do
    RAG.recall_at_k(context, 5, 1.0) and
      RAG.contains_citation(context) and
      RAG.no_unauthorized_source(context) and
      RAG.faithfulness_min(context, 0.8)
  end
end

Workflow Eval 示例

alias CMDCEval.Assertions.Workflow

{:ok, snapshot} = CMDCOrchestrator.status(run_id)

context =
  CMDCEval.Workflow.from_status(snapshot,
    expected_branches: ["approved", "default"]
  )

Workflow.gate(context,
  completion_rate_min: 1.0,
  node_failure_rate_lte: 0.0,
  branch_coverage_min: 1.0,
  human_task_sla_ms_lte: 86_400_000,
  retry_count_lte: 2,
  cost_usd_lte: 1.0,
  latency_ms_lte: 300_000,
  require_fork_join_satisfied: true
)
# => true / false

CMDCEval.Assertions.Workflow.gate_failures(context, human_task_sla_ms_lte: 1_000)
# => [%{metric: :human_task_sla_ms, expected: {:<=, 1000}, actual: 4200}]

Workflow Eval 只消费 Run / NodeRun / RunEvent 的稳定 ledger shape,不依赖 Phoenix schema、Trace Viewer 或 Eval Dashboard。企业平台可以在 AgentSpec / Workflow 发布 审批前运行这组门禁,失败时把 gate_failures/2 展示给发布人。

Reasoning Eval 示例

events = [
  {:reasoning_thought, %{strategy: "trm", stage: :start, depth: 0}},
  {:reasoning_score, %{strategy: "trm", branch_id: "b1", score: 0.91, depth: 1}},
  {:reasoning_done,
   %{
     strategy: "trm",
     depth: 1,
     token_usage: %{total_tokens: 120},
     answer: %{answer: "final", candidates: [%{id: "b1", score: 0.91}]}
   }}
]

context = CMDCEval.Reasoning.from_events(events)

CMDCEval.Assertions.Reasoning.gate(context,
  require_done: true,
  require_answer: true,
  best_score_min: 0.8,
  total_tokens_lte: 500,
  revision_count_lte: 3
)
# => true / false

Reasoning Eval 不依赖真实 LLM,可直接消费 Trace Viewer 回放、测试 fixture 或 CMDC.Reasoning.Runner.run/4 的返回 payload。

v0.2 范围

新增

  • CMDCEval.Context —— assert/3 可读取 Agent 回复、工具输出、Plugin 事件和 metadata
  • CMDCEval.Runner —— 自动订阅当前 eval session 的 CMDC EventBus 事件,并写入 Run.metadata.eval_context
  • CMDCEval.Assertions.RAG —— recall_at_k / contains_citation / grounded_answer / no_unauthorized_source / faithfulness_min / correctness_min
  • 离线 fixture 支持 —— RAG assertions 可直接对 map fixture 运行,不依赖 Arcana 或真实 LLM
  • CMDCEval.Workflow + CMDCEval.Assertions.Workflow —— 基于 orchestrator event ledger 的 WorkflowEval 最小接缝,不做完整 Eval Dashboard / 数据飞轮

v0.1 范围

已实现

🔁 推后到 v0.2

  • BFCL v3 fixtures 自动 fetch(v0.1 写占位,需手动 git clone 上游)
  • tau2-bench airline suite
  • MemoryAgentBench 子集(依赖 cmdc_memory_pg PG 集成)
  • LangSmith 直接同步(OTLP)
  • 完整 BFCL 5 子类(multiple / parallel / parallel_multiple / multi_turn)

CLI 退出码

  • 0 —— 所有 case pass
  • 1 —— 有 case 失败
  • 2 —— Suite 无 case(如 BFCL fixtures 未 fetch)
  • 3 —— Suite 模块不存在或非法

License

Apache-2.0