AgentSea.Evaluate (agentsea_evaluate v0.1.0)

Copy Markdown

Run scoring metrics over a dataset, concurrently, and aggregate the results.

Examples are evaluated in parallel with Task.async_stream (concurrency is a setting, not hand-rolled). Each metric is {module, opts} (or just module).

Example

dataset = [
  %{id: 1, input: "capital of France?", output: "Paris", expected: "Paris"},
  %{id: 2, input: "capital of France?", output: "London", expected: "Paris"}
]

%{summary: summary} =
  AgentSea.Evaluate.run(dataset, [AgentSea.Evaluate.Metric.ExactMatch])

summary["exact_match"].pass_rate  #=> 0.5

Summary

Types

metric()

@type metric() :: module() | {module(), keyword()}

result()

@type result() :: %{
  id: term() | nil,
  metrics: %{required(String.t()) => AgentSea.Evaluate.Metric.result()}
}

Functions

run(examples, metrics, opts \\ [])

@spec run([AgentSea.Evaluate.Metric.example()], [metric()], keyword()) :: %{
  results: [result()],
  summary: map()
}