AgentSea. Evaluate
(agentsea_evaluate v0.1.0)
Copy Markdown
Run scoring metrics over a dataset, concurrently, and aggregate the results.
Examples are evaluated in parallel with Task.async_stream (concurrency is a
setting, not hand-rolled). Each metric is {module, opts} (or just module).
Example
dataset = [
%{id: 1, input: "capital of France?", output: "Paris", expected: "Paris"},
%{id: 2, input: "capital of France?", output: "London", expected: "Paris"}
]
%{summary: summary} =
AgentSea.Evaluate.run(dataset, [AgentSea.Evaluate.Metric.ExactMatch])
summary["exact_match"].pass_rate #=> 0.5
Summary
Types
Functions
@spec run([AgentSea.Evaluate.Metric.example()], [metric()], keyword()) :: %{ results: [result()], summary: map() }