CouncilEx.Reliability behaviour (CouncilEx v0.1.0)

Copy Markdown View Source

Per-member historical reliability tracking for adaptive weighting.

Inspired by Wu et al. Council Mode (arXiv:2604.02923): "Agents demonstrating higher accuracy on similar historical queries receive elevated weights during aggregation."

Pieces

Persistent backends (Postgres / Redis / etc.) implement the same behaviour. Out of scope for core — host apps wire their own.

Lifecycle

  1. Run a council. Members produce outputs.
  2. Out-of-band signal — eval harness, downstream metric, user feedback — decides if a member's output was correct.
  3. Caller invokes Reliability.record(:member_id, query_features, match?) once per (member, query).
  4. Future runs: WeightedConsensus (or any weighted aggregator) calls Reliability.score(:member_id, query_features) to fetch a reliability prior. Cold start → nil → equal-weight fallback.

This module is the chicken-and-egg layer: you need ground-truth signals to populate it, and bench/ is the obvious source.

Configuration

config :council_ex, :reliability_store, CouncilEx.Reliability.ETS

Or pass a store opt to score/3 / record/4:

Reliability.score(:m1, %{features: ...}, store: MyApp.PgStore)

Summary

Functions

Record one outcome. match? = whether member's output was correct.

Look up a reliability score for a member on a query.

Callbacks

record(member_id, query_features, match?)

@callback record(
  member_id :: atom() | String.t(),
  query_features :: map(),
  match? :: boolean()
) :: :ok | {:error, term()}

score(member_id, query_features)

@callback score(
  member_id :: atom() | String.t(),
  query_features :: map()
) :: float() | nil

Functions

record(member_id, query_features, match?, opts \\ [])

@spec record(atom() | String.t(), map(), boolean(), keyword()) ::
  :ok | {:error, term()}

Record one outcome. match? = whether member's output was correct.

score(member_id, query_features, opts \\ [])

@spec score(atom() | String.t(), map(), keyword()) :: float() | nil

Look up a reliability score for a member on a query.

Returns a float in [0.0, 1.0] or nil when no history exists (cold-start). Callers should treat nil as "use equal weight".