This doc maps the heterogeneous-consensus framework in Wu et al.
Council Mode: A Heterogeneous Multi-Agent Consensus Framework for
Reducing LLM Hallucination and Bias (arXiv:2604.02923) onto the
modules CouncilEx ships in v0.8.
The paper's claim: combining diverse LLMs through a reliability-weighted consensus mechanism (not majority vote) reduces both factual hallucinations and demographic biases compared to single-model baselines, without retraining. CouncilEx already had the parallel-members + chair-synthesis shape. The new modules add the rest: weighted aggregation, per-member confidence, bias diagnostic, and reliability tracking.
TL;DR: what's new in CouncilEx
| Paper component | CouncilEx module | Status |
|---|---|---|
| Heterogeneous Agent Pool | Existing multi-provider adapters | shipped |
| Query Processor | Existing Provider.Adapter behaviour | shipped |
| Response Collector (parallel) | Existing :independent_analysis round | shipped |
| Consensus Engine (weighted) | Councils.WeightedConsensus + Rounds.WeightedSynthesis | new |
| Confidence / uncertainty quantification | MemberResult.confidence + CouncilEx.Confidence | new |
| Bias Detector | CouncilEx.BiasDetector (lexicon backend) | new (diagnostic only) |
| Learned reliability weighting | CouncilEx.Reliability + Reliability.ETS | new |
| Chair synthesis | Existing Rounds.Synthesis | shipped |
Where each piece lives
1. Councils.WeightedConsensus
lib/council_ex/councils/weighted_consensus.ex
Two-round topology: independent_analysis → weighted_synthesis.
Members run in parallel; the chair receives a per-round digest where
each member's content is annotated with a normalized :weight and the
extracted :confidence (if any).
Weight resolution per member, in order:
- Static
:weightopt declared on the member tuple :confidencefield on the member's%MemberResult{}- Equal weight
1.0 / Nfallback
Weights normalize to sum to 1.0 across :ok members.
council =
CouncilEx.Councils.WeightedConsensus.new(
as: MyApp.WC,
members: [
{:expert, MyApp.Expert, [provider: :openrouter, model: "anthropic/claude-sonnet-4-6", weight: 0.7]},
{:gen, MyApp.Gen, [provider: :openrouter, model: "openai/gpt-4o-mini", weight: 0.3]}
],
chair: {MyApp.Synth, [provider: :openrouter, model: "openai/gpt-4o"]}
)Equivalent dynamic form via WeightedConsensus.new_dynamic/1.
The chair sees (per prior round) a list of:
%{
content: "...",
model: "anthropic/claude-sonnet-4-6",
weight: 0.7,
confidence: 0.84
}The chair's system prompt is responsible for actually using those weights.
Skipping the weights in the prompt = chair runs as if the round were
plain Synthesis.
2. Per-member confidence (CouncilEx.Confidence)
lib/council_ex/confidence.ex, MemberResult.:confidence field.
Opt-in per member via the :confidence opt:
| Strategy | Cost | Calibration | Status |
|---|---|---|---|
:self_report | free (one prompt edit) | noisy | shipped |
:logprob | free where supported | OpenAI-only today | shipped (OpenAI raw responses) |
{:semantic_entropy, samples: n} | n× | most accurate | reserved |
:self_report appends a confidence-rating instruction to the system
prompt and parses a trailing {"confidence": 0.0..1.0} JSON line. The
parsed JSON is stripped from :content before downstream consumers
see it.
:logprob averages OpenAI-style top-token logprobs from
Response.raw["choices"][0]["logprobs"]["content"] and returns
exp(avg) ∈ (0, 1]. Other providers don't expose logprobs uniformly
yet; they return nil.
member :a, MyApp.A, provider: :openrouter, model: "openai/gpt-4o-mini",
confidence: :self_reportAfter the run, result.rounds |> hd |> Map.get(:member_results) |> Map.get(:a) |> Map.get(:confidence) is a float or nil.
3. CouncilEx.BiasDetector
lib/council_ex/bias_detector.ex
Diagnostic only. Does not mitigate. Mirrors the paper's Bias Detector component: surfaces when member disagreement correlates with
demographic-axis terms.
analyze/2 takes a %{member_id => MemberResult} map and returns:
%{
flagged: true,
axes: [
%{axis: :gender, score: 0.42, evidence: [{:a, ["women", "her"]}]},
%{axis: :ethnicity, score: 0.0, evidence: []},
...
],
baseline_disagreement: 0.61
}- score ∈ [0, 1]: rescaled coverage variance. Members differ in how much they invoke that axis.
- baseline_disagreement ∈ [0, 1]: token-set Jaccard distance across members. High baseline + low axis scores = members disagree on substance. Low baseline + high axis = the axis is doing the disagreeing.
Default backend is :lexicon (substring match against built-in term
lists for gender/ethnicity/religion/age/ability). Override via
:lexicon opt. :llm_judge and :embedding_cluster backends are
planned.
4. CouncilEx.Reliability
lib/council_ex/reliability.ex + lib/council_ex/reliability/ets_store.ex + null_store.ex.
Tracks per-member historical accuracy bucketed by query features. Closes the paper's "agents demonstrating higher accuracy on similar historical queries receive elevated weights" loop.
# After grading a council run out-of-band:
CouncilEx.Reliability.record(:expert, %{topic: :code_review}, true)
# Future runs:
CouncilEx.Reliability.score(:expert, %{topic: :code_review})
# => 0.818 (Laplace-smoothed (s+1)/(s+f+2))Cold-start (no history for that bucket) returns nil; callers should
treat that as "use equal weight". Buckets are exact-match. For
semantic similarity, plug in a different store.
The behaviour is record/3 + score/2. Default is Reliability.Null. Set
:reliability_store in app config or pass :store per call.
5. Benchmark harness (bench/eval/)
bench/eval/{datasets.ex,runner.ex,README.md}
Skeleton only. Opt-in fetching, bring-your-own grading function, no
CI integration. Loaders for TruthfulQA / HaluEval / BBQ. Runner runs a
council across a dataset, grades each item, and optionally feeds the
match? signal back into Reliability.record/3 so adaptive weighting
kicks in on subsequent runs.
See ../bench/eval/README.md for the
recommended workflow (load dataset → run baseline council → run
heterogeneous council → diff summaries).
What the paper does that CouncilEx does NOT do
- Mitigation. The paper's
Bias Detectorinforms aggregation; it doesn't rewrite outputs. CouncilEx'sBiasDetectoris purely diagnostic. You read the report and decide. - Embedding-based query similarity. The paper alludes to "similar historical queries"; CouncilEx's ETS store buckets by exact feature hash. Embedding-NN buckets would require a vector store; out of scope for core.
- Built-in TruthfulQA/HaluEval/BBQ runners. The harness is a skeleton. Datasets are not vendored; users fetch and grade.
- Adaptive per-query-type weight learning.
Reliabilitytracks per-bucket accuracy; turning that into a chair input still requires the caller to read the score and pass it as a member:weight(or rely on:confidenceas a proxy). A first-class adapter is a reasonable next step.
Examples
| File | Demonstrates |
|---|---|
examples/weighted_consensus_example.exs | Static :weight opts, OpenRouter |
examples/confidence_example.exs | :self_report driving weights |
examples/bias_detector_example.exs | Panel + analyze/2 over real member outputs |
examples/reliability_example.exs | ETS store mechanics, no API key |
References
- Wu, S., Li, X., Feng, Y., Li, Y., Wang, Z., & Wang, R. (2026). Council Mode: A Heterogeneous Multi-Agent Consensus Framework for Reducing LLM Hallucination and Bias. arXiv:2604.02923. PDF.