Council Mode (Wu et al.) — Paper-to-CouncilEx Mapping

Copy Markdown View Source

This doc maps the heterogeneous-consensus framework in Wu et al. Council Mode: A Heterogeneous Multi-Agent Consensus Framework for Reducing LLM Hallucination and Bias (arXiv:2604.02923) onto the modules CouncilEx ships in v0.8.

The paper's claim: combining diverse LLMs through a reliability-weighted consensus mechanism (not majority vote) reduces both factual hallucinations and demographic biases compared to single-model baselines, without retraining. CouncilEx already had the parallel-members + chair-synthesis shape. The new modules add the rest: weighted aggregation, per-member confidence, bias diagnostic, and reliability tracking.

TL;DR: what's new in CouncilEx

Paper componentCouncilEx moduleStatus
Heterogeneous Agent PoolExisting multi-provider adaptersshipped
Query ProcessorExisting Provider.Adapter behaviourshipped
Response Collector (parallel)Existing :independent_analysis roundshipped
Consensus Engine (weighted)Councils.WeightedConsensus + Rounds.WeightedSynthesisnew
Confidence / uncertainty quantificationMemberResult.confidence + CouncilEx.Confidencenew
Bias DetectorCouncilEx.BiasDetector (lexicon backend)new (diagnostic only)
Learned reliability weightingCouncilEx.Reliability + Reliability.ETSnew
Chair synthesisExisting Rounds.Synthesisshipped

Where each piece lives

1. Councils.WeightedConsensus

lib/council_ex/councils/weighted_consensus.ex

Two-round topology: independent_analysis → weighted_synthesis. Members run in parallel; the chair receives a per-round digest where each member's content is annotated with a normalized :weight and the extracted :confidence (if any).

Weight resolution per member, in order:

  1. Static :weight opt declared on the member tuple
  2. :confidence field on the member's %MemberResult{}
  3. Equal weight 1.0 / N fallback

Weights normalize to sum to 1.0 across :ok members.

council =
  CouncilEx.Councils.WeightedConsensus.new(
    as: MyApp.WC,
    members: [
      {:expert, MyApp.Expert, [provider: :openrouter, model: "anthropic/claude-sonnet-4-6", weight: 0.7]},
      {:gen,    MyApp.Gen,    [provider: :openrouter, model: "openai/gpt-4o-mini", weight: 0.3]}
    ],
    chair: {MyApp.Synth, [provider: :openrouter, model: "openai/gpt-4o"]}
  )

Equivalent dynamic form via WeightedConsensus.new_dynamic/1.

The chair sees (per prior round) a list of:

%{
  content: "...",
  model: "anthropic/claude-sonnet-4-6",
  weight: 0.7,
  confidence: 0.84
}

The chair's system prompt is responsible for actually using those weights. Skipping the weights in the prompt = chair runs as if the round were plain Synthesis.

2. Per-member confidence (CouncilEx.Confidence)

lib/council_ex/confidence.ex, MemberResult.:confidence field.

Opt-in per member via the :confidence opt:

StrategyCostCalibrationStatus
:self_reportfree (one prompt edit)noisyshipped
:logprobfree where supportedOpenAI-only todayshipped (OpenAI raw responses)
{:semantic_entropy, samples: n}most accuratereserved

:self_report appends a confidence-rating instruction to the system prompt and parses a trailing {"confidence": 0.0..1.0} JSON line. The parsed JSON is stripped from :content before downstream consumers see it.

:logprob averages OpenAI-style top-token logprobs from Response.raw["choices"][0]["logprobs"]["content"] and returns exp(avg) ∈ (0, 1]. Other providers don't expose logprobs uniformly yet; they return nil.

member :a, MyApp.A, provider: :openrouter, model: "openai/gpt-4o-mini",
                    confidence: :self_report

After the run, result.rounds |> hd |> Map.get(:member_results) |> Map.get(:a) |> Map.get(:confidence) is a float or nil.

3. CouncilEx.BiasDetector

lib/council_ex/bias_detector.ex

Diagnostic only. Does not mitigate. Mirrors the paper's Bias Detector component: surfaces when member disagreement correlates with demographic-axis terms.

analyze/2 takes a %{member_id => MemberResult} map and returns:

%{
  flagged: true,
  axes: [
    %{axis: :gender, score: 0.42, evidence: [{:a, ["women", "her"]}]},
    %{axis: :ethnicity, score: 0.0, evidence: []},
    ...
  ],
  baseline_disagreement: 0.61
}
  • score ∈ [0, 1]: rescaled coverage variance. Members differ in how much they invoke that axis.
  • baseline_disagreement ∈ [0, 1]: token-set Jaccard distance across members. High baseline + low axis scores = members disagree on substance. Low baseline + high axis = the axis is doing the disagreeing.

Default backend is :lexicon (substring match against built-in term lists for gender/ethnicity/religion/age/ability). Override via :lexicon opt. :llm_judge and :embedding_cluster backends are planned.

4. CouncilEx.Reliability

lib/council_ex/reliability.ex + lib/council_ex/reliability/ets_store.ex + null_store.ex.

Tracks per-member historical accuracy bucketed by query features. Closes the paper's "agents demonstrating higher accuracy on similar historical queries receive elevated weights" loop.

# After grading a council run out-of-band:
CouncilEx.Reliability.record(:expert, %{topic: :code_review}, true)

# Future runs:
CouncilEx.Reliability.score(:expert, %{topic: :code_review})
# => 0.818  (Laplace-smoothed (s+1)/(s+f+2))

Cold-start (no history for that bucket) returns nil; callers should treat that as "use equal weight". Buckets are exact-match. For semantic similarity, plug in a different store.

The behaviour is record/3 + score/2. Default is Reliability.Null. Set :reliability_store in app config or pass :store per call.

5. Benchmark harness (bench/eval/)

bench/eval/{datasets.ex,runner.ex,README.md}

Skeleton only. Opt-in fetching, bring-your-own grading function, no CI integration. Loaders for TruthfulQA / HaluEval / BBQ. Runner runs a council across a dataset, grades each item, and optionally feeds the match? signal back into Reliability.record/3 so adaptive weighting kicks in on subsequent runs.

See ../bench/eval/README.md for the recommended workflow (load dataset → run baseline council → run heterogeneous council → diff summaries).

What the paper does that CouncilEx does NOT do

  • Mitigation. The paper's Bias Detector informs aggregation; it doesn't rewrite outputs. CouncilEx's BiasDetector is purely diagnostic. You read the report and decide.
  • Embedding-based query similarity. The paper alludes to "similar historical queries"; CouncilEx's ETS store buckets by exact feature hash. Embedding-NN buckets would require a vector store; out of scope for core.
  • Built-in TruthfulQA/HaluEval/BBQ runners. The harness is a skeleton. Datasets are not vendored; users fetch and grade.
  • Adaptive per-query-type weight learning. Reliability tracks per-bucket accuracy; turning that into a chair input still requires the caller to read the score and pass it as a member :weight (or rely on :confidence as a proxy). A first-class adapter is a reasonable next step.

Examples

FileDemonstrates
examples/weighted_consensus_example.exsStatic :weight opts, OpenRouter
examples/confidence_example.exs:self_report driving weights
examples/bias_detector_example.exsPanel + analyze/2 over real member outputs
examples/reliability_example.exsETS store mechanics, no API key

References

  • Wu, S., Li, X., Feng, Y., Li, Y., Wang, Z., & Wang, R. (2026). Council Mode: A Heterogeneous Multi-Agent Consensus Framework for Reducing LLM Hallucination and Bias. arXiv:2604.02923. PDF.