Related Work: LLM Council Landscape vs CouncilEx

Copy Markdown View Source

Comparison of CouncilEx against the most-cited papers and notable GitHub projects in the multi-agent LLM space (as of 2026-05). Catalogues what's already covered, where the gaps are, and which gaps are worth closing.

Source: research pass over ~/Downloads/llm_council_overview.md.

Items surveyed

Papers

#PaperYearCitation
1Council Mode (Wu, Li, Feng, Li, Wang, Wang)2026arXiv:2604.02923
2Can LLM Agents Really Debate? (Wu, Li, Li)2025arXiv:2511.07784
3Multi-Agent Debate (MAD) (Du, Li, Torralba, Tenenbaum, Mordatch)ICML 2024arXiv:2305.14325
4Adjudicator: KG-Informed Council (You, Paul)2025arXiv:2512.13704

Repos

#RepoStarsStatus
1karpathy/llm-council18.2kexperimental, frozen
2bawfng04/Chaos-MoA-Pipeline2hobby, active
3dayeonki/cultural_debate~8research code, ACL 2025 oral
4dubs3c/council and 0xNyk/council-of-high-intelligence7 / 615two candidates for "The_Council"

Per-item summaries

1. Council Mode (arXiv:2604.02923)

Heterogeneous multi-agent consensus framework. Parallel LLMs → reliability-weighted aggregation → structured synthesis. Reports ~36% hallucination reduction on TruthfulQA/HaluEval, bias mitigation on BBQ/BiasBench. Already ported to CouncilEx in v0.8; see COUNCIL_MODE_PAPER.md.

2. Can LLM Agents Really Debate? (arXiv:2511.07784)

Controlled study on Knight–Knave–Spy logic puzzles. Six factors varied (team size/composition, confidence visibility, debate order/depth, difficulty). Process-level analysis tracks flip / hold / capitulate behavior.

Findings that matter for CouncilEx:

  • Diversity > structure. Heterogeneous strong-model teams drive almost all of the gains; turn order, confidence display, more rounds yield marginal improvements.
  • Visible majority hurts correctness. When a wrong answer becomes the apparent majority, individually correct agents capitulate. Effective debate teams are precisely those that can overturn incorrect consensus.
  • Reasoning quality (validity, not assertiveness) predicts truth flips.
  • Diminishing returns on depth; small + diverse beats large + homogeneous.

3. Multi-Agent Debate (MAD, Du et al.) (arXiv:2305.14325)

Canonical foundational paper. N homogeneous LLM instances; each round, every agent sees the others' prior responses and revises. Final answer = majority vote at last round. Tested on Arithmetic, GSM8K, Chess, MMLU, biographies. Accuracy scales monotonically with agents and rounds; cross-model debate (ChatGPT + Bard) solves problems neither solves alone. Limits: O(N×R) cost, long contexts, can converge on wrong consensus, requires sufficiently capable base models.

4. Adjudicator (arXiv:2512.13704)

KG-grounded 3-agent council for noisy-label correction in e-commerce taxonomy. Three agents (Policy Expert weight 1.0, Data Analyst weight 2.0, Pattern Detector weight 0.5) vote on whether a product's category is wrong; KG provides a hard structural override.

Headline result with caveats: 0.48 single-LLM → 0.59 plain council → 0.99 KG-informed council. The 0.59 → 0.99 jump is mostly from the KG hard gate, not the multi-agent debate. Balanced 1k-item test set; perfect precision is largely a threshold/override artifact.

5. karpathy/llm-council

Three stages: opinions → anonymized peer review (rankings) → chairman synthesis. Anonymization is server-side enforced; judging models never see vendor labels. Karpathy's choice of frontier models (see upstream repo) is mirrored in CouncilEx's port with active OpenRouter IDs — see PROVIDER_MODELS.md for the catalog. Chairman: a single strong model. Strict prompt-contract ranking format with regex fallback. CouncilEx port: TUTORIAL_KARPATHY_COUNCIL.md.

6. Chaos-MoA-Pipeline

Four stages: Generate → Cross-Critique → Rebuttal → Multi-Judge majority with 0–100 confidence. Same model + temperature/nonce diversity (Wu 2025 anti-pattern). Notable: cache-busting nonces, confidence-triggered re-run with more workers, "brutally fault-tolerant" graceful degradation. Diverges from vanilla MoA (Wang et al.) by adding adversarial Critique + Rebuttal + judge vote.

7. cultural_debate (Ki et al., ACL 2025)

Two debaters + neutral judge for cross-country cultural-norm alignment. NormAd-ETI benchmark (~2.6k stories, 75 countries). Headline: pairs of small 7–9B debaters + 27B judge match a single 27B model's accuracy with better cultural parity, cheaper and fairer. "Fairness" = accuracy variance across cultural clusters, not demographic identity.

8. The_Council (two candidates)

  • dubs3c/council: 3 personas (Architect / Critic / AppSec), Proposal → Debate (max 3) → Early-term on consensus → Moderator synthesis. Per-agent temperatures via YAML. ~7 stars.
  • 0xNyk/council-of-high-intelligence: 18 named personas (Aristotle, Feynman, Karpathy, etc.) intentionally paired as counterweights, distributed across 4 providers. Three modes (Full / Quick / Duo). Enforces dissent quotas, novelty gates, anti-recursion. ~615 stars.

Architecture coverage matrix

ItemMembersRound flowAggregationCouncilEx coverage
Council Mode (paper)Heterogeneous LLMsparallel → weighted consensusreliability-weightedCouncils.WeightedConsensus
Can LLM Agents Really Debate?Heterogeneous strong modelscontrolled-debate studymajority vote (criticized)⚠️ no anti-conformity safeguards
MAD (Du et al.)N homogeneous instancespropose → see-others → revise (R rounds) → majoritymajority voteIterate(:critique), Councils.Consensus
Adjudicator3 specialized agentsKG → 3 agents vote → weighted threshold + KG overrideweighted binary + symbolic veto⚠️ weighted vote yes, no KG layer
karpathy/llm-council4 frontier modelsopinions → anonymized peer review → chairmanranking + chairmanAnonymizedPeerReview
Chaos-MoA-PipelineSame model multi-tempGenerate → Critique → Rebuttal → Judge majorityjudge majority + confidence⚠️ critique/rebuttal yes, no multi-judge + retry
cultural_debate2 debaters + 1 judgeindependent → debate → judge verdictjudgeCouncils.PeerReviewbut no parity metric
dubs3c/council3 personasProposal → Debate (max 3) → Synthconsensus convergenceCouncils.Consensus w/ until callback
0xNyk/council-of-high-intelligence18 personas across providersindependent → cross-exam → crystallize → verdictsynth w/ dissent quotas⚠️ no dissent quota / novelty gate

What CouncilEx already covers

PatternMapping
Heterogeneous parallel membersMulti-provider adapters + Councils.ParallelPanel
Independent-then-aggregate:independent_analysis round
MAD-style iterative critiqueRounds.Iterate(:critique) + Councils.Consensus
Anonymized peer reviewRounds.AnonymizedPeerReview
Chairman synthesisRounds.Synthesis + chair macro
Reliability-weighted consensusCouncils.WeightedConsensus
Per-member confidenceConfidence.{:self_report, :logprob}
Early termination on consensusIterate(until: ...)
Multi-vendor heterogeneityOpenAI / Anthropic / Gemini / Ollama / OpenRouter adapters
Persona / role definitionuse CouncilEx.Member + system_prompt
Tournament / pairwiseCouncils.Tournament + Rounds.PairwiseElimination
Voting / aggregatorRounds.Vote + Aggregators.{Plurality, WeightedMean}
Bias diagnosticBiasDetector.analyze/2

Gaps surfaced by this research

Ordered by lift / cost.

🟢 Easy adds

  1. expose_confidence opt on WeightedSynthesis — SHIPPED. Audit found existing PeerReview / Critique / AnonymizedPeerReview only surface parsed content / response text to peers, never raw confidence or vote tallies — Wu-2025-safe by default. Only WeightedSynthesis ever exposed confidence (chair-terminal, not peer-conformity-prone). Added :expose_confidence opt (default true) on Rounds.WeightedSynthesis + Councils.WeightedConsensus so callers can audit chair sensitivity to confidence signal. See lib/council_ex/rounds/weighted_synthesis.ex moduledoc.
  2. Councils.JuryWithRetry — SHIPPED. K judges run :independent_analysis; iterate until avg :self_report confidence ≥ threshold or max_iterations exhausted. Judges do NOT see each other across iterations (Wu 2025 conformity mitigation: independent re-sample, not debate). Defaults: confidence_threshold: 0.7, max_iterations: 2, auto-injects :self_report. See lib/council_ex/councils/jury_with_retry.ex.
  3. Ranking-parser regex fallback for cheap models (karpathy). Karpathy's FINAL RANKING: strict-format prompt with regex fallback. Useful for local/Ollama models that can't do structured output reliably.

🟡 Medium adds

  1. Cultural / fairness parity metric (cultural_debate). BiasDetector flags demographic disagreement; no parity (variance of accuracy across protected groups). Add Fairness.parity/2 helper that consumes a graded result set.
  2. Persona-counterweight presets (0xNyk). Opposing-persona pairs (Aristotle↔Lao Tzu, Feynman↔Socrates) as Profiles.Counterweight library. Pure prompt engineering.
  3. Dissent quotas / novelty gates (0xNyk). Round-level constraint: nth response must reference m peers, include ≥1 dissent, add novel content. Postprocess validator that bounces non-compliant outputs.

🔴 Harder adds

  1. Symbolic-grounding / tool-as-veto layer (Adjudicator). Adjudicator's 0.48 → 0.99 came mostly from the KG hard gate. A non-LLM check that can veto chair output. Real lift is task-specific.
  2. Process-level flip/capitulation diagnostics (Wu 2025). Track when agents flip vs hold across rounds. Diagnostics.flip_analysis(result) surfaces the conformity failure mode.
  3. Logical-validity-aware aggregation (Wu 2025). Reasoning quality

    assertiveness predicts truth-flips. RatingAggregator that scores arguments on validity (LLM-judged) before voting. Pairs with Reliability store.

Top recommendations status

Both shipped in v0.8:

expose_confidence opt — SHIPPED

Rounds.WeightedSynthesis + Councils.WeightedConsensus accept :expose_confidence (default true). Audit confirmed other built-in peer-visibility rounds (PeerReview, Critique, AnonymizedPeerReview) were already Wu-2025-safe: they never exposed confidence or tallies to peers, only parsed content. The opt exists for callers who want to audit chair sensitivity to confidence signal even at the terminal synthesis step.

Councils.JuryWithRetry — SHIPPED

K judges run :independent_analysis in parallel; iterate until avg :self_report confidence ≥ threshold or max_iterations reached. Sensible defaults: confidence_threshold: 0.7, max_iterations: 2. Judges DO NOT see each other across iterations: independent re-sample, not debate (Wu 2025 conformity mitigation). Auto-injects confidence: :self_report on judges that didn't opt in. Example at examples/jury_with_retry_example.exs.

Items NOT worth chasing

  • Adjudicator KG construction: too task-specific. The 0.99 F1 doesn't generalize. KG-as-veto is a doc pattern, not a feature.
  • Chaos-MoA's same-model temperature diversity: Wu 2025 explicitly anti-pattern. Heterogeneous models + heterogeneous prompts beat single-model + temp variance.
  • Wholesale port of 0xNyk's 18-persona set: interesting demo, but personas are easy to define ad-hoc. Counterweight presets capture the actual insight.

Three insights worth internalizing

  1. Diversity is the load-bearing variable. Wu 2025 + MAD + Council Mode converge: gains come from heterogeneous strong base models, not clever debate machinery. CouncilEx's multi-provider design is correct; recommended user pattern is "one strong model from each major vendor," not "five copies of the same model with different temps."
  2. Visible majority is a footgun. Karpathy handles vendor anonymization. Nobody fully handles the running-tally problem. CouncilEx should default debate/critique rounds to NOT surface peer confidence/tallies until synthesis (gap #1).
  3. Symbolic grounding > more agents. For tasks with structured ground truth (taxonomy, schema, KG), a small symbolic check beats adding LLMs. Worth a CouncilEx pattern doc: tool-as-veto.

References