Comparison of CouncilEx against the most-cited papers and notable GitHub projects in the multi-agent LLM space (as of 2026-05). Catalogues what's already covered, where the gaps are, and which gaps are worth closing.
Source: research pass over ~/Downloads/llm_council_overview.md.
Items surveyed
Papers
| # | Paper | Year | Citation |
|---|---|---|---|
| 1 | Council Mode (Wu, Li, Feng, Li, Wang, Wang) | 2026 | arXiv:2604.02923 |
| 2 | Can LLM Agents Really Debate? (Wu, Li, Li) | 2025 | arXiv:2511.07784 |
| 3 | Multi-Agent Debate (MAD) (Du, Li, Torralba, Tenenbaum, Mordatch) | ICML 2024 | arXiv:2305.14325 |
| 4 | Adjudicator: KG-Informed Council (You, Paul) | 2025 | arXiv:2512.13704 |
Repos
| # | Repo | Stars | Status |
|---|---|---|---|
| 1 | karpathy/llm-council | 18.2k | experimental, frozen |
| 2 | bawfng04/Chaos-MoA-Pipeline | 2 | hobby, active |
| 3 | dayeonki/cultural_debate | ~8 | research code, ACL 2025 oral |
| 4 | dubs3c/council and 0xNyk/council-of-high-intelligence | 7 / 615 | two candidates for "The_Council" |
Per-item summaries
1. Council Mode (arXiv:2604.02923)
Heterogeneous multi-agent consensus framework. Parallel LLMs →
reliability-weighted aggregation → structured synthesis. Reports ~36%
hallucination reduction on TruthfulQA/HaluEval, bias mitigation on
BBQ/BiasBench. Already ported to CouncilEx in v0.8; see
COUNCIL_MODE_PAPER.md.
2. Can LLM Agents Really Debate? (arXiv:2511.07784)
Controlled study on Knight–Knave–Spy logic puzzles. Six factors varied (team size/composition, confidence visibility, debate order/depth, difficulty). Process-level analysis tracks flip / hold / capitulate behavior.
Findings that matter for CouncilEx:
- Diversity > structure. Heterogeneous strong-model teams drive almost all of the gains; turn order, confidence display, more rounds yield marginal improvements.
- Visible majority hurts correctness. When a wrong answer becomes the apparent majority, individually correct agents capitulate. Effective debate teams are precisely those that can overturn incorrect consensus.
- Reasoning quality (validity, not assertiveness) predicts truth flips.
- Diminishing returns on depth; small + diverse beats large + homogeneous.
3. Multi-Agent Debate (MAD, Du et al.) (arXiv:2305.14325)
Canonical foundational paper. N homogeneous LLM instances; each round, every agent sees the others' prior responses and revises. Final answer = majority vote at last round. Tested on Arithmetic, GSM8K, Chess, MMLU, biographies. Accuracy scales monotonically with agents and rounds; cross-model debate (ChatGPT + Bard) solves problems neither solves alone. Limits: O(N×R) cost, long contexts, can converge on wrong consensus, requires sufficiently capable base models.
4. Adjudicator (arXiv:2512.13704)
KG-grounded 3-agent council for noisy-label correction in e-commerce taxonomy. Three agents (Policy Expert weight 1.0, Data Analyst weight 2.0, Pattern Detector weight 0.5) vote on whether a product's category is wrong; KG provides a hard structural override.
Headline result with caveats: 0.48 single-LLM → 0.59 plain council → 0.99 KG-informed council. The 0.59 → 0.99 jump is mostly from the KG hard gate, not the multi-agent debate. Balanced 1k-item test set; perfect precision is largely a threshold/override artifact.
5. karpathy/llm-council
Three stages: opinions → anonymized peer review (rankings) → chairman
synthesis. Anonymization is server-side enforced; judging models
never see vendor labels. Karpathy's choice of frontier models (see
upstream repo) is mirrored in CouncilEx's port with active OpenRouter
IDs — see PROVIDER_MODELS.md for the catalog.
Chairman: a single strong model. Strict prompt-contract
ranking format with regex fallback. CouncilEx port:
TUTORIAL_KARPATHY_COUNCIL.md.
6. Chaos-MoA-Pipeline
Four stages: Generate → Cross-Critique → Rebuttal → Multi-Judge majority with 0–100 confidence. Same model + temperature/nonce diversity (Wu 2025 anti-pattern). Notable: cache-busting nonces, confidence-triggered re-run with more workers, "brutally fault-tolerant" graceful degradation. Diverges from vanilla MoA (Wang et al.) by adding adversarial Critique + Rebuttal + judge vote.
7. cultural_debate (Ki et al., ACL 2025)
Two debaters + neutral judge for cross-country cultural-norm alignment. NormAd-ETI benchmark (~2.6k stories, 75 countries). Headline: pairs of small 7–9B debaters + 27B judge match a single 27B model's accuracy with better cultural parity, cheaper and fairer. "Fairness" = accuracy variance across cultural clusters, not demographic identity.
8. The_Council (two candidates)
- dubs3c/council: 3 personas (Architect / Critic / AppSec), Proposal → Debate (max 3) → Early-term on consensus → Moderator synthesis. Per-agent temperatures via YAML. ~7 stars.
- 0xNyk/council-of-high-intelligence: 18 named personas (Aristotle, Feynman, Karpathy, etc.) intentionally paired as counterweights, distributed across 4 providers. Three modes (Full / Quick / Duo). Enforces dissent quotas, novelty gates, anti-recursion. ~615 stars.
Architecture coverage matrix
| Item | Members | Round flow | Aggregation | CouncilEx coverage |
|---|---|---|---|---|
| Council Mode (paper) | Heterogeneous LLMs | parallel → weighted consensus | reliability-weighted | ✅ Councils.WeightedConsensus |
| Can LLM Agents Really Debate? | Heterogeneous strong models | controlled-debate study | majority vote (criticized) | ⚠️ no anti-conformity safeguards |
| MAD (Du et al.) | N homogeneous instances | propose → see-others → revise (R rounds) → majority | majority vote | ✅ Iterate(:critique), Councils.Consensus |
| Adjudicator | 3 specialized agents | KG → 3 agents vote → weighted threshold + KG override | weighted binary + symbolic veto | ⚠️ weighted vote yes, no KG layer |
| karpathy/llm-council | 4 frontier models | opinions → anonymized peer review → chairman | ranking + chairman | ✅ AnonymizedPeerReview |
| Chaos-MoA-Pipeline | Same model multi-temp | Generate → Critique → Rebuttal → Judge majority | judge majority + confidence | ⚠️ critique/rebuttal yes, no multi-judge + retry |
| cultural_debate | 2 debaters + 1 judge | independent → debate → judge verdict | judge | ✅ Councils.PeerReview — but no parity metric |
| dubs3c/council | 3 personas | Proposal → Debate (max 3) → Synth | consensus convergence | ✅ Councils.Consensus w/ until callback |
| 0xNyk/council-of-high-intelligence | 18 personas across providers | independent → cross-exam → crystallize → verdict | synth w/ dissent quotas | ⚠️ no dissent quota / novelty gate |
What CouncilEx already covers
| Pattern | Mapping |
|---|---|
| Heterogeneous parallel members | Multi-provider adapters + Councils.ParallelPanel |
| Independent-then-aggregate | :independent_analysis round |
| MAD-style iterative critique | Rounds.Iterate(:critique) + Councils.Consensus |
| Anonymized peer review | Rounds.AnonymizedPeerReview |
| Chairman synthesis | Rounds.Synthesis + chair macro |
| Reliability-weighted consensus | Councils.WeightedConsensus |
| Per-member confidence | Confidence.{:self_report, :logprob} |
| Early termination on consensus | Iterate(until: ...) |
| Multi-vendor heterogeneity | OpenAI / Anthropic / Gemini / Ollama / OpenRouter adapters |
| Persona / role definition | use CouncilEx.Member + system_prompt |
| Tournament / pairwise | Councils.Tournament + Rounds.PairwiseElimination |
| Voting / aggregator | Rounds.Vote + Aggregators.{Plurality, WeightedMean} |
| Bias diagnostic | BiasDetector.analyze/2 |
Gaps surfaced by this research
Ordered by lift / cost.
🟢 Easy adds
- ✅
expose_confidenceopt onWeightedSynthesis— SHIPPED. Audit found existingPeerReview / Critique / AnonymizedPeerReviewonly surface parsed content / response text to peers, never raw confidence or vote tallies — Wu-2025-safe by default. OnlyWeightedSynthesisever exposed confidence (chair-terminal, not peer-conformity-prone). Added:expose_confidenceopt (defaulttrue) onRounds.WeightedSynthesis+Councils.WeightedConsensusso callers can audit chair sensitivity to confidence signal. Seelib/council_ex/rounds/weighted_synthesis.exmoduledoc. - ✅
Councils.JuryWithRetry— SHIPPED. K judges run:independent_analysis; iterate until avg:self_reportconfidence ≥ threshold ormax_iterationsexhausted. Judges do NOT see each other across iterations (Wu 2025 conformity mitigation: independent re-sample, not debate). Defaults:confidence_threshold: 0.7,max_iterations: 2, auto-injects:self_report. Seelib/council_ex/councils/jury_with_retry.ex. - Ranking-parser regex fallback for cheap models (karpathy).
Karpathy's
FINAL RANKING:strict-format prompt with regex fallback. Useful for local/Ollama models that can't do structured output reliably.
🟡 Medium adds
- Cultural / fairness parity metric (cultural_debate).
BiasDetectorflags demographic disagreement; no parity (variance of accuracy across protected groups). AddFairness.parity/2helper that consumes a graded result set. - Persona-counterweight presets (0xNyk). Opposing-persona pairs
(Aristotle↔Lao Tzu, Feynman↔Socrates) as
Profiles.Counterweightlibrary. Pure prompt engineering. - Dissent quotas / novelty gates (0xNyk). Round-level constraint: nth response must reference m peers, include ≥1 dissent, add novel content. Postprocess validator that bounces non-compliant outputs.
🔴 Harder adds
- Symbolic-grounding / tool-as-veto layer (Adjudicator). Adjudicator's 0.48 → 0.99 came mostly from the KG hard gate. A non-LLM check that can veto chair output. Real lift is task-specific.
- Process-level flip/capitulation diagnostics (Wu 2025). Track
when agents flip vs hold across rounds.
Diagnostics.flip_analysis(result)surfaces the conformity failure mode. - Logical-validity-aware aggregation (Wu 2025). Reasoning quality
assertiveness predicts truth-flips.
RatingAggregatorthat scores arguments on validity (LLM-judged) before voting. Pairs withReliabilitystore.
Top recommendations status
Both shipped in v0.8:
✅ expose_confidence opt — SHIPPED
Rounds.WeightedSynthesis + Councils.WeightedConsensus accept
:expose_confidence (default true). Audit confirmed other built-in
peer-visibility rounds (PeerReview, Critique, AnonymizedPeerReview)
were already Wu-2025-safe: they never exposed confidence or tallies
to peers, only parsed content. The opt exists for callers who want to
audit chair sensitivity to confidence signal even at the terminal
synthesis step.
✅ Councils.JuryWithRetry — SHIPPED
K judges run :independent_analysis in parallel; iterate until avg
:self_report confidence ≥ threshold or max_iterations reached.
Sensible defaults: confidence_threshold: 0.7, max_iterations: 2.
Judges DO NOT see each other across iterations: independent
re-sample, not debate (Wu 2025 conformity mitigation). Auto-injects
confidence: :self_report on judges that didn't opt in. Example at
examples/jury_with_retry_example.exs.
Items NOT worth chasing
- Adjudicator KG construction: too task-specific. The 0.99 F1 doesn't generalize. KG-as-veto is a doc pattern, not a feature.
- Chaos-MoA's same-model temperature diversity: Wu 2025 explicitly anti-pattern. Heterogeneous models + heterogeneous prompts beat single-model + temp variance.
- Wholesale port of 0xNyk's 18-persona set: interesting demo, but personas are easy to define ad-hoc. Counterweight presets capture the actual insight.
Three insights worth internalizing
- Diversity is the load-bearing variable. Wu 2025 + MAD + Council Mode converge: gains come from heterogeneous strong base models, not clever debate machinery. CouncilEx's multi-provider design is correct; recommended user pattern is "one strong model from each major vendor," not "five copies of the same model with different temps."
- Visible majority is a footgun. Karpathy handles vendor anonymization. Nobody fully handles the running-tally problem. CouncilEx should default debate/critique rounds to NOT surface peer confidence/tallies until synthesis (gap #1).
- Symbolic grounding > more agents. For tasks with structured ground truth (taxonomy, schema, KG), a small symbolic check beats adding LLMs. Worth a CouncilEx pattern doc: tool-as-veto.
References
- arXiv:2604.02923 Council Mode
- arXiv:2511.07784 Can LLM Agents Really Debate?
- arXiv:2305.14325 MAD (Du et al.)
- arXiv:2512.13704 Adjudicator
- arXiv:2505.24671 cultural_debate (ACL 2025)
- https://github.com/karpathy/llm-council
- https://github.com/bawfng04/Chaos-MoA-Pipeline
- https://github.com/dayeonki/cultural_debate
- https://github.com/dubs3c/council
- https://github.com/0xNyk/council-of-high-intelligence