This tutorial walks through reproducing the
karpathy/llm-council
3-stage deliberation pattern using council_ex. By the end you will
have a council that:
- Stage 1: Opinions. Sends a query to N frontier models in parallel, collects their independent answers.
- Stage 2: Anonymized peer review. Hands each model the other
models' answers under blind labels (
Response A,Response B, …) and asks each to rank them. Anonymization stops models from recognizing (and favoring) their own outputs. - Stage 3: Chairman synthesis. A designated model reads every prior answer plus every peer ranking and produces the final reply.
The Elixir version takes ~60 lines.
Prerequisites
# mix.exs
def deps do
[{:council_ex, "~> 0.7"}]
endexport OPENROUTER_API_KEY=sk-or-v1-...
OpenRouter is the cleanest path because all four council models live
behind one key. Single-vendor setups (OpenAI, Anthropic, Gemini) work
the same way: change the provider: and model: fields.
Step 1: Configure the provider
Application.put_env(:council_ex, :providers,
openrouter: [
adapter: CouncilEx.Provider.Adapters.OpenRouter,
api_key: {:system, "OPENROUTER_API_KEY"}
]
)Step 2: Define members
You need two member roles: an Author (used by every council seat in stage 1, and again in stage 2 wearing a judge hat) and a Chair (stage 3 synthesizer).
defmodule LLMCouncil.Members.Author do
use CouncilEx.Member
role "Author"
system_prompt """
You are one voice on an LLM Council. You will receive the user's
query in stage 1. In stage 2 you will receive a `peers` map whose
keys are anonymous labels ("Response A", "Response B", …) and whose
values are the OTHER members' stage-1 answers: yours has been
removed.
Stage 1 contract: answer the user's `:question` directly and
thoroughly. No preamble.
Stage 2 contract: read every entry in `peers`, then emit a strict
JSON object matching this schema:
{"ordering": ["Response X", "Response Y", "Response Z"],
"rationale": "<one sentence per ranked entry>"}
Most-preferred first. No commentary outside the JSON.
"""
@impl true
def output_schema, do: CouncilEx.Schemas.Ranking
end
defmodule LLMCouncil.Members.Chair do
use CouncilEx.Member
role "Chair"
system_prompt """
You are the Chair of an LLM Council. You receive every prior member's
stage-1 answer plus every member's stage-2 ranking. Synthesize a
single final answer for the user. Acknowledge tradeoffs the rankings
surface. Do NOT cite anonymous labels: speak in the user's voice.
"""
endThe output_schema/0 callback on Author makes stage-2 rankings
parse cleanly into %CouncilEx.Schemas.Ranking{}: :ordering is the
field the aggregator reads.
Step 3: Define the council
defmodule LLMCouncil do
use CouncilEx
alias LLMCouncil.Members.{Author, Chair}
# Stage 1 council (each seat = one frontier model)
member :gpt, Author, provider: :openrouter, model: "openai/gpt-4o-mini"
member :claude, Author, provider: :openrouter, model: "anthropic/claude-sonnet-4-6"
member :gemini, Author, provider: :openrouter, model: "google/gemini-2.5-flash"
member :llama, Author, provider: :openrouter, model: "meta-llama/llama-3.3-70b-instruct"
# Stage 1: opinions in parallel
round :independent_analysis
# Stage 2: every member judges the other members' stage-1 outputs,
# but sees them under anonymous labels.
round :anonymized_peer_review,
aggregator: CouncilEx.Aggregators.PeerRanking
# Stage 3: chairman synthesis (auto-runs when chair/2 is declared)
chair Chair, id: :chair,
provider: :openrouter, model: "google/gemini-2.5-flash"
end:anonymized_peer_review does the work. It:
- Builds a global label map (
gpt → Response A,claude → Response B, …) by sorting member ids alphabetically. - Removes each judge's own slot from the
peersmap it sees. - Threads the
label_to_idmap into the aggregator so winners, scores, and per-judge ballots are all reported in the original-id space even though judges only ever saw labels.
Step 4: Run a query
{:ok, result} =
CouncilEx.run(LLMCouncil,
%{question: "Should an early-stage Elixir startup pick LiveView " <>
"or Inertia.js + React for its admin panel?"})
IO.puts(result.final.content)That prints the chairman's synthesis. The interesting parts are in
result.rounds:
[stage1, stage2, _synthesis] = result.rounds
# Each model's raw stage-1 answer:
Enum.each(stage1.member_results, fn {id, mr} ->
IO.puts("=== #{id} ===")
IO.puts(mr.response.content)
end)
# Stage-2 aggregate ranking: ids, not labels:
%{
winner: winner,
scores: scores,
raw: %{
avg_position: avg_position,
votes: votes,
judge_ballots: judge_ballots,
label_to_id: label_to_id
}
} = stage2.aggregated
IO.inspect(winner, label: "consensus best stage-1 answer")
IO.inspect(avg_position, label: "average rank position per model (lower = better)")
IO.inspect(judge_ballots, label: "what each judge ranked (de-anonymized)")
IO.inspect(label_to_id, label: "anon label → real model id")avg_position is what karpathy's UI shows above the stage-2 tabs:
each model's average rank across all judges. label_to_id lets a UI
display "Response A → openai/gpt-4o-mini" alongside the raw evaluation
text the model actually produced.
What anonymization buys you
Without anonymization, every judge knows whose work is whose. Models have well-documented self-recognition behavior: given an unlabeled mix, they can often pick out their own writing style and weight it favorably. Stage-2 rankings then collapse toward "everyone ranks themselves first," which is useless signal.
AnonymizedPeerReview replaces ids with Response A/B/C before any
peer-review prompt is rendered. The judge has no way to know which
slot is its own (unless its own writing style is unmistakable, which
is a separate problem). The aggregator de-anonymizes after the fact.
Conversation history (multi-turn)
The library is one-shot per run/3. For multi-turn chat history,
prepend prior turns into the input map yourself:
history = [
%{role: "user", content: "..."},
%{role: "assistant", content: result_prev.final.content}
]
{:ok, result} =
CouncilEx.run(LLMCouncil,
%{question: "follow-up", history: history})Then reference :history in your member system prompts. Persisting
turns is the caller's job; the library has no built-in Session
helper. Wrap CouncilEx in your own Phoenix / Ecto layer if you need a
chat UI on top. docs/RUNNING_IN_PHOENIX.md covers the integration
patterns.
Comparison to karpathy/llm-council
| Concern | karpathy/llm-council (Python) | This council (council_ex) |
|---|---|---|
| Stage 1 (opinions) | stage1_collect_responses async gather | :independent_analysis round, parallel members |
| Stage 2 (anon peer review) | stage2_collect_rankings + manual Response A relabel + regex parse | :anonymized_peer_review round + Schemas.Ranking JSON output + Aggregators.PeerRanking |
| Stage 3 (chairman) | stage3_synthesize_final single call | chair macro, auto-synthesis round |
| De-anon for UI | label_to_model returned in API response | aggregated.raw.label_to_id on the round result |
| Aggregate rankings | calculate_aggregate_rankings (avg position) | aggregated.raw.avg_position |
| Per-judge ballots | shown as raw text in tabs | aggregated.raw.judge_ballots (parsed + de-anonymized) |
| Failure handling | skip failed members, continue | failure_mode: :continue | :fail_fast, retries, timeouts |
| Telemetry | print() | :telemetry events + PubSub events |
| Persistence | JSON files | caller-owned (library has no DB) |
| UI | React + Vite app | none (library only: see companion-app notes) |
Troubleshooting
Stage-2 returns {:error, :no_rankings}. A judge's response did
not parse into %Schemas.Ranking{}. Check the LLM is honoring the
JSON contract in the system prompt; tighten the wording. The
Schemas.Ranking Ecto schema requires :ordering to be a list of
strings: make sure the system prompt asks for a list of label
strings, not a numbered string.
One judge keeps voting for "Response A" regardless of order. The
model has positional bias; that's a known weakness of LLM
ranking. Mitigations: shuffle the label order via
order: :shuffle (not yet exposed as a round opt; open an issue if
you need it), or use :tournament instead of :ranking so each
comparison is pairwise.
A judge ranks itself. Should not happen: prepare_input/3 drops
the judge's own slot from peers. If you see it, file a bug with
the run id.
Reference
- Round:
CouncilEx.Rounds.AnonymizedPeerReview - Aggregator:
CouncilEx.Aggregators.PeerRanking - Helper:
CouncilEx.Anonymize(pure function, usable outside rounds) - Schema:
CouncilEx.Schemas.Ranking - Decision guide for
PeerReviewvsAnonymizedPeerReview:PEER_REVIEW_PATTERNS.md. Covers the failure modes anonymization prevents, when each round is the correct choice, and when neither fits.