Tutorial: Building a Karpathy-Style LLM Council

Copy Markdown View Source

This tutorial walks through reproducing the karpathy/llm-council 3-stage deliberation pattern using council_ex. By the end you will have a council that:

  1. Stage 1: Opinions. Sends a query to N frontier models in parallel, collects their independent answers.
  2. Stage 2: Anonymized peer review. Hands each model the other models' answers under blind labels (Response A, Response B, …) and asks each to rank them. Anonymization stops models from recognizing (and favoring) their own outputs.
  3. Stage 3: Chairman synthesis. A designated model reads every prior answer plus every peer ranking and produces the final reply.

The Elixir version takes ~60 lines.

Prerequisites

# mix.exs
def deps do
  [{:council_ex, "~> 0.7"}]
end
export OPENROUTER_API_KEY=sk-or-v1-...

OpenRouter is the cleanest path because all four council models live behind one key. Single-vendor setups (OpenAI, Anthropic, Gemini) work the same way: change the provider: and model: fields.

Step 1: Configure the provider

Application.put_env(:council_ex, :providers,
  openrouter: [
    adapter: CouncilEx.Provider.Adapters.OpenRouter,
    api_key: {:system, "OPENROUTER_API_KEY"}
  ]
)

Step 2: Define members

You need two member roles: an Author (used by every council seat in stage 1, and again in stage 2 wearing a judge hat) and a Chair (stage 3 synthesizer).

defmodule LLMCouncil.Members.Author do
  use CouncilEx.Member
  role "Author"

  system_prompt """
  You are one voice on an LLM Council. You will receive the user's
  query in stage 1. In stage 2 you will receive a `peers` map whose
  keys are anonymous labels ("Response A", "Response B", …) and whose
  values are the OTHER members' stage-1 answers: yours has been
  removed.

  Stage 1 contract: answer the user's `:question` directly and
  thoroughly. No preamble.

  Stage 2 contract: read every entry in `peers`, then emit a strict
  JSON object matching this schema:

      {"ordering": ["Response X", "Response Y", "Response Z"],
       "rationale": "<one sentence per ranked entry>"}

  Most-preferred first. No commentary outside the JSON.
  """

  @impl true
  def output_schema, do: CouncilEx.Schemas.Ranking
end

defmodule LLMCouncil.Members.Chair do
  use CouncilEx.Member
  role "Chair"

  system_prompt """
  You are the Chair of an LLM Council. You receive every prior member's
  stage-1 answer plus every member's stage-2 ranking. Synthesize a
  single final answer for the user. Acknowledge tradeoffs the rankings
  surface. Do NOT cite anonymous labels: speak in the user's voice.
  """
end

The output_schema/0 callback on Author makes stage-2 rankings parse cleanly into %CouncilEx.Schemas.Ranking{}: :ordering is the field the aggregator reads.

Step 3: Define the council

defmodule LLMCouncil do
  use CouncilEx

  alias LLMCouncil.Members.{Author, Chair}

  # Stage 1 council (each seat = one frontier model)
  member :gpt,    Author, provider: :openrouter, model: "openai/gpt-4o-mini"
  member :claude, Author, provider: :openrouter, model: "anthropic/claude-sonnet-4-6"
  member :gemini, Author, provider: :openrouter, model: "google/gemini-2.5-flash"
  member :llama,  Author, provider: :openrouter, model: "meta-llama/llama-3.3-70b-instruct"

  # Stage 1: opinions in parallel
  round :independent_analysis

  # Stage 2: every member judges the other members' stage-1 outputs,
  #           but sees them under anonymous labels.
  round :anonymized_peer_review,
    aggregator: CouncilEx.Aggregators.PeerRanking

  # Stage 3: chairman synthesis (auto-runs when chair/2 is declared)
  chair Chair, id: :chair,
    provider: :openrouter, model: "google/gemini-2.5-flash"
end

:anonymized_peer_review does the work. It:

  • Builds a global label map (gpt → Response A, claude → Response B, …) by sorting member ids alphabetically.
  • Removes each judge's own slot from the peers map it sees.
  • Threads the label_to_id map into the aggregator so winners, scores, and per-judge ballots are all reported in the original-id space even though judges only ever saw labels.

Step 4: Run a query

{:ok, result} =
  CouncilEx.run(LLMCouncil,
    %{question: "Should an early-stage Elixir startup pick LiveView " <>
                "or Inertia.js + React for its admin panel?"})

IO.puts(result.final.content)

That prints the chairman's synthesis. The interesting parts are in result.rounds:

[stage1, stage2, _synthesis] = result.rounds

# Each model's raw stage-1 answer:
Enum.each(stage1.member_results, fn {id, mr} ->
  IO.puts("=== #{id} ===")
  IO.puts(mr.response.content)
end)

# Stage-2 aggregate ranking: ids, not labels:
%{
  winner: winner,
  scores: scores,
  raw: %{
    avg_position: avg_position,
    votes: votes,
    judge_ballots: judge_ballots,
    label_to_id: label_to_id
  }
} = stage2.aggregated

IO.inspect(winner, label: "consensus best stage-1 answer")
IO.inspect(avg_position, label: "average rank position per model (lower = better)")
IO.inspect(judge_ballots, label: "what each judge ranked (de-anonymized)")
IO.inspect(label_to_id, label: "anon label → real model id")

avg_position is what karpathy's UI shows above the stage-2 tabs: each model's average rank across all judges. label_to_id lets a UI display "Response A → openai/gpt-4o-mini" alongside the raw evaluation text the model actually produced.

What anonymization buys you

Without anonymization, every judge knows whose work is whose. Models have well-documented self-recognition behavior: given an unlabeled mix, they can often pick out their own writing style and weight it favorably. Stage-2 rankings then collapse toward "everyone ranks themselves first," which is useless signal.

AnonymizedPeerReview replaces ids with Response A/B/C before any peer-review prompt is rendered. The judge has no way to know which slot is its own (unless its own writing style is unmistakable, which is a separate problem). The aggregator de-anonymizes after the fact.

Conversation history (multi-turn)

The library is one-shot per run/3. For multi-turn chat history, prepend prior turns into the input map yourself:

history = [
  %{role: "user", content: "..."},
  %{role: "assistant", content: result_prev.final.content}
]

{:ok, result} =
  CouncilEx.run(LLMCouncil,
    %{question: "follow-up", history: history})

Then reference :history in your member system prompts. Persisting turns is the caller's job; the library has no built-in Session helper. Wrap CouncilEx in your own Phoenix / Ecto layer if you need a chat UI on top. docs/RUNNING_IN_PHOENIX.md covers the integration patterns.

Comparison to karpathy/llm-council

Concernkarpathy/llm-council (Python)This council (council_ex)
Stage 1 (opinions)stage1_collect_responses async gather:independent_analysis round, parallel members
Stage 2 (anon peer review)stage2_collect_rankings + manual Response A relabel + regex parse:anonymized_peer_review round + Schemas.Ranking JSON output + Aggregators.PeerRanking
Stage 3 (chairman)stage3_synthesize_final single callchair macro, auto-synthesis round
De-anon for UIlabel_to_model returned in API responseaggregated.raw.label_to_id on the round result
Aggregate rankingscalculate_aggregate_rankings (avg position)aggregated.raw.avg_position
Per-judge ballotsshown as raw text in tabsaggregated.raw.judge_ballots (parsed + de-anonymized)
Failure handlingskip failed members, continuefailure_mode: :continue | :fail_fast, retries, timeouts
Telemetryprint():telemetry events + PubSub events
PersistenceJSON filescaller-owned (library has no DB)
UIReact + Vite appnone (library only: see companion-app notes)

Troubleshooting

Stage-2 returns {:error, :no_rankings}. A judge's response did not parse into %Schemas.Ranking{}. Check the LLM is honoring the JSON contract in the system prompt; tighten the wording. The Schemas.Ranking Ecto schema requires :ordering to be a list of strings: make sure the system prompt asks for a list of label strings, not a numbered string.

One judge keeps voting for "Response A" regardless of order. The model has positional bias; that's a known weakness of LLM ranking. Mitigations: shuffle the label order via order: :shuffle (not yet exposed as a round opt; open an issue if you need it), or use :tournament instead of :ranking so each comparison is pairwise.

A judge ranks itself. Should not happen: prepare_input/3 drops the judge's own slot from peers. If you see it, file a bug with the run id.

Reference