Two rounds in council_ex look superficially similar but solve
different problems. Picking the wrong one silently degrades signal.
This page is the decision guide.
CouncilEx.Rounds.PeerReview: cross-visibility. Members read each other's prior outputs (keyed by original member id) and keep working with that context.CouncilEx.Rounds.AnonymizedPeerReview: blind judging. Members rank each other's prior outputs under anonymous labels (Response A,Response B, …), and the aggregator reports rankings de-anonymized.
Decision matrix
PeerReview | AnonymizedPeerReview | |
|---|---|---|
| Member ids visible to peers | yes (semantic) | no (Response A/B/C) |
Own slot in peers map | omitted | omitted |
| Aggregates? | no | yes (default Aggregators.PeerRanking) |
| Suitable for ranking / voting | no: biased | yes |
| Suitable for collaboration / refinement | yes | no: strips role context |
| Output shape required from members | free-form | :ordering (e.g. Schemas.Ranking) |
label_to_id map exposed | n/a | yes, in aggregated.raw.label_to_id |
Rule of thumb:
- "Read the others, then keep working" →
PeerReview. - "Read the others, then rank them" →
AnonymizedPeerReview.
Why anonymization matters
When LLMs judge each other's work with author identities visible, rankings collapse to garbage signal. Three failure modes anonymization prevents:
Self-recognition bias. LLMs recognize their own writing style (idioms, formatting, hedging patterns). Given a mixed pile labeled by id, a model spots its own output and ranks itself first. Every model does it. Result: every judge picks self → no winner, no signal.
Brand bias. If labels expose model names (
gpt-4o-mini,claude-sonnet-4-6), models defer to known-strong brands or attack rivals based on training-data sentiment rather than actual answer quality. Judgments based on reputation, not text.Stable-position leakage. Repeated runs with the same id order let a judge learn "slot N = competitor, downrank." Stable id ordering across runs leaks signal anonymization is meant to remove.
Anon labels (Response A/B/C) plus own-slot removal close all three.
The judge sees only text and is forced to evaluate substance.
This is the only stage of a multi-model council that adds signal a single-model call cannot produce. Stage 1 = N parallel queries (boring). Stage 3 = synthesis (any model can do). Stage 2 anonymized peer review is the actual research contribution. karpathy/llm-council calls this out explicitly in its README: it is the reason that project exists.
Why AnonymizedPeerReview lives in the library
User-side anonymization is doable but error-prone:
- Easy to leak ids in prompts (forget to strip from one field).
- Easy to assign per-judge labels inconsistently: label
Ameaning different model to different judges breaks aggregation. - Easy to drop the de-anon map and lose UI traceability.
AnonymizedPeerReview solves all three:
- Global stable map. Every judge sees
gpt → Response A. Aggregation across judges is meaningful. - Own-slot removal. Judge never sees own answer at all. Self-recognition impossible.
- Map preserved through to aggregator.
winner,scores,avg_position,judge_ballotsall reported in original-id space.
When PeerReview (visible ids) is correct
Keep ids visible when identity carries meaning the next round needs.
- Heterogeneous roles.
:researcher → :critic → :synthesizer. The critic needs to know it's reading Researcher's draft, not "Response B." Anonymization destroys role context the workflow depends on. - Iterative refinement.
:draftand:editorcollaborating across rounds. The editor's prompt likely references "the draft above": id is the semantic anchor. - Non-judgment cross-pollination. "Each member sees what the others wrote, then writes a new version factoring in those perspectives." No ranking, no voting. Visibility for inspiration, not adjudication.
- Critique chains.
Rounds.Critiqueis built on top ofPeerReview.prepare_input/3.
When NEITHER fits
- Single-judge setups. No peer pool, nothing to anonymize over. Use a plain synthesis round.
- Identity-as-signal tasks. "Which model is most aligned with
house style?" you want the judge to know who wrote what. Use
PeerReviewso labels stay visible, or write a custom round. - Vendor-leaking content. Anonymization is label-level only. If models sign their answers ("As Claude, I…"), no round-level relabeling will hide that. Sanitize content first or rewrite the member system prompts to suppress self-identification.
- Pairwise judgments at scale. For tournament-style elimination
use
Rounds.PairwiseElimination; ranking entire fields per judge doesn't scale past ~6 members.
Both rounds, both supported
PeerReview is not deprecated by AnonymizedPeerReview. They
do different jobs. Removing PeerReview would break
Rounds.Critique and every collaborative-refinement council that
depends on visible ids.
See also
CouncilEx.Rounds.PeerReview(moduledoc has the same guidance, shorter)CouncilEx.Rounds.AnonymizedPeerReview(moduledoc has the full "why anonymize" rationale)CouncilEx.Aggregators.PeerRanking(report fields:avg_position,votes,judge_ballots,label_to_id)CouncilEx.Anonymize(pure helper, usable outside any round)- Tutorial: Karpathy-style council: end-to-end 3-stage example