Status: proposal, not implemented.
Problem
Need to compare relative cost/effort of running councils before spending money. Real $ pricing is volatile (changes per provider, per model, per tier, per discount). Output token size unknown pre-run. Maintaining a price table inside the library means drift, stale data, support burden.
Goal: stable, unitless score. Predicts relative weight, not dollars. Users wanting $ multiply score by their own blended rate.
Non-goals
- Exact $/run prediction.
- Tracking provider price changes.
- Replacing post-run actual usage in
Result.metadata.
Core idea
Two functions:
Complexity.score(council, opts): static, pre-run estimate from council shape.Complexity.actual(result): post-run score from realtotal_input_tokens/total_output_tokens. Same units. Calibrates the static estimate.
Ratio actual / score shows how off the static model was. Tune over time.
Formula (static)
score(council) =
Σ_rounds (
Σ_members_in_round (
tier(member.model)
× context_factor(round, council_kind)
× tools_factor(member)
× output_factor(member)
)
)Tier table
Maps model name → relative weight. One axis: how heavy is one call.
| Tier | Weight | Examples |
|---|---|---|
| nano | 0.5 | gpt-4.1-nano, claude-haiku, gemini-flash-lite |
| mini | 1.0 | gpt-4.1-mini, claude-haiku-3.5, gemini-flash |
| mid | 2.0 | gpt-4.1, gpt-4o, claude-sonnet, gemini-pro |
| large | 4.0 | gpt-5, claude-opus, gemini-ultra |
| reasoning | 8.0 | o1, o3, o4-mini-high, deepseek-r1 |
Source: MemberSpec.model/1 (TBD helper) → tier lookup table in
CouncilEx.Complexity.Tiers. Unknown model → default mid (2.0) with
warning. User override: member opts ++ [tier: :reasoning].
Tier weights are opinionated defaults, not authoritative. Easy to override per project via app config:
config :council_ex, :complexity_tiers,
"gpt-4.1-mini": 1.2,
"my-finetune": 3.0Context factor (per council kind)
How much the input grows across rounds because prior outputs are re-fed.
| Kind | Round 1 | Round N |
|---|---|---|
parallel_panel | 1.0 | n/a (single round) |
specialist | 1.0 | 1.2 (router + specialist) |
chairman | 1.0 | 1.5 (chairman re-reads members) |
consensus | 1.0 | 1.0 + 0.3·(N-1) |
peer_review | 1.0 | 1.0 + 0.5·(N-1) |
tournament | 1.0 | 1.0 + 0.4·(N-1), members halve |
voting | 1.0 | 1.0 + 0.2 (tally pass) |
Numbers picked by inspection of each council's prompt-stitching
behavior. Calibrate post-launch against Complexity.actual/1.
Tools factor
tools_factor(member) = 1.0 + 0.3 × tool_turns_estimateDefault tool_turns_estimate = 1 if member has tools, else 0. User can
override via member opt :tool_turns_estimate.
Reason: tool calls add round-trips. Each tool turn ≈ extra small call.
Output factor
output_factor(member) =
1.0 when no schema
1.2 when ecto schema (structured output cost)
1.3 when streaming + toolsAPI sketch
defmodule CouncilEx.Complexity do
@type score :: float()
@type breakdown :: %{
score: score(),
per_round: [%{round: integer(), score: score(), members: map()}],
per_member: %{member_id => score()},
factors: %{
members: integer(),
rounds: integer(),
tier_avg: float(),
tools_count: integer()
},
council_kind: atom()
}
@spec score(module() | DynamicCouncil.t(), keyword()) :: breakdown()
def score(council, opts \\ [])
@spec actual(Result.t()) :: breakdown()
def actual(result)
@spec compare(breakdown(), breakdown()) :: %{ratio: float(), delta: score()}
def compare(a, b)
endOptional thin layer for users wanting $:
defmodule CouncilEx.Complexity.Cost do
@doc """
User-supplied price/score-unit. Library does not ship prices.
Cost.estimate(score, 0.0008) # $0.0008 per complexity unit
#=> %{usd: 0.064, score: 80.0}
"""
@spec estimate(Complexity.breakdown(), float()) :: %{usd: float(), score: float()}
def estimate(breakdown, price_per_unit)
endWorked example
Council: peer_review, 3 members, 2 rounds, all gpt-4.1 (tier mid =
2.0), no tools, no schema.
Round 1:
3 members × tier 2.0 × ctx 1.0 × tools 1.0 × out 1.0 = 6.0
Round 2 (peer review):
3 members × tier 2.0 × ctx (1.0 + 0.5·2) = 2.0 × tools 1.0 × out 1.0
= 3 × 2.0 × 2.0 = 12.0
score = 18.0Compare vs parallel_panel, same members:
1 round × 3 × 2.0 × 1.0 × 1.0 × 1.0 = 6.0Peer review 3.0× heavier than parallel panel. Matches intuition. Extra round, members re-read each other.
Calibration
After implementation, run a fixed test suite of councils, log:
{council, score_static, score_actual, ratio}If ratios cluster in [0.8, 1.2], model is fine. If wide, adjust
context/tier factors. Keep calibration table in docs/COMPLEXITY_CALIBRATION.md.
Open questions
- Streaming: chunks add wire overhead but not token cost. Ignore?
- Cache hits (Anthropic prompt caching): real $ savings, no complexity savings. Score model ignores. OK because score is compute-shaped, not $-shaped.
- Reasoning tokens (o1/o3): hidden output. Tier 8.0 already accounts for "this model is heavy". Don't double-count.
- DynamicMember without explicit model: needs sensible default.
Probably
mid+ warning. - Aggregator overhead (e.g. consensus aggregator pass): treated as
N-th round? Or separate
aggregator_factor? - Should
score/2walk static councils via the DSL macro state, or require running the council once withdry_run: trueto materialize member specs? Materialized form is more accurate but couples to runtime.
Non-decision
Whether to ship this at all. It's a small file (~300 LOC), low
maintenance once tiers stabilize, no network calls. Main risk: users
treat score as $ proxy and get burned. Mitigate with docs and
Cost.estimate/2 requiring explicit price_per_unit.
Next steps
- Decide: ship or shelve.
- If ship: implement
Tierstable first, validate against existingexamples/councils, thenscore/2, thenactual/1. - Add
complexityfield toResult.metadatawhen computed. - CLI helper:
mix council.complexity MyCouncilprints breakdown.