Council Complexity Metric: design sketch

Copy Markdown View Source

Status: proposal, not implemented.

Problem

Need to compare relative cost/effort of running councils before spending money. Real $ pricing is volatile (changes per provider, per model, per tier, per discount). Output token size unknown pre-run. Maintaining a price table inside the library means drift, stale data, support burden.

Goal: stable, unitless score. Predicts relative weight, not dollars. Users wanting $ multiply score by their own blended rate.

Non-goals

  • Exact $/run prediction.
  • Tracking provider price changes.
  • Replacing post-run actual usage in Result.metadata.

Core idea

Two functions:

  1. Complexity.score(council, opts): static, pre-run estimate from council shape.
  2. Complexity.actual(result): post-run score from real total_input_tokens / total_output_tokens. Same units. Calibrates the static estimate.

Ratio actual / score shows how off the static model was. Tune over time.

Formula (static)

score(council) =
  Σ_rounds (
    Σ_members_in_round (
      tier(member.model)
      × context_factor(round, council_kind)
      × tools_factor(member)
      × output_factor(member)
    )
  )

Tier table

Maps model name → relative weight. One axis: how heavy is one call.

TierWeightExamples
nano0.5gpt-4.1-nano, claude-haiku, gemini-flash-lite
mini1.0gpt-4.1-mini, claude-haiku-3.5, gemini-flash
mid2.0gpt-4.1, gpt-4o, claude-sonnet, gemini-pro
large4.0gpt-5, claude-opus, gemini-ultra
reasoning8.0o1, o3, o4-mini-high, deepseek-r1

Source: MemberSpec.model/1 (TBD helper) → tier lookup table in CouncilEx.Complexity.Tiers. Unknown model → default mid (2.0) with warning. User override: member opts ++ [tier: :reasoning].

Tier weights are opinionated defaults, not authoritative. Easy to override per project via app config:

config :council_ex, :complexity_tiers,
  "gpt-4.1-mini": 1.2,
  "my-finetune": 3.0

Context factor (per council kind)

How much the input grows across rounds because prior outputs are re-fed.

KindRound 1Round N
parallel_panel1.0n/a (single round)
specialist1.01.2 (router + specialist)
chairman1.01.5 (chairman re-reads members)
consensus1.01.0 + 0.3·(N-1)
peer_review1.01.0 + 0.5·(N-1)
tournament1.01.0 + 0.4·(N-1), members halve
voting1.01.0 + 0.2 (tally pass)

Numbers picked by inspection of each council's prompt-stitching behavior. Calibrate post-launch against Complexity.actual/1.

Tools factor

tools_factor(member) = 1.0 + 0.3 × tool_turns_estimate

Default tool_turns_estimate = 1 if member has tools, else 0. User can override via member opt :tool_turns_estimate.

Reason: tool calls add round-trips. Each tool turn ≈ extra small call.

Output factor

output_factor(member) =
  1.0   when no schema
  1.2   when ecto schema (structured output cost)
  1.3   when streaming + tools

API sketch

defmodule CouncilEx.Complexity do
  @type score :: float()
  @type breakdown :: %{
    score: score(),
    per_round: [%{round: integer(), score: score(), members: map()}],
    per_member: %{member_id => score()},
    factors: %{
      members: integer(),
      rounds: integer(),
      tier_avg: float(),
      tools_count: integer()
    },
    council_kind: atom()
  }

  @spec score(module() | DynamicCouncil.t(), keyword()) :: breakdown()
  def score(council, opts \\ [])

  @spec actual(Result.t()) :: breakdown()
  def actual(result)

  @spec compare(breakdown(), breakdown()) :: %{ratio: float(), delta: score()}
  def compare(a, b)
end

Optional thin layer for users wanting $:

defmodule CouncilEx.Complexity.Cost do
  @doc """
  User-supplied price/score-unit. Library does not ship prices.

      Cost.estimate(score, 0.0008)   # $0.0008 per complexity unit
      #=> %{usd: 0.064, score: 80.0}
  """
  @spec estimate(Complexity.breakdown(), float()) :: %{usd: float(), score: float()}
  def estimate(breakdown, price_per_unit)
end

Worked example

Council: peer_review, 3 members, 2 rounds, all gpt-4.1 (tier mid = 2.0), no tools, no schema.

Round 1:
  3 members × tier 2.0 × ctx 1.0 × tools 1.0 × out 1.0 = 6.0

Round 2 (peer review):
  3 members × tier 2.0 × ctx (1.0 + 0.5·2) = 2.0 × tools 1.0 × out 1.0
  = 3 × 2.0 × 2.0 = 12.0

score = 18.0

Compare vs parallel_panel, same members:

1 round × 3 × 2.0 × 1.0 × 1.0 × 1.0 = 6.0

Peer review 3.0× heavier than parallel panel. Matches intuition. Extra round, members re-read each other.

Calibration

After implementation, run a fixed test suite of councils, log:

{council, score_static, score_actual, ratio}

If ratios cluster in [0.8, 1.2], model is fine. If wide, adjust context/tier factors. Keep calibration table in docs/COMPLEXITY_CALIBRATION.md.

Open questions

  1. Streaming: chunks add wire overhead but not token cost. Ignore?
  2. Cache hits (Anthropic prompt caching): real $ savings, no complexity savings. Score model ignores. OK because score is compute-shaped, not $-shaped.
  3. Reasoning tokens (o1/o3): hidden output. Tier 8.0 already accounts for "this model is heavy". Don't double-count.
  4. DynamicMember without explicit model: needs sensible default. Probably mid + warning.
  5. Aggregator overhead (e.g. consensus aggregator pass): treated as N-th round? Or separate aggregator_factor?
  6. Should score/2 walk static councils via the DSL macro state, or require running the council once with dry_run: true to materialize member specs? Materialized form is more accurate but couples to runtime.

Non-decision

Whether to ship this at all. It's a small file (~300 LOC), low maintenance once tiers stabilize, no network calls. Main risk: users treat score as $ proxy and get burned. Mitigate with docs and Cost.estimate/2 requiring explicit price_per_unit.

Next steps

  1. Decide: ship or shelve.
  2. If ship: implement Tiers table first, validate against existing examples/ councils, then score/2, then actual/1.
  3. Add complexity field to Result.metadata when computed.
  4. CLI helper: mix council.complexity MyCouncil prints breakdown.