CommBus.Tokenizer.Simple (CommBus v0.1.0)

Copy Markdown View Source

Fallback tokenizer using a heuristic character/word based approximation.

Roughly mirrors GPT tokenization by counting word boundaries and punctuation.

Summary

Functions

Counts tokens for a conversation message by summing the content token count and a fixed role-based overhead (2 tokens for most roles, 4 for tool messages).

Estimates the token count of a text string using a heuristic word-and-punctuation scan. Splits on word boundaries and counts each alphanumeric run and punctuation character as one token, roughly approximating GPT tokenization.

Functions

count_message(message, opts)

Counts tokens for a conversation message by summing the content token count and a fixed role-based overhead (2 tokens for most roles, 4 for tool messages).

Parameters

  • message — A %CommBus.Message{} struct.
  • opts — Forwarded to count_tokens/2.

Returns

A non-negative integer representing the estimated token count.

count_tokens(text, opts)

Estimates the token count of a text string using a heuristic word-and-punctuation scan. Splits on word boundaries and counts each alphanumeric run and punctuation character as one token, roughly approximating GPT tokenization.

Parameters

  • text — The text string to count tokens for.
  • _opts — Ignored; present for callback conformance.

Returns

A non-negative integer token count estimate.