textmetrics/count

Counts of words, sentences, syllables, characters, paragraphs and polysyllables — the primitives consumed by readability scores in textmetrics/readability.

Functions in this module are pure, deterministic, and O(n) in the length of their input. They iterate over extended grapheme clusters (via gleam/string.to_graphemes), not raw bytes, so "naïve" is 5 graphemes and 1 word regardless of whether the diacritic is encoded as a single codepoint or a + combining mark.

All language-specific heuristics in this module are tuned for English. Behaviour on other scripts is documented per function and biased towards “do nothing surprising”: CJK without whitespace stays one word; non-English text in syllables_in_word returns 1.

Values

pub fn characters(text: String) -> Int

Count characters that contribute to readability formulas: ASCII letters plus ASCII digits, excluding whitespace and ASCII punctuation. Non-ASCII graphemes that look letter-like (Latin-1 accents, CJK ideographs) also count, mirroring the behaviour of textstat’s char_count.

Examples:

count.characters("Hello, World!") // 10
count.characters("123 abc")       // 6
pub fn paragraphs(text: String) -> Int

Count paragraphs. A paragraph is a maximal run of non-blank lines separated by one or more blank lines. Trailing blank lines do not produce empty paragraphs.

Examples:

count.paragraphs("a\n\nb\n\nc")  // 3
count.paragraphs("one line")     // 1
count.paragraphs("")             // 0
pub fn polysyllables(text: String) -> Int

Count words with three or more syllables in text. This is the “polysyllable” count consumed by SMOG with no exclusions; the stricter Gunning-Fog “complex word” count is computed inline inside readability.gunning_fog.

pub fn sentences(text: String) -> Int

Count sentences. Sentence terminators are ., !, ?. A run of consecutive terminators counts as one boundary (so "What?!" is one sentence). A trailing non-empty fragment that lacks a terminator still counts as a sentence ("hello" → 1).

This implementation does not special-case abbreviations like Mr., Dr., e.g.. Text dense in such abbreviations will be over-segmented. Callers that need abbreviation-aware segmentation should pre-process.

Empty input returns 0.

pub fn syllables(text: String) -> Int

Count syllables in text. Sums syllables_in_word over each word found by words.

pub fn syllables_in_word(word: String) -> Int

Count syllables in a single word using an English heuristic:

  1. Lowercase the word.
  2. Strip non-ASCII-letter graphemes.
  3. Count maximal vowel groups in a e i o u y, with y only counting at non-initial position.
  4. Subtract one if the word ends in a silent e (the preceding letter being a consonant).
  5. Floor at 1.

Examples:

count.syllables_in_word("the")       // 1
count.syllables_in_word("hello")     // 2
count.syllables_in_word("syllable")  // 3
count.syllables_in_word("rhythm")    // 1 (no vowels, floors at 1)

Returns 0 for an empty input. Returns 1 for non-English words that contain no ASCII letters.

pub fn words(text: String) -> Int

Count words. A word is a maximal run of “letter-like” graphemes separated by whitespace or punctuation.

A grapheme counts as letter-like when its first code point is an ASCII letter (a-z / A-Z), an ASCII digit, or any non-ASCII character (covering Latin-1 letters, CJK ideographs, accented letters delivered as a single grapheme, etc.). Whitespace and ASCII punctuation are word boundaries.

Examples:

count.words("")              // 0
count.words("hello")         // 1
count.words("hello world")   // 2
count.words("hello, world!") // 2
count.words("hello   world") // 2 (whitespace collapses)
Search Document