Single-pass DFA-style implementation of UAX #29 sentence break with locale-specific class extensions and abbreviation suppressions.
Background
The sentence-break algorithm differs from grapheme/word break in two important ways:
The default rule is no break (rule SB998
× Any). Sentence boundaries are emitted only by SB4 (ParaSep ÷) and by SB11 (SATerm Close* Sp* ParaSep? ÷), in the absence of an earlier suppressing rule.SB8 has unbounded forward look-ahead — at an
ATerm Close* Sp*it suppresses the break if aLowerletter is reached before any ofOLetter | Upper | Lower | ParaSep | SATerm.
Locale-specific class extensions
Some locales extend the standard Sentence_Break property classes.
CLDR's el.xml, for example, extends $STerm to include U+003B
(ASCII semicolon) so that Greek text like "γδ; Ε" breaks at the
semicolon. The walker accepts a locale argument and applies these
per-locale overrides via classify/2.
State
The walker carries:
prev_actual— the property of the immediately-previous codepoint (without the SB5 transparent skip). Needed for SB3 (CR × LF).effective_prev— the property of the previous non-transparent codepoint (Extend/Format are skipped per SB5).before_aterm— the effective property before the most recentATerm, used by SB7 ((Upper|Lower) ATerm × Upper).phase— encodes how far we are through a potential sentence-terminating sequence(SA)Term Close* Sp* ParaSep?.
Suppressions
Locale-specific suppressions (e.g. "Mr.", "Dr.") are applied as a post-pass: when SB11 would fire after an ATerm-led sequence, the walker compares the trailing fragment of the segment against the suppression set and cancels the break on a longest-match.
Summary
Functions
Boundary predicate. Returns true if there is a sentence break
between string_before and string_after.
Returns {first_sentence, rest} for string, or nil for empty input.
Splits string into sentences.
Functions
Boundary predicate. Returns true if there is a sentence break
between string_before and string_after.
When suppressing, the suppression check matches the trailing word
of string_before.
Returns {first_sentence, rest} for string, or nil for empty input.
Splits string into sentences.