Unicode.String.Break.Sentence (Unicode String v2.1.0)

Copy Markdown View Source

Single-pass DFA-style implementation of UAX #29 sentence break with locale-specific class extensions and abbreviation suppressions.

Background

The sentence-break algorithm differs from grapheme/word break in two important ways:

  • The default rule is no break (rule SB998 × Any). Sentence boundaries are emitted only by SB4 (ParaSep ÷) and by SB11 (SATerm Close* Sp* ParaSep? ÷), in the absence of an earlier suppressing rule.

  • SB8 has unbounded forward look-ahead — at an ATerm Close* Sp* it suppresses the break if a Lower letter is reached before any of OLetter | Upper | Lower | ParaSep | SATerm.

Locale-specific class extensions

Some locales extend the standard Sentence_Break property classes. CLDR's el.xml, for example, extends $STerm to include U+003B (ASCII semicolon) so that Greek text like "γδ; Ε" breaks at the semicolon. The walker accepts a locale argument and applies these per-locale overrides via classify/2.

State

The walker carries:

  • prev_actual — the property of the immediately-previous codepoint (without the SB5 transparent skip). Needed for SB3 (CR × LF).

  • effective_prev — the property of the previous non-transparent codepoint (Extend/Format are skipped per SB5).

  • before_aterm — the effective property before the most recent ATerm, used by SB7 ((Upper|Lower) ATerm × Upper).

  • phase — encodes how far we are through a potential sentence-terminating sequence (SA)Term Close* Sp* ParaSep?.

Suppressions

Locale-specific suppressions (e.g. "Mr.", "Dr.") are applied as a post-pass: when SB11 would fire after an ATerm-led sequence, the walker compares the trailing fragment of the segment against the suppression set and cancels the break on a longest-match.

Summary

Functions

Boundary predicate. Returns true if there is a sentence break between string_before and string_after.

Returns {first_sentence, rest} for string, or nil for empty input.

Splits string into sentences.

Functions

break?(string_before, string_after, locale, suppressions)

@spec break?(String.t(), String.t(), atom() | binary(), MapSet.t()) :: boolean()

Boundary predicate. Returns true if there is a sentence break between string_before and string_after.

When suppressing, the suppression check matches the trailing word of string_before.

next(string, locale, suppressions)

@spec next(String.t(), atom() | binary(), MapSet.t()) ::
  {String.t(), String.t()} | nil

Returns {first_sentence, rest} for string, or nil for empty input.

split(string, locale, suppressions)

@spec split(String.t(), atom() | binary(), MapSet.t()) :: [String.t()]

Splits string into sentences.