Unicode.String.Break.Word (Unicode String v2.1.0)

Copy Markdown View Source

Single-pass DFA-style implementation of UAX #29 word break.

State

Per-position state is intentionally compact:

  • prev, prev2 — the effective Word_Break property of the previous and second-previous codepoint, where Extend, Format, and ZWJ are skipped over (per WB4). prev2 is only needed for WB7 (AHLetter MidLetter|MidNumLetQ × AHLetter), WB7c (HebrewLetter DoubleQuote × HebrewLetter), and WB11 (Numeric MidNum|MidNumLetQ × Numeric).

  • ri_parity:odd or :even, parity of the run of Regional_Indicators ending at prev (WB15/16).

  • prev_actual — the Word_Break property of the codepoint immediately preceding the current one (without WB4 skipping). Required by rules that don't allow transparent characters in between, namely WB3 (CR × LF), WB3c (ZWJ × ExtPict), and WB3d (WSegSpace × WSegSpace).

Lookahead

Some rules require knowing the character after the candidate break (WB6, WB7b, WB12). The walker therefore reads codepoints with one codepoint of buffered lookahead and resolves these rules at decision time.

Summary

Functions

Boundary predicate: true if there is a word boundary between string_before and string_after.

Returns {first_word, rest} for string, or nil for the empty string.

Splits string into a list of word-break segments.

Functions

break?(string_before, arg2)

@spec break?(String.t(), String.t()) :: boolean()

Boundary predicate: true if there is a word boundary between string_before and string_after.

next(string)

@spec next(String.t()) :: {String.t(), String.t()} | nil

Returns {first_word, rest} for string, or nil for the empty string.

split(string)

@spec split(String.t()) :: [String.t()]

Splits string into a list of word-break segments.