Single-pass DFA-style implementation of UAX #29 word break.
State
Per-position state is intentionally compact:
prev,prev2— the effective Word_Break property of the previous and second-previous codepoint, whereExtend,Format, andZWJare skipped over (per WB4).prev2is only needed for WB7 (AHLetter MidLetter|MidNumLetQ × AHLetter), WB7c (HebrewLetter DoubleQuote × HebrewLetter), and WB11 (Numeric MidNum|MidNumLetQ × Numeric).ri_parity—:oddor:even, parity of the run of Regional_Indicators ending atprev(WB15/16).prev_actual— the Word_Break property of the codepoint immediately preceding the current one (without WB4 skipping). Required by rules that don't allow transparent characters in between, namely WB3 (CR × LF), WB3c (ZWJ × ExtPict), and WB3d (WSegSpace × WSegSpace).
Lookahead
Some rules require knowing the character after the candidate break (WB6, WB7b, WB12). The walker therefore reads codepoints with one codepoint of buffered lookahead and resolves these rules at decision time.
Summary
Functions
Boundary predicate: true if there is a word boundary between
string_before and string_after.
Returns {first_word, rest} for string, or nil for the empty string.
Splits string into a list of word-break segments.
Functions
Boundary predicate: true if there is a word boundary between
string_before and string_after.
Returns {first_word, rest} for string, or nil for the empty string.
Splits string into a list of word-break segments.