Single-pass DFA-style implementation of UAX #29 grapheme cluster segmentation.
The state carried between characters is intentionally small:
prev— the Grapheme_Cluster_Break property of the previous codepointri_parity—:evenor:odd, tracking the parity of the run of Regional_Indicators ending atprev(used by GB12/GB13)ext_pict_zwj—truewhen the prefix ends with\p{Extended_Pictographic} \p{Extend}* \p{ZWJ}(used by GB11)incb—:none | :consonant | :linker, tracking progress through the GB9c sequence\p{InCB=Consonant} [\p{InCB=Extend}\p{InCB=Linker}]* \p{InCB=Linker} [\p{InCB=Extend}\p{InCB=Linker}]* × \p{InCB=Consonant}
Each character is classified once via Unicode.GraphemeClusterBreak,
Unicode.IndicConjunctBreak and a compile-time set of
Extended_Pictographic ranges, then a constant-time decision determines
whether to emit a break or continue the cluster.
Summary
Functions
Returns true if there is a grapheme cluster boundary between
string_before and string_after.
Returns the index of the next grapheme cluster boundary after position 0
in string, expressed as a {first_grapheme, rest} tuple.
Splits string into a list of grapheme clusters according to UAX #29.
Functions
Returns true if there is a grapheme cluster boundary between
string_before and string_after.
When string_before is empty there is always a boundary (GB1).
When string_after is empty there is always a boundary (GB2).
Returns the index of the next grapheme cluster boundary after position 0
in string, expressed as a {first_grapheme, rest} tuple.
Returns nil for the empty string.
Splits string into a list of grapheme clusters according to UAX #29.