Unicode.String.Break.Grapheme (Unicode String v2.1.0)

Copy Markdown View Source

Single-pass DFA-style implementation of UAX #29 grapheme cluster segmentation.

The state carried between characters is intentionally small:

  • prev — the Grapheme_Cluster_Break property of the previous codepoint
  • ri_parity:even or :odd, tracking the parity of the run of Regional_Indicators ending at prev (used by GB12/GB13)
  • ext_pict_zwjtrue when the prefix ends with \p{Extended_Pictographic} \p{Extend}* \p{ZWJ} (used by GB11)
  • incb:none | :consonant | :linker, tracking progress through the GB9c sequence \p{InCB=Consonant} [\p{InCB=Extend}\p{InCB=Linker}]* \p{InCB=Linker} [\p{InCB=Extend}\p{InCB=Linker}]* × \p{InCB=Consonant}

Each character is classified once via Unicode.GraphemeClusterBreak, Unicode.IndicConjunctBreak and a compile-time set of Extended_Pictographic ranges, then a constant-time decision determines whether to emit a break or continue the cluster.

Summary

Functions

Returns true if there is a grapheme cluster boundary between string_before and string_after.

Returns the index of the next grapheme cluster boundary after position 0 in string, expressed as a {first_grapheme, rest} tuple.

Splits string into a list of grapheme clusters according to UAX #29.

Functions

break?(string_before, arg2)

@spec break?(String.t(), String.t()) :: boolean()

Returns true if there is a grapheme cluster boundary between string_before and string_after.

When string_before is empty there is always a boundary (GB1). When string_after is empty there is always a boundary (GB2).

next(string)

@spec next(String.t()) :: {String.t(), String.t()} | nil

Returns the index of the next grapheme cluster boundary after position 0 in string, expressed as a {first_grapheme, rest} tuple.

Returns nil for the empty string.

split(string)

@spec split(String.t()) :: [String.t()]

Splits string into a list of grapheme clusters according to UAX #29.