Text segmentation, modelled on
Intl.Segmenter.
Splits text into segments by grapheme cluster, word, or sentence boundaries.
:graphemesegmentation uses Elixir's built-inString.graphemes/1and is always available.:wordand:sentencesegmentation requires the optionalunicode_stringdependency. When that library is not installed, these granularities return an error.
Note: the JS Intl.Segmenter returns an iterable of rich
segment objects with segment, index, input, and
isWordLike properties. This module returns a flat list of
segment strings for simplicity.
Summary
Functions
Segments a string into a list of substrings.
Arguments
stringis the text to segment.optionsis a keyword list of options.
Options
:granularityis:grapheme,:word, or:sentence. The default is:grapheme.:localeis a locale identifier string. Only used for:wordand:sentencegranularity. The default is"root".:trimis a boolean. Whentrue, whitespace-only segments are removed. Only applies to:wordand:sentencegranularity. The default isfalse.
Returns
{:ok, segments}wheresegmentsis a list of strings.{:error, reason}if the granularity is not supported or theunicode_stringdependency is missing.
Examples
iex> Intl.Segmenter.segment("héllo", granularity: :grapheme)
{:ok, ["h", "é", "l", "l", "o"]}
Segments a string, raising on error.
Same as segment/2 but returns the list directly or raises.
Arguments
stringis the text to segment.optionsis a keyword list of options.
Returns
- A list of segment strings.
Examples
iex> Intl.Segmenter.segment!("héllo", granularity: :grapheme)
["h", "é", "l", "l", "o"]