View Source Unicode.String (Unicode String v1.2.0)

This module provides functions that implement some of the Unicode standards:

  • The Unicode Case Folding algorithm to provide case-independent equality checking irrespective of language or script.

  • The Unicode Segmentation algorithm to detect, break or splut strings into grapheme clusters, works and sentences.

  • The Unicode Line Breaking algorithm to determine line breaks (as in word-wrapping).

Link to this section Summary

Functions

Returns match data indicating if the requested break is applicable at the point between the two string segments represented by {string_before, string_after}.

Returns a boolean indicating if the requested break is applicable at the point between the two string segments represented by {string_before, string_after}.

Compares two strings in a case insensitive manner.

Returns next segment in a string.

Splits a string according to the specified break type.

Returns an enumerable that splits a string on demand.

Return a stream that breaks a string into graphemes, words, sentences or line breaks.

Link to this section Types

@type break_match() ::
  {break_or_no_break(), {String.t(), {String.t(), String.t()}}}
  | {break_or_no_break(), {String.t(), String.t()}}
@type break_or_no_break() :: :break | :no_break
@type break_type() :: :grapheme | :word | :line | :sentence
@type error_return() :: {:error, String.t()}
@type options() :: [locale: String.t(), break: break_type(), suppressions: boolean()]
@type split_options() :: [
  locale: String.t(),
  break: break_type(),
  suppressions: boolean(),
  trim: boolean()
]
@type string_interval() :: {String.t(), String.t()}

Link to this section Functions

Link to this function

break(arg, options \\ [])

View Source
@spec break(string_interval(), options()) :: break_match() | error_return()

Returns match data indicating if the requested break is applicable at the point between the two string segments represented by {string_before, string_after}.

arguments

Arguments

  • string is any String.t/0.

  • options is a keyword list of options.

returns

Returns

A tuple indicating if a break would be applicable at this point between string_before and string_after.

  • {:break, {string_before, {matched_string, remaining_string}}} or

  • {:no_break, {string_before, {matched_string, remaining_string}}} or

  • {:error, reason}

options

Options

  • :locale is any locale returned by Unicode.String.Segment.known_locales/0. The default is "root" which corresponds to the break rules defined by the Unicode Segmentation rules.

  • :break is the type of break. It is one of :grapheme, :word, :line or :sentence. The default is :word.

  • :suppressions is a boolean which, if true, will suppress breaks for common abbreviations defined for the locale. The default is true.

examples

Examples

iex> Unicode.String.break {"This is ", "some words"}
{:break, {"This is ", {"s", "ome words"}}}

iex> Unicode.String.break {"This is ", "some words"}, break: :sentence
{:no_break, {"This is ", {"s", "ome words"}}}

iex> Unicode.String.break {"This is one. ", "This is some words."}, break: :sentence
{:break, {"This is one. ", {"T", "his is some words."}}}
Link to this function

break?(arg, options \\ [])

View Source
@spec break?(string_interval(), options()) :: boolean()

Returns a boolean indicating if the requested break is applicable at the point between the two string segments represented by {string_before, string_after}.

arguments

Arguments

  • string is any String.t/0.

  • options is a keyword list of options.

returns

Returns

  • true or false or

  • raises an exception if there is an error

options

Options

  • :locale is any locale returned by Unicode.String.Segment.known_locales/0. The default is "root" which corresponds to the break rules defined by the Unicode Segmentation rules.

  • :break is the type of break. It is one of :grapheme, :word, :line or :sentence. The default is :word.

  • :suppressions is a boolean which, if true, will suppress breaks for common abbreviations defined for the locale. The default is true.

examples

Examples

iex> Unicode.String.break? {"This is ", "some words"}
true

iex> Unicode.String.break? {"This is ", "some words"}, break: :sentence
false

iex> Unicode.String.break? {"This is one. ", "This is some words."}, break: :sentence
true
Link to this function

equals_ignoring_case?(string_a, string_b, type \\ :full)

View Source
@spec equals_ignoring_case?(String.t(), String.t(), atom()) :: boolean()

Compares two strings in a case insensitive manner.

Case folding is applied to the two string arguments which are then compared with the == operator.

arguments

Arguments

  • string_a and string_b are two strings to be compared

  • type is the case folding type to be applied. The alternatives are :full, :simple and :turkic. The default is :full.

returns

Returns

  • true or false

notes

Notes

  • This function applies the Unicode Case Folding algorithm

  • The algorithm does not apply any treatment to diacritical marks hence "compare strings without accents" is not part of this function.

examples

Examples

iex> Unicode.String.equals_ignoring_case? "ABC", "abc"
true

iex> Unicode.String.equals_ignoring_case? "beißen", "beissen"
true

iex> Unicode.String.equals_ignoring_case? "grüßen", "grussen"
false

See Unicode.String.Case.Folding.fold/1.

See Unicode.String.Case.Folding.fold/2.

Link to this function

next(string, options \\ [])

View Source
@spec next(String.t(), split_options()) :: String.t() | nil | error_return()

Returns next segment in a string.

arguments

Arguments

  • string is any String.t/0.

  • options is a keyword list of options.

returns

Returns

A tuple with the segment and the remainder of the string or "" in case the String reached its end.

  • {next_string, rest_of_the_string} or

  • {:error, reason}

options

Options

  • :locale is any locale returned by Unicode.String.Segment.known_locales/0. The default is "root" which corresponds to the break rules defined by the Unicode Segmentation rules.

  • :break is the type of break. It is one of :grapheme, :word, :line or :sentence. The default is :word.

  • :suppressions is a boolean which, if true, will suppress breaks for common abbreviations defined for the locale. The default is true.

examples

Examples

iex> Unicode.String.next "This is a sentence. And another.", break: :word
{"This", " is a sentence. And another."}

iex> Unicode.String.next "This is a sentence. And another.", break: :sentence
{"This is a sentence. ", "And another."}
Link to this function

split(string, options \\ [])

View Source
@spec split(String.t(), split_options()) :: [String.t(), ...] | error_return()

Splits a string according to the specified break type.

arguments

Arguments

  • string is any String.t/0.

  • options is a keyword list of options.

returns

Returns

  • A list of strings after applying the specified break rules or

  • {:error, reason}

options

Options

  • :locale is any locale returned by Unicode.String.Segment.known_locales/0. The default is "root" which corresponds to the break rules defined by the Unicode Segmentation rules.

  • :break is the type of break. It is one of :grapheme, :word, :line or :sentence. The default is :word.

  • :suppressions is a boolean which, if true, will suppress breaks for common abbreviations defined for the locale. The default is true.

  • :trim is a boolean indicating if segments the are comprised of only white space are to be excluded from the returned list. The default is false.

examples

Examples

iex> Unicode.String.split "This is a sentence. And another.", break: :word
["This", " ", "is", " ", "a", " ", "sentence", ".", " ", "And", " ", "another", "."]

iex> Unicode.String.split "This is a sentence. And another.", break: :word, trim: true
["This", "is", "a", "sentence", ".", "And", "another", "."]

iex> Unicode.String.split "This is a sentence. And another.", break: :sentence
["This is a sentence. ", "And another."]
Link to this function

splitter(string, options)

View Source
@spec splitter(String.t(), split_options()) :: function() | error_return()

Returns an enumerable that splits a string on demand.

arguments

Arguments

  • string is any String.t/0.

  • options is a keyword list of options.

returns

Returns

  • A function that implements the enumerable protocol or

  • {:error, reason}

options

Options

  • :locale is any locale returned by Unicode.String.Segment.known_locales/0. The default is "root" which corresponds to the break rules defined by the Unicode Segmentation rules.

  • :break is the type of break. It is one of :grapheme, :word, :line or :sentence. The default is :word.

  • :suppressions is a boolean which, if true, will suppress breaks for common abbreviations defined for the locale. The default is true.

  • :trim is a boolean indicating if segments the are comprised of only white space are to be excluded from the returned list. The default is false.

examples

Examples

iex> enum = Unicode.String.splitter "This is a sentence. And another.", break: :word, trim: true
iex> Enum.take enum, 3
["This", "is", "a"]
Link to this function

stream(string, options \\ [])

View Source (since 1.2.0)
@spec stream(String.t(), Keyword.t()) :: Enumerable.t() | {:error, String.t()}

Return a stream that breaks a string into graphemes, words, sentences or line breaks.

arguments

Arguments

  • string is any String.t/0.

  • options is a keyword list of options.

returns

Returns

options

Options

  • :locale is any locale returned by Unicode.String.Segment.known_locales/0. The default is "root" which corresponds to the break rules defined by the Unicode Segmentation rules.

  • :break is the type of break. It is one of :grapheme, :word, :line or :sentence. The default is :word.

  • :suppressions is a boolean which, if true, will suppress breaks for common abbreviations defined for the locale. The default is true.

  • :trim is a boolean indicating if segments the are comprised of only white space are to be excluded from the returned list. The default is false.

examples

Examples

iex> Enum.to_list Unicode.String.stream("this is a set of words", trim: true) ["this", "is", "a", "set", "of", "words"]

iex> Enum.to_list Unicode.String.stream("this is a set of words", break: :sentence, trim: true) ["this is a set of words"]