Gibran.Tokeniser
This module contains functions that allow the caller to convert a string into a list of tokens using different strategies.
Summary
Functions
Takes a string and splits it into a list of tokens using a regular expression. If a regular expression
is not provided it defaults to @token_regexp
Functions
Takes a string and splits it into a list of tokens using a regular expression. If a regular expression
is not provided it defaults to @token_regexp
.
iex> Gibran.Tokeniser.tokenise("The Prophet")
["the", "prophet"]
The default regular expression ignores punctuation, but accounts for apostrophes and compound words.
iex> Gibran.Tokeniser.tokenise("Prophet, The")
["prophet", "the"]
iex> Gibran.Tokeniser.tokenise("Al-Ajniha al-Mutakassira")
["al-ajniha", "al-mutakassira"]
The tokeniser will normalize any input by downcasing all tokens.
iex> Gibran.Tokeniser.tokenise("THE PROPHET")
["the", "prophet"]
Options
:pattern
A regular expression to tokenise the input. It defaults to@token_regexp
.:exclude
A filter that is applied to the string after it has been tokenised. It can be a function, string, list, or regular expression. This is useful to exclude tokens from the final list.
Examples
iex> Gibran.Tokeniser.tokenise("Broken Wings, 1912", pattern: ~r/\,/)
["broken wings", " 1912"]
iex> Gibran.Tokeniser.tokenise("Kingdom of the Imagination", exclude: &(String.length(&1) < 10))
["imagination"]
iex> Gibran.Tokeniser.tokenise("Sand and Foam", exclude: ~r/and/)
["foam"]
iex> Gibran.Tokeniser.tokenise("Eye of The Prophet", exclude: "eye of")
["the", "prophet"]
iex> Gibran.Tokeniser.tokenise("Eye of The Prophet", exclude: ["eye", "of"])
["the", "prophet"]