Gibran.Tokeniser

This module contains functions that allow the caller to convert a string into a list of tokens using different strategies.

Summary

Functions

Takes a string and splits it into a list of tokens using a regular expression. If a regular expression is not provided it defaults to @token_regexp

Functions

tokenise(input, opts \\ [])

Takes a string and splits it into a list of tokens using a regular expression. If a regular expression is not provided it defaults to @token_regexp.

iex> Gibran.Tokeniser.tokenise("The Prophet")
["the", "prophet"]

The default regular expression ignores punctuation, but accounts for apostrophes and compound words.

iex> Gibran.Tokeniser.tokenise("Prophet, The")
["prophet", "the"]

iex> Gibran.Tokeniser.tokenise("Al-Ajniha al-Mutakassira")
["al-ajniha", "al-mutakassira"]

The tokeniser will normalize any input by downcasing all tokens.

iex> Gibran.Tokeniser.tokenise("THE PROPHET")
["the", "prophet"]

Options

  • :pattern A regular expression to tokenise the input. It defaults to @token_regexp.
  • :exclude A filter that is applied to the string after it has been tokenised. It can be a function, string, list, or regular expression. This is useful to exclude tokens from the final list.

Examples

iex> Gibran.Tokeniser.tokenise("Broken Wings, 1912", pattern: ~r/\,/)
["broken wings", " 1912"]

iex> Gibran.Tokeniser.tokenise("Kingdom of the Imagination", exclude: &(String.length(&1) < 10))
["imagination"]

iex> Gibran.Tokeniser.tokenise("Sand and Foam", exclude: ~r/and/)
["foam"]

iex> Gibran.Tokeniser.tokenise("Eye of The Prophet", exclude: "eye of")
["the", "prophet"]

iex> Gibran.Tokeniser.tokenise("Eye of The Prophet", exclude: ["eye", "of"])
["the", "prophet"]