View Source Tokenizers.Tokenizer (Tokenizers v0.1.2)

The struct and associated functions for a tokenizer.

A Tokenizers.Tokenizer.t() is a container that holds the constituent parts of the tokenization pipeline.

When you call Tokenizers.Tokenizer.encode/3, the input text goes through the following pipeline:

  • normalization
  • pre-tokenization
  • model
  • post-processing

This returns a Tokenizers.Encoding.t(), which can then give you the token ids for each token in the input text. These token ids are usually used as the input for natural language processing machine learning models.

Link to this section Summary

Types

An input being a subject to tokenization.

t()

Functions

Decode the given list of ids or list of lists of ids back to strings.

Encode the given sequence or batch of sequences to a Tokenizers.Encoding.t().

Instantiate a new tokenizer from the file at the given path.

Instantiate a new tokenizer from an existing file on the Hugging Face Hub.

Get the Tokenizer's Model.

Get the tokenizer's vocabulary as a map of token to id.

Get the number of tokens in the vocabulary.

Convert a given id to its token.

Save the tokenizer to the provided path.

Convert a given token to its id.

Link to this section Types

Specs

encode_input() :: String.t() | {String.t(), String.t()}

An input being a subject to tokenization.

Can be either a single sequence, or a pair of sequences.

Specs

t() :: %Tokenizers.Tokenizer{reference: reference(), resource: binary()}

Link to this section Functions

Link to this function

decode(tokenizer, ids, opts \\ [])

View Source

Specs

decode(Tokenizer.t(), non_neg_integer() | [non_neg_integer()], Keyword.t()) ::
  {:ok, String.t() | [String.t()]} | {:error, term()}

Decode the given list of ids or list of lists of ids back to strings.

Options

  • :skip_special_tokens - whether the special tokens should be removed from the decoded string. Defaults to true.
Link to this function

encode(tokenizer, input, opts \\ [])

View Source

Specs

encode(Tokenizer.t(), encode_input() | [encode_input()], Keyword.t()) ::
  {:ok, Encoding.t() | [Encoding.t()]} | {:error, term()}

Encode the given sequence or batch of sequences to a Tokenizers.Encoding.t().

Options

  • :add_special_tokens - whether to add special tokens to the encoding. Defaults to true.

Specs

from_file(String.t()) :: {:ok, Tokenizer.t()} | {:error, term()}

Instantiate a new tokenizer from the file at the given path.

Link to this function

from_pretrained(identifier)

View Source

Specs

from_pretrained(String.t()) :: {:ok, Tokenizer.t()} | {:error, term()}

Instantiate a new tokenizer from an existing file on the Hugging Face Hub.

Specs

get_model(Tokenizer.t()) :: Tokenizers.Model.t()

Get the Tokenizer's Model.

Specs

get_vocab(Tokenizer.t()) :: %{required(binary()) => integer()}

Get the tokenizer's vocabulary as a map of token to id.

Link to this function

get_vocab_size(tokenizer)

View Source

Specs

get_vocab_size(Tokenizer.t()) :: non_neg_integer()

Get the number of tokens in the vocabulary.

Link to this function

id_to_token(tokenizer, id)

View Source

Specs

id_to_token(Tokenizer.t(), integer()) :: String.t()

Convert a given id to its token.

Specs

save(Tokenizer.t(), String.t()) :: {:ok, String.t()} | {:error, term()}

Save the tokenizer to the provided path.

Link to this function

token_to_id(tokenizer, token)

View Source

Specs

token_to_id(Tokenizer.t(), binary()) :: non_neg_integer()

Convert a given token to its id.