View Source Tokenizers.Tokenizer (Tokenizers v0.1.1)
The struct and associated functions for a tokenizer.
A Tokenizers.Tokenizer.t()
is a container that holds the constituent parts of the tokenization pipeline.
When you call Tokenizers.Tokenizer.encode/3
, the input text goes through the following pipeline:
- normalization
- pre-tokenization
- model
- post-processing
This returns a Tokenizers.Encoding.t()
, which can then give you the token ids for each token in the input text. These token ids are usually used as the input for natural language processing machine learning models.
Link to this section Summary
Functions
Decode the given list of ids or list of lists of ids back to strings.
Encode the given sequence or batch of sequences to a Tokenizers.Encoding.t()
.
Instantiate a new tokenizer from the file at the given path.
Instantiate a new tokenizer from an existing file on the Hugging Face Hub.
Get the Tokenizer
's Model
.
Get the tokenizer's vocabulary as a map of token to id.
Get the number of tokens in the vocabulary.
Convert a given id to its token.
Save the tokenizer to the provided path.
Convert a given token to its id.
Link to this section Types
Specs
An input being a subject to tokenization.
Can be either a single sequence, or a pair of sequences.
Specs
Link to this section Functions
Specs
decode(Tokenizer.t(), non_neg_integer() | [non_neg_integer()], Keyword.t()) :: {:ok, String.t() | [String.t()]} | {:error, term()}
Decode the given list of ids or list of lists of ids back to strings.
Options
:skip_special_tokens
- whether the special tokens should be removed from the decoded string. Defaults totrue
.
Specs
encode(Tokenizer.t(), encode_input() | [encode_input()], Keyword.t()) :: {:ok, Encoding.t() | [Encoding.t()]} | {:error, term()}
Encode the given sequence or batch of sequences to a Tokenizers.Encoding.t()
.
Options
:add_special_tokens
- whether to add special tokens to the encoding. Defaults totrue
.
Specs
Instantiate a new tokenizer from the file at the given path.
Specs
Instantiate a new tokenizer from an existing file on the Hugging Face Hub.
Specs
get_model(Tokenizer.t()) :: Tokenizers.Model.t()
Get the Tokenizer
's Model
.
Specs
Get the tokenizer's vocabulary as a map of token to id.
Specs
get_vocab_size(Tokenizer.t()) :: non_neg_integer()
Get the number of tokens in the vocabulary.
Specs
Convert a given id to its token.
Specs
Save the tokenizer to the provided path.
Specs
token_to_id(Tokenizer.t(), binary()) :: non_neg_integer()
Convert a given token to its id.