View Source Tokenizers.Encoding (Tokenizers v0.1.1)

The struct and associated functions for an encoding, the output of a tokenizer.

Use these functions to retrieve the inputs needed for a natural language processing machine learning model.

Link to this section Summary

Functions

Get the attention mask from an encoding.

Get the ids from an encoding.

Get offsets from an encoding.

Get special tokens mask from an encoding.

Get the tokens from an encoding.

Get token type ids from an encoding.

Returns the number of tokens in an Encoding.t().

Pad the encoding to the given length.

Truncate the encoding to the given length.

Link to this section Types

Specs

t() :: %Tokenizers.Encoding{reference: reference(), resource: binary()}

Link to this section Functions

Link to this function

get_attention_mask(encoding)

View Source

Specs

get_attention_mask(Encoding.t()) :: [integer()]

Get the attention mask from an encoding.

Specs

get_ids(Encoding.t()) :: [integer()]

Get the ids from an encoding.

Specs

get_offsets(Encoding.t()) :: [{integer(), integer()}]

Get offsets from an encoding.

Link to this function

get_special_tokens_mask(encoding)

View Source

Specs

get_special_tokens_mask(Encoding.t()) :: [integer()]

Get special tokens mask from an encoding.

Specs

get_tokens(Encoding.t()) :: [binary()]

Get the tokens from an encoding.

Specs

get_type_ids(Encoding.t()) :: [integer()]

Get token type ids from an encoding.

Specs

n_tokens(encoding :: Encoding.t()) :: non_neg_integer()

Returns the number of tokens in an Encoding.t().

Link to this function

pad(encoding, length, opts \\ [])

View Source

Specs

pad(encoding :: Encoding.t(), length :: pos_integer(), opts :: Keyword.t()) ::
  Encoding.t()

Pad the encoding to the given length.

Options

  • direction - The padding direction. Can be :right or :left. Default: :right.
  • pad_id - The id corresponding to the padding token. Default: 0.
  • pad_token - The padding token to use. Default: "[PAD]".
  • pad_type_id - The type ID corresponding to the padding token. Default: 0.
Link to this function

truncate(encoding, max_len, opts \\ [])

View Source

Specs

truncate(encoding :: Encoding.t(), length :: integer(), opts :: Keyword.t()) ::
  Encoding.t()

Truncate the encoding to the given length.

Options

  • direction - The truncation direction. Can be :right or :left. Default: :right.
  • stride - The length of previous content to be included in each overflowing piece. Default: 0.