View Source Tokenizers.Model.BPE (Tokenizers v0.5.1)

Summary

Types

Options for model initialisation.

Functions

Instantiate an empty BPE model.

Instantiate a BPE model from the given vocab and merges files.

Instantiate a BPE model from the given vocab and merges.

Types

@type options() :: [
  cache_capacity: number(),
  dropout: float(),
  unk_token: String.t(),
  continuing_subword_prefix: String.t(),
  end_of_word_suffix: String.t(),
  fuse_unk: boolean(),
  byte_fallback: boolean()
]

Options for model initialisation.

  • :byte_fallback- whether to use the byte fallback trick

  • :cache_capacity - the number of words that the BPE cache can contain. The cache allows to speed-up the process by keeping the result of the merge operations for a number of words. Defaults to 10_000

  • :dropout - The BPE dropout to use. Must be a float between 0 and 1

  • :unk_token - The unknown token to be used by the model

  • :continuing_subword_prefix - The prefix to attach to subword units that don't represent a beginning of word

  • :end_of_word_suffix - The suffix to attach to subword units that represent an end of word

Functions

@spec empty() :: {:ok, Tokenizers.Model.t()}

Instantiate an empty BPE model.

Link to this function

from_file(vocab_path, merges_path, options \\ [])

View Source
@spec from_file(String.t(), String.t(), options()) :: {:ok, Tokenizers.Model.t()}

Instantiate a BPE model from the given vocab and merges files.

Link to this function

init(vocab, merges, options \\ [])

View Source
@spec init(
  %{required(String.t()) => integer()},
  [{String.t(), String.t()}],
  options()
) :: {:ok, Tokenizers.Model.t()}

Instantiate a BPE model from the given vocab and merges.