Penelope v0.5.0 Penelope.NLP.Tokenize.BertTokenizer View Source

This is a BERT-compatible wordpiece tokenizer/vectorizer implementation. It provides the ability to encode a text string into an integer vector containing values derived from a wordpiece vocabulary. The encoded results can also be converted back to the original text or a substring of it.

The initial tokenization is performed by splitting on whitespace. These tokens are then further split by punctuation and piecing to find the longest matching wordpieces in the vocabulary. Indexes into the original whitespace tokenization are maintained, so that the vectorization can be inverted without losing anything except non-space whitespace.

https://arxiv.org/abs/1810.04805

Link to this section Summary

Functions

detokenizes a (possibly sub-)sequence of an encoded string

tokenizes and vectorizes a string

Link to this section Functions

Link to this function decode(arg) View Source
decode({[String.t()], [integer()], [integer()]}) :: String.t()

detokenizes a (possibly sub-)sequence of an encoded string

Link to this function encode(text, vocab, options \\ []) View Source
encode(
  text :: String.t(),
  vocab :: %{required(String.t()) => integer()},
  options :: keyword()
) :: {[String.t()], [integer()], [integer()]}

tokenizes and vectorizes a string

The following options are supported:

keydescriptiondefault
lowercasedowncase during vectorization?true
split_regexregex used to tokenize the text~r/[\s]/u
strip_regexregex used to remove invalid characters~r/[\p{Mn}\p{C}\x{0000}\x{FFFD}]/u
punct_regexregex used to split pieces on punctuation~r/[\p{P}$+<=>^`|~]/u
special_tokenslist of special tokens not to piece["[PAD]", "[UNK]", "[CLS]", "[SEP]", "[MASK]"]
unknown_tokenthe key used to indicate OOV token"[UNK]"
piece_prefixprefix used to indicate subsequent pieces"##"