Penelope v0.5.0 Penelope.NLP.Tokenize.BertTokenizer View Source
This is a BERT-compatible wordpiece tokenizer/vectorizer implementation. It provides the ability to encode a text string into an integer vector containing values derived from a wordpiece vocabulary. The encoded results can also be converted back to the original text or a substring of it.
The initial tokenization is performed by splitting on whitespace. These tokens are then further split by punctuation and piecing to find the longest matching wordpieces in the vocabulary. Indexes into the original whitespace tokenization are maintained, so that the vectorization can be inverted without losing anything except non-space whitespace.
https://arxiv.org/abs/1810.04805
Link to this section Summary
Functions
detokenizes a (possibly sub-)sequence of an encoded string
tokenizes and vectorizes a string
Link to this section Functions
detokenizes a (possibly sub-)sequence of an encoded string
tokenizes and vectorizes a string
The following options are supported:
key | description | default |
---|---|---|
lowercase | downcase during vectorization? | true |
split_regex | regex used to tokenize the text | ~r/[\s]/u |
strip_regex | regex used to remove invalid characters | ~r/[\p{Mn}\p{C}\x{0000}\x{FFFD}]/u |
punct_regex | regex used to split pieces on punctuation | ~r/[\p{P}$+<=>^`|~]/u |
special_tokens | list of special tokens not to piece | ["[PAD]", "[UNK]", "[CLS]", "[SEP]", "[MASK]"] |
unknown_token | the key used to indicate OOV token | "[UNK]" |
piece_prefix | prefix used to indicate subsequent pieces | "##" |