Text.Language.Classifier.Fasttext.Tokenizer (Text v0.6.0)

Copy Markdown View Source

Whitespace tokenizer matching fastText's Dictionary::readWord.

fastText splits on a fixed byte set: space, newline, carriage return, tab, vertical tab, form feed, and NUL. Any maximal run of non-whitespace bytes between whitespace bytes (or string boundaries) is one token. The rules are byte-level, not codepoint-level, so this implementation is not Unicode-aware — that intentionally mirrors the reference.

At inference time the official Python wrapper replaces newlines with spaces before calling into C++ (predict in python/fasttext_module/fasttext/FastText.py). This tokenizer takes the same approach: callers wanting strict Python parity can pre-strip newlines themselves, but the default behaviour treats \n as whitespace and never emits the special </s> EOS token that the training-time readWord would produce.

See src/dictionary.cc (Dictionary::readWord, Dictionary::readWordNoNewline).

Summary

Functions

Splits a binary into tokens using fastText's whitespace rules.

Functions

tokenize(text)

@spec tokenize(binary()) :: [binary()]

Splits a binary into tokens using fastText's whitespace rules.

Arguments

  • text is a UTF-8 binary or arbitrary byte sequence. Whitespace is treated at the byte level: a stray byte in [\s, \n, \r, \t, \v, \f, \0] is a separator no matter what surrounding bytes look like.

Returns

  • A list of binaries, in document order, with no empty tokens. Returns [] for empty or whitespace-only input.

Examples

iex> Text.Language.Classifier.Fasttext.Tokenizer.tokenize("hello world")
["hello", "world"]

iex> Text.Language.Classifier.Fasttext.Tokenizer.tokenize("  hello\tworld\n")
["hello", "world"]

iex> Text.Language.Classifier.Fasttext.Tokenizer.tokenize("")
[]

iex> Text.Language.Classifier.Fasttext.Tokenizer.tokenize("一個 中文 句子")
["一個", "中文", "句子"]