Modules
Functions for basic text processing and analysis.
Text cleanup utilities: HTML stripping, whitespace collapse, Unicode normalization, and mojibake repair.
Extract statistically significant word bigrams from a token stream.
Locates runtime data files used by Text modules, fetching them from upstream sources when permitted.
String edit-distance algorithms.
Word embeddings — load pre-trained vectors and compute similarity, nearest neighbours, and analogies.
Emoji detection and short-name conversion.
Extract URLs and email addresses from arbitrary text at social-media quality.
Phase 3 of the URL / email extraction pipeline: trim spurious trailing punctuation from a candidate span.
Phase 2 validator for email-address candidates.
Phase 1 of the URL / email extraction pipeline: find candidate spans.
UTR #39 §5.1 single-script restriction check, used by
Text.Extract to flag mixed-script hosts as potential homograph
attacks.
Top-level domain validation for Text.Extract.
Twitter-text-specific URL handling quirks, gated behind the
:twitter_quirks option.
Phase 2 validator for URL candidates.
Hyphenation via Liang's algorithm with TeX hyphenation patterns.
Information-retrieval scoring against an indexed corpus.
An indexed corpus of documents for information-retrieval scoring.
Pluralisation for the English language based on the paper An Algorithmic Approach to English Pluralization.
Keyword-In-Context concordance.
A single keyword-in-context occurrence.
Language tag utilities used across the package.
Pure-Elixir port of fastText's lid.176 language identification model.
Training and model hyperparameters extracted from a fastText model file.
The result of running fastText language identification on a piece of text.
Vocabulary and label table parsed from a fastText model file.
A single dictionary entry parsed from a fastText model file.
Converts an input string into the flat list of input-matrix row indices that fastText averages to produce a feature vector.
Bit-exact port of fastText's string hash function.
The Huffman tree fastText constructs over output labels for hierarchical softmax inference.
Forward-pass scoring for fastText models.
Resolves a language detection into a CLDR-canonical locale string.
A fully-loaded fastText model.
Parses a fastText .bin model file into a
Text.Language.Classifier.Fasttext.Model struct.
Identifies the dominant Unicode script of a piece of text.
Character n-gram extraction and input-matrix indexing for fastText models.
Whitespace tokenizer matching fastText's Dictionary::readWord.
Dictionary-driven lemmatization.
A single named entity span.
Compute ngrams and their counts from a given UTF8 string.
Pattern-based detection and redaction of personally-identifiable information.
Cologne phonetics (Kölner Phonetik), the German-language counterpart to Soundex.
Double Metaphone phonetic encoding (Lawrence Philips, 2000).
Metaphone phonetic encoding (Lawrence Philips, 1990).
New York State Identification and Intelligence System (NYSIIS) phonetic encoding (Robert L. Taft, 1970).
Soundex phonetic encoding (Russell-Odell, 1918).
Readability metrics for English text.
Locale-aware word and sentence segmentation.
Sentiment analysis with multilingual support.
Behaviour for sentiment-analysis backends.
Neural sentiment backend backed by Bumblebee.
Default sentiment backend — lexicon-based, multilingual via the bundled AFINN lexicons.
Lexicon-based sentiment scoring.
Bundled AFINN sentiment lexicons.
Set- and vector-based string similarity coefficients.
URL-safe slug generation with locale-aware Unicode folding.
Spell correction.
Bundled multilingual stopword lists.
Extractive text summarization.
Syllable counting for English words.
Restore case to text that has been lowercased.
Implements word counting for lists, streams and flows.
Builds a weighted list of terms suitable for rendering as a word cloud.
Behaviour for Text.WordCloud scoring backends.
Trivial frequency-counting backend for Text.WordCloud.
Neural keyword-extraction backend backed by Bumblebee.
RAKE (Rapid Automatic Keyword Extraction) backend for Text.WordCloud.
TF-IDF backend for Text.WordCloud.
TextRank backend for Text.WordCloud.
YAKE! (Yet Another Keyword Extractor) backend for Text.WordCloud.
Wordle-style spiral layout for word-cloud rendering.
Renders a laid-out word cloud as an SVG string.
Word frequency lookup tables.
Mix Tasks
Downloads lemmatization dictionaries from the
michmech/lemmatization-lists
upstream and places them in the configured Text.Data cache so
Text.Lemma can load them with no further network access.
Downloads the fastText lid.176.bin model file used by
Text.Language.Classifier.Fasttext for language identification.
Pre-downloads every external model used by :text so that subsequent
calls run without network access.
Refreshes priv/extract/tlds.txt from
data.iana.org.
Converts the AFINN data vendored
under data/affin/ into per-language TSV files under
priv/sentiment/, ready for compile-time loading by
Text.Sentiment.Lexicons.AFINN.
Runs the canonical test inputs through the reference fastText
implementation and writes per-input top-K predictions to
test/fixtures/golden_predictions.json.
Fetches the stopwords-iso
bundle (a single JSON file mapping ISO 639-1 codes to lists of stopwords)
and writes one plain-text file per language under priv/stopwords/.