API Reference Text v#0.6.0

Copy Markdown View Source

Modules

Functions for basic text processing and analysis.

Text cleanup utilities: HTML stripping, whitespace collapse, Unicode normalization, and mojibake repair.

Extract statistically significant word bigrams from a token stream.

Locates runtime data files used by Text modules, fetching them from upstream sources when permitted.

String edit-distance algorithms.

Word embeddings — load pre-trained vectors and compute similarity, nearest neighbours, and analogies.

Emoji detection and short-name conversion.

Extract URLs and email addresses from arbitrary text at social-media quality.

Phase 3 of the URL / email extraction pipeline: trim spurious trailing punctuation from a candidate span.

Phase 2 validator for email-address candidates.

Phase 1 of the URL / email extraction pipeline: find candidate spans.

UTR #39 §5.1 single-script restriction check, used by Text.Extract to flag mixed-script hosts as potential homograph attacks.

Top-level domain validation for Text.Extract.

Twitter-text-specific URL handling quirks, gated behind the :twitter_quirks option.

Phase 2 validator for URL candidates.

Hyphenation via Liang's algorithm with TeX hyphenation patterns.

Information-retrieval scoring against an indexed corpus.

An indexed corpus of documents for information-retrieval scoring.

Pluralisation for the English language based on the paper An Algorithmic Approach to English Pluralization.

Keyword-In-Context concordance.

A single keyword-in-context occurrence.

Language tag utilities used across the package.

Pure-Elixir port of fastText's lid.176 language identification model.

Training and model hyperparameters extracted from a fastText model file.

The result of running fastText language identification on a piece of text.

Vocabulary and label table parsed from a fastText model file.

A single dictionary entry parsed from a fastText model file.

Converts an input string into the flat list of input-matrix row indices that fastText averages to produce a feature vector.

Bit-exact port of fastText's string hash function.

The Huffman tree fastText constructs over output labels for hierarchical softmax inference.

Forward-pass scoring for fastText models.

Resolves a language detection into a CLDR-canonical locale string.

A fully-loaded fastText model.

Identifies the dominant Unicode script of a piece of text.

Character n-gram extraction and input-matrix indexing for fastText models.

Whitespace tokenizer matching fastText's Dictionary::readWord.

Dictionary-driven lemmatization.

Named-entity recognition via Bumblebee.

A single named entity span.

Compute ngrams and their counts from a given UTF8 string.

Pattern-based detection and redaction of personally-identifiable information.

Part-of-speech tagging via Bumblebee.

Cologne phonetics (Kölner Phonetik), the German-language counterpart to Soundex.

Double Metaphone phonetic encoding (Lawrence Philips, 2000).

Metaphone phonetic encoding (Lawrence Philips, 1990).

New York State Identification and Intelligence System (NYSIIS) phonetic encoding (Robert L. Taft, 1970).

Soundex phonetic encoding (Russell-Odell, 1918).

Readability metrics for English text.

Locale-aware word and sentence segmentation.

Sentiment analysis with multilingual support.

Behaviour for sentiment-analysis backends.

Neural sentiment backend backed by Bumblebee.

Default sentiment backend — lexicon-based, multilingual via the bundled AFINN lexicons.

Lexicon-based sentiment scoring.

Bundled AFINN sentiment lexicons.

Set- and vector-based string similarity coefficients.

URL-safe slug generation with locale-aware Unicode folding.

Spell correction.

Bundled multilingual stopword lists.

Extractive text summarization.

Syllable counting for English words.

Restore case to text that has been lowercased.

Implements word counting for lists, streams and flows.

Builds a weighted list of terms suitable for rendering as a word cloud.

Behaviour for Text.WordCloud scoring backends.

Trivial frequency-counting backend for Text.WordCloud.

Neural keyword-extraction backend backed by Bumblebee.

RAKE (Rapid Automatic Keyword Extraction) backend for Text.WordCloud.

YAKE! (Yet Another Keyword Extractor) backend for Text.WordCloud.

Wordle-style spiral layout for word-cloud rendering.

Renders a laid-out word cloud as an SVG string.

Word frequency lookup tables.

Mix Tasks

Downloads lemmatization dictionaries from the michmech/lemmatization-lists upstream and places them in the configured Text.Data cache so Text.Lemma can load them with no further network access.

Downloads the fastText lid.176.bin model file used by Text.Language.Classifier.Fasttext for language identification.

Pre-downloads every external model used by :text so that subsequent calls run without network access.

Refreshes priv/extract/tlds.txt from data.iana.org.

Converts the AFINN data vendored under data/affin/ into per-language TSV files under priv/sentiment/, ready for compile-time loading by Text.Sentiment.Lexicons.AFINN.

Runs the canonical test inputs through the reference fastText implementation and writes per-input top-K predictions to test/fixtures/golden_predictions.json.

Fetches the stopwords-iso bundle (a single JSON file mapping ISO 639-1 codes to lists of stopwords) and writes one plain-text file per language under priv/stopwords/.