# Text v0.6.1 - Table of Contents

Text analysis and processing for Elixir including ngram,
language detection and more.

## Pages

- [Text](readme.md)
- [Changelog](changelog.md)
- [LICENSE](license.md)

- Guides
  - [Text classification — language identification](text_classification.md)
  - [Sentiment analysis](sentiment.md)
  - [Part-of-speech tagging and named-entity recognition](pos_ner.md)
  - [Keyword-in-context (KWIC) concordance](kwic.md)
  - [Word clouds](word_clouds.md)

## Modules

- [Text](Text.md): Functions for basic text processing
and analysis.
- [Text.Clean](Text.Clean.md): Text cleanup utilities: HTML stripping, whitespace collapse,
Unicode normalization, and mojibake repair.
- [Text.Collocation](Text.Collocation.md): Extract statistically significant word bigrams from a token stream.
- [Text.Data](Text.Data.md): Locates runtime data files used by Text modules, fetching them
from upstream sources when permitted.
- [Text.Distance](Text.Distance.md): String edit-distance algorithms.
- [Text.Embedding](Text.Embedding.md): Word embeddings — load pre-trained vectors and compute similarity,
nearest neighbours, and analogies.
- [Text.Emoji](Text.Emoji.md): Emoji detection and short-name conversion.
- [Text.Extract](Text.Extract.md): Extract URLs and email addresses from arbitrary text at
social-media quality.
- [Text.Extract.Boundary](Text.Extract.Boundary.md): Phase 3 of the URL / email extraction pipeline: trim spurious
trailing punctuation from a candidate span.
- [Text.Extract.Email](Text.Extract.Email.md): Phase 2 validator for email-address candidates.
- [Text.Extract.Scanner](Text.Extract.Scanner.md): Phase 1 of the URL / email extraction pipeline: find candidate spans.
- [Text.Extract.Script](Text.Extract.Script.md): UTR #39 §5.1 single-script restriction check, used by
`Text.Extract` to flag mixed-script hosts as potential homograph
attacks.
- [Text.Extract.Tld](Text.Extract.Tld.md): Top-level domain validation for `Text.Extract`.
- [Text.Extract.Twitter](Text.Extract.Twitter.md): Twitter-text-specific URL handling quirks, gated behind the
`:twitter_quirks` option.
- [Text.Extract.Url](Text.Extract.Url.md): Phase 2 validator for URL candidates.
- [Text.Hyphenation](Text.Hyphenation.md): Hyphenation via Liang's algorithm with TeX hyphenation patterns.
- [Text.IR](Text.IR.md): Information-retrieval scoring against an indexed corpus.
- [Text.IR.Corpus](Text.IR.Corpus.md): An indexed corpus of documents for information-retrieval scoring.
- [Text.Inflect.En](Text.Inflect.En.md): Pluralisation for the English language based on the paper
[An Algorithmic Approach to English Pluralization](http://users.monash.edu/~damian/papers/HTML/Plurals.html).
- [Text.KWIC](Text.KWIC.md): Keyword-In-Context concordance.
- [Text.KWIC.Match](Text.KWIC.Match.md): A single keyword-in-context occurrence.
- [Text.Language](Text.Language.md): Language tag utilities used across the package.
- [Text.Language.Classifier.Fasttext](Text.Language.Classifier.Fasttext.md): Pure-Elixir port of fastText's `lid.176` language identification model.
- [Text.Language.Classifier.Fasttext.Args](Text.Language.Classifier.Fasttext.Args.md): Training and model hyperparameters extracted from a fastText model file.
- [Text.Language.Classifier.Fasttext.Detection](Text.Language.Classifier.Fasttext.Detection.md): The result of running fastText language identification on a piece of
text.
- [Text.Language.Classifier.Fasttext.Dictionary](Text.Language.Classifier.Fasttext.Dictionary.md): Vocabulary and label table parsed from a fastText model file.
- [Text.Language.Classifier.Fasttext.Entry](Text.Language.Classifier.Fasttext.Entry.md): A single dictionary entry parsed from a fastText model file.
- [Text.Language.Classifier.Fasttext.Features](Text.Language.Classifier.Fasttext.Features.md): Converts an input string into the flat list of input-matrix row indices
that fastText averages to produce a feature vector.
- [Text.Language.Classifier.Fasttext.Hash](Text.Language.Classifier.Fasttext.Hash.md): Bit-exact port of fastText's string hash function.
- [Text.Language.Classifier.Fasttext.HuffmanTree](Text.Language.Classifier.Fasttext.HuffmanTree.md): The Huffman tree fastText constructs over output labels for hierarchical
softmax inference.
- [Text.Language.Classifier.Fasttext.Inference](Text.Language.Classifier.Fasttext.Inference.md): Forward-pass scoring for fastText models.
- [Text.Language.Classifier.Fasttext.Locale](Text.Language.Classifier.Fasttext.Locale.md): Resolves a language detection into a CLDR-canonical locale string.
- [Text.Language.Classifier.Fasttext.Model](Text.Language.Classifier.Fasttext.Model.md): A fully-loaded fastText model.
- [Text.Language.Classifier.Fasttext.ModelLoader](Text.Language.Classifier.Fasttext.ModelLoader.md): Parses a fastText `.bin` model file into a
`Text.Language.Classifier.Fasttext.Model` struct.
- [Text.Language.Classifier.Fasttext.ScriptDetector](Text.Language.Classifier.Fasttext.ScriptDetector.md): Identifies the dominant Unicode script of a piece of text.
- [Text.Language.Classifier.Fasttext.Subwords](Text.Language.Classifier.Fasttext.Subwords.md): Character n-gram extraction and input-matrix indexing for fastText models.
- [Text.Language.Classifier.Fasttext.Tokenizer](Text.Language.Classifier.Fasttext.Tokenizer.md): Whitespace tokenizer matching fastText's `Dictionary::readWord`.
- [Text.Lemma](Text.Lemma.md): Dictionary-driven lemmatization.
- [Text.NER](Text.NER.md): Named-entity recognition via [Bumblebee](https://hex.pm/packages/bumblebee).
- [Text.NER.Entity](Text.NER.Entity.md): A single named entity span.
- [Text.Ngram](Text.Ngram.md): Compute ngrams and their counts from a given
UTF8 string.
- [Text.PII](Text.PII.md): Pattern-based detection and redaction of personally-identifiable
information.
- [Text.POS](Text.POS.md): Part-of-speech tagging via [Bumblebee](https://hex.pm/packages/bumblebee).
- [Text.Phonetic.Cologne](Text.Phonetic.Cologne.md): Cologne phonetics (Kölner Phonetik), the German-language counterpart
to Soundex.
- [Text.Phonetic.DoubleMetaphone](Text.Phonetic.DoubleMetaphone.md): Double Metaphone phonetic encoding (Lawrence Philips, 2000).
- [Text.Phonetic.Metaphone](Text.Phonetic.Metaphone.md): Metaphone phonetic encoding (Lawrence Philips, 1990).
- [Text.Phonetic.NYSIIS](Text.Phonetic.NYSIIS.md): New York State Identification and Intelligence System (NYSIIS)
phonetic encoding (Robert L. Taft, 1970).
- [Text.Phonetic.Soundex](Text.Phonetic.Soundex.md): Soundex phonetic encoding (Russell-Odell, 1918).
- [Text.Readability](Text.Readability.md): Readability metrics for English text.
- [Text.Segment](Text.Segment.md): Locale-aware word and sentence segmentation.
- [Text.Sentiment](Text.Sentiment.md): Sentiment analysis with multilingual support.
- [Text.Sentiment.Backend](Text.Sentiment.Backend.md): Behaviour for sentiment-analysis backends.
- [Text.Sentiment.Backends.Bumblebee](Text.Sentiment.Backends.Bumblebee.md): Neural sentiment backend backed by
[Bumblebee](https://hex.pm/packages/bumblebee).
- [Text.Sentiment.Backends.Lexicon](Text.Sentiment.Backends.Lexicon.md): Default sentiment backend — lexicon-based, multilingual via the
bundled AFINN lexicons.
- [Text.Sentiment.Lexicon](Text.Sentiment.Lexicon.md): Lexicon-based sentiment scoring.
- [Text.Sentiment.Lexicons.AFINN](Text.Sentiment.Lexicons.AFINN.md): Bundled [AFINN](https://github.com/fnielsen/afinn) sentiment lexicons.
- [Text.Similarity](Text.Similarity.md): Set- and vector-based string similarity coefficients.
- [Text.Slug](Text.Slug.md): URL-safe slug generation with locale-aware Unicode folding.
- [Text.Spell](Text.Spell.md): Spell correction.
- [Text.Stopwords](Text.Stopwords.md): Bundled multilingual stopword lists.
- [Text.Summarize](Text.Summarize.md): Extractive text summarization.
- [Text.Syllable](Text.Syllable.md): Syllable counting for English words.
- [Text.Truecase](Text.Truecase.md): Restore case to text that has been lowercased.
- [Text.Word](Text.Word.md): Implements word counting for lists, streams and flows.
- [Text.WordCloud](Text.WordCloud.md): Builds a weighted list of terms suitable for rendering as a word cloud.
- [Text.WordCloud.Backend](Text.WordCloud.Backend.md): Behaviour for `Text.WordCloud` scoring backends.
- [Text.WordCloud.Backends.Frequency](Text.WordCloud.Backends.Frequency.md): Trivial frequency-counting backend for `Text.WordCloud`.
- [Text.WordCloud.Backends.KeyBERT](Text.WordCloud.Backends.KeyBERT.md): Neural keyword-extraction backend backed by
[Bumblebee](https://hex.pm/packages/bumblebee).
- [Text.WordCloud.Backends.RAKE](Text.WordCloud.Backends.RAKE.md): RAKE (Rapid Automatic Keyword Extraction) backend for `Text.WordCloud`.
- [Text.WordCloud.Backends.TFIDF](Text.WordCloud.Backends.TFIDF.md): TF-IDF backend for `Text.WordCloud`.
- [Text.WordCloud.Backends.TextRank](Text.WordCloud.Backends.TextRank.md): TextRank backend for `Text.WordCloud`.
- [Text.WordCloud.Backends.YAKE](Text.WordCloud.Backends.YAKE.md): YAKE! (Yet Another Keyword Extractor) backend for `Text.WordCloud`.
- [Text.WordCloud.Layout](Text.WordCloud.Layout.md): Wordle-style spiral layout for word-cloud rendering.
- [Text.WordCloud.SVG](Text.WordCloud.SVG.md): Renders a laid-out word cloud as an SVG string.
- [Text.WordFreq](Text.WordFreq.md): Word frequency lookup tables.

## Mix Tasks

- [mix text.download_lemma_data](Mix.Tasks.Text.DownloadLemmaData.md): Downloads lemmatization dictionaries from the
[`michmech/lemmatization-lists`](https://github.com/michmech/lemmatization-lists)
upstream and places them in the configured `Text.Data` cache so
`Text.Lemma` can load them with no further network access.
- [mix text.download_lid176](Mix.Tasks.Text.DownloadLid176.md): Downloads the fastText `lid.176.bin` model file used by
`Text.Language.Classifier.Fasttext` for language identification.
- [mix text.download_models](Mix.Tasks.Text.DownloadModels.md): Pre-downloads every external model used by `:text` so that subsequent
calls run without network access.
- [mix text.download_tlds](Mix.Tasks.Text.DownloadTlds.md): Refreshes `priv/extract/tlds.txt` from
[`data.iana.org`](https://data.iana.org/TLD/tlds-alpha-by-domain.txt).
- [mix text.gen_afinn_lexicons](Mix.Tasks.Text.GenAfinnLexicons.md): Converts the [AFINN](https://github.com/fnielsen/afinn) data vendored
under `data/affin/` into per-language TSV files under
`priv/sentiment/`, ready for compile-time loading by
`Text.Sentiment.Lexicons.AFINN`.
- [mix text.gen_golden_fixtures](Mix.Tasks.Text.GenGoldenFixtures.md): Runs the canonical test inputs through the reference fastText
implementation and writes per-input top-K predictions to
`test/fixtures/golden_predictions.json`.
- [mix text.gen_stopwords](Mix.Tasks.Text.GenStopwords.md): Fetches the [stopwords-iso](https://github.com/stopwords-iso/stopwords-iso)
bundle (a single JSON file mapping ISO 639-1 codes to lists of stopwords)
and writes one plain-text file per language under `priv/stopwords/`.

