All notable changes to this project are documented here. The format follows Keep a Changelog.

[0.6.1] — 2026-05-04

Bug Fixes

  • Fix CI and and :json_polyfill config on OTP 26.

[0.6.0] — 2026-05-04

Added

  • Text.Extract — twitter-text-quality URL and email extraction with full UTS #46 IDNA, IANA TLD validation, and UTR #39 single-script defence against homograph attacks. Public API is urls/2, emails/2, all/2, split/2, and autolink/2; options include :require_scheme, :tld_mode, :eai, :strict_idn, and :twitter_quirks.

  • Text.Extract.split/2 — splits text into an interleaved list of plain-string fragments and validated entity maps, byte-for-byte round-trippable to the original. The building block for custom rendering of extracted URLs/emails into anchors, mentions, badges, or link-preview cards.

  • Text.Extract.autolink/2 — wraps URLs and emails in HTML <a> anchors, returning Phoenix.HTML.safe() for drop-in Phoenix template use. Display text preserves the original Unicode (bücher.de); the href uses Punycode (xn--bcher-kva.de).

  • mix text.download_tlds — refreshes the bundled IANA TLD list at priv/extract/tlds.txt. --diff previews added/removed entries; --force overwrites unconditionally.

  • Text.WordCloud.to_d3_cloud/2 — adapts terms/2 output into the [%{text, size}, …] shape consumed by d3-cloud. Supports :linear (default) and :sqrt sizing; shares the :font_size_range vocabulary with Text.WordCloud.Layout.

[0.5.0] — 2026-05-02

Added

  • Text.Phonetic.NYSIIS — New York State Identification and Intelligence System phonetic encoding (Taft, 1970). Designed as a Soundex successor for English personal-name matching; produces pronounceable letter codes rather than digits and is more discriminating than Soundex on common name variations.

  • Text.Phonetic.Cologne — Kölner Phonetik (Postel, 1969), the German-language counterpart to Soundex. Optimized for German spelling variants — Müller / Mueller / Muller and Meyer / Mayer / Maier / Meier collapse to single codes.

  • Text.Phonetic.DoubleMetaphone — Lawrence Philips' Double Metaphone (2000), the de-facto standard for fuzzy English-name matching with non-Anglo origins. Returns a {primary, alternate} code pair so the same Anglicised name can match across multiple plausible pronunciations (e.g. SmithSchmidt, CatherineKatherine). Handles Germanic, Italian, Spanish, French, Greek, and Slavic patterns.

  • match?/2 (and match?/3 where options apply) on every Text.Phonetic.* module for direct equality comparison without manual encode/2 == encode/2 boilerplate. Text.Phonetic.DoubleMetaphone.match?/3 checks all four primary/alternate combinations.

  • Text.Clean.unaccent/1 — strip diacritics and fold non-decomposable Latin letters (ÞTh, ßss, ÆAE, łl, đd) by delegating to Unicode.Transform.LatinAscii.transform/1. Also exposed as the :unaccent option on Text.Clean.clean/2.

  • Text.Distance gains four set-based similarity metrics over character n-grams: jaccard/3, sorensen_dice/3, tanimoto/3 (alias for jaccard/3), and cosine/3. All accept an :n option for configurable shingle size (default 2). Operate at the grapheme level for Unicode correctness.

  • Text.Inflect.En.singularize/2 and Text.Inflect.En.singularize_noun/2 — invert the existing pluralizer. Combines reverse lookup of Conway's irregular tables, explicit suffix rules for unambiguous English plural forms (-ies, -shes/-ches/-xes/-zes/-sses), small whitelists for Greek-derived -is/-es plurals (analyses → analysis) and English -us plurals (geniuses → genius), and a pluralize/2 round-trip search to validate other candidates.

  • Text.Readability.dale_chall/2 and Text.Readability.spache/2 — the two classic word-list readability indices, backed by bundled easy-words lists in priv/readability/ (Dale-Chall 2,949 words, Spache 1,063 words; both sourced from the MIT-licensed py-readability-metrics distribution of the public-domain originals). statistics/2 now also returns :difficult_words and :unfamiliar_words counts.

  • Text.Hyphenation bundles six additional language packs: de-1996, fr, es, it, nl, pt. All loaded at compile time with zero I/O, joining the existing en-us pack. Source: hyph-utf8 upstream; per-file licenses (MIT/X11/BSD/LPPL) are preserved in each .tex header.

  • Text.WordFreq bundles six additional frequency tables at the same top-30,000 cap as English: de, fr, es, it, nl, pt. Source: Hermit Dave's MIT-licensed FrequencyWords OpenSubtitles 2018 corpus.

  • Text.Emoji.sentiment/1 and Text.Emoji.text_sentiment/1 — per-emoji and aggregate sentiment scoring backed by the bundled Emoji Sentiment Ranking v1.0 (Kralj Novak et al., 2015 — CC-BY-SA 3.0; data file at priv/emoji_sentiment/emoji_sentiment_v1.csv, ~750 emoji with negative/neutral/positive proportions and an aggregate score in [-1.0, 1.0]). Aggregate scoring is occurrence-weighted to match the original paper.

  • mix text.download_lemma_data <lang>... — fetches lemmatization dictionaries from the michmech upstream into the Text.Data cache without requiring the per-app auto_download_lemma_data flag. Useful as a build step when shipping a release with the dictionaries pre-warmed. Pass --list to see the supported languages; --force to refresh.

Changed

  • Text.Lemma moduledoc now enumerates the upstream-available languages (~20 languages from the michmech project) and notes that no Dutch (nl) dictionary exists upstream. Bundling the non-English dictionaries was evaluated and deferred — the smallest of them (French, 4.7 MB raw) by itself would push the package near hex's 8 MB limit. Use the new mix text.download_lemma_data task or set auto_download_lemma_data: true to populate the cache.

Fixed

  • Text.Inflect.En.Helpers.replace_suffix/3 now actually replaces only the trailing suffix instead of all repeated trailing occurrences, fixing cases like theses (which previously transformed to thisis instead of thesis because both es occurrences were rewritten). Affects rule output where the suffix repeats inside the base word.

[0.4.0] — 2026-05-01

Added

  • Text.Truecase — case restoration for ALL-CAPS or lowercased text using POS-aware heuristics for proper nouns, acronyms, and sentence starts.

  • Text.Clean — pipeline-style normalization (whitespace, control characters, smart-quotes, dashes, NFC/NFKC) with a composable clean/2 API.

  • Text.Emoji — emoji detection, stripping, and counting. Uses the :unicode package's emoji property tables; no external data required.

  • Text.Hyphenation — Knuth–Liang TeX-pattern hyphenation. Ships en-US patterns (~5k); other languages can be loaded via Text.Hyphenation.Parser from any hyph-*.tex file.

  • Text.PII — pattern-based detection and redaction of phone numbers, emails, credit-card-shaped digits, IBANs, IPv4/IPv6, and US SSNs.

  • Text.Spell — Norvig-style edit-distance spelling suggestions backed by Text.WordFreq. Returns ranked candidates with their corpus frequency.

  • Text.Summarize — extractive summarization via a sentence-graph TextRank with configurable similarity (:cosine or :jaccard) and target length.

  • Text.Syllable — English syllable counting using a vowel-group heuristic with override exceptions. Used as the per-word syllable signal feeding Text.Readability.

  • Text.Readability — Flesch, Flesch–Kincaid, Gunning-Fog, SMOG, Coleman–Liau, ARI, and Linsear-Write scores plus a unified analyze/2 summary.

  • Text.WordFreq — frequency lookup over a 30k-word English corpus shipped in priv/wordfreq/en.tsv. Provides rank/2, frequency/2, is_common?/2, and top/2.

  • Text.Lemma — dictionary-based lemmatization. Ships an en-US table of ~42k inflected→base mappings; lookup/2 falls back to the input when no entry exists.

  • Text.Inflect.En.Pluralize and Text.Inflect.En.Singularize — English noun inflection covering ~1.6 KLoC of irregular-form rules and exceptions, with Text.Inflect.En.Helpers for shared morphology utilities.

  • Text.Sentiment.Lexicons.AFINN now ships sentiment lexicons for 104 languages (up from 7), an Emoji Sentiment Ranking 1.0 lexicon (:emoji, ~840 entries derived from the upstream corpus and rescaled onto AFINN's −5..+5 integer range), and per-language negator lists (negators/1). The seven hand-curated 0.3.0 lexicons (:en, :da, :fi, :fr, :pl, :sv, :tr) are preserved unchanged; the other ~95 are upstream machine-translated and ship as a baseline.

  • Text.Sentiment.Backends.Lexicon automatically resolves per-language negators from Text.Sentiment.Lexicons.AFINN.negators/1 based on the requested :language option, so non-English text gets negation handling out of the box. Callers can still override with an explicit :negators list.

  • mix text.gen_afinn_lexicons regenerates priv/sentiment/ from the vendored data/affin/ source files. Hand-curated TSVs are preserved unless --overwrite is passed.

Changed

  • The :unicode_string dependency requirement is ~> 2.1. The 2.1 release replaces its regex evaluator with a single-pass DFA engine; benchmarks show ~17× faster word-cloud builds for typical English prose, with linear (rather than O(N²)) scaling on long unbroken inputs.

  • Text.Word.word_count/2 documentation now explicitly calls out that the default &String.split/1 splitter does not implement UAX #29 segmentation and does not work for languages without inter-word whitespace (Chinese, Japanese, Korean, Thai, Lao, Khmer, Burmese). Examples show how to pass a UAX-aware or dictionary-aware splitter for those cases.

[0.3.0] — 2026-04-29

Added

  • Text.WordCloud — multilingual keyword extraction returning a weighted term list suitable for rendering as a word cloud. Six backends: YAKE! (default, unsupervised statistical), frequency, RAKE, TextRank, TF-IDF (requires :reference_corpus), and KeyBERT (neural, requires :bumblebee). The :stem option (requires the optional :text_stemmer dependency) buckets morphological variants — demolish, demolished, demolishing — into a single entry labelled with the most-frequent surface form.

  • Text.WordCloud.Layout — Wordle-style Archimedean-spiral packing that produces renderer-agnostic (x, y, width, height, font_size, rotation) placements. Pluggable :font_metrics callback so callers can supply pixel-accurate metrics from their actual font stack.

  • Text.WordCloud.SVG — renders placements as a self-contained SVG document. Pluggable :palette (list of hex strings, a Color.Palette.Tonal scale, a Color.Palette.Theme, or nil for single-colour) plus three mapping strategies (:by_weight, :by_index, :by_hash). Hex-string palettes work without optional deps; Color.Palette structs require the optional :color dependency.

  • Text.Stopwords — bundled multilingual stopword lists from stopwords-iso (~60 languages, MIT license). Public API: for/1, contains?/2, available_languages/0, available?/1, union/2, extend/2. Generation tooling lives in mix text.gen_stopwords.

  • mix text.download_models --keybert — pre-fetches the multilingual MiniLM sentence-transformer used by Text.WordCloud.Backends.KeyBERT (~470 MB). The --bumblebee shorthand now includes --keybert alongside --sentiment --pos --ner.

  • Text.POS — part-of-speech tagging via the optional :bumblebee dependency. English by default (vblagoje/bert-english-uncased-finetuned-pos); override :model for other checkpoints. Returns coarse-grained tag atoms (:noun, :verb, :adj, …) with confidence scores.

  • Text.NER — named-entity recognition via the optional :bumblebee dependency. Multilingual by default (Davlan/bert-base-multilingual-cased-ner-hrl, 10 high-resource languages, CoNLL-2003 tag set). Returns Text.NER.Entity structs with span byte offsets, type atom (:per, :org, :loc, :misc), and score.

  • Text.Embedding — load pre-trained word vectors in fastText .vec format. Exposes vector/2, similarity/3, nearest/3, and analogy/5 over an L2-normalised Nx matrix. Supports :filter and :max_tokens options for partial loads.

  • Text.Language.Classifier.Fasttext.ScriptDetector.han_variant/1 — disambiguates Simplified (:Hans) from Traditional (:Hant) Chinese using a curated codepoint-frequency analysis. detect/1 now returns :Hans or :Hant directly for Han text when the input is unambiguous, falling back to :Hani otherwise. The script signal flows through to Text.Language.Classifier.Fasttext.Locale.resolve/2, producing zh-Hans-CN vs zh-Hant-TW automatically.

  • Text.Language.normalize/1 and Text.Language.to_locale_string/1 — every public function in the package that takes a :language or :locale option now accepts an atom, a string (BCP-47 or otherwise), or a Localize.LanguageTag struct (when the optional :localize dependency is loaded). The new helpers normalise to the language subtag (atom) or to a canonical BCP-47 string respectively.

  • Text.Sentiment.Backend behaviour with two shipped backends: Text.Sentiment.Backends.Lexicon (the default — lexicon-based, multilingual via AFINN, always available) and Text.Sentiment.Backends.Bumblebee (optional — neural via Bumblebee and XLM-RoBERTa, requires :bumblebee and :exla deps). Routing via the :backend option to Text.Sentiment.analyze/2 or globally via the :sentiment_backend application configuration.

  • Text.Sentiment — multilingual lexicon-based sentiment analysis. Returns a label (:positive, :negative, :neutral), a normalised compound score, and the matched-token count. Handles negation ("not good" flips polarity) and intensifiers ("very good" boosts) via VADER-style scalars.

  • Text.Sentiment.Lexicons.AFINN — bundled AFINN sentiment lexicons (Apache 2.0) for English, Danish, Finnish, French, Polish, Swedish, and Turkish, plus a language-agnostic emoticon lexicon. Routed automatically by Text.Sentiment.analyze/2's :language option.

  • Text.Sentiment.lexicon_for/2 — composes a per-language lexicon with the emoticon lexicon and/or domain-specific overrides.

  • Text.Language.Classifier.Fasttext — a pure-Elixir port of fastText's lid.176 language identification model. Validated bit-for-bit against the official C++/Python reference for hashing, subword extraction, feature assembly, and tree traversal. See the README for usage.

  • Text.Language.Classifier.Fasttext.ModelLoader.load/2 parses an lid.176.bin file (~126 MB) into a typed Model struct with the input/output matrices held as Nx tensors.

  • Text.Language.Classifier.Fasttext.detect/3, classify/2, and to_locale/2 for the public detection API.

  • Text.Language.Classifier.Fasttext.ScriptDetector for Unicode-script-of-text classification, used to disambiguate multi-script locales (e.g. sr-Latn vs sr-Cyrl). Backed by the unicode Hex package.

  • Text.Language.Classifier.Fasttext.Locale.resolve/2 for CLDR-canonical locale assembly via likely-subtags. Uses the optional localize dependency when present, with a built-in fallback table for the most common languages otherwise.

  • mix text.download_lid176 task that fetches lid.176.bin into priv/lid_176/. The model file is gitignored and not part of the Hex package.

  • mix text.download_models task (plural) that pre-fetches every external model used by :textlid.176.bin plus the default Hugging Face checkpoints behind Text.Sentiment.Backends.Bumblebee, Text.POS, and Text.NER — for production environments that need every artefact present at boot. Selection flags (--lid176, --sentiment, --pos, --ner, --bumblebee) limit the download to a subset.

  • mix text.gen_subword_fixtures, mix text.gen_features_fixtures, mix text.gen_predict_fixtures (via priv/scripts/*.py) for regenerating the differential test fixtures against the reference fasttext Python bindings.

  • docs/lid176_binary_format.md — full byte-layout specification of fastText's model file, derived from the C++ source.

Changed

  • The minimum Elixir version is now ~> 1.17 (raised from ~> 1.8). All development and testing targets Elixir 1.20 on Erlang/OTP 28.

  • Added required dependencies on :nx and :unicode. Optional dependencies on :exla (recommended for inference performance) and :localize (for CLDR-canonical locale resolution).

  • The fastText inference forward pass (take + mean + dot, plus the softmax tail for softmax-loss models) is now wrapped in Nx.Defn so that an EXLA-compiled execution runs the entire pass as a single fused XLA kernel. With EXLA configured as both backend and defn compiler, per-prediction wall time on lid.176 drops from roughly 200 μs to ~100 μs — about 2× over the unfused EXLA path and 6-9× over Nx.BinaryBackend. Bit-equivalent to the pre-fusion form; the test suite passes both ways.

  • The hierarchical-softmax scoring path is now also fused into the same defn graph: per-leaf paths through the Huffman tree are pre-computed at model load time and stored as fixed-shape tensors on Text.Language.Classifier.Fasttext.HuffmanTree. The recursive BEAM-side DFS (and its accompanying f32-rounding workaround) is gone. For lid.176 specifically the latency is comparable to the previous DFS approach (~125 μs vs ~110 μs) — the win materialises for larger label spaces. The simpler architecture removes a fragile spot.

  • Hex package version bumped to 0.3.0.

Removed

  • Breaking: the legacy n-gram language classifiers (Text.Language.Classifier.NaiveBayesian, CummulativeFrequency, RankOrder) and their supporting modules (Text.Language, Text.Language.Classifier, Text.Corpus, Text.Vocabulary). These required a separately-installed corpus (text_corpus_udhr) and were not competitive with the fastText classifier on inputs outside the UDHR register. Use Text.Language.Classifier.Fasttext.classify/2 and detect/3 instead.

  • The :meeseeks build-time HTML scraper dependency along with the English-inflection scraper module (Text.Inflect.Data.En) and its mix text.create_english_plurals task. Pluralization data continues to ship as a precompiled ETF blob in priv/inflection/en/en.etf; only the regeneration tooling is gone.

  • Text.Ngram.Frequency struct, Text.frequency_tuple typedef, and the Text.ensure_compiled?/1 helper. All three existed solely to support the deleted classifier behaviour and had no other callers.

[0.2.0] — 2020-06-28

Added

  • Pluralization for English words.

  • Language detection classifiers — corpora defined in separate libraries, e.g. text_corpus_udhr.

Changed

  • Refactored word counting.

[0.1.0] — 2019-08-26

Added

  • Initial version implementing ngrams.