All notable changes to this project are documented here. The format follows Keep a Changelog.
[0.6.0] — 2026-05-04
Added
Text.Extract— twitter-text-quality URL and email extraction with full UTS #46 IDNA, IANA TLD validation, and UTR #39 single-script defence against homograph attacks. Public API isurls/2,emails/2,all/2,split/2, andautolink/2; options include:require_scheme,:tld_mode,:eai,:strict_idn, and:twitter_quirks.Text.Extract.split/2— splits text into an interleaved list of plain-string fragments and validated entity maps, byte-for-byte round-trippable to the original. The building block for custom rendering of extracted URLs/emails into anchors, mentions, badges, or link-preview cards.Text.Extract.autolink/2— wraps URLs and emails in HTML<a>anchors, returningPhoenix.HTML.safe()for drop-in Phoenix template use. Display text preserves the original Unicode (bücher.de); thehrefuses Punycode (xn--bcher-kva.de).mix text.download_tlds— refreshes the bundled IANA TLD list atpriv/extract/tlds.txt.--diffpreviews added/removed entries;--forceoverwrites unconditionally.Text.WordCloud.to_d3_cloud/2— adaptsterms/2output into the[%{text, size}, …]shape consumed by d3-cloud. Supports:linear(default) and:sqrtsizing; shares the:font_size_rangevocabulary withText.WordCloud.Layout.
[0.5.0] — 2026-05-02
Added
Text.Phonetic.NYSIIS— New York State Identification and Intelligence System phonetic encoding (Taft, 1970). Designed as a Soundex successor for English personal-name matching; produces pronounceable letter codes rather than digits and is more discriminating than Soundex on common name variations.Text.Phonetic.Cologne— Kölner Phonetik (Postel, 1969), the German-language counterpart to Soundex. Optimized for German spelling variants —Müller/Mueller/MullerandMeyer/Mayer/Maier/Meiercollapse to single codes.Text.Phonetic.DoubleMetaphone— Lawrence Philips' Double Metaphone (2000), the de-facto standard for fuzzy English-name matching with non-Anglo origins. Returns a{primary, alternate}code pair so the same Anglicised name can match across multiple plausible pronunciations (e.g.Smith↔Schmidt,Catherine↔Katherine). Handles Germanic, Italian, Spanish, French, Greek, and Slavic patterns.match?/2(andmatch?/3where options apply) on everyText.Phonetic.*module for direct equality comparison without manualencode/2 == encode/2boilerplate.Text.Phonetic.DoubleMetaphone.match?/3checks all four primary/alternate combinations.Text.Clean.unaccent/1— strip diacritics and fold non-decomposable Latin letters (Þ→Th,ß→ss,Æ→AE,ł→l,đ→d) by delegating toUnicode.Transform.LatinAscii.transform/1. Also exposed as the:unaccentoption onText.Clean.clean/2.Text.Distancegains four set-based similarity metrics over character n-grams:jaccard/3,sorensen_dice/3,tanimoto/3(alias forjaccard/3), andcosine/3. All accept an:noption for configurable shingle size (default 2). Operate at the grapheme level for Unicode correctness.Text.Inflect.En.singularize/2andText.Inflect.En.singularize_noun/2— invert the existing pluralizer. Combines reverse lookup of Conway's irregular tables, explicit suffix rules for unambiguous English plural forms (-ies,-shes/-ches/-xes/-zes/-sses), small whitelists for Greek-derived-is/-esplurals (analyses → analysis) and English-usplurals (geniuses → genius), and apluralize/2round-trip search to validate other candidates.Text.Readability.dale_chall/2andText.Readability.spache/2— the two classic word-list readability indices, backed by bundled easy-words lists inpriv/readability/(Dale-Chall 2,949 words, Spache 1,063 words; both sourced from the MIT-licensedpy-readability-metricsdistribution of the public-domain originals).statistics/2now also returns:difficult_wordsand:unfamiliar_wordscounts.Text.Hyphenationbundles six additional language packs:de-1996,fr,es,it,nl,pt. All loaded at compile time with zero I/O, joining the existingen-uspack. Source: hyph-utf8 upstream; per-file licenses (MIT/X11/BSD/LPPL) are preserved in each.texheader.Text.WordFreqbundles six additional frequency tables at the same top-30,000 cap as English:de,fr,es,it,nl,pt. Source: Hermit Dave's MIT-licensed FrequencyWords OpenSubtitles 2018 corpus.Text.Emoji.sentiment/1andText.Emoji.text_sentiment/1— per-emoji and aggregate sentiment scoring backed by the bundled Emoji Sentiment Ranking v1.0 (Kralj Novak et al., 2015 — CC-BY-SA 3.0; data file atpriv/emoji_sentiment/emoji_sentiment_v1.csv, ~750 emoji with negative/neutral/positive proportions and an aggregate score in[-1.0, 1.0]). Aggregate scoring is occurrence-weighted to match the original paper.mix text.download_lemma_data <lang>...— fetches lemmatization dictionaries from the michmech upstream into theText.Datacache without requiring the per-appauto_download_lemma_dataflag. Useful as a build step when shipping a release with the dictionaries pre-warmed. Pass--listto see the supported languages;--forceto refresh.
Changed
Text.Lemmamoduledoc now enumerates the upstream-available languages (~20 languages from the michmech project) and notes that no Dutch (nl) dictionary exists upstream. Bundling the non-English dictionaries was evaluated and deferred — the smallest of them (French, 4.7 MB raw) by itself would push the package near hex's 8 MB limit. Use the newmix text.download_lemma_datatask or setauto_download_lemma_data: trueto populate the cache.
Fixed
Text.Inflect.En.Helpers.replace_suffix/3now actually replaces only the trailing suffix instead of all repeated trailing occurrences, fixing cases liketheses(which previously transformed tothisisinstead ofthesisbecause bothesoccurrences were rewritten). Affects rule output where the suffix repeats inside the base word.
[0.4.0] — 2026-05-01
Added
Text.Truecase— case restoration for ALL-CAPS or lowercased text using POS-aware heuristics for proper nouns, acronyms, and sentence starts.Text.Clean— pipeline-style normalization (whitespace, control characters, smart-quotes, dashes, NFC/NFKC) with a composableclean/2API.Text.Emoji— emoji detection, stripping, and counting. Uses the:unicodepackage's emoji property tables; no external data required.Text.Hyphenation— Knuth–Liang TeX-pattern hyphenation. Ships en-US patterns (~5k); other languages can be loaded viaText.Hyphenation.Parserfrom anyhyph-*.texfile.Text.PII— pattern-based detection and redaction of phone numbers, emails, credit-card-shaped digits, IBANs, IPv4/IPv6, and US SSNs.Text.Spell— Norvig-style edit-distance spelling suggestions backed byText.WordFreq. Returns ranked candidates with their corpus frequency.Text.Summarize— extractive summarization via a sentence-graph TextRank with configurable similarity (:cosineor:jaccard) and target length.Text.Syllable— English syllable counting using a vowel-group heuristic with override exceptions. Used as the per-word syllable signal feedingText.Readability.Text.Readability— Flesch, Flesch–Kincaid, Gunning-Fog, SMOG, Coleman–Liau, ARI, and Linsear-Write scores plus a unifiedanalyze/2summary.Text.WordFreq— frequency lookup over a 30k-word English corpus shipped inpriv/wordfreq/en.tsv. Providesrank/2,frequency/2,is_common?/2, andtop/2.Text.Lemma— dictionary-based lemmatization. Ships an en-US table of ~42k inflected→base mappings;lookup/2falls back to the input when no entry exists.Text.Inflect.En.PluralizeandText.Inflect.En.Singularize— English noun inflection covering ~1.6 KLoC of irregular-form rules and exceptions, withText.Inflect.En.Helpersfor shared morphology utilities.Text.Sentiment.Lexicons.AFINNnow ships sentiment lexicons for 104 languages (up from 7), an Emoji Sentiment Ranking 1.0 lexicon (:emoji, ~840 entries derived from the upstream corpus and rescaled onto AFINN's −5..+5 integer range), and per-language negator lists (negators/1). The seven hand-curated 0.3.0 lexicons (:en,:da,:fi,:fr,:pl,:sv,:tr) are preserved unchanged; the other ~95 are upstream machine-translated and ship as a baseline.Text.Sentiment.Backends.Lexiconautomatically resolves per-language negators fromText.Sentiment.Lexicons.AFINN.negators/1based on the requested:languageoption, so non-English text gets negation handling out of the box. Callers can still override with an explicit:negatorslist.mix text.gen_afinn_lexiconsregeneratespriv/sentiment/from the vendoreddata/affin/source files. Hand-curated TSVs are preserved unless--overwriteis passed.
Changed
The
:unicode_stringdependency requirement is~> 2.1. The 2.1 release replaces its regex evaluator with a single-pass DFA engine; benchmarks show ~17× faster word-cloud builds for typical English prose, with linear (rather than O(N²)) scaling on long unbroken inputs.Text.Word.word_count/2documentation now explicitly calls out that the default&String.split/1splitter does not implement UAX #29 segmentation and does not work for languages without inter-word whitespace (Chinese, Japanese, Korean, Thai, Lao, Khmer, Burmese). Examples show how to pass a UAX-aware or dictionary-aware splitter for those cases.
[0.3.0] — 2026-04-29
Added
Text.WordCloud— multilingual keyword extraction returning a weighted term list suitable for rendering as a word cloud. Six backends: YAKE! (default, unsupervised statistical), frequency, RAKE, TextRank, TF-IDF (requires:reference_corpus), and KeyBERT (neural, requires:bumblebee). The:stemoption (requires the optional:text_stemmerdependency) buckets morphological variants —demolish,demolished,demolishing— into a single entry labelled with the most-frequent surface form.Text.WordCloud.Layout— Wordle-style Archimedean-spiral packing that produces renderer-agnostic(x, y, width, height, font_size, rotation)placements. Pluggable:font_metricscallback so callers can supply pixel-accurate metrics from their actual font stack.Text.WordCloud.SVG— renders placements as a self-contained SVG document. Pluggable:palette(list of hex strings, aColor.Palette.Tonalscale, aColor.Palette.Theme, ornilfor single-colour) plus three mapping strategies (:by_weight,:by_index,:by_hash). Hex-string palettes work without optional deps;Color.Palettestructs require the optional:colordependency.Text.Stopwords— bundled multilingual stopword lists from stopwords-iso (~60 languages, MIT license). Public API:for/1,contains?/2,available_languages/0,available?/1,union/2,extend/2. Generation tooling lives inmix text.gen_stopwords.mix text.download_models --keybert— pre-fetches the multilingual MiniLM sentence-transformer used byText.WordCloud.Backends.KeyBERT(~470 MB). The--bumblebeeshorthand now includes--keybertalongside--sentiment --pos --ner.Text.POS— part-of-speech tagging via the optional:bumblebeedependency. English by default (vblagoje/bert-english-uncased-finetuned-pos); override:modelfor other checkpoints. Returns coarse-grained tag atoms (:noun,:verb,:adj, …) with confidence scores.Text.NER— named-entity recognition via the optional:bumblebeedependency. Multilingual by default (Davlan/bert-base-multilingual-cased-ner-hrl, 10 high-resource languages, CoNLL-2003 tag set). ReturnsText.NER.Entitystructs with span byte offsets, type atom (:per,:org,:loc,:misc), and score.Text.Embedding— load pre-trained word vectors in fastText.vecformat. Exposesvector/2,similarity/3,nearest/3, andanalogy/5over an L2-normalisedNxmatrix. Supports:filterand:max_tokensoptions for partial loads.Text.Language.Classifier.Fasttext.ScriptDetector.han_variant/1— disambiguates Simplified (:Hans) from Traditional (:Hant) Chinese using a curated codepoint-frequency analysis.detect/1now returns:Hansor:Hantdirectly for Han text when the input is unambiguous, falling back to:Haniotherwise. The script signal flows through toText.Language.Classifier.Fasttext.Locale.resolve/2, producingzh-Hans-CNvszh-Hant-TWautomatically.Text.Language.normalize/1andText.Language.to_locale_string/1— every public function in the package that takes a:languageor:localeoption now accepts an atom, a string (BCP-47 or otherwise), or aLocalize.LanguageTagstruct (when the optional:localizedependency is loaded). The new helpers normalise to the language subtag (atom) or to a canonical BCP-47 string respectively.Text.Sentiment.Backendbehaviour with two shipped backends:Text.Sentiment.Backends.Lexicon(the default — lexicon-based, multilingual via AFINN, always available) andText.Sentiment.Backends.Bumblebee(optional — neural via Bumblebee and XLM-RoBERTa, requires:bumblebeeand:exladeps). Routing via the:backendoption toText.Sentiment.analyze/2or globally via the:sentiment_backendapplication configuration.Text.Sentiment— multilingual lexicon-based sentiment analysis. Returns a label (:positive,:negative,:neutral), a normalised compound score, and the matched-token count. Handles negation ("not good"flips polarity) and intensifiers ("very good"boosts) via VADER-style scalars.Text.Sentiment.Lexicons.AFINN— bundled AFINN sentiment lexicons (Apache 2.0) for English, Danish, Finnish, French, Polish, Swedish, and Turkish, plus a language-agnostic emoticon lexicon. Routed automatically byText.Sentiment.analyze/2's:languageoption.Text.Sentiment.lexicon_for/2— composes a per-language lexicon with the emoticon lexicon and/or domain-specific overrides.Text.Language.Classifier.Fasttext— a pure-Elixir port of fastText'slid.176language identification model. Validated bit-for-bit against the official C++/Python reference for hashing, subword extraction, feature assembly, and tree traversal. See the README for usage.Text.Language.Classifier.Fasttext.ModelLoader.load/2parses anlid.176.binfile (~126 MB) into a typedModelstruct with the input/output matrices held asNxtensors.Text.Language.Classifier.Fasttext.detect/3,classify/2, andto_locale/2for the public detection API.Text.Language.Classifier.Fasttext.ScriptDetectorfor Unicode-script-of-text classification, used to disambiguate multi-script locales (e.g.sr-Latnvssr-Cyrl). Backed by theunicodeHex package.Text.Language.Classifier.Fasttext.Locale.resolve/2for CLDR-canonical locale assembly via likely-subtags. Uses the optionallocalizedependency when present, with a built-in fallback table for the most common languages otherwise.mix text.download_lid176task that fetcheslid.176.binintopriv/lid_176/. The model file is gitignored and not part of the Hex package.mix text.download_modelstask (plural) that pre-fetches every external model used by:text—lid.176.binplus the default Hugging Face checkpoints behindText.Sentiment.Backends.Bumblebee,Text.POS, andText.NER— for production environments that need every artefact present at boot. Selection flags (--lid176,--sentiment,--pos,--ner,--bumblebee) limit the download to a subset.mix text.gen_subword_fixtures,mix text.gen_features_fixtures,mix text.gen_predict_fixtures(viapriv/scripts/*.py) for regenerating the differential test fixtures against the referencefasttextPython bindings.docs/lid176_binary_format.md— full byte-layout specification of fastText's model file, derived from the C++ source.
Changed
The minimum Elixir version is now
~> 1.17(raised from~> 1.8). All development and testing targets Elixir 1.20 on Erlang/OTP 28.Added required dependencies on
:nxand:unicode. Optional dependencies on:exla(recommended for inference performance) and:localize(for CLDR-canonical locale resolution).The fastText inference forward pass (
take + mean + dot, plus the softmax tail for softmax-loss models) is now wrapped inNx.Defnso that an EXLA-compiled execution runs the entire pass as a single fused XLA kernel. With EXLA configured as both backend anddefncompiler, per-prediction wall time onlid.176drops from roughly 200 μs to ~100 μs — about 2× over the unfused EXLA path and 6-9× overNx.BinaryBackend. Bit-equivalent to the pre-fusion form; the test suite passes both ways.The hierarchical-softmax scoring path is now also fused into the same
defngraph: per-leaf paths through the Huffman tree are pre-computed at model load time and stored as fixed-shape tensors onText.Language.Classifier.Fasttext.HuffmanTree. The recursive BEAM-side DFS (and its accompanying f32-rounding workaround) is gone. Forlid.176specifically the latency is comparable to the previous DFS approach (~125 μs vs ~110 μs) — the win materialises for larger label spaces. The simpler architecture removes a fragile spot.Hex package version bumped to
0.3.0.
Removed
Breaking: the legacy n-gram language classifiers (
Text.Language.Classifier.NaiveBayesian,CummulativeFrequency,RankOrder) and their supporting modules (Text.Language,Text.Language.Classifier,Text.Corpus,Text.Vocabulary). These required a separately-installed corpus (text_corpus_udhr) and were not competitive with the fastText classifier on inputs outside the UDHR register. UseText.Language.Classifier.Fasttext.classify/2anddetect/3instead.The
:meeseeksbuild-time HTML scraper dependency along with the English-inflection scraper module (Text.Inflect.Data.En) and itsmix text.create_english_pluralstask. Pluralization data continues to ship as a precompiled ETF blob inpriv/inflection/en/en.etf; only the regeneration tooling is gone.Text.Ngram.Frequencystruct,Text.frequency_tupletypedef, and theText.ensure_compiled?/1helper. All three existed solely to support the deleted classifier behaviour and had no other callers.
[0.2.0] — 2020-06-28
Added
Pluralization for English words.
Language detection classifiers — corpora defined in separate libraries, e.g. text_corpus_udhr.
Changed
- Refactored word counting.
[0.1.0] — 2019-08-26
Added
- Initial version implementing
ngrams.