Builds a weighted list of terms suitable for rendering as a word cloud.
The function returns a list of %{term, weight, count, kind} maps
sorted by :weight (descending). The top term always has weight
1.0; every other weight is normalised relative to it. Visual
layout — placing the words on a canvas — is handled separately by
Text.WordCloud.Layout.
Supports several scoring algorithms via the :scoring option;
:yake (the default) requires no reference corpus and is
multilingual by construction. See the Text.WordCloud.Backends.*
modules for the catalogue.
Multilingual end-to-end:
Tokenisation runs through
Text.Segment.words/2(Unicode UAX #29).Sentence segmentation uses
Text.Segment.sentences/2.Stopwords come from the bundled
Text.Stopwords(~60 languages) via the:stopwordsoption.Language is auto-detected with
Text.Language.Classifier.Fasttextwhen:languageis unset, falling back to no language-specific behaviour if the classifier is not available.
Summary
Types
A scored term, ready for rendering.
Types
@type term_entry() :: %{ term: String.t(), weight: float(), count: pos_integer(), kind: :word | :phrase }
A scored term, ready for rendering.
Functions
@spec terms( String.t() | [String.t()], keyword() ) :: [term_entry()]
Returns a weighted list of terms for text suitable for word-cloud rendering.
Arguments
textis a UTF-8 string or a list of strings. A list is treated as a corpus of independent documents.
Options
:scoring—:yake(default),:frequency,:tf_idf,:rake,:text_rank,:key_bert, or any module implementingText.WordCloud.Backend.:max_terms— cap on returned entries. Default100.:min_count— drop terms occurring fewer times than this. Default1.:ngram_range—{min, max}token length for candidate terms. Default depends on backend ({1, 3}for YAKE,{1, 1}for Frequency).:language— atom, BCP-47 string, orLocalize.LanguageTag. Defaultnil(no language-specific behaviour). Pass{:auto, model}to auto-detect via a pre-loadedText.Language.Classifier.Fasttext.Model— the orchestrator does not load the fastText model itself, so callers wanting detection load it once at boot and hand it in.:stopwords—:auto(use the bundled list for the resolved language; default),:none, a list, aMapSet, or{:extend, [extra]}to add to the bundled list.:case_fold— boolean, defaulttrue.:stem— boolean, defaultfalse. Whentrue, candidate terms are bucketed by their Snowball stem so morphological variants (demolish,demolished,demolishing,demolition) collapse into a single entry. The most-frequent surface form represents the bucket; counts and raw scores are summed across members. Requires the optional:text_stemmerdependency. The stemmer language defaults to the resolved:language; override with:stem_language.:stem_language— atom override for the stemmer language. Useful when the corpus language differs from the bucketing language (e.g. mixed-language text where you want only English variants consolidated). Defaults to:language.:include—:all(default),:wordsonly, or:phrasesonly.:reference_corpus— used by:tf_idfand:log_likelihood.
Returns
- A list of
%{term, weight, count, kind}maps sorted by:weightdescending. The top entry hasweight: 1.0.
Examples
iex> text = "the cat sat on the mat. the cat ran. the cat slept."
iex> [first | _] = Text.WordCloud.terms(text, scoring: :frequency, language: :en, max_terms: 3)
iex> first.term
"cat"
@spec to_d3_cloud( [term_entry()], keyword() ) :: [ %{ text: String.t(), size: float(), weight: float(), count: pos_integer(), kind: :word | :phrase } ]
Converts scored terms into the shape consumed by d3-cloud.
d3-cloud expects an array of {text, size} records and runs its
Wordle-style layout in the browser. This adapter maps each entry's
:weight to a pixel font size using the same :font_size_range
vocabulary as Text.WordCloud.Layout, so a server-rendered SVG and a
client-rendered d3-cloud will scale identically.
The original :weight, :count, and :kind fields are passed through
unchanged. d3-cloud ignores them but exposes the full datum to its
text, fontSize, fontWeight, and rotate callbacks, so consumers
can read e.g. d.count for tooltips with no extra plumbing.
Arguments
termsis the output ofText.WordCloud.terms/2(or any list of%{term, weight, count, kind}maps).
Options
:font_size_rangeis a{min, max}pixel tuple. Weight1.0maps tomax, weight0.0maps tomin. Default{12, 96}.:scaleis:linear(default) or:sqrt.:sqrtproduces area-proportional sizing, which is the convention most d3-cloud examples use.:linearmatchesText.WordCloud.Layout's behaviour.
Returns
- A list of
%{text, size, weight, count, kind}maps sorted by:sizedescending. The:textand:sizekeys are what d3-cloud consumes; the rest are passed through for callbacks.
Examples
iex> terms = [
...> %{term: "elixir", weight: 1.0, count: 5, kind: :word},
...> %{term: "phoenix", weight: 0.5, count: 2, kind: :word}
...> ]
iex> Text.WordCloud.to_d3_cloud(terms, font_size_range: {10, 100})
[
%{text: "elixir", size: 100.0, weight: 1.0, count: 5, kind: :word},
%{text: "phoenix", size: 55.0, weight: 0.5, count: 2, kind: :word}
]