Text.WordCloud.Backends.TFIDF (Text v0.6.0)

Copy Markdown View Source

TF-IDF backend for Text.WordCloud.

Scores each candidate term as tf(t) * idf(t), where tf is the raw count in the foreground text and idf is the inverse-document frequency over a user-supplied reference corpus. This is the classical "what is distinctive about this document?" scorer — it surfaces terms that are common in the foreground but rare across the background.

Foreground vs background

  • The first argument to Text.WordCloud.terms/2 is the foreground: a single string or a list of strings (treated as one document).

  • The reference corpus is supplied via the :reference_corpus option, either as a list of background documents (TF-IDF computes IDF over them) or as a precomputed %{term => idf} map.

Without a reference corpus the backend falls back to IDF = 1.0 for every term, reducing to a frequency cloud — which is rarely what you want. The orchestrator emits an IO.warn/2 in that case.

Smoothing

Uses log-smoothed IDF with the standard log(N / (1 + df)) form:

  • N = number of reference documents.
  • df_t = number of reference documents containing term t.

Terms unseen in the reference get IDF = log(N / 1) = log(N), giving them a sensible high score rather than zero.

Defaults

  • :ngram_range defaults to {1, 1} for this backend — IDF over multi-token phrases is rarely meaningful unless the reference corpus is large enough that phrases recur. Override explicitly if you have such a corpus.

Standard Text.WordCloud orchestrator options (:language, :stopwords, :case_fold, :locale) are honoured.