Text.WordCloud.Backends.YAKE (Text v0.6.0)

Copy Markdown View Source

YAKE! (Yet Another Keyword Extractor) backend for Text.WordCloud.

Implements the unsupervised, statistical keyword-extraction algorithm described in Campos et al., Information Sciences 509, 2020. YAKE! computes five per-word features (casing, position, frequency, relatedness to context, sentence dispersion) entirely from the input document, then composes them into n-gram candidate scores. No reference corpus or trained model is required — this is what makes it the right default for a multilingual word-cloud library.

The algorithm's only language-specific dependency is the stopword list, supplied via Text.Stopwords.for/1 (or the caller's :stopwords override). YAKE!'s own design treats stopwords as phrase-boundary markers and as low-content interior fillers, so a good list directly improves output quality.

Score direction

YAKE!'s published score is "lower = more important". This module inverts internally before returning, so the value passed to the orchestrator is the standard "higher = more important" form every other backend uses.

Options

  • :ngram_range{min, max} candidate length. Defaults to {1, 3} (the YAKE paper's default).

  • :window_size — neighbour-context window for the relatedness feature. Defaults to 1 (immediate neighbours), matching the reference implementation.

Standard Text.WordCloud orchestrator options (:language, :stopwords, :case_fold, :locale) are honoured.

Caveats

This is a faithful but simplified port of the algorithm: the five features are computed exactly as in the paper, but the candidate-generation rules use the stricter "phrases must start and end with a non-stopword" form rather than the paper's full composition rules. In practice this produces output well-correlated with the reference Python implementation (LIAAD/yake); a differential-fixture test against that implementation is a follow-up.