Text.Word (Text v0.6.0)

Copy Markdown View Source

Implements word counting for lists, streams and flows.

Tokenization

Word counting separates a text into tokens via a caller-supplied splitter function. The default splitter is String.split/1, which splits only on Unicode whitespace. This default has two important properties to be aware of:

  • It does not implement Unicode word segmentation (UAX #29). String.split/1 is a fast byte-level whitespace split. It does not respect the boundary rules in UAX #29 — for example, it keeps don't, co-operate, U.S., and 1,200 as single tokens, where UAX #29 would emit several. For frequency counting this is usually the desired behaviour, but if you need standards- compliant boundaries (e.g. for cursor movement, search highlighting, or linguistic analysis) pass an explicit splitter that delegates to Unicode.String.split/2. See examples below.

  • It does not work for languages that don't use whitespace between words. Chinese, Japanese, Korean, Thai, Lao, Khmer, and Burmese (Myanmar) write running text without spaces. Calling word_count/1 on text in those languages with the default splitter will return the entire passage (or large punctuation- delimited chunks of it) as a single "word". For these languages you must pass a splitter that uses dictionary-based segmentation, e.g. &Unicode.String.split(&1, break: :word, locale: :zh, trim: true).

Choosing a splitter

SplitterBehaviourWhen to use
&String.split/1 (default)Whitespace only, ~50–100× faster than UAXEnglish / Western prose, fast counting
&Unicode.String.split(&1, break: :word, trim: true)UAX #29 segmentationStandards-compliant boundaries; required for CJK/SE-Asian text (with :locale)
&Regex.split(~r/\W+/u, &1, trim: true)Alphabetic runs onlyStrip all punctuation, ASCII-only words

Note that Unicode.String.split/2 with break: :word produces punctuation tokens (",", ".", "'", etc.) as their own words. Filter or rejoin those before frequency counting if you only want alphabetic tokens.

Summary

Types

A list of words and their frequencies in a text

A function to split text

Enumerable types for word counting

Functions

Counts the average word length in a frequency list.

Sorts the words in a frequency list by frequency.

Counts the total number of words in a frequency list.

Counts the number of words in a string, File.Stream, or Flow.

Types

frequency_list()

@type frequency_list() :: [{String.t(), pos_integer()}, ...]

A list of words and their frequencies in a text

splitter()

@type splitter() :: function()

A function to split text

text()

@type text() :: Flow.t() | File.Stream.t() | String.t() | [String.t(), ...]

Enumerable types for word counting

Functions

average_word_length(frequency_list)

@spec average_word_length(frequency_list()) :: float()

Counts the average word length in a frequency list.

Arguments

Returns

  • A float representing the average word length.

Examples

sort(frequency_list, direction \\ :desc)

@spec sort(frequency_list(), :asc | :desc) :: frequency_list()

Sorts the words in a frequency list by frequency.

Arguments

  • frequency_list is a list of frequencies returned from Text.Word.word_count/2.

  • direction is either :asc or :desc. The default is :desc.

Returns

  • The frequency_list sorted in the direction specified.

Examples

total_word_count(frequency_list)

@spec total_word_count(frequency_list()) :: pos_integer()

Counts the total number of words in a frequency list.

Arguments

Returns

  • An integer number of words.

Notes

The total reflects whatever tokenization was used to build frequency_list. With the default String.split/1 splitter the count is the number of whitespace-separated tokens, which:

  • counts contractions (don't), hyphenations (co-operate), abbreviations (U.S.) and decimals (1,200) as a single word each — typically what a frequency-counter wants;

  • undercounts radically on Chinese / Japanese / Korean / Thai / Lao / Khmer / Burmese, where the entire input may collapse to a single token. Use a UAX/dictionary-aware splitter via word_count/2 for those languages.

Examples

word_count(text, splitter \\ &String.split/1)

@spec word_count(Flow.t() | File.Stream.t() | String.t() | [String.t()], splitter()) ::
  frequency_list()

Counts the number of words in a string, File.Stream, or Flow.

Arguments

  • text is either a String.t, Flow.t, File.Stream.t or a list of strings.

  • splitter is an arity-1 function that takes a string and returns a list of tokens. The default is &String.split/1, which splits only on Unicode whitespace.

Returns

  • A list of 2-tuples of the form {word, count}, referred to as a frequency list.

Notes on the default splitter

The default &String.split/1 is fast but does not implement Unicode word segmentation (UAX #29) and does not work for languages that write without spaces between words (Chinese, Japanese, Korean, Thai, Lao, Khmer, Burmese). On such input the whole passage will be returned as a single token (or as a small number of punctuation-delimited chunks).

See the module documentation for a full discussion of splitter choices.

Examples

# English / Western prose — default splitter is fine.
Text.Word.word_count("the quick brown fox the lazy dog")
#=> [{"the", 2}, {"quick", 1}, {"brown", 1}, ...]

# Chinese — must use dictionary-aware UAX segmentation.
splitter = &Unicode.String.split(&1, break: :word, locale: :zh, trim: true)
Text.Word.word_count("中文文本不使用空格", splitter)

# Standards-compliant Western tokenization, with punctuation
# tokens filtered out.
uax_alpha = fn text ->
  text
  |> Unicode.String.split(break: :word, trim: true)
  |> Enum.reject(&Regex.match?(~r/^\W+$/u, &1))
end
Text.Word.word_count("Don't stop — believe!", uax_alpha)