Implements word counting for lists, streams and flows.
Tokenization
Word counting separates a text into tokens via a caller-supplied
splitter function. The default splitter is String.split/1,
which splits only on Unicode whitespace. This default has two
important properties to be aware of:
It does not implement Unicode word segmentation (UAX #29).
String.split/1is a fast byte-level whitespace split. It does not respect the boundary rules in UAX #29 — for example, it keepsdon't,co-operate,U.S., and1,200as single tokens, where UAX #29 would emit several. For frequency counting this is usually the desired behaviour, but if you need standards- compliant boundaries (e.g. for cursor movement, search highlighting, or linguistic analysis) pass an explicit splitter that delegates toUnicode.String.split/2. See examples below.It does not work for languages that don't use whitespace between words. Chinese, Japanese, Korean, Thai, Lao, Khmer, and Burmese (Myanmar) write running text without spaces. Calling
word_count/1on text in those languages with the default splitter will return the entire passage (or large punctuation- delimited chunks of it) as a single "word". For these languages you must pass a splitter that uses dictionary-based segmentation, e.g.&Unicode.String.split(&1, break: :word, locale: :zh, trim: true).
Choosing a splitter
| Splitter | Behaviour | When to use |
|---|---|---|
&String.split/1 (default) | Whitespace only, ~50–100× faster than UAX | English / Western prose, fast counting |
&Unicode.String.split(&1, break: :word, trim: true) | UAX #29 segmentation | Standards-compliant boundaries; required for CJK/SE-Asian text (with :locale) |
&Regex.split(~r/\W+/u, &1, trim: true) | Alphabetic runs only | Strip all punctuation, ASCII-only words |
Note that Unicode.String.split/2 with break: :word produces
punctuation tokens (",", ".", "'", etc.) as their own words.
Filter or rejoin those before frequency counting if you only want
alphabetic tokens.
Summary
Types
A list of words and their frequencies in a text
A function to split text
Enumerable types for word counting
Functions
Counts the average word length in a frequency list.
Sorts the words in a frequency list by frequency.
Counts the total number of words in a frequency list.
Counts the number of words in a string,
File.Stream, or Flow.
Types
@type frequency_list() :: [{String.t(), pos_integer()}, ...]
A list of words and their frequencies in a text
@type splitter() :: function()
A function to split text
@type text() :: Flow.t() | File.Stream.t() | String.t() | [String.t(), ...]
Enumerable types for word counting
Functions
@spec average_word_length(frequency_list()) :: float()
Counts the average word length in a frequency list.
Arguments
frequency_listis a list of frequencies returned fromText.Word.word_count/2.
Returns
- A float representing the average word length.
Examples
@spec sort(frequency_list(), :asc | :desc) :: frequency_list()
Sorts the words in a frequency list by frequency.
Arguments
frequency_listis a list of frequencies returned fromText.Word.word_count/2.directionis either:ascor:desc. The default is:desc.
Returns
- The
frequency_listsorted in the direction specified.
Examples
@spec total_word_count(frequency_list()) :: pos_integer()
Counts the total number of words in a frequency list.
Arguments
frequency_listis a list of frequencies returned fromText.Word.word_count/2.
Returns
- An integer number of words.
Notes
The total reflects whatever tokenization was used to build
frequency_list. With the default String.split/1 splitter the
count is the number of whitespace-separated tokens, which:
counts contractions (
don't), hyphenations (co-operate), abbreviations (U.S.) and decimals (1,200) as a single word each — typically what a frequency-counter wants;undercounts radically on Chinese / Japanese / Korean / Thai / Lao / Khmer / Burmese, where the entire input may collapse to a single token. Use a UAX/dictionary-aware splitter via
word_count/2for those languages.
Examples
@spec word_count(Flow.t() | File.Stream.t() | String.t() | [String.t()], splitter()) :: frequency_list()
Counts the number of words in a string,
File.Stream, or Flow.
Arguments
textis either aString.t,Flow.t,File.Stream.tor a list of strings.splitteris an arity-1 function that takes a string and returns a list of tokens. The default is&String.split/1, which splits only on Unicode whitespace.
Returns
- A list of 2-tuples of the form
{word, count}, referred to as a frequency list.
Notes on the default splitter
The default &String.split/1 is fast but does not implement
Unicode word segmentation (UAX #29) and does not work for
languages that write without spaces between words (Chinese,
Japanese, Korean, Thai, Lao, Khmer, Burmese). On such input the
whole passage will be returned as a single token (or as a small
number of punctuation-delimited chunks).
See the module documentation for a full discussion of splitter choices.
Examples
# English / Western prose — default splitter is fine.
Text.Word.word_count("the quick brown fox the lazy dog")
#=> [{"the", 2}, {"quick", 1}, {"brown", 1}, ...]
# Chinese — must use dictionary-aware UAX segmentation.
splitter = &Unicode.String.split(&1, break: :word, locale: :zh, trim: true)
Text.Word.word_count("中文文本不使用空格", splitter)
# Standards-compliant Western tokenization, with punctuation
# tokens filtered out.
uax_alpha = fn text ->
text
|> Unicode.String.split(break: :word, trim: true)
|> Enum.reject(&Regex.match?(~r/^\W+$/u, &1))
end
Text.Word.word_count("Don't stop — believe!", uax_alpha)