scrape v2.0.0 Scrape.Util.Text

Small helper functions that help dealing with plain text, sanitizing HTML snippets and the like.

Summary

Functions

Extract the main content from a HTML site. The resulting paragraphs are stripped of every non-space whitespace, and then joined via `”

Count the meaningful words of a given text or list of words. The results are aggregated into a map of the form: %{"tag" => 17}

A text paragraph is relevant if it has a minimum amount of characters and contains any indicators of a sentence-like structure

A text paragraph shall not include any whitespace except single spaces between words

Split a given text up into a list of (downcased) meaningful words

Strip all HTML tags from a text

Remove all occurences of javascript from a HTML snippet. Uses a regex

Functions

article_from_html(html)
article_from_html(String.t) :: String.t

Extract the main content from a HTML site. The resulting paragraphs are stripped of every non-space whitespace, and then joined via " ".

If you need the individual articles back, just split the text via String.split(text, " ").

count_words(text)
count_words(String.t | [String.t]) :: %{optional(String.t) => float}

Count the meaningful words of a given text or list of words. The results are aggregated into a map of the form: %{"tag" => 17}

is_relevant?(text)
is_relevant?(String.t) :: String.t

A text paragraph is relevant if it has a minimum amount of characters and contains any indicators of a sentence-like structure.

Very naive approach, but works surprisingly well so far.

normalize_whitespace(text)
normalize_whitespace(String.t) :: String.t

A text paragraph shall not include any whitespace except single spaces between words.

iex> Scrape.Util.Text.normalize_whitespace(“

hello world

“) “hello world”

to_words(text)
to_words(String.t) :: [String.t]

Split a given text up into a list of (downcased) meaningful words.

without_html(text)
without_html(String.t) :: String.t

Strip all HTML tags from a text

without_js(text)
without_js(String.t) :: String.t

Remove all occurences of javascript from a HTML snippet. Uses a regex.