scrape v2.0.0 Scrape.Util.Text
Small helper functions that help dealing with plain text, sanitizing HTML snippets and the like.
Summary
Functions
Extract the main content from a HTML site. The resulting paragraphs are stripped of every non-space whitespace, and then joined via `”
Count the meaningful words of a given text or list of words. The results
are aggregated into a map of the form: %{"tag" => 17}
A text paragraph is relevant if it has a minimum amount of characters and contains any indicators of a sentence-like structure
A text paragraph shall not include any whitespace except single spaces between words
Split a given text up into a list of (downcased) meaningful words
Strip all HTML tags from a text
Remove all occurences of javascript from a HTML snippet. Uses a regex
Functions
Extract the main content from a HTML site. The resulting paragraphs are
stripped of every non-space whitespace, and then joined via "
"
.
If you need the individual articles back, just split the text via
String.split(text, "
")
.
Count the meaningful words of a given text or list of words. The results
are aggregated into a map of the form: %{"tag" => 17}
A text paragraph is relevant if it has a minimum amount of characters and contains any indicators of a sentence-like structure.
Very naive approach, but works surprisingly well so far.
A text paragraph shall not include any whitespace except single spaces between words.
iex> Scrape.Util.Text.normalize_whitespace(“
hello world
“) “hello world”
Split a given text up into a list of (downcased) meaningful words.