scrape v2.0.0 Scrape.Website

Every function in this module takes an HTML string, and returns some data extracted from it, mostly strings. Floki is used for parsing the raw HTML.

Usually, we want some general metadata from websites, not a deep text analysis or the like. Since this is the foundation of a web crawler in the future, the used algorithms should be as fast as possible, even when the resulting quality suffers a little.

Summary

Functions

Look for a canonical url and use it, choose the given URL otherwise

Returns the description of a HTML site

Returns the favicon url of a HTML site

Returns a list of feed urls for a HTML site

Returns the main image url of a HTML site

Fetch the meta-keywords if exists

Returns the title of a HTML site

Iterate over all URLs in the website object and expand them to absolute ones

Functions

find_canonical(website, html)
find_canonical(%Scrape.Website{description: term, favicon: term, feeds: term, image: term, tags: term, title: term, url: term}, String.t) :: %Scrape.Website{description: term, favicon: term, feeds: term, image: term, tags: term, title: term, url: term}

Look for a canonical url and use it, choose the given URL otherwise

find_description(html)
find_description(String.t) :: String.t

Returns the description of a HTML site

find_favicon(html)
find_favicon(String.t) :: String.t

Returns the favicon url of a HTML site

find_feeds(html)
find_feeds(String.t) :: [String.t]

Returns a list of feed urls for a HTML site

find_image(html)
find_image(String.t) :: String.t

Returns the main image url of a HTML site

find_tags(html)
find_tags(String.t) :: [%{name: String.t, accuracy: float}]

Fetch the meta-keywords if exists

find_title(html)
find_title(String.t) :: String.t

Returns the title of a HTML site

normalize_urls(website)
normalize_urls(%Scrape.Website{description: term, favicon: term, feeds: term, image: term, tags: term, title: term, url: term}) :: %Scrape.Website{description: term, favicon: term, feeds: term, image: term, tags: term, title: term, url: term}

Iterate over all URLs in the website object and expand them to absolute ones

parse(html, url)
parse(String.t, String.t) :: %Scrape.Website{description: term, favicon: term, feeds: term, image: term, tags: term, title: term, url: term}