scrape v2.0.0 Scrape.Website
Every function in this module takes an HTML string, and returns some data extracted from it, mostly strings. Floki is used for parsing the raw HTML.
Usually, we want some general metadata from websites, not a deep text analysis or the like. Since this is the foundation of a web crawler in the future, the used algorithms should be as fast as possible, even when the resulting quality suffers a little.
Summary
Functions
Look for a canonical url and use it, choose the given URL otherwise
Returns the description of a HTML site
Returns the favicon url of a HTML site
Returns a list of feed urls for a HTML site
Returns the main image url of a HTML site
Fetch the meta-keywords if exists
Returns the title of a HTML site
Iterate over all URLs in the website object and expand them to absolute ones
Functions
find_canonical(%Scrape.Website{description: term, favicon: term, feeds: term, image: term, tags: term, title: term, url: term}, String.t) :: %Scrape.Website{description: term, favicon: term, feeds: term, image: term, tags: term, title: term, url: term}
Look for a canonical url and use it, choose the given URL otherwise
Returns the description of a HTML site
Fetch the meta-keywords if exists
normalize_urls(%Scrape.Website{description: term, favicon: term, feeds: term, image: term, tags: term, title: term, url: term}) :: %Scrape.Website{description: term, favicon: term, feeds: term, image: term, tags: term, title: term, url: term}
Iterate over all URLs in the website object and expand them to absolute ones