Funkspector (funkspector v1.2.0)
Funkspector is a web scraper that lets you extract data from web pages.
Summary
Functions
Parses an HTML document.
Given a URL, it will follow the redirections and return the final URL and the final response.
Parses an XML sitemap.
Parses a text sitemap.
Functions
page_scrape(url, options \\ %{})
Parses an HTML document.
This can be used to request a document by passing its URL, like:
Funkspector.page_scrape("https://example.com")
Or to scrape an already loaded document, by passing its HTML contents:
Funkspector.page_scrape("https://example.com", contents: "<html>...</html>")
Example: request a document
iex> { :ok, document } = Funkspector.page_scrape("https://jaimeiniesta.com")
iex> Enum.take(document.data.links.http.external, 3)
["http://www.archive.elixirconf.eu/elixirconf2016", "https://steadyhq.com/", "https://stuart.com/"]
Example: site not found
iex> Funkspector.page_scrape("https://notfoundwebsite.com")
{:error, "https://notfoundwebsite.com", %HTTPoison.Error{reason: :nxdomain, id: nil}}
resolve(url, options \\ %{})
Given a URL, it will follow the redirections and return the final URL and the final response.
Examples
iex> { :ok, final_url, _response } = Funkspector.resolve("http://github.com")
iex> final_url
"https://github.com/"
scrape(url, options, scraping_function)
sitemap_scrape(url, options \\ %{})
Parses an XML sitemap.
This can be used to request a document by passing its URL, like:
Funkspector.sitemap_scrape("https://example.com")
Or to scrape an already loaded document, by passing its XML contents:
Funkspector.sitemap_scrape("https://example.com/sitemap.xml", contents: "<xml>...</xml>")
Example
iex> { :ok, document } = Funkspector.sitemap_scrape("https://rocketvalidator.com/sitemap.xml")
iex> length document.data.locs
1244
iex> Enum.take(document.data.locs, 3)
["https://rocketvalidator.com/", "https://rocketvalidator.com/who", "https://rocketvalidator.com/html-validation"]
text_sitemap_scrape(url, options \\ %{})
Parses a text sitemap.
This can be used to request a document by passing its URL, like:
Funkspector.text_sitemap_scrape("https://example.com")
Or to scrape an already loaded document, by passing its text contents:
Funkspector.text_sitemap_scrape("https://example.com/sitemap.txt", contents: "...")
Example
iex> { :ok, document } = Funkspector.text_sitemap_scrape("https://rocketvalidator.com/sitemap.txt")
iex> length document.data.lines
1244
iex> Enum.take(document.data.lines, 3)
["https://rocketvalidator.com/", "https://rocketvalidator.com/who", "https://rocketvalidator.com/html-validation"]