Crawler v1.0.0 Crawler.Parser
Parses pages and calls a link handler to handle the detected links.
Link to this section Summary
Link to this section Functions
Link to this function
parse(input)
Parses the links and returns the page.
There are two hooks:
link_handler
is useful when a custom parser calls this default parser and utilises a different link handler for processing links.scraper
is useful for scraping content immediately as the parser parses the page, alternatively you can simply access the crawled data asynchronously, refer to the README
Examples
iex> {:ok, page} = Parser.parse(%Page{
iex> body: "Body",
iex> opts: %{scraper: Crawler.Scraper, html_tag: "a", content_type: "text/html"}
iex> })
iex> page.body
"Body"
iex> {:ok, page} = Parser.parse(%Page{
iex> body: "<a href='http://parser/1'>Link</a>",
iex> opts: %{scraper: Crawler.Scraper, html_tag: "a", content_type: "text/html"}
iex> })
iex> page.body
"<a href='http://parser/1'>Link</a>"
iex> {:ok, page} = Parser.parse(%Page{
iex> body: "<a name='hello'>Link</a>",
iex> opts: %{scraper: Crawler.Scraper, html_tag: "a", content_type: "text/html"}
iex> })
iex> page.body
"<a name='hello'>Link</a>"
iex> {:ok, page} = Parser.parse(%Page{
iex> body: "<a href='http://parser/2' target='_blank'>Link</a>",
iex> opts: %{scraper: Crawler.Scraper, html_tag: "a", content_type: "text/html"}
iex> })
iex> page.body
"<a href='http://parser/2' target='_blank'>Link</a>"
iex> {:ok, page} = Parser.parse(%Page{
iex> body: "<a href='parser/2'>Link</a>",
iex> opts: %{scraper: Crawler.Scraper, html_tag: "a", content_type: "text/html", referrer_url: "http://hello"}
iex> })
iex> page.body
"<a href='parser/2'>Link</a>"
iex> {:ok, page} = Parser.parse(%Page{
iex> body: "<a href='../parser/2'>Link</a>",
iex> opts: %{scraper: Crawler.Scraper, html_tag: "a", content_type: "text/html", referrer_url: "http://hello"}
iex> })
iex> page.body
"<a href='../parser/2'>Link</a>"
iex> {:ok, page} = Parser.parse(%Page{
iex> body: image_file(),
iex> opts: %{scraper: Crawler.Scraper, html_tag: "img", content_type: "image/png"}
iex> })
iex> page.body
"#{image_file()}"
Link to this function
parse_links(body, opts, link_handler)