View Source Hop (hop v0.1.0)

Hop is a tiny web crawling framework for Elixir.

Hop's goal is to be simple and extensible, while still providing enough guardrails to get up and running quickly. Hop implements a simple, depth-limited, breadth-first crawler - and keeps track of already visited URLs for you. It also provides several utility functions for implementing crawlers with best practices.

You can crawl a webpage with just a few lines of code with Hop:

url
|> Hop.new()
|> Hop.stream()
|> Enum.each(fn {url, _response, _state} ->
  IO.puts("Visited: #{url}")
end)

By default, Hop will perform a single-host, breadth-first crawl of all of the pages on the website. The stream/2 execution function will return a stream of {url, response, state} tuples for each successfully visited page in the crawl. This is designed to make it easy to perform caching, extract items from a page, or do whatever other work deemed necessary.

The defaults are designed to get you up and running quickly; however, the power comes in with Hop's simple extensibility. Hop provides extensibility by allowing you to customize your crawler's behavior at 3 different steps:

  1. Prefetch
  2. Fetch
  3. Next

Prefetch

The prefetch stage is a pre-request stage typically used for request validation. By default, during the prefetch stage, Hop will:

  1. Check that the URL has a populated scheme and host

  2. Check that the URL's scheme is acceptable. Default accepted schemes are http and https. This prevents Hop from attempting to visit tel:, email: and other links with non-HTTP schemes.

  3. Check that the host is actually valid, using :inet.gethostbyname

  4. Check that the host is an acceptable host to visit, e.g. not outside of the host the crawl started on.

  5. Perform a HEAD request to check that the content-length does not exceed the maximum specified content length and to check that the content-type matches one of the acceptable mime-types. This is useful for preventing Hop from spending time downloading large files you don't care about.

Most users should stick to using Hop's default pre-request logic. If you want to customize the behavior; however, you can pass your own prefetch/3 function:

url
|> Hop.new()
|> Hop.prefetch(fn url, _state, _opts -> {:ok, url} end)
|> Hop.stream()
|> Enum.each(fn {url, _response, _state} ->
  IO.puts("Visited: #{url}")
end)

This simple example performs no validation, and forwards all URLs to the fetch stage.

Your custom prefetch/3 function should be an arity-3 function which takes a URL, the current Crawl state, and the current Hop's configuration options as input and returns {:ok, url} or an error value. Any none {:ok, url} values will be ignored during fetch.

Fetch

The fetch stage performs the actual requests during your crawl. By default, Hop uses Req and performs a GET request without retries on each URL. In other words, the default fetch function in Hop is:

def fetch(url, state, opts) do
  req_options = opts[:req_options]

  with {:ok, response} <- Req.get(url, req_options) do
    {:ok, response, state}
  end
end

Notice you can customize the Req request options by passing :req_options as a part of your Hop's configuration.

This is simple and acceptable for many cases; however, certain applications require more advanced setups and customization options. For example, you might want to configure your fetch to proxy requests through Puppeteer to render javascript. You can do this easily:

def custom_fetch(url, state, _opts) do
  # url to proxy that renders a page with puppeteer
  proxy_url = "http://localhost:3000/render"

  with {:ok, response} <- Req.post(url, json: %{url: url}) do
    {:ok, response, state}
  end
end

Next

The next function dictates the next links to be crawled during execution. It takes as input the current URL, the response from fetch, the current state, and configuration options. By default, Hop returns all links on the current page as next in the queue to be crawled. You can customize this behaviour by implementing your own next/4 function:

def custom_next(url, response, state, _opts) do
  links =
    url
    |> Hop.fetch_links(response)
    |> Enum.reject(&String.contains?(&1, "wp-uploads"))

  {:ok, links, state}
end

This simple example ignores all URLs that contain wp-uploads. Hop provides a convenience fetch_links/2 to fetch all of the absolute URLs on a webpage. This just uses Floki under-the-hood.

Summary

Builders

Sets this hop's fetch function.

Creates a new Hop starting at the given URL(s).

Sets this hop's next function.

Sets this hop's prefetch function.

Execution

Returns a stream that represents the execution of the given Hop.

Configuration

Returns the current configuration value set for the given key in the Hop.

Puts the given configuration value for the given key in the given hop.

Validators

Validates that the given URL's content is valid for the crawl.

Validates that the given URL's hostname is valid for the crawl.

Validates that the given URL's scheme is valid for the crawl.

Validates that the given URL has not already been visited.

HTML Helpers

Fetches all of the links on a given page.

State Manipulation

Marks the given URL as visited.

Builders

Sets this hop's fetch function.

The fetch function is what is used to actually make requests. By default, Hop uses &Req.get(&1, retry: false). If you want to change the options passed to Req, you can do so here.

Your fetch/1 function should accept a URL and return a tuple of {:ok, response}.

Note that Hop is HTTP-Client agnostic. The response object is simply forwarded to the process function. This means you can swap for a new HTTP-client if necessary.

Creates a new Hop starting at the given URL(s).

Sets this hop's next function.

The next function dictates which links are meant to be crawled next after the current page.

Sets this hop's prefetch function.

The prefetch function is essentially meant to be a pre-request validation stage. This could serve the purpose of validating that a given URL is valid, that the content is valid (e.g. via a HEAD request), that the request matches a site's Robots.txt, etc. Most clients will want to leave this as-is.

Execution

Link to this function

stream(hop, state \\ %State{})

View Source

Returns a stream that represents the execution of the given Hop.

This function will perform a limited-depth, breadth-first crawl from the given start URL, and lazily return tuples of {url, response, state} for each successfully visited page.

Configuration

Returns the current configuration value set for the given key in the Hop.

Link to this function

put_config(hop, key, value)

View Source

Puts the given configuration value for the given key in the given hop.

Validators

Link to this function

validate_content(value, state, opts)

View Source

Validates that the given URL's content is valid for the crawl.

This function attempts to perform a HEAD request to the given URL to check if the response content-type is one of the accepted mime types and that the max content length is below the specified max content length for the crawl.

If the given server does not support HEAD requests, it will simply accept the URL as valid.

Link to this function

validate_hostname(value, state, opts)

View Source

Validates that the given URL's hostname is valid for the crawl.

Validates that the hostname is populated, correct, and falls within the set of hostnames allowed according to the crawl state.

Link to this function

validate_scheme(value, state, opts)

View Source

Validates that the given URL's scheme is valid for the crawl.

Validates that the scheme is populated, correct, and falls within one of the configured accepted schemes according to the :accepted_schemes configuration option.

Link to this function

validate_visited(value, state, opts)

View Source

Validates that the given URL has not already been visited.

The Crawl state contains a member :visited that is populated with a set of URLs that have already been visited.

HTML Helpers

Link to this function

fetch_links(url, body, opts \\ [])

View Source

Fetches all of the links on a given page.

This function takes the current URL and merges it with the given anchor link to generate a fully-qualified URL.

Options

* `:crawl_query?` - whether or not to treat query parameters
as unique links to crawl. Defaults to `true`

* `:crawl_fragment?` - whether or not to treat fragments as
unique links to crawl. Defaults to `false`  

State Manipulation

Marks the given URL as visited.

This will update the set of visited URLs, and also mark this URL as the last crawled URL in the state struct.