View Source Hop (hop v0.1.0)
Hop is a tiny web crawling framework for Elixir.
Hop's goal is to be simple and extensible, while still providing enough guardrails to get up and running quickly. Hop implements a simple, depth-limited, breadth-first crawler - and keeps track of already visited URLs for you. It also provides several utility functions for implementing crawlers with best practices.
You can crawl a webpage with just a few lines of code with Hop:
url
|> Hop.new()
|> Hop.stream()
|> Enum.each(fn {url, _response, _state} ->
IO.puts("Visited: #{url}")
end)
By default, Hop will perform a single-host, breadth-first crawl of all of the
pages on the website. The stream/2
execution function will return a stream of
{url, response, state}
tuples for each successfully visited page in the crawl.
This is designed to make it easy to perform caching, extract items from a page,
or do whatever other work deemed necessary.
The defaults are designed to get you up and running quickly; however, the power comes in with Hop's simple extensibility. Hop provides extensibility by allowing you to customize your crawler's behavior at 3 different steps:
- Prefetch
- Fetch
- Next
Prefetch
The prefetch stage is a pre-request stage typically used for request validation. By default, during the prefetch stage, Hop will:
Check that the URL has a populated scheme and host
Check that the URL's scheme is acceptable. Default accepted schemes are http and https. This prevents Hop from attempting to visit
tel:
,email:
and other links with non-HTTP schemes.Check that the host is actually valid, using
:inet.gethostbyname
Check that the host is an acceptable host to visit, e.g. not outside of the host the crawl started on.
Perform a HEAD request to check that the content-length does not exceed the maximum specified content length and to check that the content-type matches one of the acceptable mime-types. This is useful for preventing Hop from spending time downloading large files you don't care about.
Most users should stick to using Hop's default pre-request logic. If you want to
customize the behavior; however, you can pass your own prefetch/3
function:
url
|> Hop.new()
|> Hop.prefetch(fn url, _state, _opts -> {:ok, url} end)
|> Hop.stream()
|> Enum.each(fn {url, _response, _state} ->
IO.puts("Visited: #{url}")
end)
This simple example performs no validation, and forwards all URLs to the fetch stage.
Your custom prefetch/3
function should be an arity-3 function which takes a URL, the
current Crawl state, and the current Hop's configuration options as input and returns
{:ok, url}
or an error value. Any none {:ok, url}
values will be ignored during fetch.
Fetch
The fetch stage performs the actual requests during your crawl. By default, Hop uses Req and performs a GET request without retries on each URL. In other words, the default fetch function in Hop is:
def fetch(url, state, opts) do
req_options = opts[:req_options]
with {:ok, response} <- Req.get(url, req_options) do
{:ok, response, state}
end
end
Notice you can customize the Req
request options by passing :req_options
as
a part of your Hop's configuration.
This is simple and acceptable for many cases; however, certain applications require more advanced setups and customization options. For example, you might want to configure your fetch to proxy requests through Puppeteer to render javascript. You can do this easily:
def custom_fetch(url, state, _opts) do
# url to proxy that renders a page with puppeteer
proxy_url = "http://localhost:3000/render"
with {:ok, response} <- Req.post(url, json: %{url: url}) do
{:ok, response, state}
end
end
Next
The next function dictates the next links to be crawled during execution.
It takes as input the current URL, the response from fetch, the current state,
and configuration options. By default, Hop returns all links on the current page
as next in the queue to be crawled. You can customize this behaviour by implementing
your own next/4
function:
def custom_next(url, response, state, _opts) do
links =
url
|> Hop.fetch_links(response)
|> Enum.reject(&String.contains?(&1, "wp-uploads"))
{:ok, links, state}
end
This simple example ignores all URLs that contain wp-uploads
. Hop provides a convenience
fetch_links/2
to fetch all of the absolute URLs on a webpage. This just uses Floki under-the-hood.
Summary
Builders
Sets this hop's fetch function.
Creates a new Hop starting at the given URL(s).
Sets this hop's next function.
Sets this hop's prefetch function.
Execution
Returns a stream that represents the execution of the given Hop.
Configuration
Returns the current configuration value set for the given key in the Hop.
Puts the given configuration value for the given key in the given hop.
Validators
Validates that the given URL's content is valid for the crawl.
Validates that the given URL's hostname is valid for the crawl.
Validates that the given URL's scheme is valid for the crawl.
Validates that the given URL has not already been visited.
HTML Helpers
Fetches all of the links on a given page.
State Manipulation
Marks the given URL as visited.
Builders
Sets this hop's fetch function.
The fetch function is what is used to actually make requests. By
default, Hop uses &Req.get(&1, retry: false)
. If you want to
change the options passed to Req, you can do so here.
Your fetch/1
function should accept a URL and return a tuple of
{:ok, response}
.
Note that Hop is HTTP-Client agnostic. The response
object is
simply forwarded to the process
function. This means you can
swap for a new HTTP-client if necessary.
Creates a new Hop starting at the given URL(s).
Sets this hop's next function.
The next function dictates which links are meant to be crawled next after the current page.
Sets this hop's prefetch function.
The prefetch function is essentially meant to be a pre-request validation stage. This could serve the purpose of validating that a given URL is valid, that the content is valid (e.g. via a HEAD request), that the request matches a site's Robots.txt, etc. Most clients will want to leave this as-is.
Execution
Returns a stream that represents the execution of the given Hop.
This function will perform a limited-depth, breadth-first crawl from
the given start URL, and lazily return tuples of {url, response, state}
for each successfully visited page.
Configuration
Returns the current configuration value set for the given key in the Hop.
Puts the given configuration value for the given key in the given hop.
Validators
Validates that the given URL's content is valid for the crawl.
This function attempts to perform a HEAD request to the given URL to check if the response content-type is one of the accepted mime types and that the max content length is below the specified max content length for the crawl.
If the given server does not support HEAD requests, it will simply accept the URL as valid.
Validates that the given URL's hostname is valid for the crawl.
Validates that the hostname is populated, correct, and falls within the set of hostnames allowed according to the crawl state.
Validates that the given URL's scheme is valid for the crawl.
Validates that the scheme is populated, correct, and falls within
one of the configured accepted schemes according to the :accepted_schemes
configuration option.
Validates that the given URL has not already been visited.
The Crawl state contains a member :visited
that is populated with
a set of URLs that have already been visited.
HTML Helpers
Fetches all of the links on a given page.
This function takes the current URL and merges it with the given anchor link to generate a fully-qualified URL.
Options
* `:crawl_query?` - whether or not to treat query parameters
as unique links to crawl. Defaults to `true`
* `:crawl_fragment?` - whether or not to treat fragments as
unique links to crawl. Defaults to `false`
State Manipulation
Marks the given URL as visited.
This will update the set of visited URLs, and also mark this URL as the last crawled URL in the state struct.