Spidey v0.3.0 Spidey View Source

A dead-simple, concurrent web crawler which focuses on ease of use and speed.

Usage

Spidey has been thought with ease of usage in mind, so all you have to do to get started is:

iex> Spidey.crawl("https://manzanit0.github.io", :crawler_pool, pool_size: 15)
[
  "https://https://manzanit0.github.io/foo",
  "https://https://manzanit0.github.io/bar-baz/#",
  ...
]

In a nutshell, the above line will:

  1. Spin up a new supervision tree under the Spidey OTP Application that will contain a pool of workers for crawling.
  2. Create an ETS table to store crawled urls
  3. Crawl the website
  4. Return all the urls as a list
  5. Teardown the supervision tree and the ETS table

The function is synchronous, but if you were to call it asynchronously multiple times, each invocation will spin up a new supervision trees with a new pool and its set of workers.

Specifying your own filter

Furthermore, if you would you want to specify your own filter for crawled URLs, you can do so by implementing the Spidey.Filter behaviour:

defmodule MyApp.RssFilter do
  @behaviour Spidey.Filter

  @impl true
  def filter_urls(urls, _opts) do
    urls
    |> Stream.reject(&String.ends_with?(&1, "feed/"))
    |> Stream.reject(&String.ends_with?(&1, "feed"))
  end
 end

And simply pass it down to the crawler as an option:

Spidey.crawl("https://manzanit0.github.io", :crawler_pool, filter: MyApp.RssFilter)

It's encouraged to use the Stream module instead of the Enum since the code that handles the filtering uses streams.

Configuration

Currently Spidey supports the following configuration:

  • :log - the log level used when logging events with Elixir's Logger. If false, disables logging. Defaults to :debug
config :spidey, log: :info

Using the CLI

To be able to run the application make sure to have Elixir installed. Please check the official instructions: link

Once you have Elixir installed, to set up the application run:

git clone https://github.com/Manzanit0/spidey
cd spidey
mix deps.get
mix escript.build

To crawl websites, run the escript ./spidey:

./spidey --site https://manzanit0.github.io/

Escripts will run in any system which has Erlang/OTP installed, regardless if they have Elixir or not.

CLI options

Spidey provides two main functionalities – crawling a specific domain and saving it to a file according to the plain text site map protocol. For the latter, simply append --save to the execution.

Installation

The package can be installed by adding spidey to your list of dependencies in mix.exs:

def deps do
  [
    {:spidey, "~> 0.2"}
  ]
end

The docs can be found at https://hexdocs.pm/spidey

Link to this section Summary

Functions

Crawls a website for all the same-domain urls, returning a list with them.

Just like crawl/3 but saves the list of urls to file

Link to this section Functions

Link to this function

crawl(url, pool_name \\ :default, opts \\ [])

View Source

Crawls a website for all the same-domain urls, returning a list with them.

The default pool_name is :default, but a custom one can be provided.

The default filter rejects assets, Wordpress links, and others. To provide custom filtering make sure to implement the Spidey.Filter behaviour and provide it via the filter option.

Furthermore, crawl/3 accepts the following options:

  • filter: a custom url filter
  • pool_size: the amount of workers to crawl the website. Defaults to 20.
  • max_overflow: the amount of workers to overflow before queueing urls. Defaults to 5.

Examples

iex> Spidey.crawl("https://manzanit0.github.io", :crawler_pool, filter: MyCustomFilter, pool_size: 15)
["https://https://manzanit0.github.io/foo", "https://https://manzanit0.github.io/bar-baz/#", ...]
Link to this function

crawl_to_file(url, path, pool_name \\ :default, opts \\ [])

View Source

Just like crawl/3 but saves the list of urls to file