crawlie v0.3.1 Crawlie

The simple Elixir web crawler.

Summary

Functions

Crawls the urls provided in source, using the Crawlie.ParserLogic provided in parser_logic

Functions

crawl(source, parser_logic, options \\ [])

Crawls the urls provided in source, using the Crawlie.ParserLogic provided in parser_logic.

The options are used to tweak the crawler’s behaviour. You can use most of the options for HttPoison, as well as Crawlie specific options.

arguments

  • source - a Stream or an Enum containing the urls to crawl
  • parser_logic- a Crawlie.ParserLogic behaviour implementation
  • options - a Keyword List of options

Crawlie specific options

  • :http_client - module implementing the Crawlie.HttpClient behaviour to be used to make the requests. If not provided, will default to Crawlie.HttpClient.HTTPoisonClient.
  • :mock_client_fun - If you’re using the Crawlie.HttpClient.MockClient, this would be the url -> {:ok, body :: String.t} | {:error, term} function simulating making the requests. for details
  • :max_depth - maximum crawling “depth”. 0 by default.
  • :max_retries - maximum amount of tries Crawlie should try to fetch any individual page before giving up. By default 3.
  • :fetch_phase - Flow partition configuration for the fetching phase of the crawling Flow. It should be a Keyword List containing any subset of :min_demand, :max_demand and :stages properties. For the meaning of these options see Flow documentation
  • :process_phase - same as :fetch_phase, but for the processing (page parsing, data and link extraction) part of the process
  • :pqueue_module - One of pqueue implementations: :pqueue, :pqueue2, :pqueue3, :pqueue4. Different implementation have different performance characteristics and allow for different :max_depth values. Consult docs for details. By default using :pqueue3 - good performance and allowing arbitrary :max_depth values.