Crawly v0.7.0 API Reference

Modules

Crawly is a fast high-level web crawling & scraping framework for Elixir.

Crawly HTTP API. Allows to schedule/stop/get_stats of all running spiders.

Data Storage, is a module responsible for storing crawled items. On the high level it's possible to represent the architecture of items storage this way

A worker process which stores items for individual spiders. All items are pre-processed by item_pipelines.

Crawly Engine - process responsible for starting and stopping spiders.

Crawler manager module

Filters out requests which are going outside of the crawled domain.

Avoid scheduling requests for the same pages.

Set/Rotate user agents for crawling. The user agents are read from :crawly, :user_agents sessions.

Defines the structure of spider's result

A behavior module for implementing a pipeline module

Encodes a given item (map) into CSV. Does not flatten nested maps.

Filters out duplicated items based on the provided item_id.

Encodes a given item (map) into JSON

Ensure that scraped item contains a set of required fields.

Stores a given item into Filesystem

Request wrapper

Request storage, a module responsible for storing urls for crawling

Requests Storage, is a module responsible for storing requests for a given spider.

A behavior module for implementing a Crawly Spider

Utility functions for Crawly

A worker process responsible for the actual work (fetching requests, processing responces)