Crawly v0.2.0 API Reference

Modules

Crawly is a fast high-level web crawling & scraping framework for Elixir.

Crawly HTTP API. Allows to schedule/stop/get_stats of all running spiders.

Data Storage, is a module responsible for storing crawled items. On the high level it's possible to represent the architecture of items storage this way

A worker process which stores items for individual spiders. All items are pre-processed by item_pipelines.

Crawly Engine - process responsible for starting and stopping spiders.

Engine supervisor responsible for spider subtrees

Crawler manager module

A supervisor module used to spawn Crawler trees

Filters out requests which are going outside of the crawled domain

Avoid scheduling requests for the same pages.

Set/Rotate user agents for crawling. The user agents are read from :crawly, :user_agents sessions.

Defines the structure of spider's result

A behavior module for implementing a pipeline module

Encodes a given item (map) into CSV

Filters out duplicated items (helps to avoid storing duplicates)

Encodes a given item (map) into JSON

Ensure that scraped item contains all fields defined in config: item.

Request wrapper

Request storage, a module responsible for storing urls for crawling

Requests Storage, is a module responsible for storing requests for a given spider.

A behavior module for implementing a Crawly Spider

Utility functions for Crawly

A worker process responsible for the actual work (fetching requests, processing responces)