Crawly v0.7.0 API Reference
Modules
Crawly is a fast high-level web crawling & scraping framework for Elixir.
Crawly HTTP API. Allows to schedule/stop/get_stats of all running spiders.
Data Storage, is a module responsible for storing crawled items. On the high level it's possible to represent the architecture of items storage this way
A worker process which stores items for individual spiders. All items are pre-processed by item_pipelines.
Crawly Engine - process responsible for starting and stopping spiders.
Crawler manager module
Filters out requests which are going outside of the crawled domain.
Obey robots.txt
Avoid scheduling requests for the same pages.
Set/Rotate user agents for crawling. The user agents are read from :crawly, :user_agents sessions.
Defines the structure of spider's result
A behavior module for implementing a pipeline module
Encodes a given item (map) into CSV. Does not flatten nested maps.
Filters out duplicated items based on the provided item_id
.
Encodes a given item (map) into JSON
Ensure that scraped item contains a set of required fields.
Stores a given item into Filesystem
Request wrapper
Request storage, a module responsible for storing urls for crawling
Requests Storage, is a module responsible for storing requests for a given spider.
A behavior module for implementing a Crawly Spider
Utility functions for Crawly
A worker process responsible for the actual work (fetching requests, processing responces)