Crawly v0.3.0 API Reference
Modules
Crawly is a fast high-level web crawling & scraping framework for Elixir.
Crawly HTTP API. Allows to schedule/stop/get_stats of all running spiders.
Data Storage, is a module responsible for storing crawled items. On the high level it's possible to represent the architecture of items storage this way
A worker process which stores items for individual spiders. All items are pre-processed by item_pipelines.
Crawly Engine - process responsible for starting and stopping spiders.
Engine supervisor responsible for spider subtrees
Crawler manager module
A supervisor module used to spawn Crawler trees
Filters out requests which are going outside of the crawled domain
Obey robots.txt
Avoid scheduling requests for the same pages.
Set/Rotate user agents for crawling. The user agents are read from :crawly, :user_agents sessions.
Defines the structure of spider's result
A behavior module for implementing a pipeline module
Encodes a given item (map) into CSV
Filters out duplicated items (helps to avoid storing duplicates)
Encodes a given item (map) into JSON
Ensure that scraped item contains all fields defined in config: item.
Request wrapper
Request storage, a module responsible for storing urls for crawling
Requests Storage, is a module responsible for storing requests for a given spider.
A behavior module for implementing a Crawly Spider
Utility functions for Crawly
A worker process responsible for the actual work (fetching requests, processing responces)