View Source ReqCrawl.Robots (ReqCrawl v0.2.0)

A Req plugin to parse robots.txt files

You can attach this plugin to any %Req.Request you use for a crawler and it will only run against URLs with a path of /robots.txt.

It outputs a map with the following fields:

  • :errors - A list of any errors encountered during parsing
  • :sitemaps - A list of the sitemaps
  • :rules - A map of the rules with User-Agents as the keys and a map with the following values as the fields:
    • :allow - A list of allowed paths
    • :disallow - A list of the disallowed paths

Output is stored in the ReqResponse in the private field under the :crawl_robots key

Summary

Functions

Link to this function

attach(request, options \\ [])

View Source