View Source ReqCrawl.Sitemap (ReqCrawl v0.2.0)

Gathers all URLs from a Sitemap or SitemapIndex according to the specification described at https://sitemaps.org/protocol.html

Supports the following formats:

  • .xml (for sitemap and sitemapindex)
  • .txt (for sitemap)

Outputs a 2-Tuple of {type, urls} where type is one of :sitemap or :sitemapindex and urls is a list of URL strings extracted from the body.

Output is stored in the ReqResponse in the private field under the :crawl_sitemap key

Summary

Functions

Link to this function

attach(request, options \\ [])

View Source