Roboxir (Roboxir v0.1.1) View Source

Roboxir is a straightforward Robots.txt parser that lets you know if the crawler with specified name is legable to crawl a website. This parser has two functions, crawlable/2 and crawlable?/2

Link to this section Summary

Functions

Simillarly to crawlable?/2 parses the robots.txt on the desired website, returns a Struct which can be used to determine the allowed/disallowed url paths per agent.

Checks if a user-agent is legable to crawl the website, returns true if the agent can crawl the page, false otherwise.

Link to this section Functions

Link to this function

crawlable(agent_name, url \\ nil)

View Source

Specs

crawlable(String.t(), String.t()) :: Roboxir.UserAgent.t()

Simillarly to crawlable?/2 parses the robots.txt on the desired website, returns a Struct which can be used to determine the allowed/disallowed url paths per agent.

Examples

iex> user_agent = Roboxir.crawlable("some_random_agent", "https://google.com/")
%Roboxir.UserAgent{
  allowed_urls: ["/js/", "/finance", "/maps/reserve/partners", "/maps/reserve",
   "/searchhistory/", "/alerts/$", "/alerts/remove", "/alerts/manage",
   "/accounts/o8/id", "/s2/static"],
  delay: 0,
  disallowed_urls: ["/nonprofits/account/", "/localservices/*", "/local/tab/",
   "/local/place/rap/", "/local/place/reviews/", ..],
  name: "google",
  sitemap_urls: []
}

iex> user_agent = Roboxir.crawlable("some_random_agent", "https://google.com/")
iex> user_agent.disallowed_urls
["/nonprofits/account/", "/localservices/*", "/local/tab/", "/local/place/rap/",
 "/local/place/reviews/", "/local/place/products/", "/local/dining/",
 "/local/dealership/", "/local/cars/", "/local/cars", "/intl/*/about/views/",
 "/about/views/", ..]
Link to this function

crawlable?(agent_name, url \\ nil)

View Source

Specs

crawlable?(String.t(), String.t()) :: boolean()

Checks if a user-agent is legable to crawl the website, returns true if the agent can crawl the page, false otherwise.

Examples

iex> Roboxir.crawlable?("your_agent_name", "https://google.com/")
true