View Source ScraperEx (scraper_ex v0.1.2)

ScraperEx

Credo Dialyzer

This library exists to make scraping a bit easier for business use cases

installation

Installation

Available in Hex, the package can be installed by adding scraper_ex to your list of dependencies in mix.exs:

def deps do
  [
    {:scraper_ex, "~> 0.1.0"}
  ]
end

The docs can be found at https://hexdocs.pm/scraper_ex.

usage

Usage

ScraperEx uses Hound under the hood, which means you can configure hound to use any browser/runner you'd like. By default we use chrome_headless

ScraperEx.Window

This module exists to manage windows with Hound. Hounds window management by default doesn't help very much with session management, which leads to zombie windows hanging around and can really start to eat up memory. To avoid this we can use ScraperEx.Window to run and interact with a individual session

ScraperEx

The two useful functions in here are ScraperEx.run_task_in_window and ScraperEx.run_task, run task allows you to input various steps for a scraper while run_in_window will also start a window for you, the bare version won't and you will need to manage your own ScraperEx.Window

tasks

Tasks

Tasks are defined by configs, you can either use the struct form using ScraperEx.Task.Config modules or use the short forms

The following actions are currently implemented:

You can allow errors by wrapping a command in

  ScraperEx.allow_error({:click, {:css, ".thing"}})

Example

iex> ScraperEx.run_task_in_window([
...>   {:navigate_to, "https://en.wikipedia.org/wiki/Example.com"},
...>   {:read, :references, {:css, ".reference-text"}},
...>   {:read, :page_title, {:id, "firstHeading"}},
...>   {:read, :external_link_4, {:css, "#bodyContent ul:nth-child(21) li:nth-child(4)"}},
...>   {:click, {:css, "h2:has(#External_links) + ul li:nth-of-type(3) a"}, :timer.seconds(1)},
...>   {:read, :clicked_url, {:css, "h1"}},
...> ])
%{ \
  page_title: "Example.com", \
  external_link_4: "example.edu", \
  clicked_url: "Example Domain", \
  references: [ \
    "\"IANA WHOIS Service\". IANA. Retrieved 2022-10-25.", \
    "\"IANA-managed Reserved Domains\". IANA. Retrieved 2020-06-20.", \
    "RFC 2606, Reserved Top Level DNS Names, D. Eastlake, A. Panitz, The Internet Society (June 1999), Section 3.", \
    "RFC 6761, S. Cheshire, M. Krochmal, Special-Use Domain Names, IETF (February 2013)" \
  ] \
}

Link to this section Summary

Functions

This function allows you to run a task within a window you control, good for times where you have a long running window you need to run multiple tasks on

This function allows you to run a task and a window is started for you, good for times where you have a short running task to open and close a web page

Link to this section Types

@type task_atom_config() ::
  {:navigate_to, url :: String.t()}
  | {:navigate_to, url :: String.t(), load_time :: pos_integer()}
  | {:input, Hound.Element.selector()}
  | {:value, key :: String.t() | atom(), Hound.Element.selector()}
  | {:click, Hound.Element.selector()}
  | :screenshot
  | {:screenshot, path :: String.t()}
  | {:scroll, x :: pos_integer()}
  | {:scroll, x :: pos_integer(), y :: pos_integer()}
  | {:sleep, period :: pos_integer()}
  | {:send_text, text :: String.t()}
  | {:send_keys, keys :: [atom()] | atom()}
  | {:javascript, script :: String.t()}
  | {:javascript, key :: atom() | String.t(), script :: String.t()}
@type task_config() ::
  task_atom_config()
  | task_module_config()
  | {:allow_error, task_atom_config() | task_module_config()}
@type window_opts() :: [name: String.t(), start_fn: (() -> any())]

Link to this section Functions

@spec allow_error(task_config()) ::
  {:allow_error, task_atom_config() | task_module_config()}
@spec run_task([task_config()]) :: ErrorMessage.t_res(map())

This function allows you to run a task within a window you control, good for times where you have a long running window you need to run multiple tasks on

example

Example

iex> Hound.start_session()
iex> ScraperEx.run_task([
...>   {:navigate_to, "https://example.com/", :timer.seconds(1)},
...>   {:click, {:css, "a"}, :timer.seconds(1)},
...>   {:read, :page_title, {:css, "h1"}},
...> ])
{:ok, %{page_title: "IANA-managed Reserved Domains"}}
iex> Hound.end_session()
Link to this function

run_task_in_window(configs, window_opts \\ [])

View Source
@spec run_task_in_window([task_config()], window_opts()) :: ErrorMessage.t_res(map())

This function allows you to run a task and a window is started for you, good for times where you have a short running task to open and close a web page

example

Example

iex> ScraperEx.run_task_in_window([
...>   {:navigate_to, "https://example.com/", :timer.seconds(1)},
...>   {:click, {:css, "a"}, :timer.seconds(1)},
...>   {:read, :page_title, {:css, "h1"}},
...> ])
{:ok, %{page_title: "IANA-managed Reserved Domains"}}