View Source ScraperEx (scraper_ex v0.1.2)
ScraperEx
This library exists to make scraping a bit easier for business use cases
installation
Installation
Available in Hex, the package can be installed
by adding scraper_ex
to your list of dependencies in mix.exs
:
def deps do
[
{:scraper_ex, "~> 0.1.0"}
]
end
The docs can be found at https://hexdocs.pm/scraper_ex.
usage
Usage
ScraperEx
uses Hound under the hood, which means you can configure hound to use any browser/runner you'd like. By default we use chrome_headless
ScraperEx.Window
This module exists to manage windows with Hound. Hounds window management by default doesn't help very much with session management, which leads to
zombie windows hanging around and can really start to eat up memory. To avoid this we can use ScraperEx.Window
to run and interact with a
individual session
ScraperEx
The two useful functions in here are ScraperEx.run_task_in_window
and
ScraperEx.run_task
, run task allows you to input various steps for a
scraper while run_in_window will also start a window for you, the bare
version won't and you will need to manage your own ScraperEx.Window
tasks
Tasks
Tasks are defined by configs, you can either use the struct form using ScraperEx.Task.Config
modules or use the short forms
The following actions are currently implemented:
:navigate_to
orScraperEx.Task.Config.Navigate
:input
orScraperEx.Task.Config.Input
:click
orScraperEx.Task.Config.Click
:read
orScraperEx.Task.Config.Read
:screenshot
orScraperEx.Task.Config.Screenshot
:scroll
orScraperEx.Task.Config.Scroll
:sleep
orScraperEx.Task.Config.Sleep
:send_text
orScraperEx.Task.Config.SendText
:send_keys
orScraperEx.Task.Config.SendKeys
:javascript
orScraperEx.Task.Config.Javascript
You can allow errors by wrapping a command in
ScraperEx.allow_error({:click, {:css, ".thing"}})
Example
iex> ScraperEx.run_task_in_window([
...> {:navigate_to, "https://en.wikipedia.org/wiki/Example.com"},
...> {:read, :references, {:css, ".reference-text"}},
...> {:read, :page_title, {:id, "firstHeading"}},
...> {:read, :external_link_4, {:css, "#bodyContent ul:nth-child(21) li:nth-child(4)"}},
...> {:click, {:css, "h2:has(#External_links) + ul li:nth-of-type(3) a"}, :timer.seconds(1)},
...> {:read, :clicked_url, {:css, "h1"}},
...> ])
%{ \
page_title: "Example.com", \
external_link_4: "example.edu", \
clicked_url: "Example Domain", \
references: [ \
"\"IANA WHOIS Service\". IANA. Retrieved 2022-10-25.", \
"\"IANA-managed Reserved Domains\". IANA. Retrieved 2020-06-20.", \
"RFC 2606, Reserved Top Level DNS Names, D. Eastlake, A. Panitz, The Internet Society (June 1999), Section 3.", \
"RFC 6761, S. Cheshire, M. Krochmal, Special-Use Domain Names, IETF (February 2013)" \
] \
}
Link to this section Summary
Functions
This function allows you to run a task within a window you control, good for times where you have a long running window you need to run multiple tasks on
This function allows you to run a task and a window is started for you, good for times where you have a short running task to open and close a web page
Link to this section Types
@type task_atom_config() :: {:navigate_to, url :: String.t()} | {:navigate_to, url :: String.t(), load_time :: pos_integer()} | {:input, Hound.Element.selector()} | {:value, key :: String.t() | atom(), Hound.Element.selector()} | {:click, Hound.Element.selector()} | :screenshot | {:screenshot, path :: String.t()} | {:scroll, x :: pos_integer()} | {:scroll, x :: pos_integer(), y :: pos_integer()} | {:sleep, period :: pos_integer()} | {:send_text, text :: String.t()} | {:send_keys, keys :: [atom()] | atom()} | {:javascript, script :: String.t()} | {:javascript, key :: atom() | String.t(), script :: String.t()}
@type task_config() :: task_atom_config() | task_module_config() | {:allow_error, task_atom_config() | task_module_config()}
@type task_module_config() :: ScraperEx.Task.Config.Screenshot.t() | ScraperEx.Task.Config.Navigate.t() | ScraperEx.Task.Config.Input.t() | ScraperEx.Task.Config.Click.t() | ScraperEx.Task.Config.Scroll.t() | ScraperEx.Task.Config.Read.t()
Link to this section Functions
@spec allow_error(task_config()) :: {:allow_error, task_atom_config() | task_module_config()}
@spec run_task([task_config()]) :: ErrorMessage.t_res(map())
This function allows you to run a task within a window you control, good for times where you have a long running window you need to run multiple tasks on
example
Example
iex> Hound.start_session()
iex> ScraperEx.run_task([
...> {:navigate_to, "https://example.com/", :timer.seconds(1)},
...> {:click, {:css, "a"}, :timer.seconds(1)},
...> {:read, :page_title, {:css, "h1"}},
...> ])
{:ok, %{page_title: "IANA-managed Reserved Domains"}}
iex> Hound.end_session()
@spec run_task_in_window([task_config()], window_opts()) :: ErrorMessage.t_res(map())
This function allows you to run a task and a window is started for you, good for times where you have a short running task to open and close a web page
example
Example
iex> ScraperEx.run_task_in_window([
...> {:navigate_to, "https://example.com/", :timer.seconds(1)},
...> {:click, {:css, "a"}, :timer.seconds(1)},
...> {:read, :page_title, {:css, "h1"}},
...> ])
{:ok, %{page_title: "IANA-managed Reserved Domains"}}