SpiderMan behaviour (spider_man v0.3.1)

Documentation for SpiderMan.

Spider Life Cycle

  1. Spider.settings()
  2. Spider.prepare_for_start(:pre, state)
  3. Spider.prepare_for_start_component(:downloader, state)
  4. Spider.prepare_for_start_component(:spider, state)
  5. Spider.prepare_for_start_component(:item_processor, state)
  6. Spider.prepare_for_start(:post, state)
  7. Spider.init(state)
  8. Spider.handle_response(response, context)
  9. Spider.prepare_for_stop_component(:downloader, state)
  10. Spider.prepare_for_stop_component(:spider, state)
  11. Spider.prepare_for_stop_component(:item_processor, state)
  12. Spider.prepare_for_stop(state)

Link to this section Summary

Functions

fetch spider's statistics of all ets

fetch spider's state

insert a request to spider

insert multiple requests to spider

list spiders where already started

fetch spider's statistics

fetch component's statistics

fetch spider's status

stop a spider

Link to this section Types

Specs

component() :: :downloader | :spider | :item_processor

Specs

ets_stats() :: [size: pos_integer(), memory: pos_integer()] | nil
Link to this type

prepare_for_start_stage()

Specs

prepare_for_start_stage() :: :pre | :post

Specs

request() :: SpiderMan.Request.t()

Specs

requests() :: [request()]

Specs

settings() :: keyword()

Specs

spider() :: module() | atom()

Specs

status() :: :running | :suspended

Link to this section Functions

Link to this function

continue(spider, timeout \\ :infinity)

Specs

continue(spider(), timeout()) :: :ok

continue a spider

Link to this function

ets_stats(spider)

Specs

ets_stats(spider()) :: [
  common_pipeline_tid: ets_stats(),
  downloader_tid: ets_stats(),
  failed_tid: ets_stats(),
  spider_tid: ets_stats(),
  item_processor_tid: ets_stats()
]

fetch spider's statistics of all ets

Link to this function

get_state(spider)

Specs

get_state(spider()) :: SpiderMan.Engine.state()

fetch spider's state

Link to this function

insert_request(spider, request)

Specs

insert_request(spider(), request()) :: true | nil

insert a request to spider

Link to this function

insert_requests(spider, requests)

Specs

insert_requests(spider(), requests()) :: true | nil

insert multiple requests to spider

Specs

list_spiders() :: [spider()]

list spiders where already started

Link to this function

retry_failed(spider, max_retries \\ 3, timeout \\ :infinity)

Specs

retry_failed(spider(), max_retries :: integer(), timeout()) ::
  {:ok, count :: integer()}

retry failed events for a spider

Link to this function

run_until(spider, settings \\ [], fun)

Specs

run_until(spider(), settings(), (... -> any())) :: millisecond :: integer()
Link to this function

run_until_zero(spider, settings \\ [], check_interval \\ 1500)

Specs

run_until_zero(spider(), settings(), check_interval :: integer()) ::
  millisecond :: integer()
Link to this function

start(spider, settings \\ [])

Specs

start a spider

Settings

  • :print_stats - The default value is true.

  • :log2file - The default value is true.

  • :status - The default value is :running.

  • :spider_module

  • :ets_file

  • :downloader_options

  • :spider_options

  • :item_processor_options

Downloader options

  • :requester - The default value is {{SpiderMan.Requester.Finch, []}}.

  • :producer - The default value is SpiderMan.Producer.ETS.

  • :context - The default value is %{}.

  • :processor - The default value is [max_demand: 1].

    • :stages
    • :concurrency - The default value is 8.
    • :min_demand
    • :max_demand - The default value is 10.
    • :partition_by
    • :spawn_opt
    • :hibernate_after
  • :rate_limiting - The default value is [allowed_messages: 10, interval: 1000].

    • :allowed_messages - Required.
    • :interval - Required.
  • :pipelines - The default value is [SpiderMan.Pipeline.DuplicateFilter].

  • :post_pipelines - The default value is [].

Spider options

  • :producer - The default value is SpiderMan.Producer.ETS.

  • :context - The default value is %{}.

  • :processor - The default value is [max_demand: 1].

    • :stages
    • :concurrency - The default value is 8.
    • :min_demand
    • :max_demand - The default value is 10.
    • :partition_by
    • :spawn_opt
    • :hibernate_after
  • :rate_limiting

    • :allowed_messages - Required.
    • :interval - Required.
  • :pipelines - The default value is [].

  • :post_pipelines - The default value is [].

Batchers options

  • :concurrency - The default value is 1.

  • :batch_size - The default value is 100.

  • :batch_timeout - The default value is 1000.

  • :partition_by

  • :spawn_opt

  • :hibernate_after

ItemProcessor options

  • :storage - The default value is SpiderMan.Storage.JsonLines.

  • :batchers - The default value is [default: [concurrency: 1, batch_size: 50, batch_timeout: 1000]].

  • :producer - The default value is SpiderMan.Producer.ETS.

  • :context - The default value is %{}.

  • :processor - The default value is [].

    • :stages
    • :concurrency - The default value is 8.
    • :min_demand
    • :max_demand - The default value is 10.
    • :partition_by
    • :spawn_opt
    • :hibernate_after
  • :rate_limiting

    • :allowed_messages - Required.
    • :interval - Required.
  • :pipelines - The default value is [SpiderMan.Pipeline.DuplicateFilter].

  • :post_pipelines - The default value is [].

Specs

stats(spider()) :: [
  status: status(),
  common_pipeline_tid: ets_stats(),
  downloader_tid: ets_stats(),
  failed_tid: ets_stats(),
  spider_tid: ets_stats(),
  item_processor_tid: ets_stats()
]

fetch spider's statistics

Link to this function

stats(spider, component)

Specs

stats(spider(), component()) :: ets_stats()

fetch component's statistics

Specs

status(spider()) :: status()

fetch spider's status

Specs

stop(spider()) :: :ok | {:error, error}
when error: :not_found | :running | :restarting

stop a spider

Link to this function

suspend(spider, timeout \\ :infinity)

Specs

suspend(spider(), timeout()) :: :ok

suspend a spider

Link to this section Callbacks

Link to this callback

handle_response(arg1, context)

Specs

handle_response(SpiderMan.Response.t(), context :: map()) :: %{
  optional(:requests) => [SpiderMan.Request.t()],
  optional(:items) => [SpiderMan.Item.t()]
}
Link to this callback

init(state)

(optional)

Specs

init(state) :: state when state: SpiderMan.Engine.state()
Link to this callback

prepare_for_start(prepare_for_start_stage, state)

(optional)

Specs

prepare_for_start(prepare_for_start_stage(), state) :: state
when state: SpiderMan.Engine.state()
Link to this callback

prepare_for_start_component(component, arg2)

(optional)

Specs

prepare_for_start_component(component(), options | false) :: options
when options: keyword()
Link to this callback

prepare_for_stop(arg1)

(optional)

Specs

prepare_for_stop(SpiderMan.Engine.state()) :: :ok
Link to this callback

prepare_for_stop_component(component, options)

(optional)

Specs

prepare_for_stop_component(component(), options :: keyword() | false) :: :ok
Link to this callback

settings()

(optional)

Specs

settings() :: settings()