Stateless HTTP-first web retrieval with optional domain policy, caching, focused filtering, citation-ready passage metadata, and Extractous-backed document extraction.
This module is intended for document retrieval workloads where starting a full browser session would be unnecessary or too expensive.
Summary
Functions
Fetches a URL over HTTP(S) and returns normalized document content.
Types
@type result() :: %{ :url => String.t(), :final_url => String.t(), :content => String.t(), :format => atom(), :content_type => String.t(), :document_type => atom(), :retrieved_at => String.t(), :estimated_tokens => non_neg_integer(), :original_estimated_tokens => non_neg_integer(), :truncated => boolean(), :filtered => boolean(), :focus_matches => non_neg_integer(), :cached => boolean(), :citations => %{enabled: boolean()}, :passages => [map()], optional(:title) => String.t() | nil, optional(:metadata) => map() }
Functions
@spec fetch( String.t(), keyword() ) :: {:ok, result()} | {:error, Exception.t()}
Fetches a URL over HTTP(S) and returns normalized document content.
Supported options:
:format-:markdown,:text, or:html:selector- CSS selector for HTML pages:allowed_domains/:blocked_domains- mutually exclusive host/path rules:max_content_tokens- approximate token cap:citations- boolean, when true include passage spans:focus_terms- list of terms used for focused filtering:focus_window- paragraph window around focus matches:timeout- receive timeout in milliseconds:cache- enable ETS cache, defaults totrue:cache_ttl_ms- cache TTL in milliseconds:require_known_url/:known_urls- optional URL provenance guard:extractous- optionalExtractousExkeyword options merged with config