UrlFetcher
UrlFetcher fetches URLs present in image and anchor tags in a given URL.
Usage
UrlFetcher
UrlFetcher.fetch("https://myawesome.url/page.html")
will retrieve all link and image URLs present in https://myawesome.url/page.html
, returning them as lists links
and assets
in UrlFetcher.SiteData
struct.
Some options you can provide to the fetcher:
http_client
: HTTP Client to be used. Must comply withUrlFetcher.Http.Client
behaviour. Defaults toUrlFetcher.Http.Adapter.Poison
.unique
: boolean. If set, removes duplicates from results. Defaults totrue
.normalize
: transforms all urls to absolute if set to:absolute
, or leaves them as they are with:original
. Defaults tooriginal
.internal_only
: boolean. If set, filters urls to the ones internal to the site being fetched. Defaults tofalse
.
HTTP Client behaviour
HTTP Client behaviour is defined in UrlFetcher.Http.Client
. You can choose whatever HTTP client you prefer as long as it complies with that behavior or you implement a wrapper. Note that, by default, HTTP Client must follow redirects.
Installation
The package is available in Hex, and can be installed
by adding url_fetcher
to your list of dependencies in mix.exs
:
def deps do
[
{:url_fetcher, "~> 0.2.1"}
]
end
Documentation can be found at https://hexdocs.pm/url_fetcher/.
Contributing
Please have a look at the contributing guidelines.
Url Fetcher has some automated CI Github actions that will take care of reviewing any pull request:
- Check code formatting
- Check tests pass
- Check static analysis with dialyzer
- Submit any code style suggestions and improvements as comments on the PR
Once everything looks good, your PR will be merged. Every push to the main branch will trigger an automated publishing of the package and documentation to hex.
Benchmarking
In order to improve performance it is important to actually benchmark the code. UrlFetcher uses benchee for than. Have a look at benchmark.exs and compare your implementation against the current code before submitting a pull request. Run the benchmark with mix run bin/benchmark.exs <url>
.