View Source CommonCrawl.Index (CommonCrawl v0.2.0)

Interacting with index files of Common Crawl.

Summary

Functions

Returns URL of the file containing the index paths for a given crawl ID.

Returns URL of the cluster.idx file.

Filter filenames from cluster.idx with a given function. Returns a stream.

Fetches a gzipped index file.

Fetches all available index files for a given crawl. At the end of the list will be the "metadata.yaml" and the "cluster.idx" files.

Fetches the cluster.idx file.

Parses a line of an index file.

Returns URL of the index file.

Functions

all_paths_url(crawl_id)

@spec all_paths_url(String.t()) :: String.t()

Returns URL of the file containing the index paths for a given crawl ID.

cluster_idx_url(crawl_id)

@spec cluster_idx_url(String.t()) :: String.t()

Returns URL of the cluster.idx file.

Examples

iex> CommonCrawl.Index.cluster_idx_url("CC-MAIN-2017-34")
"https://data.commoncrawl.org/cc-index/collections/CC-MAIN-2017-34/indexes/cluster.idx"

filter_cluster_idx(cluster_idx, fun)

@spec filter_cluster_idx(binary(), function()) :: list()

Filter filenames from cluster.idx with a given function. Returns a stream.

Examples

# Get index files with ".de" TLDs.
index_files =
  CommonCrawl.Index.filter_cluster_idx(
    cluster_idx,
    fn line -> String.starts_with?(line, "de") end
  )
  |> Enum.to_list()

get(crawl_id, filename, opts \\ [])

Fetches a gzipped index file.

get_all_paths(crawl_id, opts \\ [])

Fetches all available index files for a given crawl. At the end of the list will be the "metadata.yaml" and the "cluster.idx" files.

get_cluster_idx(crawl_id, opts \\ [])

Fetches the cluster.idx file.

parser(line)

@spec parser(Enum.t()) :: {:ok, {String.t(), integer(), map()}} | {:error, any()}

Parses a line of an index file.

url(crawl_id, filename)

@spec url(String.t(), String.t()) :: String.t()

Returns URL of the index file.

Examples

iex> CommonCrawl.Index.url("CC-MAIN-2017-34", "cdx-00203.gz")
"https://data.commoncrawl.org/cc-index/collections/CC-MAIN-2017-34/indexes/cdx-00203.gz"