View Source CommonCrawl.Index (CommonCrawl v0.2.0)
Interacting with index files of Common Crawl.
Summary
Functions
Returns URL of the file containing the index paths for a given crawl ID.
Returns URL of the cluster.idx file.
Filter filenames from cluster.idx with a given function. Returns a stream.
Fetches a gzipped index file.
Fetches all available index files for a given crawl. At the end of the list will be the "metadata.yaml" and the "cluster.idx" files.
Fetches the cluster.idx file.
Parses a line of an index file.
Returns URL of the index file.
Functions
Returns URL of the file containing the index paths for a given crawl ID.
Returns URL of the cluster.idx file.
Examples
iex> CommonCrawl.Index.cluster_idx_url("CC-MAIN-2017-34")
"https://data.commoncrawl.org/cc-index/collections/CC-MAIN-2017-34/indexes/cluster.idx"
Filter filenames from cluster.idx with a given function. Returns a stream.
Examples
# Get index files with ".de" TLDs.
index_files =
CommonCrawl.Index.filter_cluster_idx(
cluster_idx,
fn line -> String.starts_with?(line, "de") end
)
|> Enum.to_list()
Fetches a gzipped index file.
Fetches all available index files for a given crawl. At the end of the list will be the "metadata.yaml" and the "cluster.idx" files.
Fetches the cluster.idx file.
Parses a line of an index file.
Returns URL of the index file.
Examples
iex> CommonCrawl.Index.url("CC-MAIN-2017-34", "cdx-00203.gz")
"https://data.commoncrawl.org/cc-index/collections/CC-MAIN-2017-34/indexes/cdx-00203.gz"