View Source CommonCrawl.Index (CommonCrawl v0.3.4)
Interacting with index files of Common Crawl.
Summary
Functions
Returns URL of the file containing the index paths for a given crawl ID.
Returns URL of the cluster.idx file.
Fetches a gzipped index file.
Fetches all available index files for a given crawl. At the end of the list will be the "metadata.yaml" and the "cluster.idx" files.
Fetches the cluster.idx file.
Parses a line of an index file into a tuple containing the search key, timestamp, and metadata map.
Creates a stream of parsed index entries from index files.
Streams parsed index entries for the specified host.
Returns URL of the index file.
Functions
Returns URL of the file containing the index paths for a given crawl ID.
Examples
iex> CommonCrawl.Index.all_paths_url("CC-MAIN-2017-34")
"https://data.commoncrawl.org/crawl-data/CC-MAIN-2017-34/cc-index.paths.gz"
Returns URL of the cluster.idx file.
Examples
iex> CommonCrawl.Index.cluster_idx_url("CC-MAIN-2017-34")
"https://data.commoncrawl.org/cc-index/collections/CC-MAIN-2017-34/indexes/cluster.idx"
Fetches a gzipped index file.
Examples
iex> CommonCrawl.Index.get("CC-MAIN-2024-51", "cdx-00000.gz")
{:ok, <<31, 139, 8, 0, 0, 0, 0, 0, 0, 3, ...>>}
Fetches all available index files for a given crawl. At the end of the list will be the "metadata.yaml" and the "cluster.idx" files.
Examples
iex> CommonCrawl.Index.get_all_paths("CC-MAIN-2024-51")
{:ok, [
"cc-index/collections/CC-MAIN-2024-51/indexes/cdx-00000.gz",
"cc-index/collections/CC-MAIN-2024-51/indexes/cdx-00001.gz",
# ... more index files
"cc-index/collections/CC-MAIN-2024-51/indexes/metadata.yaml",
"cc-index/collections/CC-MAIN-2024-51/indexes/cluster.idx"
]}
Fetches the cluster.idx file.
Examples
iex> CommonCrawl.Index.get_cluster_idx("CC-MAIN-2024-51")
{:ok, "0,100,22,165)/ 20241209080420..."}
Parses a line of an index file into a tuple containing the search key, timestamp, and metadata map.
Examples
iex> line = "com,example)/ 20240108123456 {"url": "http://www.example.com"}"
iex> CommonCrawl.Index.parser(line)
{:ok, {"com,example)/", 20240108123456, %{"url" => "http://www.example.com"}}}
@spec stream( String.t(), keyword() ) :: Enumerable.t()
Creates a stream of parsed index entries from index files.
Options
:preprocess_fun
- function to preprocess the stream before processing (default: & &1):dir
- temporary directory for storing downloaded files (default: System.tmp_dir!()):max_attempts
- maximum number of retry attempts for fetching cluster.idx (default: 3):backoff
- milliseconds to wait between retry attempts (default: 500)
Examples
# Stream all index entries
CommonCrawl.Index.stream("CC-MAIN-2024-51")
# Stream only German domains and shuffle them before processing
CommonCrawl.Index.stream("CC-MAIN-2024-51", preprocess_fun: fn stream ->
stream
|> Stream.filter(&String.starts_with?(&1, "de"))
|> Enum.shuffle()
end)
@spec stream_host(String.t(), String.t(), keyword()) :: Enumerable.t()
Streams parsed index entries for the specified host.
This function wraps stream/2
, applying a filter to include only those entries whose URL host matches the given host
.
Examples
iex> CommonCrawl.Index.stream_host("CC-MAIN-2024-51", "www.example.com") |> Enum.take(2)
[
{"com,example)/", 20240108123456, %{"url" => "http://www.example.com"}},
{"com,example)/", 20240108123457, %{"url" => "http://www.example.com/page2"}}
]
Returns URL of the index file.
Examples
iex> CommonCrawl.Index.url("CC-MAIN-2017-34", "cdx-00203.gz")
"https://data.commoncrawl.org/cc-index/collections/CC-MAIN-2017-34/indexes/cdx-00203.gz"