Meeseeks v0.3.1 Meeseeks

Meeseeks is an Elixir library for extracting data from HTML.

iex> import Meeseeks.CSS
Meeseeks.CSS
iex> html = Tesla.get("https://news.ycombinator.com/").body
"..."
iex> for story <- Meeseeks.all(html, css("tr.athing")) do
       title = Meeseeks.one(story, css(".title a"))
       %{title: Meeseeks.text(title),
         url: Meeseeks.attr(title, "href")}
     end
[%{title: "...", url: "..."}, %{title: "...", url: "..."}, ...]

Dependencies

Meeseeks depends on html5ever via the html5ever NIF.

Because html5ever is a Rust library, you will need to have the Rust compiler installed.

This dependency is necessary because there are no HTML5 spec compliant parsers written in Elixir/Erlang.

Getting Started

Parse

Start by parsing a source (HTML string or Meeseeks.TupleTree) into a Meeseeks.Document so that it can be queried.

iex> document = Meeseeks.parse("<div id=main><p>1</p><p>2</p><p>3</p></div>")
%Meeseeks.Document{...}

The selection functions accept an unparsed source, but parsing is expensive, so parse ahead of time when running multiple selections on the same document.

Select

Next, use one of Meeseeks's two selection functions, all or one, to search for nodes. Both functions accept a queryable (a source, a document, or a Meeseeks.Result) and one or more Meeseeks.Selectors.

all returns a list of results representing every node matching one of the provided selectors, while one returns a result representing the first node to match a selector (depth-first).

Use the css macro provided by Meeseeks.CSS to generate selectors.

iex> import Meeseeks.CSS
Meeseeks.CSS
iex> result = Meeseeks.one(document, css("#main p"))
%Meeseeks.Result{ "<p>1</p>" }

Extract

Retrieve information from the result with an extraction function.

The Meeseeks.Result extraction functions are attr, attrs, data, dataset, html, own_text, tag, text, tree.

iex> Meeseeks.tag(result)
"p"
iex> Meeseeks.text(result)
"1"
iex> Meeseeks.tree(result)
{"p", [], ["1"]}

Custom Selectors

Meeseeks is designed to have extremely extensible selectors, and creating a custom selector is as easy as defining a struct that implements the Meeseeks.Selector behaviour.

iex> defmodule CommentContainsSelector do
       use Meeseeks.Selector

       alias Meeseeks.Document

       defstruct value: ""

       def match?(selector, %Document.Comment{} = node, _document) do
         String.contains?(node.content, selector.value)
       end

       def match?(_selector, _node, _document) do
         false
       end
     end
{:module, ...}
iex> selector = %CommentContainsSelector{value: "TODO"}
%CommentContainsSelector{value: "TODO"}
iex> Meeseeks.one("<!-- TODO: Close vuln! -->", selector)
%Meeseeks.Result{ "<!-- TODO: Close vuln! -->" }

To learn more, check the documentation for Meeseeks.Selector and Meeseeks.Selector.Combinator

Summary

Functions

Returns a Result for each node in the queryable matching a selector

Returns the value for attribute in result, or nil if there isn't one

Returns the result's attributes list, which may be empty, or nil if result represents a node without attributes

Returns the combined data of result or result's children, which may be an empty string

Returns a map of result's data attributes, or nil if result represents a node without attributes

Returns the combined HTML of result and its descendants

Returns a Result for the first node in the queryable (depth-first) matching a selector

Returns the combined text of result or result's children, which may be an empty string

Parses an HTML string or Meeseeks.TupleTree into a Meeseeks.Document

Returns result's tag, or nil if result represents a node without a tag

Returns the combined text of result or result's descendants, which may be an empty string

Returns a Meeseeks.TupleTree of result and its descendants

Types

queryable()
selectors()
source()

Functions

all(queryable, selectors)

Returns a Result for each node in the queryable matching a selector.

Examples

iex> Meeseeks.all("<div id=main><p>1</p><p>2</p><p>3</p></div>", css("#main p"))
[%Meeseeks.Result{ "<p>1</p>" }, %Meeseeks.Result{ "<p>2</p>" },
 %Meeseeks.Result{ "<p>3</p>" }]
attr(result, attribute)

Returns the value for attribute in result, or nil if there isn't one.

Examples

iex> result = Meeseeks.one("<div id=example>Hi</div>", css("#example"))
%Meeseeks.Result{ "<div id=\"example\">Hi</div>" }
iex> Meeseeks.attr(result, "id")
"example"
attrs(result)
attrs(Meeseeks.Result.t) :: [{String.t, String.t}] | nil

Returns the result's attributes list, which may be empty, or nil if result represents a node without attributes.

Examples

iex> result = Meeseeks.one("<div id=example>Hi</div>", css("#example"))
%Meeseeks.Result{ "<div id=\"example\">Hi</div>" }
iex> Meeseeks.attrs(result)
[{"id", "example"}]
data(result)

Returns the combined data of result or result's children, which may be an empty string.

Examples

iex> result1 = Meeseeks.one("<div id=example>Hi</div>", css("#example"))
%Meeseeks.Result{ "<div id=\"example\">Hi</div>" }
iex> Meeseeks.data(result1)
""

iex> result2 = Meeseeks.one("<script id=example>Hi</script>", css("#example"))
%Meeseeks.Result{ "<script id=\"example\">Hi</script>" }
iex> Meeseeks.data(result2)
"Hi"
dataset(result)
dataset(Meeseeks.Result.t) ::
  %{optional(String.t) => String.t} |
  nil

Returns a map of result's data attributes, or nil if result represents a node without attributes.

Behaves like HTMLElement.dataset; only valid data attributes are included, and attribute names have "data-" removed and are converted to camelCase.

See: https://developer.mozilla.org/en-US/docs/Web/API/HTMLElement/dataset

Examples

iex> result = Meeseeks.one("<div id=example data-x-val=1 data-y-val=2></div>", css("#example"))
%Meeseeks.Result{ "<div id="example" data-x-val="1" data-y-val="2"></div>" }
iex> Meeseeks.dataset(result)
%{"xVal" => "1", "yVal" => "2"}
html(result)

Returns the combined HTML of result and its descendants.

Examples

iex> result = Meeseeks.one("<div id=example>Hi</div>", css("#example"))
%Meeseeks.Result{ "<div id=\"example\">Hi</div>" }
iex> Meeseeks.html(result)
"<div id=\"example\">Hi</div>"
one(queryable, selectors)

Returns a Result for the first node in the queryable (depth-first) matching a selector.

Examples

iex> Meeseeks.one("<div id=main><p>1</p><p>2</p><p>3</p></div>", css("#main p"))
%Meeseeks.Result{ "<p>1</p>" }
own_text(result)

Returns the combined text of result or result's children, which may be an empty string.

Examples

iex> result = Meeseeks.one("<div>Hello, <b>World!</b></div>", css("div"))
%Meeseeks.Result{ "<div>Hello, <b>World!</b></div>" }
iex> Meeseeks.own_text(result)
"Hello,"
parse(source)

Parses an HTML string or Meeseeks.TupleTree into a Meeseeks.Document.

Examples

iex> Meeseeks.parse("<div id=main><p>Hello, Meeseeks!</p></div>")
%Meeseeks.Document{...}

iex> Meeseeks.parse({"div", [{"id", "main"}], [{"p", [], ["Hello, Meeseeks!"]}]})
%Meeseeks.Document{...}
tag(result)
tag(Meeseeks.Result.t) :: String.t | nil

Returns result's tag, or nil if result represents a node without a tag.

Examples

iex> result = Meeseeks.one("<div id=example>Hi</div>", css("#example"))
%Meeseeks.Result{ "<div id=\"example\">Hi</div>" }
iex> Meeseeks.tag(result)
"div"
text(result)

Returns the combined text of result or result's descendants, which may be an empty string.

Examples

iex> result = Meeseeks.one("<div>Hello, <b>World!</b></div>", css("div"))
%Meeseeks.Result{ "<div>Hello, <b>World!</b></div>" }
iex> Meeseeks.own_text(result)
"Hello, World!"

Returns a Meeseeks.TupleTree of result and its descendants.

Examples

iex> result = Meeseeks.one("<div id=example>Hi</div>", css("#example"))
%Meeseeks.Result{ "<div id=\"example\">Hi</div>" }
iex> Meeseeks.tree(result)
{"div", [{"id", "example"}], ["Hi"]}