Premailex.HTMLParser.Xmerl (Premailex v1.0.0)

Copy Markdown View Source

A simple HTML parser using Erlang's built-in :xmerl library.

This is used as a fallback when no other HTML parsing libraries are available. It is designed for well-formed XML-like HTML, but it does not support real-world HTML, which is often not valid XML.

For more robust HTML parsing, use Premailex.HTMLParser.LazyHTML, Premailex.HTMLParser.Floki, or Premailex.HTMLParser.Meeseeks.

:xmerl_sax_parser is used to prevent atom leaks.

Parser limitations

  • Only well-formed XML-like HTML is supported. HTML5 syntax such as unquoted attribute values (<div data-x=a>) or unclosed non-void tags is rejected.

  • A round-trip through this parser is not lossless:

    • XML normalises whitespace in attribute values (newlines and tabs become spaces).

    • Named entities are decoded to characters and serialized as the literal character (e.g. &copy; round-trips as ©).

    • xmlns:* namespace declarations always appear first in the attribute list, regardless of source position (:xmerl_sax_parser removes them from the element's attribute list).

    • Void elements always serialize as <br> (HTML style), regardless of whether the source used <br/>.