A simple HTML parser using Erlang's built-in :xmerl library.
This is used as a fallback when no other HTML parsing libraries are available. It is designed for well-formed XML-like HTML, but it does not support real-world HTML, which is often not valid XML.
For more robust HTML parsing, use Premailex.HTMLParser.LazyHTML,
Premailex.HTMLParser.Floki, or Premailex.HTMLParser.Meeseeks.
:xmerl_sax_parser is used to prevent atom leaks.
Parser limitations
Only well-formed XML-like HTML is supported. HTML5 syntax such as unquoted attribute values (
<div data-x=a>) or unclosed non-void tags is rejected.A round-trip through this parser is not lossless:
XML normalises whitespace in attribute values (newlines and tabs become spaces).
Named entities are decoded to characters and serialized as the literal character (e.g.
©round-trips as©).xmlns:*namespace declarations always appear first in the attribute list, regardless of source position (:xmerl_sax_parserremoves them from the element's attribute list).Void elements always serialize as
<br>(HTML style), regardless of whether the source used<br/>.