Newxp. PreProcessing
(newxp v0.1.1)
Copy Markdown
Functions for processing HTML content into plain text for different use cases.
Summary
Functions
Get configured html2text options.
Process content for general applications.
Convert HTML to plain text for summarization.
Functions
Get configured html2text options.
Returns a keyword list suitable for passing to HTML2Text.convert/2:
link_footnotes: false— omits link footnotesempty_img_mode: :ignore— skips images without alt textwidth: :infinity— disables line wrapping
Examples
Newxp.PreProcessing.get_html2text_handler()
# => [link_footnotes: false, empty_img_mode: :ignore, width: :infinity]
Process content for general applications.
This includes:
- Core HTML cleaning (figures, tables, noscript, read-more)
- Convert to plaintext (preserving most HTML structure)
Examples
html = "<p>Hello</p><figure><img/></figure>"
Newxp.PreProcessing.process_for_general(html)
# => "Hello\n"
Convert HTML to plain text for summarization.
Strips links, images, and formatting. Output is unwrapped plain text suitable for feeding into summarization models.
Examples
html = "<p>Hello <a href=\"https://example.com\">world</a></p>"
Newxp.PreProcessing.process_for_summary(html)
# => "Hello world\n"