Newxp.PreProcessing (newxp v0.1.1)

Copy Markdown

Functions for processing HTML content into plain text for different use cases.

Summary

Functions

Get configured html2text options.

Process content for general applications.

Convert HTML to plain text for summarization.

Functions

get_html2text_handler()

Get configured html2text options.

Returns a keyword list suitable for passing to HTML2Text.convert/2:

  • link_footnotes: false — omits link footnotes
  • empty_img_mode: :ignore — skips images without alt text
  • width: :infinity — disables line wrapping

Examples

Newxp.PreProcessing.get_html2text_handler()
# => [link_footnotes: false, empty_img_mode: :ignore, width: :infinity]

process_for_general(html)

Process content for general applications.

This includes:

  • Core HTML cleaning (figures, tables, noscript, read-more)
  • Convert to plaintext (preserving most HTML structure)

Examples

html = "<p>Hello</p><figure><img/></figure>"
Newxp.PreProcessing.process_for_general(html)
# => "Hello\n"

process_for_summary(html)

Convert HTML to plain text for summarization.

Strips links, images, and formatting. Output is unwrapped plain text suitable for feeding into summarization models.

Examples

html = "<p>Hello <a href=\"https://example.com\">world</a></p>"
Newxp.PreProcessing.process_for_summary(html)
# => "Hello world\n"