LiteParse (liteparse v0.1.0)

Copy Markdown View Source

Elixir wrapper for the LiteParse Rust library, providing fast, local PDF and document parsing with spatial text extraction.

Summary

Functions

Parses a document file from disk and returns its extracted text and page count.

Parses a document from in-memory binary data and returns its extracted text and page count. Useful when the document is not on disk (e.g. received from a network request or an upload).

Types

parse_result()

@type parse_result() :: %{text: String.t(), page_count: non_neg_integer()}

Functions

parse(path, opts \\ [])

@spec parse(Path.t(), keyword() | LiteParse.Config.t()) ::
  {:ok, parse_result()} | {:error, String.t()}

Parses a document file from disk and returns its extracted text and page count.

Options

See LiteParse.Config for the full list. Pass options as a keyword list:

LiteParse.parse("doc.pdf", max_pages: 100, ocr_enabled: false)

Or as a reusable struct:

config = LiteParse.Config.new(ocr_enabled: false)
LiteParse.parse("doc.pdf", config)

Returns {:ok, %{text: binary, page_count: integer}} on success or {:error, reason} if the file cannot be read or parsed.

parse_input(bytes, opts \\ [])

@spec parse_input(binary(), keyword() | LiteParse.Config.t()) ::
  {:ok, parse_result()} | {:error, String.t()}

Parses a document from in-memory binary data and returns its extracted text and page count. Useful when the document is not on disk (e.g. received from a network request or an upload).

Mirrors the underlying liteparse::LiteParse::parse_input API with PdfInput::Bytes.

Options

See LiteParse.Config for the full list. Pass options as a keyword list:

LiteParse.parse_input(uploaded_pdf_binary, max_pages: 100, ocr_enabled: false)

Or as a reusable struct:

config = LiteParse.Config.new(ocr_enabled: false)
LiteParse.parse_input(uploaded_pdf_binary, config)

Returns {:ok, %{text: binary, page_count: integer}} on success or {:error, reason} if the data cannot be parsed.