Hex.pm Docs License

Native Elixir PDF reader — text extraction with positions, layout reconstruction, links, images, metadata, encryption (RC4/AES-128/AES-256), AcroForm fields, outlines, and annotations. Uses only the Erlang/OTP standard library.

Part of the ExPDF umbrella.

Installation

def deps do
  [
    {:ex_pdf_read, "~> 1.0"}
  ]
end

Most users should depend on ex_pdf instead, which bundles core, components, and the reader.

Usage

{:ok, doc} = Pdf.Reader.open("invoice.pdf")
{:ok, %Pdf.Reader.Result{} = result, _doc} = Pdf.Reader.read(doc)

result.meta.title       # "Invoice 042"
result.meta.page_count  # 3

for page <- result.pages do
  for line <- page.lines, token <- line.tokens do
    IO.puts(token.text)
  end
end

Convenience shapes

{:ok, pages, _} = Pdf.Reader.read(doc, shape: :text)    # [String.t()] per page
{:ok, shapes, _} = Pdf.Reader.read(doc, shape: :shapes) # [%Shape{}] flat

Encrypted PDFs

{:ok, doc} = Pdf.Reader.open("encrypted.pdf", password: "secret")

Supports Standard Security Handler V1–V5 (RC4-40, RC4-128, AES-128, AES-256).

Error recovery

{:ok, doc} = Pdf.Reader.open(broken_bin, recover: true)

Recovers from corrupted xref tables, missing %%EOF, broken page-tree links, dangling font refs, and truncated streams.

Dictionary-based word split

{:ok, result, _} = Pdf.Reader.read(doc, dictionary: :es)

Bundled Spanish frequency list (~50k words). Custom wordlists via MapSet.

License

MIT. See LICENSE.md.