A pure-Elixir PDF parsing and surgery engine. No NIFs, no C bindings, no external binaries — one runtime dependency (:telemetry).

PdfEx is built around a lossless invariant: serialize(open(bytes)) == bytes for any unmodified document, and edits are appended as PDF incremental updates so the original bytes are always a byte-for-byte prefix of the output.

What it does

  • Parse real-world PDFs: classic xref tables, PDF 1.5+ xref streams, object streams, Flate + PNG predictors — with deliberate leniency for the malformed files real producers emit (sloppy xref entry EOLs, mispointed startxref, junk bytes in content streams).
  • Extract text with positions, fonts, real /Widths metrics, and ToUnicode/encoding decoding.
  • Edit structure (PdfEx.Editor): insert, delete, and reorder pages — deletions are lossless (objects are freed, never destroyed).
  • Edit text (PdfEx.ContentEdit): run-level text replacement and glyph deletion via token-span patching, with width compensation so the rest of the line doesn't drift. Works on single-byte fonts and Type0 / Identity-H composite fonts.
  • Move text (PdfEx.Convert): per-glyph stable UIDs and position mutations that token-patch the content stream without regenerating anything.
  • Project to HTML, both ways (PdfEx.Convert): a byte-faithful visual mode and a semantic mode (<h*>/<p>/<li> with data-uid ranges). Editing the HTML maps back to per-run text ops (reverse mapping).
  • Collaborate (PdfEx.Session): supervised, per-document editing sessions with a crash-surviving snapshot cache and operational-transform conflict resolution for concurrent edits.
  • Serialize (PdfEx.Serializer): incremental by default (lossless, matching the source's xref style) or opt-in full re-serialization (one clean revision; not byte-lossless).
  • Subset fonts (PdfEx.Font.Surgery): TrueType glyph-retaining subsetting (composite-glyph closure, recomputed checksums).
{:ok, doc} = PdfEx.open(File.read!("report.pdf"))

{:ok, n}    = PdfEx.page_count(doc)
{:ok, text} = PdfEx.extract_text(doc)

# Structural surgery — lossless, incremental
{:ok, doc}   = PdfEx.Editor.delete_page(doc, 2)
edited_bytes = PdfEx.Serializer.serialize(doc)
# byte_size(edited_bytes) > byte_size(original); original is a prefix

# Rewrite the text of a run (addressed by a stable glyph UID)
{:ok, doc} = PdfEx.ContentEdit.replace_text(doc, "p_3_g_0", "Revised heading")

# Semantic HTML, with data-uid back-references for round-tripping edits
{:ok, html} = PdfEx.Convert.to_html(doc, mode: :semantic)

# Collaborative session: reads bypass the server; writes are OT-coordinated
{:ok, id}  = PdfEx.Session.open(doc)
{:ok, _op} = PdfEx.Session.apply_op(id, %PdfEx.Op.UpdateText{uid: "p_3_g_0", text: "Hi"})
{:ok, doc} = PdfEx.Session.fetch(id)

Design

  • Lazy dual-AST. Untouched objects stay as zero-copy binary references; only touched objects materialize. Content-stream edits patch token spans in place.
  • Pure functional core. Every parse/edit/serialize API is a pure function over an immutable PdfEx.Document. Errors are tagged tuples ({:error, %PdfEx.Error{}}) — malformed input never raises. The only stateful component is the optional collaborative session shell (a supervised GenServer per document; reads still bypass it).
  • Hardened against hostile input: atom-table exhaustion (unknown names stay binaries), nesting-depth bombs, circular xref//Length chains, unbounded xref-stream ranges, refc binary pinning.

Current limitations (0.1.x)

  • No encryption support — encrypted PDFs return an error at open.
  • Text/position edits require uncompressed content streams and patch the first /Contents stream of a page.
  • Composite-font editing covers Identity-H only; other CMaps and CFF (Type0/CIDFontType0) glyph injection are out of scope. Re-encoding maps to glyphs already present in the font's ToUnicode (no new glyphs).
  • TrueType subsetting is glyph-retaining (ids preserved, unused outlines emptied); glyph renumbering/compaction, CFF subsetting, and wiring the subset back into a document's FontFile2 are future work.
  • Full re-serialization (mode: :full) is explicitly not byte-lossless.

Installation

def deps do
  [
    {:pdf_ex, "~> 0.1.0"}
  ]
end

Documentation

Generate the docs locally with ExDoc:

mix docs        # writes HTML to doc/

Testing

mix test                          # unit + integration suite
mix test --include corpus         # also sweep real PDFs in test/fixtures/corpus/
mix dialyzer                      # static analysis

The corpus sweep asserts the library's hard invariants against any PDFs you drop into test/fixtures/corpus/ (gitignored): open never raises, unmutated round-trips are byte-identical, and incremental edits re-parse.

License

MIT — see LICENSE.