A pure-Elixir PDF parsing and surgery engine. No NIFs, no C bindings, no
external binaries — one runtime dependency (:telemetry).
PdfEx is built around a lossless invariant: serialize(open(bytes)) == bytes
for any unmodified document, and edits are appended as PDF incremental updates
so the original bytes are always a byte-for-byte prefix of the output.
What it does
- Parse real-world PDFs: classic xref tables, PDF 1.5+ xref streams,
object streams, Flate + PNG predictors — with deliberate leniency for the
malformed files real producers emit (sloppy xref entry EOLs, mispointed
startxref, junk bytes in content streams). - Extract text with positions, fonts, real
/Widthsmetrics, and ToUnicode/encoding decoding. - Edit structure (
PdfEx.Editor): insert, delete, and reorder pages — deletions are lossless (objects are freed, never destroyed). - Edit text (
PdfEx.ContentEdit): run-level text replacement and glyph deletion via token-span patching, with width compensation so the rest of the line doesn't drift. Works on single-byte fonts and Type0 / Identity-H composite fonts. - Move text (
PdfEx.Convert): per-glyph stable UIDs and position mutations that token-patch the content stream without regenerating anything. - Project to HTML, both ways (
PdfEx.Convert): a byte-faithful visual mode and a semantic mode (<h*>/<p>/<li>withdata-uidranges). Editing the HTML maps back to per-run text ops (reverse mapping). - Collaborate (
PdfEx.Session): supervised, per-document editing sessions with a crash-surviving snapshot cache and operational-transform conflict resolution for concurrent edits. - Serialize (
PdfEx.Serializer): incremental by default (lossless, matching the source's xref style) or opt-in full re-serialization (one clean revision; not byte-lossless). - Subset fonts (
PdfEx.Font.Surgery): TrueType glyph-retaining subsetting (composite-glyph closure, recomputed checksums).
{:ok, doc} = PdfEx.open(File.read!("report.pdf"))
{:ok, n} = PdfEx.page_count(doc)
{:ok, text} = PdfEx.extract_text(doc)
# Structural surgery — lossless, incremental
{:ok, doc} = PdfEx.Editor.delete_page(doc, 2)
edited_bytes = PdfEx.Serializer.serialize(doc)
# byte_size(edited_bytes) > byte_size(original); original is a prefix
# Rewrite the text of a run (addressed by a stable glyph UID)
{:ok, doc} = PdfEx.ContentEdit.replace_text(doc, "p_3_g_0", "Revised heading")
# Semantic HTML, with data-uid back-references for round-tripping edits
{:ok, html} = PdfEx.Convert.to_html(doc, mode: :semantic)
# Collaborative session: reads bypass the server; writes are OT-coordinated
{:ok, id} = PdfEx.Session.open(doc)
{:ok, _op} = PdfEx.Session.apply_op(id, %PdfEx.Op.UpdateText{uid: "p_3_g_0", text: "Hi"})
{:ok, doc} = PdfEx.Session.fetch(id)Design
- Lazy dual-AST. Untouched objects stay as zero-copy binary references; only touched objects materialize. Content-stream edits patch token spans in place.
- Pure functional core. Every parse/edit/serialize API is a pure function
over an immutable
PdfEx.Document. Errors are tagged tuples ({:error, %PdfEx.Error{}}) — malformed input never raises. The only stateful component is the optional collaborative session shell (a supervised GenServer per document; reads still bypass it). - Hardened against hostile input: atom-table exhaustion (unknown names
stay binaries), nesting-depth bombs, circular xref/
/Lengthchains, unbounded xref-stream ranges, refc binary pinning.
Current limitations (0.1.x)
- No encryption support — encrypted PDFs return an error at open.
- Text/position edits require uncompressed content streams and patch the
first
/Contentsstream of a page. - Composite-font editing covers Identity-H only; other CMaps and CFF (Type0/CIDFontType0) glyph injection are out of scope. Re-encoding maps to glyphs already present in the font's ToUnicode (no new glyphs).
- TrueType subsetting is glyph-retaining (ids preserved, unused outlines
emptied); glyph renumbering/compaction, CFF subsetting, and wiring the
subset back into a document's
FontFile2are future work. - Full re-serialization (
mode: :full) is explicitly not byte-lossless.
Installation
def deps do
[
{:pdf_ex, "~> 0.1.0"}
]
endDocumentation
Generate the docs locally with ExDoc:
mix docs # writes HTML to doc/
Testing
mix test # unit + integration suite
mix test --include corpus # also sweep real PDFs in test/fixtures/corpus/
mix dialyzer # static analysis
The corpus sweep asserts the library's hard invariants against any PDFs you
drop into test/fixtures/corpus/ (gitignored): open never raises, unmutated
round-trips are byte-identical, and incremental edits re-parse.
License
MIT — see LICENSE.