View Source Pdf.Reader (ExPDF v1.0.1)
Native PDF reader — opens a PDF binary or file path and provides pure-functional access to text runs with positions, raster images, document metadata, interactive form fields, document outlines (bookmarks), and page annotations. No GenServer, no mutable state; the reader is a fully lazy, immutable pipeline.
Typical usage
{:ok, doc} = Pdf.Reader.open("report.pdf")
{:ok, [page1_text | _], doc} = Pdf.Reader.read_text(doc)
{:ok, runs, doc} = Pdf.Reader.read_text_with_positions(doc)
{:ok, meta, doc} = Pdf.Reader.read_metadata(doc)
{:ok, n} = Pdf.Reader.page_count(doc)
:ok = Pdf.Reader.close(doc)Outlines (bookmarks)
{:ok, outlines, _doc} = Pdf.Reader.read_outlines(doc)
# => [%Pdf.Reader.Outline{title: "Chapter 1", level: 0, dest_page: 1, children: [...]}, ...]
# Bang variant — raises Pdf.Reader.Error on failure
outlines = Pdf.Reader.read_outlines!(doc)Annotations
{:ok, annotations, _doc} = Pdf.Reader.read_annotations(doc)
# => [%Pdf.Reader.Annotation{type: :highlight, page: 2, rect: {x1, y1, x2, y2}, ...}, ...]
# Bang variant — raises Pdf.Reader.Error on failure
annotations = Pdf.Reader.read_annotations!(doc)Error recovery
open/2 accepts a recover: true option that activates four orthogonal
recovery phases (R-1..R-4). Each recovery action is logged as a structured
event tuple appended to doc.recovery_log. Use recovery_log/1 to inspect:
{:ok, doc} = Pdf.Reader.open(bin, recover: true)
Pdf.Reader.recovery_log(doc)
# => [] when the PDF was well-formed
# => [{:xref_recovered, 5}, {:page_failed, 2, :unresolved_ref}] on a corrupt PDFClosed set of recovery event tuples:
| Tuple | Meaning |
|---|---|
{:xref_recovered, n} | Linear scan recovered n object entries (R-3) |
{:eof_marker_missing, :linear_scan_used} | %%EOF absent; linear scan used (R-3) |
{:page_failed, page_n, reason} | Page skipped; text/images from other pages returned (R-1) |
{:font_skipped, page_n, font_name, reason} | Font replaced with U+FFFD fallback (R-2) |
{:page_tree_recovered, n_pages} | Catalog/Pages fallback; n_pages recovered (R-4) |
An empty recovery_log after open/2 guarantees no recovery occurred.
No other tuple shapes are appended by the recovery paths.
The following errors remain fatal even with recover: true:
:not_a_pdf, :encrypted_password_required, :encrypted_wrong_password,
:encrypted_unsupported_handler, {:io_error, reason}.
Known gaps (documented limitations):
- Encrypted AND corrupted PDFs — the synthetic trailer from R-3 does not
include
/Encrypt; decryption cannot proceed. - Catalog-fallback page order (R-4) — the page list is in xref-insertion
order, NOT document order.
{:page_tree_recovered, n}signals this. - R-4 probe cost —
recover: truetriggers a full page-tree walk atopen/2time (O(pages)). Acceptable for opt-in mode; document in callers that open very large PDFs.
Encryption (Phase 2)
Standard Security Handler V1/V2/V4/V5-R6 supported. Use open/2 with the
password: opt:
{:ok, doc} = Pdf.Reader.open(bin, password: "secret")Empty password is auto-tried first (covers metadata-protection cases).
Errors: :encrypted_password_required, :encrypted_wrong_password,
:encrypted_unsupported_handler. See Pdf.Reader.Errors for the full set.
Form XObject recursion (Phase 3)
Do operators referencing /Type /XObject /Subtype /Form objects are
recursed into transparently — text and images inside Forms (headers,
footers, repeated logos, templated form fields) appear in read_text*
and read_images/1 output. CTM is multiplied with the Form's /Matrix
and resources are merged (Form wins on key collision). Cycle detection
via a visited-set guards against A → B → A loops; recursion depth is
capped at 8 ({:cycle_detected, ref} and {:max_depth_exceeded, ref}
events are emitted internally and dropped from text output).
Known limitations
- No CID fonts beyond ToUnicode — CID-keyed fonts that rely on
/CIDToGIDMapor registry/ordering/supplement data are not decoded. Onlybfchar/bfrangesections of ToUnicode CMaps are parsed. - No CCITT / JBIG2 / JPEG2000 image filters — images using
CCITTFaxDecode,JBIG2Decode, orJPXDecodeproduce{:error, {:unsupported_filter, name}}. - No OCR — scanned PDFs with no embedded text produce an empty text list.
- Standard-14 font metrics — fonts without embedded
/Widths(Standard-14 such as Helvetica, Times-Roman) produce zero-width glyph advance; onlyTc/Twcharacter/word spacing contribute. Hardcoded AFM metrics are a separate change. - No BBox clipping — text outside a Form's
/BBoxis still extracted. - Annotation appearance streams not rendered — visual rendering is out of scope.
- Markup popup hierarchies not resolved — popup windows are not extracted.
- Sound/movie/screen/redact/3D annotations — not extracted; surface as
:unknown. - AcroForm widget annotations — covered by
read_acroform/1, notread_annotations/1.
Spec references
- PDF 1.7 § 7.7.3 — Page Tree
- PDF 1.7 § 7.7.3.4 — Inheritance of Page Attributes (resource walk, cycle guard): https://opensource.adobe.com/dc-acrobat-sdk-docs/standards/pdfstandards/pdf/PDF32000_2008.pdf
- PDF 1.7 § 12.3.2 — Destinations
- PDF 1.7 § 12.3.3 — Document Outline
- PDF 1.7 § 12.5 — Annotations
- PDF 1.7 § 12.5.6.x — Annotation subtypes
- PDF 1.7 § 12.6 — Actions
- PDF 1.7 § 14.3.2 — Metadata Streams (XMP merge precedence): https://opensource.adobe.com/dc-acrobat-sdk-docs/standards/pdfstandards/pdf/PDF32000_2008.pdf
Error reasons
See Pdf.Reader.Errors for the complete documented reason set.
Summary
Functions
Pure helper: enriches each token in a Line list with :kind and
:shape. Tokens without an overlapping shape get kind: :text and
shape: nil. Tokens overlapping a shape get the shape attached and
:kind derived from shape.type
No-op in Phase 1 (no file handle or process held after open/1).
Pure helper: groups a flat TextRun list into Line structs.
Opens a PDF from a binary or a file path.
Bang variant of open/2. Raises Pdf.Reader.Error on failure.
Returns the total number of pages in the document.
Bang variant of page_count/1. Raises Pdf.Reader.Error on failure.
Unified entry point — returns the entire extracted PDF in one struct.
Bang variant of read/2. Raises Pdf.Reader.Error on failure.
Extracts AcroForm interactive form fields from the document.
Bang variant of read_acroform/1. Raises Pdf.Reader.Error on failure.
Extracts all annotations from all pages in the document.
Bang variant of read_annotations/1. Raises Pdf.Reader.Error on failure.
Returns the annotations list directly on success.
Extracts images from all pages.
Bang variant of read_images/1. Raises Pdf.Reader.Error on failure.
Reconstructs logical text lines from the page's TextRuns.
Bang variant of read_lines/2. Raises Pdf.Reader.Error on failure.
Extracts document metadata from the Info dictionary.
Bang variant of read_metadata/1. Raises Pdf.Reader.Error on failure.
Extracts document outline (bookmarks) from the PDF catalog's /Outlines tree.
Bang variant of read_outlines/1. Raises Pdf.Reader.Error on failure.
Returns the outlines list directly on success.
Returns the actionable elements (link-like shapes) of the document.
Bang variant of read_shapes/1. Raises Pdf.Reader.Error on failure.
Returns the plain text for each page as a list of strings.
Bang variant of read_text/2. Raises Pdf.Reader.Error on failure.
Returns text runs with absolute positions for all pages.
Bang variant of read_text_with_positions/1. Raises Pdf.Reader.Error on failure.
Returns the recovery event log for a document in chronological (oldest-first) order.
Pure helper: scans a list of Line structs for URL and email patterns
and emits the inferred shapes. Exposed for callers that already have
a lines list and want the inference layer alone (no annotations).
Types
@type reason() :: :not_a_pdf | :malformed | :encrypted_password_required | :encrypted_wrong_password | :encrypted_unsupported_handler | :io_error | {:io_error, File.posix()} | {:unsupported_filter, atom()} | {:unresolved_ref, Pdf.Reader.Document.ref()} | {:unsupported_pdf_version, String.t()} | {:malformed, atom(), map()}
Functions
@spec attach_shapes_to_tokens([Pdf.Reader.Line.t()], [Pdf.Reader.Shape.t()]) :: [ Pdf.Reader.Line.t() ]
Pure helper: enriches each token in a Line list with :kind and
:shape. Tokens without an overlapping shape get kind: :text and
shape: nil. Tokens overlapping a shape get the shape attached and
:kind derived from shape.type:
:uri | :goto | :launch | :named→:link:email→:email
A shape "contains" a token when:
- The shape and the line are on the same page.
- The shape's X range overlaps the token's X range.
- The shape's Y is within ±2 points of the line's Y.
Spec references:
- PDF 1.7 § 12.5.6.5 — Link Annotations (rect semantics)
- PDF 1.7 § 12.6.4 — Action types (URI/GoTo/Launch/Named)
@spec close(Pdf.Reader.Document.t()) :: :ok
No-op in Phase 1 (no file handle or process held after open/1).
Exists to reserve the API slot for future streaming/mmap support and to
signal to callers that they may drop the :binary field to reclaim memory.
Always returns :ok. Does NOT raise.
@spec lines_from_runs( [Pdf.Reader.TextRun.t()], keyword() ) :: [Pdf.Reader.Line.t()]
Pure helper: groups a flat TextRun list into Line structs.
Exposed publicly so callers who already have a runs list (from
read_text_with_positions/1 or hand-crafted in tests) can reuse the
grouping logic without reopening the document.
See read_lines/2 for option semantics.
@spec open( binary() | Path.t(), keyword() ) :: {:ok, Pdf.Reader.Document.t()} | {:error, reason()}
Opens a PDF from a binary or a file path.
Options
password: String.t()— the password to use when opening an encrypted PDF. Defaults to""(the empty string). The empty password is ALWAYS tried first regardless of this option (R-ENC4). If the empty password succeeds, the PDF is opened without requiring a non-empty password.
Success
Returns {:ok, %Pdf.Reader.Document{}} with:
:version— the PDF version string (e.g."1.7"):xref— merged cross-reference table (all/Prevchains followed):trailer— the most-recent trailer dictionary as a plain map:binary— the full PDF binary (held for lazy object resolution):cache— starts as%{}:encryption—nilfor non-encrypted PDFs; populated%StandardHandler{}on success
Errors
{:error, :not_a_pdf}— binary does not start with%PDF-{:error, :malformed}— missing%%EOF, invalidstartxref, etc.{:error, :encrypted_password_required}—/Encryptfound; no password supplied or empty password rejected.{:error, :encrypted_wrong_password}— password supplied but authentication failed.{:error, :encrypted_unsupported_handler}— unsupported encryption handler or RC4 unavailable.{:error, :io_error}— file read failed (no detail){:error, {:io_error, posix}}— file read failed with POSIX reason
Spec references
- PDF 1.7 § 7.6 — Standard Security Handler: https://opensource.adobe.com/dc-acrobat-sdk-docs/standards/pdfstandards/pdf/PDF32000_2008.pdf
- PDF 2.0 § 7.6 — Standard Security Handler (V5/R6): https://www.pdfa.org/wp-content/uploads/2023/04/ISO_32000_2_2020_PDF_2.0_FDIS.pdf
@spec open!( binary() | Path.t(), keyword() ) :: Pdf.Reader.Document.t()
Bang variant of open/2. Raises Pdf.Reader.Error on failure.
@spec page_count(Pdf.Reader.Document.t()) :: {:ok, pos_integer()} | {:error, reason()}
Returns the total number of pages in the document.
Cross-validates the /Count entry in the page tree root against the
actual number of leaf page refs found by traversal. If they disagree,
returns {:error, {:malformed, :page_tree_count_mismatch, %{declared: n, actual: m}}}.
Recovery mode (R-4)
When recover_mode: true and the page list was recovered via the catalog
fallback (xref scan), there is no /Pages /Count to cross-validate against.
In that case, the declared-count lookup is skipped and the actual count from
the xref scan is returned directly. This branch is signalled by
{:page_tree_recovered, n} in recovery_log.
Spec references: PDF 1.7 § 7.7.3 (Page Tree), § 7.7.3.4 (Inheritance).
@spec page_count!(Pdf.Reader.Document.t()) :: pos_integer()
Bang variant of page_count/1. Raises Pdf.Reader.Error on failure.
@spec read( Pdf.Reader.Document.t(), keyword() ) :: {:ok, term(), Pdf.Reader.Document.t()} | {:error, reason()}
Unified entry point — returns the entire extracted PDF in one struct.
Default shape is %Pdf.Reader.Result{} carrying:
:meta— document-level metadata (title, author, subject, creator, producer, dates, page_count, PDF version, encryption flag, recovery_log, plus the raw Info+XMP map). PDF 1.7 § 14.3.:pages—[%Pdf.Reader.Result.Page{number, meta, lines}]. Each page's:linesincludes text lines AND embedded images as synthetic lines, sorted top-to-bottom. Each line's tokens carry:kind(:text | :link | :email | :image) and:shape.
Convenience shapes
Pass :shape if you only want one slice without building the full struct:
:text→[String.t()](plain text per page):shapes→[%Pdf.Reader.Shape{}](links/emails/images flat)
Line tokenisation opts
:y_tolerance(default2.0) — PDF point tolerance to collapse text runs onto the same line.:gap_factor(default1.0em) — token-split threshold inside a line. Forwarded toread_lines/2.
Image opts
:image_bytes(defaultfalse) — whentrue, image tokens carry the raw decoded:bytesinmetaalongside the always-present:data_uri. Off by default to keep the result lightweight; turn on if the caller needs the binary (e.g. to write images to disk or run a QR decoder).
Spec references
- PDF 1.7 § 7.7.3 — Page Tree
- PDF 1.7 § 8.9 — Images (XObject /Subtype /Image)
- PDF 1.7 § 9.4 — Text objects
- PDF 1.7 § 12.5.6.5 — Link Annotations
- PDF 1.7 § 12.6.4 — Action types (URI, GoTo, Launch, Named)
- PDF 1.7 § 14.3 — Document Information Dictionary + XMP
@spec read!( Pdf.Reader.Document.t(), keyword() ) :: term()
Bang variant of read/2. Raises Pdf.Reader.Error on failure.
@spec read_acroform(Pdf.Reader.Document.t()) :: {:ok, [Pdf.Reader.FormField.t()], Pdf.Reader.Document.t()} | {:error, reason()}
Extracts AcroForm interactive form fields from the document.
Walks the /AcroForm /Fields tree depth-first, emitting only leaf fields
as a flat list of %Pdf.Reader.FormField{} structs. Hierarchical names
(/T dot-joined from ancestor path) are resolved. /FT is inherited
downward from the nearest ancestor that defines it.
Returns {:ok, [], doc} when no /AcroForm is present or /Fields is empty.
Never returns {:error, _} for absent or empty AcroForms.
Spec references
- PDF 1.7 § 12.7 (Interactive Forms)
- PDF 1.7 § 12.7.3 (Field Dictionaries)
- PDF 1.7 § 12.7.4 (Field Types)
@spec read_acroform!(Pdf.Reader.Document.t()) :: [Pdf.Reader.FormField.t()]
Bang variant of read_acroform/1. Raises Pdf.Reader.Error on failure.
@spec read_annotations(Pdf.Reader.Document.t()) :: {:ok, [Pdf.Reader.Annotation.t()], Pdf.Reader.Document.t()} | {:error, reason()}
Extracts all annotations from all pages in the document.
Enumerates every page via Page.list_refs/1 and, for each page, resolves
its /Annots array. Supports 10 annotation subtypes:
:link, :text, :highlight, :underline, :strikeout, :squiggly,
:square, :circle, :freetext, :file_attachment. Other subtypes
surface as :unknown with raw fields preserved in :kind_specific.
Returns {:ok, [], doc} when no page has an /Annots array — never an error.
Spec references
- PDF 1.7 § 12.5 — Annotations
- PDF 1.7 § 12.5.6.x — Annotation subtypes
- PDF 1.7 § 12.6 — Actions
@spec read_annotations!(Pdf.Reader.Document.t()) :: [Pdf.Reader.Annotation.t()]
Bang variant of read_annotations/1. Raises Pdf.Reader.Error on failure.
Returns the annotations list directly on success.
@spec read_images(Pdf.Reader.Document.t()) :: {:ok, [Pdf.Reader.Image.t()], Pdf.Reader.Document.t()} | {:error, reason()}
Extracts images from all pages.
For each page, resolves the XObject references from content-stream Do
operators and classifies them as JPEG or PNG-like based on their /Filter.
Returns {:ok, [], doc} when no images are found. The returned doc
carries an updated :recovery_log when opened with recover: true.
@spec read_images!(Pdf.Reader.Document.t()) :: [Pdf.Reader.Image.t()]
Bang variant of read_images/1. Raises Pdf.Reader.Error on failure.
@spec read_lines( Pdf.Reader.Document.t(), keyword() ) :: {:ok, [Pdf.Reader.Line.t()], Pdf.Reader.Document.t()} | {:error, reason()}
Reconstructs logical text lines from the page's TextRuns.
Many machine-generated PDFs (government forms, tax documents) place
glyphs individually with TJ + per-glyph kerning, producing one
TextRun per character. This function coalesces those runs into a
list of Pdf.Reader.Line structs, where each line carries:
:page,:y,:x— absolute position in user space:text— the joined text with single spaces between tokens:tokens—[%{x, text, width}]separated by visible whitespace
The token list lets callers detect column layouts (e.g. table rows where every line has tokens at the same X positions).
Options
:y_tolerance(default2.0) — runs whose Y differs by less than this many points collapse onto the same line. PDFs often jitter by fractional points within a line.:gap_factor(default1.0) — split into a new token when the horizontal gap between two consecutive runs exceedsfont_size × gap_factor. Lower factor = more splits. The default of 1 em separates tokens at visible whitespace (typical inter-glyph advance is ~0.5 em, so 1 em reliably catches real spaces).
Returns {:ok, [Line.t()], doc}. Lines are ordered by page ascending,
then by Y descending (top-to-bottom in PDF user space).
Spec references
- PDF 1.7 § 9.4 — Text objects
- PDF 1.7 § 9.4.4 — Text-showing operators
@spec read_lines!( Pdf.Reader.Document.t(), keyword() ) :: [Pdf.Reader.Line.t()]
Bang variant of read_lines/2. Raises Pdf.Reader.Error on failure.
@spec read_metadata(Pdf.Reader.Document.t()) :: {:ok, %{required(String.t()) => String.t()}, Pdf.Reader.Document.t()} | {:error, reason()}
Extracts document metadata from the Info dictionary.
Resolves the trailer's /Info reference and returns its key-value pairs
as a %{String.t() => String.t()} map. String values are decoded from
PDF literal strings ({:string, binary}).
Common keys: "Title", "Author", "Subject", "Keywords",
"Creator", "Producer", "CreationDate", "ModDate".
Returns {:ok, %{}, doc} when no /Info entry is present.
Spec reference
PDF 1.7 § 14.3.3 (Document Information Dictionary).
@spec read_metadata!(Pdf.Reader.Document.t()) :: %{required(String.t()) => String.t()}
Bang variant of read_metadata/1. Raises Pdf.Reader.Error on failure.
@spec read_outlines(Pdf.Reader.Document.t()) :: {:ok, [Pdf.Reader.Outline.t()], Pdf.Reader.Document.t()} | {:error, reason()}
Extracts document outline (bookmarks) from the PDF catalog's /Outlines tree.
Walks the /First//Next linked list at each nesting level, threading
/Parent for depth. Cycle detection via MapSet and a depth cap of 32
prevent hangs on corrupt PDFs.
Returns {:ok, [], doc} when no /Outlines entry is present — never an error.
Spec references
- PDF 1.7 § 12.3.3 — Document Outline
- PDF 1.7 § 12.3.2 — Destinations
@spec read_outlines!(Pdf.Reader.Document.t()) :: [Pdf.Reader.Outline.t()]
Bang variant of read_outlines/1. Raises Pdf.Reader.Error on failure.
Returns the outlines list directly on success.
@spec read_shapes(Pdf.Reader.Document.t()) :: {:ok, [Pdf.Reader.Shape.t()], Pdf.Reader.Document.t()} | {:error, reason()}
Returns the actionable elements (link-like shapes) of the document.
Combines two sources:
- Annotations of subtype
/Link(PDF 1.7 § 12.5.6.5) — real clickable regions placed by the document author. Each becomes a%Pdf.Reader.Shape{source: :annotation}. - Inferred shapes — URL and email patterns appearing as plain
text in
read_lines/2output. Common in government forms that printhttp://...oremail@domainwithout making them clickable. Each becomes%Pdf.Reader.Shape{source: :inferred}.
Returns {:ok, shapes, doc}. Shapes are sorted by :page ascending,
then by :y descending (top-to-bottom) when a rect is available.
Spec references
- PDF 1.7 § 12.5.6.5 — Link Annotations
- PDF 1.7 § 12.6.4 — Action types (URI, GoTo, Launch, Named)
- RFC 3986 § 3 — URI Generic Syntax
- RFC 5321 § 4.1.2 — Mailbox/Domain syntax (mailto)
@spec read_shapes!(Pdf.Reader.Document.t()) :: [Pdf.Reader.Shape.t()]
Bang variant of read_shapes/1. Raises Pdf.Reader.Error on failure.
@spec read_text( Pdf.Reader.Document.t(), keyword() ) :: {:ok, [String.t()], Pdf.Reader.Document.t()} | {:error, reason()}
Returns the plain text for each page as a list of strings.
Options:
:pages—[pos_integer]to filter to specific 1-indexed page numbers. Default: all pages.
Returns {:ok, page_strings, doc} where each element is the concatenated
text for one page. The returned doc carries an updated :recovery_log
when opened with recover: true. Unresolved glyphs appear as U+FFFD
(already encoded by the encoding cascade layer).
@spec read_text!( Pdf.Reader.Document.t(), keyword() ) :: [String.t()]
Bang variant of read_text/2. Raises Pdf.Reader.Error on failure.
@spec read_text_with_positions(Pdf.Reader.Document.t()) :: {:ok, [Pdf.Reader.TextRun.t()], Pdf.Reader.Document.t()} | {:error, reason()}
Returns text runs with absolute positions for all pages.
Walks each page, decodes its content stream(s), and returns a flat list of
%Pdf.Reader.TextRun{} structs ordered by page then appearance in the
content stream.
Returns {:ok, [], doc} when no text is found. Never returns :no_text_found
as an error per the spec resolution (empty is valid).
The returned doc carries an updated :recovery_log when opened with
recover: true — callers should pass the returned doc to recovery_log/1
to inspect per-page failures.
Form XObjects (Do operator referencing /Type /Form) are NOT recursed —
per Phase 1 scope. A deferred marker is recorded but produces no TextRun.
@spec read_text_with_positions!(Pdf.Reader.Document.t()) :: [Pdf.Reader.TextRun.t()]
Bang variant of read_text_with_positions/1. Raises Pdf.Reader.Error on failure.
@spec recovery_log(Pdf.Reader.Document.t()) :: [Pdf.Reader.Document.recovery_event()]
Returns the recovery event log for a document in chronological (oldest-first) order.
An empty list guarantees that no recovery action occurred during open/2.
This is the canonical way for callers to inspect recovery events — direct
access to doc.recovery_log MUST NOT be used in application code.
The closed set of recovery event tuples is documented in Pdf.Reader.Document.
Spec reference
PDF 1.7 § 7.5 — PDF file structure (recovery model).
@spec shapes_from_lines([Pdf.Reader.Line.t()]) :: [Pdf.Reader.Shape.t()]
Pure helper: scans a list of Line structs for URL and email patterns
and emits the inferred shapes. Exposed for callers that already have
a lines list and want the inference layer alone (no annotations).