View Source Changelog
All notable changes to this project will be documented in this file.
The format is based on Keep a Changelog, and this project adheres to Semantic Versioning.
1.0.1 — 2026-05-07 (fork: ExPDF)
First release of ex_pdf on Hex. Fork of
andrewtimberlake/elixir-pdf
v0.7.2 (Hex package :pdf, ©Andrew Timberlake, MIT). The writer API is
preserved unchanged; this release adds a native PDF reader, error
recovery, AcroForm/outlines/annotations extraction, encryption support,
and many quality-of-life improvements.
Renamed
- OTP app and Hex package:
:pdf→:ex_pdf. Module names are unchanged (Pdf,Pdf.Reader, etc. — the namespacePdf.*was kept to avoid breaking writer callers). - Mixfile module:
Pdf.Mixfile→ExPdf.Mixfile. - Repo:
andrewtimberlake/elixir-pdf→MisaelMa/ExPDF. - All internal
:code.priv_dir(:pdf)andApplication.compile_env(:pdf, …)references updated to:ex_pdf(transparent to library users).
Added — Native PDF reader
The reader is implemented in pure Elixir + Erlang/OTP stdlib (:zlib,
:crypto, :binary, :unicode, :xmerl). No new Hex runtime deps,
no system tool deps.
Unified entry point
Pdf.Reader.read/2— one call returns%Pdf.Reader.Result{meta, pages}carrying document-level metadata + per-page lines with:kind-tagged tokens (:text | :link | :email | :image). SeePdf.Reader.ResultandPdf.Reader.Linefor the full shape.- Convenience shapes:
read(doc, shape: :text)returns[String.t()],read(doc, shape: :shapes)returns[%Pdf.Reader.Shape{}]. - Image opt:
image_bytes: trueincludes raw decoded:bytesin shape meta (off by default —:data_uriis always present).
Core extraction primitives
Pdf.Reader.open/2(with optionalpassword:andrecover:opts)Pdf.Reader.read_text/1— plain text per pagePdf.Reader.read_text_with_positions/1— text runs with absolute X/YPdf.Reader.read_lines/2— logical lines with token tokenisationPdf.Reader.read_metadata/1— Info dict + XMP (PDF 1.7 § 14.3)Pdf.Reader.read_images/1— embedded raster images with positionsPdf.Reader.read_outlines/1— bookmark treePdf.Reader.read_annotations/1— per-page annotationsPdf.Reader.read_acroform/1— interactive form fieldsPdf.Reader.read_shapes/1— link-like elements (annotations + inferred)Pdf.Reader.recovery_log/1— recovery event accessorPdf.Reader.page_count/1- Bang variants for every read function
Encryption (PDF 1.7 § 7.6, PDF 2.0 § 7.6)
- Standard Security Handler V1/V2 (RC4-40, RC4-128)
- V4 with /AESV2 (AES-128) —
crypto:crypto_init_dyn/4 - V5/R6 with /AESV3 (AES-256) — full Algorithm 2.A round trip
- Empty-password auto-try, file-key derivation per Algorithms 2/3/5/6/8
CID fonts (PDF 1.7 § 9.7)
- Identity-H/V composite fonts via 2-byte CID tokenisation
- 40 predefined CMaps from
adobe-type-tools/cmap-resourcesbundled inpriv/cmap/(UniJIS, GBK, KSC, ETen, Identity, …) - Adobe Japan1/CNS1/Korea1/GB1 collections bundled in
priv/ - PostScript-subset CMap parser with codespace-aware variable-length tokenizer
- ToUnicode CMap fallback for glyphs outside predefined ranges
Per-glyph widths (PDF 1.7 § 9.4.4, § 9.6.2.1, § 9.7.4.3)
- Full advance formula
tx = ((w/1000 - Tj_kern) × Tfs + Tc + Tw_space) × Thapplied per glyph - Heterogeneous CIDFont
/Wparsing (Form A + Form B + interleaved) - Standard 14 fallback to 500-unit average glyph width when
/Widthsis absent — restores correct positional advance for Helvetica / Times-Roman / Courier PDFs
Form XObject recursion (PDF 1.7 § 8.10)
Dooperator transparent recursion into/Subtype /FormXObjects- CTM × Form
/Matrixmultiplication, resource merging, cycle detection, depth cap (8)
AcroForm field extraction (PDF 1.7 § 12.7)
Pdf.Reader.FormFieldstruct with/FTinheritance walk- Text, button, choice, signature field types
Outlines and annotations (PDF 1.7 § 12.3, § 12.5)
Pdf.Reader.Outlinetree with destinations resolvedPdf.Reader.Annotationstruct with subtype detection (link, text, highlight, file attachment, …)- Annotation-source links automatically merge into the unified
Pdf.Reader.ShapeAPI
Image extraction (PDF 1.7 § 8.9)
/Subtype /ImageXObjects with absolute positions and CTM-derived rendered dimensions:jpeg(DCTDecode passthrough) and:png_like(FlateDecode + Predictor) classificationshape.meta.data_uri— RFC 2397data:URI ready for HTML embedding. JPEG is passthrough; png_like is re-encoded into a real PNG (PNG 1.2 § 5: signature + IHDR + IDAT + IEND with filter byte 0 + zlib) so the URI is browser-loadable.
Shape inference
Pdf.Reader.Shapestruct unifies link annotations and pattern- inferred URIs/emails/images.- URL regex per RFC 3986 § 3, email regex per RFC 5321 § 4.1.2.
- Trailing punctuation (
. , ; : ) ]) stripped from inferred URIs.
Error recovery
- Opt-in
recover: trueflag with four orthogonal phases:- R-1 Per-page isolation — one bad page does not kill the doc
- R-2 Font lenience — bad font refs fall back to U+FFFD per byte
- R-3 XRef linear scan —
:binary.matches/2recovers from corrupted xref tables; trailer synthesis from lasttrailer<<…>>block or/Type /Catalogscan; multi-gen dedup (highest gen wins) - R-4 Catalog/Pages tree fallback —
/Type /Pagexref scan when/Rootor/Pagesdoesn't resolve
- Closed set of recovery event tuples observable via
recovery_log/1::eof_marker_missing,:xref_recovered,:page_tree_recovered,:page_failed,:font_skipped. - Fatal errors (
:not_a_pdf,:encrypted_password_required,:encrypted_wrong_password,:encrypted_unsupported_handler) remain hard errors even underrecover: true.
Added — Tooling
releaser~> 0.0.7 dev dependency for monorepo-aware version bumping, changelog generation, and Hex publishing.- Hex package metadata: maintainers, contributors (Andrew Timberlake + Misael Sánchez), licenses, links (GitHub + upstream + changelog).
- ExDoc
groups_for_modulesseparating Reader and Writer namespaces. - Comprehensive README documenting every reader feature with spec citations.
Test suite
1180 tests, 0 failures, 30 excluded as of this release.
Unreleased — pdf-reader-error-recovery
Added
Pdf.Reader.open/2recover: trueoption — opts in to error-recovery mode. Whenrecover: false(the default, unchanged), all existing strict behavior is preserved. Whenrecover: true, the reader activates four orthogonal recovery phases (R-1..R-4) and logs each recovery action instead of returning{:error, _}.Pdf.Reader.recovery_log/1— public accessor returning the recovery event log in chronological (oldest-first) order. An empty list afteropen/2guarantees that no recovery action occurred. Direct access todoc.recovery_logis discouraged in application code.Pdf.Reader.Documentstruct extension — two new fields with defaults that are invisible to code that does not reference them:recover_mode :: boolean(), defaultfalserecovery_log :: [recovery_event()], default[]
PUBLIC API CHANGE —
read_text/1andread_images/1return shape. Both functions now return{:ok, list, doc}3-tuples (doc is the updated document carrying the recovery log). The bang variants (read_text!/2,read_images!/1) are unchanged.R-1 — Per-page isolation: when
recover: true, a failed page is logged as{:page_failed, page_n, reason}and skipped; remaining pages continue. Spec: PDF 1.7 § 7.7.3, § 7.8.R-2 — Font decoder lenience: when
recover: trueand a font dict fails to resolve, the decoder for that font is replaced with a per-byte U+FFFD identity decoder (<<0xFFFD::utf8>>per byte). The event{:font_skipped, page_n, font_name, reason}is logged.String.valid?/1is guaranteed true on recovery output. Spec: PDF 1.7 § 9.6, § 9.10.R-3 — XRef linear scan: when
recover: trueand normal xref loading fails (corruptstartxrefoffset, absent%%EOF),XRef.recover/1performs a:binary.matches/2scan to reconstruct the cross-reference table. A{:xref_recovered, n_objects}event is logged. When%%EOFis absent, an additional{:eof_marker_missing, :linear_scan_used}event is prepended. Spec: PDF 1.7 § 7.5.4, § 7.5.5, § 7.5.8.R-4 — Catalog / Pages tree fallback: when
recover: trueand/Rootor/Pagescannot be resolved, the reader scans the recovered xref entries for objects with/Type /Pageand (/ContentsOR/Parent). A{:page_tree_recovered, n_pages}event is logged. Form XObjects (which also carry/Type /Pagesometimes) are correctly excluded by the filter. Spec: PDF 1.7 § 7.7.2, § 7.7.3.
Closed set of recovery event tuples
| Tuple | Meaning |
|---|---|
{:xref_recovered, n} | Linear scan recovered n object entries |
{:eof_marker_missing, :linear_scan_used} | %%EOF absent; linear scan was invoked |
{:page_failed, page_n, reason} | Page page_n skipped; reason is an atom or term |
{:font_skipped, page_n, font_name, reason} | Font replaced with U+FFFD fallback |
{:page_tree_recovered, n_pages} | Catalog/Pages fallback found n_pages page objects |
No other tuple shapes are appended.
Known gaps (recovery)
- Encrypted AND corrupted PDFs — when a PDF is both encrypted and has a
corrupt xref/catalog, the synthetic trailer built by the linear scan does not
include
/Encrypt. Decryption cannot proceed; these PDFs are non-decryptable even withrecover: true. - Catalog-fallback page order — when R-4 triggers, the page list is in
xref-insertion order, NOT document order. The
{:page_tree_recovered, n}event signals this known limitation to callers. - R-4 probe cost — with
recover: true,do_open/2runs a full page-tree walk immediately after xref load (to surface{:page_tree_recovered, n}on the doc returned fromopen/2). This is O(pages) and measurable on large documents. It is opt-in by design.
Internal
- Test suite: 1128 tests, 0 failures, 29 excluded (was 1125 before this change).
- New test file:
test/pdf/reader/recovery_test.exs(65 tests: 16 RED, 11 GREEN, integration, smoke, and stress). - Strict TDD throughout (red → green → refactor per task).
- Spec-driven via SDD (
sdd/pdf-reader-error-recovery/*artifacts in engram).
Unreleased — pdf-reader-per-glyph-widths
Added
- Per-glyph width support (
Pdf.Reader.Font.Widths): text-matrix advance now uses the full PDF 1.7 § 9.4.4 formula —tx = ((w/1000) * Tfs + Tc + Tw_if_space) * Th— rather than a uniform 1-em approximation. Glyph widths are loaded from:/Widths,/FirstChar,/LastCharfor simple fonts (Type1, TrueType) — § 9.6.2.1/WForm A/B arrays and/DWfallback for CIDFonts (Type0) — § 9.7.4.3
Pdf.Reader.Font.Widths— new module with closures of type(binary() -> [non_neg_integer()]), one per font, built alongside the existing decoder map inextract_page_runs/3and threaded through Form XObject recursion.GraphicsState.widths_fn— new field (defaultnil) storing the active font's width closure. Set by theTfoperator alongsidedecoder. (§ 9.4.4)Tc,Tw,Tz,TLtext-state operators now correctly updateGraphicsState. Previously their operands were silently dropped.- TJ kerning shift now applies horizontal scaling (
Th):shift = -(n/1000) * Tfs * Th(previouslyThwas omitted).
Changed
GraphicsState.horizontal_scalingdefault changed from1.0to100.0(the PDF spec unit is a percentage;Th = horizontal_scaling / 100). Existing code that reads this field directly and expects the percentage form is unaffected; callers that divided by 100 already will need to adjust.
Documented gaps (not in scope)
- Vertical writing widths (
/W2,/DW2) — § 9.7.4.4 - Standard-14 hardcoded AFM metrics — § 9.6.2.2 (fonts without embedded
/Widthscurrently produce zero-width advance; Tc/Tw still apply correctly) - Non-default
/FontMatrixscaling on CIDType2 fonts — § 9.7.4.3
Internal
- Test suite: 1107 tests, 0 failures, 27 excluded (was 1095 before this change).
- New file:
lib/pdf/reader/font/widths.ex - New test file:
test/pdf/reader/font/widths_test.exs(25 tests)
Unreleased — housekeeping-dialyzer-warnings
Internal
- Removed 9 dead-code clauses flagged by Dialyzer "pattern can never match the type".
All defensive
{:error, _}arms in bang-wrappers and downstream pattern dispatches where the upstream successtyping was{:ok, ...}-only. No behavior change. Specifically:read_metadata!/1error branch;extract_doc_id/1`{:hex_string, }andpatterns;resolvepage_resources/4dead{n,g}andnilkey branches plus unreachable{:error, }cache arm;doresolve_page_resources/4dead{n,g}andnilparent_key branches;font.ex{:error, }arm forCID.Decoder.build/2;decoder.exparseregistry(nil)clause. Defensive_error/fallbacks inoutlines.exandannotations.exthat guard against future widening ofDestination.resolve/3` return type were intentionally kept and annotated with comments.
Unreleased — pdf-reader-cid-fonts-tier3
Added
- 10 Tier 3 predefined CMaps bundled in
priv/cmap/:- Adobe-Japan1:
EUC-H,EUC-V - Adobe-CNS1:
B5-H,B5-V,ETenms-B5-H,ETenms-B5-V - Adobe-GB1:
GB-H,GB-V - Adobe-Korea1:
KSCms-UHC-HW-H,KSCms-UHC-HW-V - Source:
adobe-type-tools/cmap-resources(Apache-2.0)
- Adobe-Japan1:
Pdf.Reader.CID.PredefinedCMap.@bundledset extended from 30 to 40 names.
Internal
- Test suite: 1063 tests, 3 pre-existing failures (encryption), 0 new failures.
- Bundle size: +51.9 KB additional
priv/cmap/data.
Unreleased — housekeeping-mix-format
Internal
- Auto-formatted 12 pre-existing files via
mix formatto satisfy--check-formatted. Affected files:lib/pdf.ex,lib/pdf/builder.ex,lib/pdf/fonts.ex,lib/pdf/images/png.ex,lib/pdf/layout.ex,lib/pdf/page.ex,lib/pdf/styled_table.ex,test/pdf/builder_test.exs,test/pdf/fonts_test.exs,test/pdf/layout_test.exs,test/pdf/page_templates_test.exs,test/pdf/styled_table_test.exs. No behavior change.
Unreleased — pdf-reader-annotations-outlines
Added
- Document outlines (bookmarks) —
Pdf.Reader.read_outlines/1returns[%Pdf.Reader.Outline{title, level, dest_page, children}]walking catalog/Outlineslinked list with cycle detection (visited MapSet) and depth cap 32. - Annotations —
Pdf.Reader.read_annotations/1returns[%Pdf.Reader.Annotation{type, page, rect, contents, ...}]for the 10 in-scope subtypes: Link, Text, Highlight, Underline, StrikeOut, Squiggly, Square, Circle, FreeText, FileAttachment. Other subtypes surface as:type :unknownwith raw fields preserved in:kind_specific. Pdf.Reader.Destination— resolves all 4/Destvariants (direct array, named string,/A /S /GoTo /D <array>,/A /S /GoTo /D <name>). Name-tree walker handles depth-20 + cycle detection.Pdf.Reader.Utils— extracted shareddecode_pdf_string/1(UTF-16BE BOM- hex string aware) and
parse_rect/1.Pdf.ReaderandPdf.Reader.AcroFormmigrated to use Utils; private duplicates removed.
- hex string aware) and
- Page index cache —
:page_ref_indexcached once perread_*call viaDestination.ensure_page_index/1. Avoids O(n) page-ref lookups per annotation/outline.
Out of scope
- Annotation appearance streams.
- Markup popup hierarchies.
- Sound/movie/screen/redact/3D annotations.
- AcroForm widget annotations (covered by separate
pdf-reader-acroform-extraction).
Internal
- 1053+ tests, 0 failures.
- Strict TDD throughout.
- Spec-driven via SDD (
sdd/pdf-reader-annotations-outlines/*artifacts in engram).
Unreleased — pdf-reader-cid-fonts-cmap-resources
Added
30 Adobe predefined CMaps bundled in
priv/cmap/(Tier 1 + Tier 2):- Tier 1 (16 files):
UniJIS-UTF16-H/V,UniJIS-UCS2-H/V,UniCNS-UTF16-H/V,UniCNS-UCS2-H/V,UniGB-UTF16-H/V,UniGB-UCS2-H/V,UniKS-UTF16-H/V,UniKS-UCS2-H/V - Tier 2 (14 files):
GBK-EUC-H/V,GBKp-EUC-H/V,GBK2K-H/V,ETen-B5-H/V,KSCms-UHC-H/V,90ms-RKSJ-H/V,90msp-RKSJ-H/V - Source:
adobe-type-tools/cmap-resources(Apache-2.0)
- Tier 1 (16 files):
Pdf.Reader.CID.CMapParser— minimal PostScript subset parser (codespacerange,cidchar,cidrange,notdefchar,notdefrange,usecmap). Silently skips all other PS content. Returns{:ok, cmap_fields} | {:error, reason}. Never raises on malformed input.Pdf.Reader.CID.Codespace.tokenize/2— variable-length 1–4 byte codespace-aware tokenizer per PDF 1.7 § 9.7.6 shortest-match rule. Bytes outside all codespace ranges are silently dropped one-at-a-time.Pdf.Reader.CID.PredefinedCMap— lazy loader withDocument.cachekeyed{:predefined_cmap, name}andusecmapchain support (cycle detection via visited MapSet; missing/non-bundled parents fall back to empty CMap per discovery #182).Pdf.Reader.CID.Decoder.build_predefined/2— new dispatch branch resolves bytes → CID via codespace + CMap → Unicode via existing Adobe collection table. Resolution cascade: ToUnicode CMap → predefined CMap → Adobe registry → U+FFFD with sentinel.Pdf.Reader.Font.cid_font_type/1— extends the formercid_font?/1predicate to recognise bundled predefined CMap names; dispatch returns:identity | {:predefined, name} | :not_cid.
Known Limitations
- Tier 3 CMaps not bundled —
EUC,B5,GB,ETenms-B5,KSCms-UHC-HWand similar encodings were deferred. Shipped inpdf-reader-cid-fonts-tier3. - Adobe-{Japan1,CNS1,Korea1,GB1}-UCS2 abstract parent files do not exist
in
adobe-type-tools/cmap-resources. Theusecmapoperator falls back to empty parent CMap if the named parent is not bundled — child's mappings still work standalone. Real-worldusecmapchains are exercised via -V → -H pairs (e.g.UniJIS-UTF16-V usecmap UniJIS-UTF16-H).
Internal
- Test suite: 909 tests, 0 failures (was 890 before this change's tests).
- Strict TDD throughout (red → green → refactor per task pair).
- Spec-driven via SDD (
sdd/pdf-reader-cid-fonts-cmap-resources/*).
Unreleased — pdf-reader-cid-fonts
Added
CID composite font support — Type0 fonts with
/Encoding /Identity-Hor/Identity-Vare now dispatched to a new CID decoder path inPdf.Reader.Font.build_decoder_internal/2. Text extraction from standard CJK PDFs (Japanese, Chinese Traditional/Simplified, Korean) now returns correct Unicode instead ofU+FFFD.Four Adobe collection modules — compile-time CID → Unicode tables bundled as
@external_resourcepattern-match clauses (O(1) BEAM dispatch):Pdf.Reader.CID.AdobeJapan1— ~9 600 entries (UniJIS-UCS2 column)Pdf.Reader.CID.AdobeCNS1— ~18 300 entries (UniCNS-UCS2 column)Pdf.Reader.CID.AdobeKorea1— ~17 100 entries (UniKS-UCS2 column)Pdf.Reader.CID.AdobeGB1— ~28 700 entries (UniGB-UCS2 column)
Source data:
adobe-type-tools/cmap-resourcesrepository. Blob SHAs committed:Adobe-Japan1-7/cid2code.txt→4aead36837daAdobe-CNS1-7/cid2code.txt→13ebdcb98e07Adobe-Korea1-2/cid2code.txt→0b5db6b5f5c3Adobe-GB1-6/cid2code.txt→c94c7bf8c943- Repository HEAD at time of normalization:
f5cf3bca7fdf
Pdf.Reader.CID.CIDToGIDMap— parses/CIDToGIDMapentries (/Identity, FlateDecode-decoded binary stream, or indirect ref). Stored for future glyph-rendering work; not used in the Unicode cascade.Pdf.Reader.CID.Decoder— resolves per-CID Unicode via cascade: ToUnicode CMap → Adobe registry table →U+FFFDwith sentinel{idx, "cid:0xHHHH"}.mix.exspackage.files—"priv"added so that@external_resourcepaths in the Adobe collection modules resolve correctly at Hex compile time.
Known limitations
- Non-Identity predefined CMaps not decoded — fonts with
/Encoding /UniJIS-UTF16-H,/GBK-EUC-H, etc. fall through to the simple-font path and emitU+FFFDwith sentinels. Full support planned for future changepdf-reader-cid-fonts-cmap-resources. - Vertical writing mode — Identity-V is dispatched to the same decoder as Identity-H. No positional adjustments for vertical layout.
Unreleased — pdf-reader-acroform-extraction
Added
Pdf.Reader.read_acroform/1andread_acroform!/1— extract interactive AcroForm form fields from a PDF document. Returns a flat list of%Pdf.Reader.FormField{}structs with decoded names, types, values, flags, and rectangles. Absent/AcroFormreturns{:ok, [], doc}— never an error.Pdf.Reader.FormFieldstruct — carries:name(fully-qualified dot-path),:partial_name,:type(:text | :button | :choice | :signature | :unknown),:value(type-specific decoded value),:default,:tooltip,:flags(%{atom => boolean}decoded from/Ffbitmask),:rect.Pdf.Reader.AcroFormwalker module — depth-first leaf-only walker with cycle detection (MapSetof{n, g}xref keys), depth cap (@max_field_depth 8),/FTinheritance, hierarchical naming, and widget-only annotation filtering.
Unreleased — pdf-reader-resource-inheritance-multilevel
Fixed
- Cyclic /Parent infinite loop —
resolve_page_resources/4now carries avisitedMapSetof{obj_num, gen_num}xref refs during each/Parent-chain walk. If a ref is encountered a second time (direct self-ref or transitive cycle), the walk is silently terminated and%{}is returned. Prevents corrupt PDFs from hanging the reader indefinitely.
Added
- Per-leaf-page resource cache — resolved
/Resourcesmaps are now stored indoc.cacheunder{:page_resources, {n, g}}keyed by the leaf page's xref ref. Subsequent calls for the same page (e.g. a secondread_text/1call on an open doc) skip the/Parent-chain walk entirely and return the cached value.
Note
- Moduledoc clarified — removed the stale "Known limitations" entry that stated resource inheritance was limited to one level of parent-chain walk. The full recursive walk has been in place since Phase 1.1; only the documentation was wrong. Added PDF 1.7 § 7.7.3 and § 7.7.3.4 spec references.
Unreleased — fix-writer-set-info-state
Fixed
- Info dict lost after page mutations —
Pdf.set_info/2(and its single-key variantsset_title/2,set_author/2, etc.) stores metadata by updatingdocument.objects. However,Pagecarries its own copy ofobjectsthat was snapshotted at page-creation time. Any subsequent page mutation (set_font,text_at, …) callssync_page/2, which replacesdocument.objectswithpage.objects— silently discarding the info update. The fix propagates the info-dict change intodocument.current.objectsinsideput_info/2, so both copies stay in sync andsync_page/2no longer clobbers metadata.
Unreleased — pdf-reader-form-xobject-recursion (Phase 3)
Added — Phase 3 (pdf-reader-form-xobject-recursion)
- Form XObject recursion —
Dooperators referencing/Type /XObject /Subtype /Formare now recursed into transparently. Text and images inside Forms (headers, footers, repeated logos, templated form fields) appear inPdf.Reader.read_text/2,read_text_with_positions/1, andread_images/1output. Previously these objects were emitted as{:deferred, :form_xobject, name}events and silently dropped — that behavior is REPLACED. - CTM ×
/Matrixinheritance — child Form's CTM isForm.Matrix × parent CTM at time of Do. Graphics state is saved on entry and restored on exit (effectivelyq ... Qaround the form). - Resource merging — Form's
/Resourcesis shallow-merged with the page's resources (Form wins on key collision). Per-Form font decoders are built viaPdf.Reader.Font.build_decoders_for_resources/2and benefit from the existingDocument.cache{:font_decoder, font_ref}cache. - Cycle detection — interpreter state carries a
:visitedMapSetof{obj_num, gen_num}xref keys, threaded forward into child states. When a Form references an already-visited Form (directly or transitively), an internal{:cycle_detected, ref}event is emitted and recursion is skipped. - Depth cap — recursion is capped at
@max_form_depth 8. Beyond that, an internal{:max_depth_exceeded, ref}event is emitted and the Form is skipped. - Image bubble-up — images embedded inside Form XObjects bubble up to the
parent's event stream and appear in
read_images/1output, with CTM reflecting the full transform (Form.Matrix × parent CTM × image local CTM). - Internal/cycle/depth events dropped from text output — the new
{:cycle_detected, _}and{:max_depth_exceeded, _}event types flow throughPdf.Reader.ContentStream.interpret/3's output but are silently dropped byevents_to_text_runs/2. Publicread_text*API surface unchanged.
Modified — Phase 3
Pdf.Reader.ContentStream.interpret/3— public arity and return shape unchanged (backward-compat). New privatedo_interpret_with_doc/5for the recursive path;extract_page_runs/3andextract_page_images/3now use it.Pdf.Reader.ImageandPdf.Reader.TextRunevents from inside Forms are appended to the parent page's event list.build_xobjects_map/1simplified — passes raw{:ref, n, g}refs fromresources["XObject"]instead of pre-classifying as:form. ContentStream classifies on demand insideDo.
Out of scope (Phase 3)
- BBox clipping of Form contents — text outside a Form's
/BBoxis still extracted (presentational concern, not data extraction). - Pattern XObject recursion —
/Type /Patternobjects referenced viaDoare skipped. - Multi-level page-tree resource inheritance (still one-level walk only).
- AcroForm interactive field extraction.
Internal — Phase 3
- Test suite: 756 tests, 0 failures (738 default + 18
@tag :fixtures). - Strict TDD applied throughout (red → green → refactor per task).
- Spec-driven via SDD (
sdd/pdf-reader-form-xobject-recursion/*artifacts in engram). Pdf.Reader.ContentStream@moduledoccites PDF 1.7 § 8.10 (Form XObjects), § 8.10.2 (Form Dictionaries), § 8.4 (Coordinate Systems), § 8.8 (External Objects / Do operator) plus pdf.js + pdfminer-six reference impls.
Unreleased — pdf-reader-encryption (Phase 2)
Added — Phase 2 (pdf-reader-encryption)
- Standard Security Handler support — encrypted PDFs are now READABLE via
Pdf.Reader.open/2when the correct password is provided (or empty for metadata-protection cases). Implements all four spec versions:- V1 / R=2 — RC4 40-bit (legacy)
- V2 / R=3 — RC4 up to 128-bit (most common pre-2008)
- V4 / R=4 — Crypt Filters + AES-128 (PDF 1.6+)
- V5 / R=6 — AES-256 + SHA-256/384/512 mixing (PDF 2.0 / Acrobat X+)
Pdf.Reader.open/2withpassword: String.t()opt (default"").- Always tries empty password first (metadata-protection auto-unlock).
- If non-empty password supplied, tries as user → owner password.
Pdf.Reader.open/1retained — delegates toopen/2with empty opts.
- New error atoms in
Pdf.Reader.reason/0::encrypted_password_required— no password supplied, empty failed.:encrypted_wrong_password— supplied password rejected as user AND owner.:encrypted_unsupported_handler—/Filter != /Standard, V5/R5 (deprecated), or RC4 unavailable on the runtime.- The legacy
:encryptedatom is REMOVED (existing test updated to assert the new atom).
Pdf.Reader.Documentstruct gained:encryptionfield (%StandardHandler{}when encrypted,nilotherwise).- Decryption hook integrated transparently in
Pdf.Reader.ObjectResolver.resolve_in_use/3only —resolve_compressed/3is left untouched (object-stream contents are decrypted ONCE at the containing-stream level; double-decryption would corrupt them). - Per-object encryption key derivation per PDF 1.7 § 7.6.2 for V1/V2/V4
(file key + obj_num + gen_num + optional
sAlTliteral → MD5 → truncate). V5 uses the file encryption key directly. - Crypt Filter
/Identityhonored — V4 streams marked/Identityare passed through plaintext (common XMP metadata pattern). /EncryptMetadata falsehonored — when set in the Encrypt dict, the catalog's/Metadatastream is read as plaintext regardless of the default Stream Filter.mix.exs—:cryptoadded toextra_applications(required at release time; the OTP:cryptoapp is stdlib, not a Hex dep).- New modules:
Pdf.Reader.Encryption(facade),Pdf.Reader.Encryption.{PasswordPad, ObjectKey, StandardHandler, V1V2, V4, V5}.
Known Limitations (Phase 2, carried forward)
- End-to-end V4/V5 round-trip integration tests deferred — algorithm-level
unit tests (73 total across V1V2/V4/V5) verify each cipher against published
vectors from Mozilla pdf.js
crypto_spec.js, cross-checked with Node.js. V2/R3 is fully covered end-to-end viacraft_rc4_v2_pdf/1(round-trip from hand-crafted PDF throughopen/2→read_text/1). V4/V5 dispatch through the resolver hook is unit-validated but lacks a full hand-crafted PDF round-trip fixture. Planned aspdf-reader-encryption-fixtures-handcraft. - Real-world fixture PDFs not committed — would require
qpdfas a build/test dependency, which contradicts the project's "native only, zero external dependencies" principle. Planned as a separate optional change if/when the constraint is relaxed. - R5 (deprecated V5 variant) — unsupported by design. PDFs with
V=5 R=5return{:error, :encrypted_unsupported_handler}. - Public-Key Security Handler (X.509 cert-based,
/Filter /Adobe.PubSecor similar) — not supported. Returns:encrypted_unsupported_handler. - Permission flag enforcement — flags are read but NOT enforced. We are
a reader; downstream tools may choose to honor
/Pbits. - RC4 availability — runtime dependent on OpenSSL configuration. On
systems where RC4 is disabled (some OpenSSL 3 builds), V1/V2 PDFs return
:encrypted_unsupported_handler. AES paths (V4/V5) work everywhere.
Internal — Phase 2
- Test suite: 726 tests, 0 failures (708 default + 18
@tag :fixtures). - 73 unit tests across V1V2/V4/V5 verify algorithms 2, 4, 5, 6, 7, 8, 9,
10 against vectors sourced from Mozilla pdf.js
test/unit/crypto_spec.js(Apache-2.0). Each vector independently re-computed with Node.jscryptoand:cryptoErlang to confirm parity. - Strict TDD applied throughout (red → green → refactor per task).
- Spec-driven via SDD (
sdd/pdf-reader-encryption/*artifacts in engram). - All algorithm modules cite canonical spec URLs (PDF 1.7/2.0, NIST FIPS 197,
NIST SP 800-38A, RFC 1321) in
@moduledoc.
Unreleased — pdf-reader-cascade-wire (Phase 1.1)
Added — Phase 1.1 (pdf-reader-cascade-wire)
- Encoding cascade wired through
read_text/2andread_text_with_positions/1— text is now decoded to Unicode (was raw bytes in Phase 1). Per-font cascade order: ToUnicode CMap → /Differences + AGL → base encoding (WinAnsi/MacRoman/Standard) → U+FFFD. - Per-font decoder construction with cache —
Pdf.Reader.Font.build_decoder/2builds closures per font dict; decoders are cached inDocument.cachekeyed by{:font_decoder, font_ref}(indirect-ref fonts only; inline font dicts are not cached). Tfoperator switches active decoder mid-content-stream — font changes in the stream are respected; each text operation uses the decoder for the currently active font.- XMP metadata parsing via
:xmerl(OTP stdlib) —read_metadata/1merges XMP with/Info; XMP wins on conflict (PDF 1.7 § 14.3.2). Recognized namespaces: dc:, xmp:, pdf:. Malformed XMP falls back to/Info-only silently. Pdf.Reader.Imagestruct gained:ctm,:render_width,:render_height,:rotation_radiansfields. CTM decomposition follows PDF 1.7 § 8.3.3 and § 8.9.5.- Resource inheritance — one-level parent-chain walk added to
resolve_page_resources/2so writer-built PDFs (which store resources on the Pages parent node, not the leaf page) extract text and images correctly. - New modules:
Pdf.Reader.Font,Pdf.Reader.XMP - New fixture:
test/fixtures/images/tiny.jpg(32×32 px, ~900 B, public-domain JPEG from picsum.photos — used by image CTM integration tests)
Known Limitations (Phase 1.1, carried forward)
- Resource inheritance — only one level of parent-chain walk is implemented. PDFs with
deeply nested page trees that store resources two or more levels above the leaf page may
produce empty text. Planned as
pdf-reader-resource-inheritancechange. - Per-glyph advance via
/Widths— glyph advance is approximated as uniform (char_count × font_size). Per-glyph widths are a separate change. - Form XObject
Dorecursion — content inside form XObjects (/Type /Form) is not extracted. Planned for Phase 3. - CID fonts beyond ToUnicode — CID-keyed fonts without a
/ToUnicodeCMap produce U+FFFD substitutions. Planned for Phase 3. - CCITTFaxDecode, JBIG2Decode, JPXDecode — not supported; these require third-party C libraries and are outside scope.
Internal — Phase 1.1
- Test suite: 616 tests, 0 failures (598 default + 18
@tag :fixtures) - Strict TDD applied throughout (red → green → refactor per task)
- Spec-driven via SDD (
sdd/pdf-reader-cascade-wire/*artifacts in engram)
Unreleased — pdf-reader-core (Phase 1)
Added
Pdf.Reader.open/1,read_text/2,read_text_with_positions/1,read_images/1,read_metadata/1,page_count/1,close/1— and bang variants (open!/1, etc.)- Stream filter pipeline:
FlateDecodewith PNG predictors 1–4 and 10–14 and TIFF Predictor 2 (horizontal differencing)ASCII85Decodewithzshortcut and~>EOD markerASCIIHexDecodewith whitespace tolerance and>EODRunLengthDecode(128 = EOD, 0–127 = literal, 129–255 = repeat)LZWDecodewith variable-width codes (9–12 bit), EarlyChange 0 and 1
- Cross-reference table support: classic xref (PDF 1.0–1.4) AND xref streams (PDF 1.5+)
with
/Prevchain merging and hybrid chains (mixed classic + stream) - Object stream (
/Type /ObjStm) decoding viaPdf.Reader.ObjectStream - Encoding cascade (per-glyph): ToUnicode CMap → /Differences + Adobe Glyph List →
base encoding (WinAnsi / MacRoman / StandardEncoding) →
U+FFFDwith diagnostic sentinel - Bundled Adobe Glyph List 2.0 as a compile-time module (~4 500 entries, BSD-licensed)
- Public-domain encoding tables:
- Apple ROMAN.TXT (canonical Mac Roman mapping)
- PDF 1.7 Annex D.2 StandardEncoding (cross-checked against Mozilla pdf.js)
- Lazy indirect-object resolver with pure
Mapcache — no GenServer, no Agent - Pure tagged-tuple internal value model:
{:ref, n, g},{:name, _},{:string, _},{:hex_string, _},{:stream, dict, body}, plain%{}for dicts, plain lists for arrays
Known Limitations (Phase 1)
- No encryption support — encrypted PDFs return
{:error, :encrypted}(deferred to Phase 2) - No CID fonts beyond ToUnicode-mapped glyphs (deferred to Phase 3)
- No
CCITTFaxDecode,JBIG2Decode, or JPEG 2000 image filters — these require third-party C libraries and are outside scope - No AcroForm or XFA form field extraction
- No OCR or scanned-PDF text extraction — impossible without third parties
- Form XObject (
Dooperator) is recognised but not recursed; content is not extracted - Glyph advance approximation: uses
char_count × font_sizeinstead of per-glyph/Widths; start-of-run position is exact, inter-run drift is possible for proportional fonts - CMap multi-codepoint mappings (ligatures): only the first codepoint is used
- Malformed PDFs return strict
{:error, :malformed}— no partial-recovery mode - XMP metadata streams are not parsed;
read_metadata/1reads only the/Infodictionary
Internal
- Test suite: 550 tests, 0 failures (541 default + 9
@tag :fixtures) - Strict TDD applied throughout (red → green → refactor per task)
- Spec-driven via SDD (
sdd/pdf-reader-core/*artifacts in engram)
0.7.1 (2024-07-23)
- Fix memory leak when cleaning up a PDF process
0.7.0 (2024-07-12)
- Add
autoprint/1to automatically open the print dialog in a browser
0.6.1 (2023-01-19)
- Fix bug with zero width strings and empty rows (also fixes [#24])
- Fix issue with nil cap height [#35]
- Raise RuntimeError when attempting to add text without a font [#36]
- Fix typespec for
text_wrap/5[#37]
0.6.0 (2021-12-07)
- Add
:oddand:evento:row_styleon table with a lower precedence than indexed styles - Fix bug where only the first non-WinAnsi character was replaced [#32]
0.5.0 (2020-12-02)
- Catch errors raised within the GenServer and re-raise them in the calling process
0.4.0 (2020-08-12)
- Add
:encoding_replacement_characteroption to supply a replacement character when encoding fails - Add
:allow_row_overflowoption toPdf.table/4to allow row contents to be split across pages
0.3.7 (2020-04-29)
- Bug fix: Fix memory leak by stopping internal processes
0.3.6 (2020-04-22)
- Bug fix: Correctly handle encoded text as binary, not UTF-8 encoded string
- Bug fix: External fonts now work like built-in fonts #17
- Bug fix: Reset colours changed by attributed text
- Bug fix: Fix global options for text_at/4 when using a string #11
0.3.5 (2020-04-14)
- Deprecate:
Pdf.delete/1in favour ofPdf.cleanup/1 - Deprecate:
Pdf.open/2in favour ofPdf.build/2