View Source Pdf.Reader.XRef (ExPDF v1.0.1)
Facade that dispatches to the appropriate xref reader and follows /Prev chains.
Dispatch logic (PDF 1.7 § 7.5.8)
At a given startxref offset, peeks at the first non-whitespace bytes:
- Starts with
xref→ classic xref table (§ 7.5.4). Delegates toPdf.Reader.XRef.Classic. - Starts with digits matching
N G obj→ xref stream (§ 7.5.8). Delegates toPdf.Reader.XRef.Stream.
Both formats carry /Prev chain links that reference older xref sections.
Those are followed recursively, with newer entries overriding older ones.
Hybrid PDFs
Incremental updates may mix classic and stream xrefs in the same /Prev chain.
load/2 handles this transparently by dispatching each chain link independently.
Linear scan recovery (PDF 1.7 § 7.5.4, § 7.5.8)
When normal xref loading fails (corrupt or missing %%EOF, bad startxref
offset), recover/1 performs a linear scan of the full PDF binary to
reconstruct the cross-reference table without relying on the startxref
pointer or the on-disk xref section.
Algorithm:
- Use
:binary.matches/2to find all occurrences of" obj"in the binary. - Back-scan each match for a
\n<digits> <digits>prefix — this distinguishes real object headers fromobjsubstrings inside content streams or strings. - Build a map of
{obj_num, gen_num} => {:in_use, offset, gen_num}entries. - On collision (same
obj_num, differentgen_num), keep the highestgen_num; ties are broken by the later (higher) byte offset. - Synthesise a trailer dict by scanning the binary for the LAST
trailer\n<<...>>block. If none is found, scan recovered object entries for one containing/Type /Catalogto derive/Root. - Returns
{:ok, entries_map, trailer_struct}.
Spec references
- PDF 1.7 § 7.5.4 — Cross-reference table
- PDF 1.7 § 7.5.5 — File trailer
- PDF 1.7 § 7.5.8 — Cross-reference streams
Summary
Functions
Loads all xref sections reachable from start_offset (following /Prev links)
and merges them into a single entries map.
Recovers a cross-reference table from a PDF binary by linear scan, without
relying on startxref or any xref section in the file.
Types
@type entries() :: %{required(Pdf.Reader.Document.ref()) => entry()}
@type entry() :: Pdf.Reader.Document.xref_entry()
Functions
@spec load(binary(), non_neg_integer()) :: {:ok, entries(), Pdf.Reader.Trailer.t()} | {:error, term()}
Loads all xref sections reachable from start_offset (following /Prev links)
and merges them into a single entries map.
Newer sections' entries override older ones on conflict (reverse-chain order).
Returns {:ok, entries_map, trailer_struct} or {:error, reason}.
@spec recover(binary()) :: {:ok, entries(), Pdf.Reader.Trailer.t()}
Recovers a cross-reference table from a PDF binary by linear scan, without
relying on startxref or any xref section in the file.
Algorithm
- Use
:binary.matches/2to find every" obj"substring inbinary. - For each match position, back-scan to validate the
\n<digits> <digits>prefix that characterises a real indirect-object header. This rejects false positives whereobjappears inside a content stream or string literal. - Parse
(obj_num, gen_num)from the prefix and compute the byte offset of the object (start ofN G obj). - Deduplicate by
obj_num: when the same number appears more than once keep the entry with the highestgen_num. Ifgen_numvalues tie, the entry at the larger byte offset wins (later in the file = more recent revision). - Synthesise a
%Pdf.Reader.Trailer{}by scanning for the lasttrailer\n<<...>>block. If none is found, scan recovered entries for an object whose dict contains/Type /Catalogand use its ref as/Root.
Returns {:ok, entries_map, trailer_struct} where entries_map is keyed by
{obj_num, gen_num} tuples.
PDF 1.7 § 7.5.4 — Cross-reference table PDF 1.7 § 7.5.8 — Cross-reference streams