Minimal PDF trailer / xref scanner for the PAdES adapter.
Scope is the file-level structure only — the four primitives the Phase 4 plan calls out:
- Locate
startxrefand the most-recent xref offset. - Parse the text-format xref subsections at that offset.
- Extract
/Size,/Root,/Prevfrom the trailer dict. - Walk the
/Prevchain across revisions.
Out of scope (deliberately): content streams, encoded streams, page
resources, font dictionaries, and any indirect-object body. None of
those are required for incremental signature emission or for
recomputing the byte-range covered by a /Sig.
Cross-reference streams (PDF 1.5+, /Type /XRef) are not handled in
v1 and surface as {:error, {:malformed_pdf, :xref_stream_unsupported}}.
Per the Phase 4 plan, the writer always emits the legacy text-format
xref (still legal in PDF 1.7+); the reader is the side that needs to
tolerate vendor variation, and we accept the limitation until a real
corpus argues for lopdf on the verify path.
Summary
Functions
Returns the merged xref offsets across every revision in the PDF — newest entry per object number wins (incremental updates override).
Returns the next free indirect-object number, derived from the
most-recent revision's /Size. PAdES incremental updates allocate
fresh object numbers starting here.
Convenience: locate the most-recent xref and read it.
Returns the catalog dict body (the bytes between << and >> of the
object pointed at by /Root). The catalog is what an incremental
update must re-emit when adding a /Sig field — its /AcroForm and
/Pages entries need to be preserved.
Returns the dict body (the bytes between the object's outer <<
and matching >>) for the object at offset. :not_a_dict for
objects that don't begin with a dict (streams, primitives).
Reads the textual body of the indirect object at the given offset.
Returns the bytes between obj and endobj, trimmed.
Reads the xref table + trailer at the given offset and returns a
Revision describing this PDF revision.
Walks the /Prev chain newest-first. The first element is the most
recent revision (the one startxref points at); the last is the
original.
Returns the list of {object_number, dict_body} pairs for every
indirect object whose body is a dictionary containing /Type /Sig.
Returns the byte offset stored in the file's terminating startxref
marker. Searches the last 8192 bytes — PDF 1.7 §7.5.5
requires it within the last 1 KiB but real-world authoring tools emit
trailing whitespace that pushes the marker further back.
Types
@type error() :: {:malformed_pdf, atom()}
Reader error. Always carries :malformed_pdf as the class atom.
Functions
@spec merged_xref_offsets(binary()) :: {:ok, %{required(non_neg_integer()) => non_neg_integer()}} | {:error, error()}
Returns the merged xref offsets across every revision in the PDF — newest entry per object number wins (incremental updates override).
Used by the verify path to enumerate indirect objects by number without picking older revisions of objects that were superseded.
@spec next_object_number(binary()) :: {:ok, non_neg_integer()} | {:error, error()}
Returns the next free indirect-object number, derived from the
most-recent revision's /Size. PAdES incremental updates allocate
fresh object numbers starting here.
@spec parse(binary()) :: {:ok, SignCore.PDF.Reader.Revision.t()} | {:error, error()}
Convenience: locate the most-recent xref and read it.
Returns the catalog dict body (the bytes between << and >> of the
object pointed at by /Root). The catalog is what an incremental
update must re-emit when adding a /Sig field — its /AcroForm and
/Pages entries need to be preserved.
Returns {:error, {:malformed_pdf, :catalog_not_indirect}} if the
catalog body isn't a plain dict (rare; would only happen if /Root
pointed at an object stream).
@spec read_dict_at(binary(), non_neg_integer()) :: {:ok, binary()} | {:error, error() | :not_a_dict}
Returns the dict body (the bytes between the object's outer <<
and matching >>) for the object at offset. :not_a_dict for
objects that don't begin with a dict (streams, primitives).
@spec read_object_body(binary(), non_neg_integer()) :: {:ok, binary()} | {:error, error()}
Reads the textual body of the indirect object at the given offset.
Returns the bytes between obj and endobj, trimmed.
Used by the Writer to extract the catalog dict so an incremental
update can re-emit it with a merged /AcroForm entry. Does not
parse stream contents; the bytes are returned verbatim.
@spec read_revision(binary(), non_neg_integer()) :: {:ok, SignCore.PDF.Reader.Revision.t()} | {:error, error()}
Reads the xref table + trailer at the given offset and returns a
Revision describing this PDF revision.
@spec revisions(binary()) :: {:ok, [SignCore.PDF.Reader.Revision.t()]} | {:error, error()}
Walks the /Prev chain newest-first. The first element is the most
recent revision (the one startxref points at); the last is the
original.
@spec signature_dicts(binary()) :: {:ok, [{non_neg_integer(), binary()}]} | {:error, error()}
Returns the list of {object_number, dict_body} pairs for every
indirect object whose body is a dictionary containing /Type /Sig.
This is the canonical way to locate signature dicts: it ignores
comments, content-stream text that happens to mention /Type /Sig,
and superseded older revisions of the same object number. Each
returned dict body is bounded — only the dict content between its
outer << and matching >>, suitable for whitespace-tolerant
regex extraction of /ByteRange and /Contents.
@spec startxref(binary()) :: {:ok, non_neg_integer()} | {:error, error()}
Returns the byte offset stored in the file's terminating startxref
marker. Searches the last 8192 bytes — PDF 1.7 §7.5.5
requires it within the last 1 KiB but real-world authoring tools emit
trailing whitespace that pushes the marker further back.