View Source Pdf.Reader.Font (ExPDF v1.0.1)

Per-font decoder construction for the encoding cascade.

A "decoder" is a closure (binary -> {String.t(), [{non_neg_integer(), binary()}]}) that maps raw font-code bytes to UTF-8 text plus a list of unresolved sentinels.

Simple fonts (Type1, TrueType, etc.)

Cascade per byte (delegates to Pdf.Reader.Encoding.resolve_byte/3): ToUnicode CMap → /Differences + AGL → base encoding → U+FFFD + sentinel.

Composite fonts (Type0/Identity-H/V)

When /Encoding is Identity-H or Identity-V, the font is dispatched to Pdf.Reader.CID.Decoder.build/2. The CID decoder consumes bytes in 2-byte big-endian chunks and resolves via: ToUnicode CMap → Adobe registry table (Japan1/CNS1/Korea1/GB1) → U+FFFD.

Non-Identity predefined CMaps (UniJIS-UTF16-H, GBK-EUC-H, etc.) are also supported when bundled in priv/cmap/ — the decoder dispatches to Pdf.Reader.CID.Decoder.build_predefined/2 which uses Pdf.Reader.CID.PredefinedCMap for byte→CID lookup followed by the same Adobe registry → Unicode resolution as Identity-H/V.

Cache

Decoders for fonts referenced by indirect ref {:ref, n, g} are cached in Document.cache under key {:font_decoder, {n, g}} for reuse across pages with shared font resources. Inline font dicts (plain maps, no ref) are NOT cached.

Recovery mode (R-2)

When doc.recover_mode is true and a font dict fails to resolve or build, build_decoders_for_resources/2 installs a fallback U+FFFD identity decoder for that font instead of returning {:error, _}. The fallback emits <<0xFFFD::utf8>> per input byte, which guarantees String.valid?/1 is true on the resulting text. A {:font_skipped, page_n, font_name, reason} event is logged to doc.recovery_log for each failed font. Fonts that build successfully are NOT affected.

Spec: PDF 1.7 § 9.6 (font dictionaries), § 9.10 (text content extraction).

Spec references

Summary

Functions

Build a decoder closure for a font.

Build decoders for all fonts in a page's resources map.

Types

@type decoder_fn() :: (binary() -> {String.t(), [{non_neg_integer(), binary()}]})

Functions

Link to this function

build_decoder(font_ref, doc)

View Source
@spec build_decoder(
  map() | {:ref, pos_integer(), non_neg_integer()},
  Pdf.Reader.Document.t()
) ::
  {:ok, decoder_fn(), Pdf.Reader.Document.t()} | {:error, term()}

Build a decoder closure for a font.

Accepts either:

  • A font_dict (plain map) — inline font, built directly without caching.
  • A {:ref, n, g} tuple — indirect font reference; result is cached in doc.cache under {:font_decoder, {n, g}}.

Returns {:ok, decoder_fn, updated_doc}.

Link to this function

build_decoders_for_resources(resources, doc)

View Source
@spec build_decoders_for_resources(map(), Pdf.Reader.Document.t()) ::
  {:ok, %{required(binary()) => decoder_fn()}, [{binary(), term()}],
   Pdf.Reader.Document.t()}
  | {:error, term()}

Build decoders for all fonts in a page's resources map.

Walks resources["Font"] (a map of font name → font dict or ref) and calls build_decoder/2 for each entry. Returns a map keyed by font name.

In strict mode (doc.recover_mode == false): returns {:ok, decoders, [], doc} on success, or {:error, reason} on first font build failure (unchanged).

In recovery mode (doc.recover_mode == true): on per-font build failure, installs a per-byte U+FFFD fallback decoder for that font name and appends {font_name, reason} to the returned font_failures list. The page is NOT aborted. The caller is responsible for converting failures to {:font_skipped, page_n, font_name, reason} events and logging them.

Returns {:ok, %{font_name => decoder_fn}, [{font_name, reason}], updated_doc}.

Spec references

  • PDF 1.7 § 9.6 — Font dictionaries
  • PDF 1.7 § 9.10 — Extraction of text content