ExPdfium (ExPdfium v0.2.0)

Copy Markdown View Source

Elixir bindings for pdfium, Google's Chromium PDF engine, via the Rust pdfium-render crate. The native library ships precompiled (rustler_precompiled), so there is no Rust toolchain or separately-installed pdfium to set up.

Read-only toolkit

ExPdfium is a read & extract toolkit: open documents, page counts, rendering, text extraction/search, metadata, page geometry, permissions, structure (bookmarks/links/attachments), and forms/annotations (read). It does not create, edit, or save PDFs.

Example

{:ok, doc} = ExPdfium.open("file.pdf")
{:ok, 3} = ExPdfium.page_count(doc)

{:ok, %ExPdfium.Bitmap{data: data, width: w, height: h}} =
  ExPdfium.render_page(doc, 0, dpi: 300)
{:ok, image} = Vix.Vips.Image.new_from_binary(data, w, h, 4, :VIPS_FORMAT_UCHAR)

:ok = ExPdfium.close(doc)

# Encrypted documents:
{:ok, doc} = ExPdfium.open("secret.pdf", password: "hunter2")

Summary

Documents

Explicitly close a document, releasing pdfium memory early. Optional and idempotent.

Open a PDF from a file path or an in-memory binary.

Number of pages in the document.

Rendering

Render a 0-indexed page to an ExPdfium.Bitmap (an uncompressed 4-channel pixel buffer).

Structure & navigation

Extract the bytes of the embedded file at index (see attachments/1).

List the document's embedded files.

Return the links on a 0-indexed page.

Return the document outline (bookmarks) as a nested tree.

Forms & annotations

Return the annotations on a 0-indexed page, in page order.

Read the document's AcroForm fields, one entry per widget, across all pages.

Return which interactive-form technology the document uses.

Diagnostics

Return a marker string confirming the native pdfium library loaded and initialized. Useful as a smoke test that the precompiled NIF is healthy.

Types

A bounding rectangle in PDF user-space points (1/72 inch). The origin is the page's bottom-left corner and y increases upward, so top >= bottom.

Documents

close(document)

@spec close(ExPdfium.Document.t()) :: :ok

Explicitly close a document, releasing pdfium memory early. Optional and idempotent.

Documents are also closed when garbage-collected, but that close is processed asynchronously (on a background thread, so it can't stall a scheduler while a long render holds the pdfium lock). Call this for deterministic, immediate release.

open(path_or_binary, opts \\ [])

@spec open(
  Path.t() | binary(),
  keyword()
) :: {:ok, ExPdfium.Document.t()} | {:error, atom()}

Open a PDF from a file path or an in-memory binary.

A binary beginning with "%PDF" is treated as document bytes; any other binary is treated as a file path. (A few PDFs carry junk bytes before the header; pass those as an explicit path, or strip the leading bytes.)

Options

  • :password — password for an encrypted PDF (default nil)

Errors

Returns {:error, reason} where reason is one of:

  • :enoent — the path does not exist
  • :invalid_pdf — the bytes are not a parseable PDF
  • :password_error — the document is encrypted and the password was missing or incorrect
  • :unsupported_security — unsupported encryption/security handler
  • :file_error / :io_error / :open_failed — other read/open failures
  • :bad_source — internal: malformed source argument (e.g. a non-UTF-8 path)

page_count(document)

@spec page_count(ExPdfium.Document.t()) ::
  {:ok, non_neg_integer()} | {:error, :document_closed | :lock_poisoned}

Number of pages in the document.

Returns {:error, :document_closed} if the document has been closed with close/1.

Rendering

render_page(document, page_index, opts \\ [])

@spec render_page(ExPdfium.Document.t(), non_neg_integer(), keyword()) ::
  {:ok, ExPdfium.Bitmap.t()} | {:error, atom()}

Render a 0-indexed page to an ExPdfium.Bitmap (an uncompressed 4-channel pixel buffer).

Options

Sizing (highest precedence first; the default is dpi: 72):

  • :width and/or :height — output size in pixels (aspect-preserving if only one is given)
  • :scale — multiple of the natural size (1.0 == 72 DPI)
  • :dpi — dots per inch (e.g. 150, 300)

Other:

  • :format:rgba (default) or :bgra (pdfium's native order, no conversion)
  • :background:white (default) or :transparent

Bitmap layout

data is width * height * 4 bytes, row-major, stride (== width * 4) bytes per row, 8 bits per channel. Hand it straight to Vix/Image:

{:ok, %ExPdfium.Bitmap{data: data, width: w, height: h}} =
  ExPdfium.render_page(doc, 0, dpi: 300)
{:ok, image} = Vix.Vips.Image.new_from_binary(data, w, h, 4, :VIPS_FORMAT_UCHAR)

Errors

  • :page_out_of_bounds — no such page index
  • :document_closed — the document was closed
  • :unsupported_format / :unsupported_background — bad option value
  • :render_failed — pdfium failed to render the page

Metadata & geometry

Structure & navigation

attachment_data(document, index)

@spec attachment_data(ExPdfium.Document.t(), non_neg_integer()) ::
  {:ok, binary()} | {:error, atom()}

Extract the bytes of the embedded file at index (see attachments/1).

Returns {:error, :attachment_not_found} for an invalid index, or {:error, :attachment_failed} if pdfium cannot read the file data.

attachments(document)

@spec attachments(ExPdfium.Document.t()) ::
  {:ok,
   [%{index: non_neg_integer(), name: String.t(), size: non_neg_integer()}]}
  | {:error, atom()}

List the document's embedded files.

Each is %{index: non_neg_integer(), name: String.t(), size: non_neg_integer()}. Use attachment_data/2 with the index to extract the bytes.

links(document, page_index)

@spec links(ExPdfium.Document.t(), non_neg_integer()) ::
  {:ok,
   [
     %{
       bounds: bounds() | nil,
       uri: String.t() | nil,
       page: non_neg_integer() | nil
     }
   ]}
  | {:error, atom()}

Return the links on a 0-indexed page.

Each link is %{bounds: t:bounds/0 | nil, uri: String.t() | nil, page: non_neg_integer() | nil}uri for a web link, page for an internal /Dest destination. bounds is nil if the link has no rectangle; uri and page are both nil for an unsupported or action-based link.

outline(document)

@spec outline(ExPdfium.Document.t()) :: {:ok, [map()]} | {:error, atom()}

Return the document outline (bookmarks) as a nested tree.

Each node is %{title: String.t(), page: non_neg_integer() | nil, children: [node]}, where page is the 0-indexed destination page (or nil). A document with no outline returns {:ok, []}.

page is nil for a bookmark whose target is a GoTo action rather than a /Dest. The tree is capped (depth 64, 50_000 nodes) to bound pathological or cyclic outlines; beyond that it is silently truncated.

Forms & annotations

annotations(document, page_index)

@spec annotations(ExPdfium.Document.t(), non_neg_integer()) ::
  {:ok, [map()]} | {:error, atom()}

Return the annotations on a 0-indexed page, in page order.

Each annotation is:

%{
  type: atom(),                # the PDF /Subtype, e.g. :text, :highlight,
                               # :link, :widget, :ink, :stamp, :free_text…
  bounds: t:bounds/0 | nil,    # the annotation rectangle, in PDF points
  contents: String.t() | nil,  # the /Contents text
  name: String.t() | nil,      # the annotation's /NM name (not a field name)
  hidden: boolean(),
  printed: boolean()
}

Widget annotations (form-field controls) are listed alongside markup annotations; use form_fields/1 to read their field values. A page with no annotations returns {:ok, []}.

form_fields(document)

@spec form_fields(ExPdfium.Document.t()) :: {:ok, [map()]} | {:error, atom()}

Read the document's AcroForm fields, one entry per widget, across all pages.

Each field is:

%{
  name: String.t() | nil,   # the field's /T name
  type: :text | :checkbox | :radio_button | :combo_box | :list_box |
        :push_button | :signature | :unknown,
  value: String.t() | nil,  # text/combo/list value, or the selected on-state of a button group
  checked: boolean() | nil, # checkbox/radio only; nil for other types
  read_only: boolean(),
  required: boolean(),
  page: non_neg_integer(),  # 0-indexed page the widget sits on
  bounds: t:bounds/0 | nil
}

A checkbox or radio group shares one name across its option widgets, so it surfaces as one entry per option widget. For these, value is the group's currently-selected on-state (the same string on every widget in the group), and checked flags which widget is the selected one — so to find a radio group's answer, take the value of the entry whose checked is true. A document with no form returns {:ok, []}.

value and checked are read straight from pdfium without coercion: a checked checkbox is %{value: "Yes", checked: true}, never flattened to a string.

Limitations

  • This reads a group's selected value, not its available options — pdfium does not expose per-option export names for checkbox/radio groups. A naive Map.new(fields, &{&1.name, &1.value}) collapses a group to one entry; to find a group's answer, take the value of the entry whose checked is true.
  • A multi-select list box reports only pdfium's single value string, so additional selections beyond the first are not surfaced.

form_type(document)

@spec form_type(ExPdfium.Document.t()) ::
  {:ok, :none | :acrobat | :xfa_full | :xfa_foreground} | {:error, atom()}

Return which interactive-form technology the document uses.

One of :none, :acrobat (a classic AcroForm), :xfa_full, or :xfa_foreground (XFA forms). A document with no form returns {:ok, :none}.

XFA caveat

Reading XFA form data requires a pdfium build with the V8 JavaScript engine, which ExPdfium does not ship. form_fields/1 reads AcroForm fields; for an :xfa_full document the AcroForm view may be empty or partial.

Diagnostics

pdfium_version()

@spec pdfium_version() :: String.t()

Return a marker string confirming the native pdfium library loaded and initialized. Useful as a smoke test that the precompiled NIF is healthy.

pdfium exposes no build-version string through its public C API, so this is a fixed confirmation marker rather than a version number.

Types

bounds()

@type bounds() :: %{left: float(), bottom: float(), right: float(), top: float()}

A bounding rectangle in PDF user-space points (1/72 inch). The origin is the page's bottom-left corner and y increases upward, so top >= bottom.