PhoenixKitCatalogue.Workers.PdfExtractor (PhoenixKitCatalogue v0.2.0)

Copy Markdown View Source

Oban worker that extracts text page-by-page from a PDF using pdfinfo (page count) + pdftotext (per-page text).

Keyed by file_uuid (core's phoenix_kit_files.uuid), not the per-upload phoenix_kit_cat_pdfs.uuid — so two uploads of identical content share one extraction job.

Lifecycle

  1. Look up the extraction row by file_uuid. If terminal (extracted / scanned_no_text / failed), no-op (retry of an already-done job, or duplicate enqueue from a content-dedup upload).
  2. Resolve the binary via Storage.retrieve_file/1 — returns a temp path. Works whether the file lives on local disk, S3, or anything core supports.
  3. Mark "extracting".
  4. pdfinfo for page count. Treat parse failures as fatal.
  5. For each page, pdftotext -layout, normalize, hash, upsert into the per-page content cache, insert a pdf_pages row.
  6. Transition to extracted (or scanned_no_text if all pages came back empty). Failures mid-loop transition to failed.

Concurrency

Configured via the host app's Oban queue config. Recommend queue: :catalogue_pdf, limit: 2 so a 1000-page PDF doesn't pin CPU or block other queues.