PhoenixKitCatalogue.Workers.PdfExtractor (PhoenixKitCatalogue v0.8.0)

Copy Markdown View Source

Oban worker that extracts text page-by-page from a PDF using pdfinfo (page count) + pdftotext (per-page text).

Keyed by file_uuid (core's phoenix_kit_files.uuid), not the per-upload phoenix_kit_cat_pdfs.uuid — so two uploads of identical content share one extraction job.

Lifecycle

  1. Look up the extraction row by file_uuid. If terminal (extracted / scanned_no_text / failed), no-op (retry of an already-done job, or duplicate enqueue from a content-dedup upload).
  2. Resolve the binary via Storage.retrieve_file/1 — returns a temp path. Works whether the file lives on local disk, S3, or anything core supports.
  3. Mark "extracting".
  4. pdfinfo for page count. Treat parse failures as fatal.
  5. For each page, pdftotext -layout, normalize, hash, upsert into the per-page content cache, insert a pdf_pages row.
  6. Transition to extracted (or scanned_no_text if all pages came back empty). Failures mid-loop transition to failed.

Concurrency

Configured via the host app's Oban queue config. Recommend queue: :catalogue_pdf, limit: 2 so a 1000-page PDF doesn't pin CPU or block other queues.

Deduplication

Re-enqueueing the same content (duplicate-content upload, the self-heal requeue_stuck_extractions/1, or the per-PDF Retry button) is deduped application-side in PdfLibrary.insert_extraction_job/1 — it skips the insert when a non-terminal PdfExtractor job already exists for the file_uuid. We deliberately do not use Oban's built-in unique: option: satisfying its compile-time check requires listing every incomplete state including :suspended, but that enum value is absent from the oban_job_state enum on hosts that upgraded the Oban library without running its latest migration — the uniqueness query then raises 22P02 and kills every enqueue. The app-side guard queries only the four states (available / scheduled / executing / retryable) present in every Oban version. Races are harmless: this worker short-circuits on a terminal status and page inserts are upserts.