Oban worker that extracts text page-by-page from a PDF using
pdfinfo (page count) + pdftotext (per-page text).
Keyed by file_uuid (core's phoenix_kit_files.uuid), not the
per-upload phoenix_kit_cat_pdfs.uuid — so two uploads of identical
content share one extraction job.
Lifecycle
- Look up the extraction row by
file_uuid. If terminal (extracted/scanned_no_text/failed), no-op (retry of an already-done job, or duplicate enqueue from a content-dedup upload). - Resolve the binary via
Storage.retrieve_file/1— returns a temp path. Works whether the file lives on local disk, S3, or anything core supports. - Mark
"extracting". pdfinfofor page count. Treat parse failures as fatal.- For each page,
pdftotext -layout, normalize, hash, upsert into the per-page content cache, insert apdf_pagesrow. - Transition to
extracted(orscanned_no_textif all pages came back empty). Failures mid-loop transition tofailed.
Concurrency
Configured via the host app's Oban queue config. Recommend
queue: :catalogue_pdf, limit: 2 so a 1000-page PDF doesn't pin
CPU or block other queues.