PDF library — upload, extract, search.
Layered on top of core's phoenix_kit_files system. The catalogue
owns only:
phoenix_kit_cat_pdfs— per-upload row (the user-facing "this name in the library"). Soft-delete viastatus(active/trashed).phoenix_kit_cat_pdf_extractions— per unique file content (one row perfile_uuid). Holds the worker state machine.phoenix_kit_cat_pdf_pages— per-page join.phoenix_kit_cat_pdf_page_contents— content-addressed page text dedup cache.
Core handles binary storage, content checksum dedup, multi-bucket
redundancy, on-disk lifecycle (Storage.trash_file/1,
PruneTrashJob).
Public surface re-exported from PhoenixKitCatalogue.Catalogue.
Activity logging follows the catalogue convention — success-only on
the context layer; the LV layer's Web.Helpers.log_operation_error/3
writes the db_pending: true audit row on failure.
Authorization
The mutating context functions accept :actor_uuid for activity
attribution but do not enforce role checks — authorization is
the LV mount layer's job (admin live_session + on_mount hook).
Same convention as the rest of the catalogue context. New non-LV
callers (background jobs, RPC, extension modules) MUST verify the
caller is allowed before invoking these functions.
create_pdf_from_upload/3 does require a non-nil :actor_uuid —
not as authorization, but because core's phoenix_kit_files.user_uuid
is NOT NULL and we'd otherwise crash mid-flow after writing bytes
to disk. Returns {:error, :missing_actor} cleanly when missing.
Summary
Functions
Returns the total PDF count, matching the optional status filter.
Stores an uploaded PDF.
Returns the extraction state for a PDF (or its file_uuid), or
nil if the file has no extraction row yet.
Fetches a PDF by UUID. Returns nil if not found.
Fetches a PDF by UUID. Raises Ecto.NoResultsError if not found.
Lists PDFs in the library, newest first.
Loads additional hits for one PDF beyond what the initial grouped search returned. Used by the modal's per-PDF "Show more matches" expand action.
Permanently removes a phoenix_kit_cat_pdfs row.
Removes phoenix_kit_cat_pdf_page_contents rows that no
phoenix_kit_cat_pdf_pages row references anymore. Safe to call
any time.
Re-enqueues extraction for every PDF stuck in a non-terminal state.
Restores a trashed PDF back to active.
Retries text extraction for a single PDF.
Searches the PDF library for any active PDF whose pages match one of the item's translated names.
Soft-deletes a PDF: flips status to "trashed" and records
trashed_at. Underlying file + extraction + page rows untouched
(other live PDF entries may still reference them).
Types
@type group() :: %{ pdf: PhoenixKitCatalogue.Schemas.Pdf.t(), total_matches: non_neg_integer(), hits: [hit()] }
Per-PDF group returned by search_pdfs_for_item/2.
@type hit() :: %{ pdf: PhoenixKitCatalogue.Schemas.Pdf.t(), page_number: pos_integer(), snippet: String.t(), score: float() }
One PDF search hit returned to the UI.
Functions
@spec count_pdfs(keyword()) :: non_neg_integer()
Returns the total PDF count, matching the optional status filter.
@spec create_pdf_from_upload(String.t(), String.t(), keyword()) :: {:ok, PhoenixKitCatalogue.Schemas.Pdf.t()} | {:error, term()}
Stores an uploaded PDF.
tmp_path is the local file from consume_uploaded_entry's callback.
original_filename is the user's chosen name. byte_size is from
entry.client_size.
Flow:
Storage.store_file/2(core) — handles SHA-256 dedup, on-disk placement, multi-bucket redundancy. Same content uploaded twice (any name) returns the samefile_uuid.- Upsert the per-file extraction row. If newly created, enqueue the worker — otherwise the previous extraction is reused.
- Always insert a fresh
phoenix_kit_cat_pdfsrow so each upload gets its own per-name entry in the library. - Activity action:
pdf.uploaded. Metadata flagscontent_dedup: truewhen the file row was a hit.
Returns {:ok, pdf} on success.
The persisted byte_size is read from the file on disk via
File.stat!/1 — never from a browser-supplied value — so the
recorded size always matches the actual stored bytes.
@spec get_extraction(PhoenixKitCatalogue.Schemas.Pdf.t() | Ecto.UUID.t()) :: PhoenixKitCatalogue.Schemas.PdfExtraction.t() | nil
Returns the extraction state for a PDF (or its file_uuid), or
nil if the file has no extraction row yet.
@spec get_pdf(Ecto.UUID.t()) :: PhoenixKitCatalogue.Schemas.Pdf.t() | nil
Fetches a PDF by UUID. Returns nil if not found.
@spec get_pdf!(Ecto.UUID.t()) :: PhoenixKitCatalogue.Schemas.Pdf.t()
Fetches a PDF by UUID. Raises Ecto.NoResultsError if not found.
@spec list_pdfs(keyword()) :: [PhoenixKitCatalogue.Schemas.Pdf.t()]
Lists PDFs in the library, newest first.
Options
:status— filter to a status string ("active"/"trashed"). Passnilto include all. Defaults to"active".:limit(default 100),:offset(default 0)
@spec more_pdf_matches_for_item( PhoenixKitCatalogue.Schemas.Item.t(), Ecto.UUID.t(), keyword() ) :: [ hit() ]
Loads additional hits for one PDF beyond what the initial grouped search returned. Used by the modal's per-PDF "Show more matches" expand action.
Returns a flat list of hit() ordered by page_number ASC (literal
search) or similarity DESC (when a :trigram_query opt is given).
Options
:offset(default 0):limit(default 50):trigram_query— when set, score bypg_trgmsimilarity against this string (matches the trigram fallback's ordering).
@spec permanently_delete_pdf( PhoenixKitCatalogue.Schemas.Pdf.t(), keyword() ) :: {:ok, PhoenixKitCatalogue.Schemas.Pdf.t()} | {:error, Ecto.Changeset.t()}
Permanently removes a phoenix_kit_cat_pdfs row.
When this is the last (active OR trashed) row referencing the
underlying file_uuid, hands the file off to Storage.trash_file/1
so core's daily PruneTrashJob deletes the binary, cascading to
the extraction and page rows.
@spec prune_orphan_page_contents() :: non_neg_integer()
Removes phoenix_kit_cat_pdf_page_contents rows that no
phoenix_kit_cat_pdf_pages row references anymore. Safe to call
any time.
Returns the number of rows removed. Suitable for wiring to a daily Oban cron once the corpus is large enough to care.
@spec requeue_stuck_extractions(keyword()) :: {:ok, %{ requeued: non_neg_integer(), skipped: non_neg_integer(), failed: non_neg_integer() }}
Re-enqueues extraction for every PDF stuck in a non-terminal state.
The heal path for PDFs uploaded while the :catalogue_pdf queue was
unavailable (their jobs never ran) or orphaned extracting rows whose
worker died mid-run. The per-upload enqueue_extraction/1 guard only
fires at upload time, so without this nothing ever re-drives those rows.
pending rows are always re-enqueued — no live job can exist for them.
extracting rows are re-enqueued only when older than
:stale_after_seconds (default 900) so an actively-running
extraction isn't double-processed.
Returns {:ok, %{requeued: n, skipped: s, failed: m}}:
requeued— rows whose extraction job was actually (re-)enqueued.skipped— rows a live job already covers, so there was nothing to do (the app-level dedup). Reported separately sorequeuedcan't claim credit for rows we didn't touch.failed— rows whose enqueue was refused (e.g. the:catalogue_pdfqueue is still not running, so they were markedfailedwith the actionable message instead).
The split keeps "re-queued N" honest when every enqueue actually failed or was a no-op. Safe to call repeatedly (the worker is idempotent).
The whole selection is de-duped against live jobs in a single query and
enqueued with one Oban.insert_all/1, so a full 1000-row
click is a handful of statements rather than ~2k per-row round-trips.
Capped at 1000 rows per call; re-run to process more.
Options
:stale_after_seconds(default900) — minimum age of anextractingrow before it's considered orphaned.:limit(default1000) — max rows touched per call.
@spec restore_pdf( PhoenixKitCatalogue.Schemas.Pdf.t(), keyword() ) :: {:ok, PhoenixKitCatalogue.Schemas.Pdf.t()} | {:error, Ecto.Changeset.t()}
Restores a trashed PDF back to active.
@spec retry_extraction( PhoenixKitCatalogue.Schemas.Pdf.t() | Ecto.UUID.t(), keyword() ) :: {:ok, PhoenixKitCatalogue.Schemas.PdfExtraction.t()} | {:error, term()}
Retries text extraction for a single PDF.
Resets the extraction row to pending (clearing any prior
error_message) and re-enqueues the worker. Use for a failed row
(transient failure: queue was down, pdftotext hiccup) or one that
looks stuck in pending / extracting.
This is a retry, not a full re-extract: it does not delete existing
pdf_pages rows or clear page_count / extracted_at. The worker's
page inserts are upserts and mark_extracted/2 overwrites page_count
on success, so a re-run self-heals. The admin UI only offers Retry on
failed rows (which carry no successful page data), so the distinction
rarely matters in practice.
The worker no-ops on a terminal status, so resetting to pending
first is what lets a failed row run again.
Returns:
{:ok, extraction}— reset + enqueued.{:error, :no_extraction}— the file has no extraction row.{:error, :already_extracted}— the row is already in a SUCCESS terminal (extracted/scanned_no_text). Refused so a stray caller can't reset a good extraction back topendingand drop the PDF out of search mid-run. Passforce: trueto override (e.g. a deliberate re-extract after a normalizer change). The admin UI only offers Retry onfailedrows, so this only bites a programmatic caller.{:error, reason}— the enqueue guard refused (e.g.:extraction_queue_unavailablewhen the:catalogue_pdfqueue still isn't running). The row is leftfailedwith the actionable message in that case, exactly as on upload.
Accepts a %Pdf{} (the LV path) or a bare file_uuid.
Options
:force(defaultfalse) — re-run even a success-terminal row.
@spec search_pdfs_for_item( PhoenixKitCatalogue.Schemas.Item.t(), keyword() ) :: [group()]
Searches the PDF library for any active PDF whose pages match one of the item's translated names.
Returns groups keyed by PDF, each with the total match count for
the corpus plus the first :per_pdf hits (default 5). Use
more_pdf_matches_for_item/3 to load additional hits within one PDF
on demand (the "Show more matches" expand action).
Strategy:
- Build the title list from the item's primary name + every enabled language's translated name. Drop blanks and duplicates.
- Literal
ILIKE ANYagainst the deduped page-content table — fast and precise. Joined to activephoenix_kit_cat_pdfsrows viafile_uuid. Rows are window-ranked per PDF and window-counted per PDF in a single SQL pass; the outer query caps atrn <= per_pdfso the result is bounded byper_pdf × distinct PDFs that match. - If literal returns nothing, fall back to a
pg_trgmsimilarity search using the longest title (default threshold 0.4) — same grouping shape, best similarity first within each PDF.
Trashed PDFs are excluded. Groups are ordered newest-PDF-first.
Options
:per_pdf(default 5) — preview hits returned per PDF.:similarity_threshold(default 0.4) — trigram fallback threshold.
@spec trash_pdf( PhoenixKitCatalogue.Schemas.Pdf.t(), keyword() ) :: {:ok, PhoenixKitCatalogue.Schemas.Pdf.t()} | {:error, Ecto.Changeset.t()}
Soft-deletes a PDF: flips status to "trashed" and records
trashed_at. Underlying file + extraction + page rows untouched
(other live PDF entries may still reference them).