PhoenixKitCatalogue.Catalogue.PdfLibrary (PhoenixKitCatalogue v0.5.0)

Copy Markdown View Source

PDF library — upload, extract, search.

Layered on top of core's phoenix_kit_files system. The catalogue owns only:

  • phoenix_kit_cat_pdfs — per-upload row (the user-facing "this name in the library"). Soft-delete via status (active / trashed).
  • phoenix_kit_cat_pdf_extractions — per unique file content (one row per file_uuid). Holds the worker state machine.
  • phoenix_kit_cat_pdf_pages — per-page join.
  • phoenix_kit_cat_pdf_page_contents — content-addressed page text dedup cache.

Core handles binary storage, content checksum dedup, multi-bucket redundancy, on-disk lifecycle (Storage.trash_file/1, PruneTrashJob).

Public surface re-exported from PhoenixKitCatalogue.Catalogue. Activity logging follows the catalogue convention — success-only on the context layer; the LV layer's Web.Helpers.log_operation_error/3 writes the db_pending: true audit row on failure.

Authorization

The mutating context functions accept :actor_uuid for activity attribution but do not enforce role checks — authorization is the LV mount layer's job (admin live_session + on_mount hook). Same convention as the rest of the catalogue context. New non-LV callers (background jobs, RPC, extension modules) MUST verify the caller is allowed before invoking these functions.

create_pdf_from_upload/3 does require a non-nil :actor_uuid — not as authorization, but because core's phoenix_kit_files.user_uuid is NOT NULL and we'd otherwise crash mid-flow after writing bytes to disk. Returns {:error, :missing_actor} cleanly when missing.

Summary

Types

Per-PDF group returned by search_pdfs_for_item/2.

One PDF search hit returned to the UI.

Functions

Returns the total PDF count, matching the optional status filter.

Returns the extraction state for a PDF (or its file_uuid), or nil if the file has no extraction row yet.

Fetches a PDF by UUID. Returns nil if not found.

Fetches a PDF by UUID. Raises Ecto.NoResultsError if not found.

Lists PDFs in the library, newest first.

Loads additional hits for one PDF beyond what the initial grouped search returned. Used by the modal's per-PDF "Show more matches" expand action.

Permanently removes a phoenix_kit_cat_pdfs row.

Removes phoenix_kit_cat_pdf_page_contents rows that no phoenix_kit_cat_pdf_pages row references anymore. Safe to call any time.

Re-enqueues extraction for every PDF stuck in a non-terminal state.

Restores a trashed PDF back to active.

Retries text extraction for a single PDF.

Searches the PDF library for any active PDF whose pages match one of the item's translated names.

Soft-deletes a PDF: flips status to "trashed" and records trashed_at. Underlying file + extraction + page rows untouched (other live PDF entries may still reference them).

Types

group()

@type group() :: %{
  pdf: PhoenixKitCatalogue.Schemas.Pdf.t(),
  total_matches: non_neg_integer(),
  hits: [hit()]
}

Per-PDF group returned by search_pdfs_for_item/2.

hit()

@type hit() :: %{
  pdf: PhoenixKitCatalogue.Schemas.Pdf.t(),
  page_number: pos_integer(),
  snippet: String.t(),
  score: float()
}

One PDF search hit returned to the UI.

Functions

count_pdfs(opts \\ [])

@spec count_pdfs(keyword()) :: non_neg_integer()

Returns the total PDF count, matching the optional status filter.

create_pdf_from_upload(tmp_path, original_filename, opts \\ [])

@spec create_pdf_from_upload(String.t(), String.t(), keyword()) ::
  {:ok, PhoenixKitCatalogue.Schemas.Pdf.t()} | {:error, term()}

Stores an uploaded PDF.

tmp_path is the local file from consume_uploaded_entry's callback. original_filename is the user's chosen name. byte_size is from entry.client_size.

Flow:

  1. Storage.store_file/2 (core) — handles SHA-256 dedup, on-disk placement, multi-bucket redundancy. Same content uploaded twice (any name) returns the same file_uuid.
  2. Upsert the per-file extraction row. If newly created, enqueue the worker — otherwise the previous extraction is reused.
  3. Always insert a fresh phoenix_kit_cat_pdfs row so each upload gets its own per-name entry in the library.
  4. Activity action: pdf.uploaded. Metadata flags content_dedup: true when the file row was a hit.

Returns {:ok, pdf} on success.

The persisted byte_size is read from the file on disk via File.stat!/1 — never from a browser-supplied value — so the recorded size always matches the actual stored bytes.

get_extraction(file_uuid)

Returns the extraction state for a PDF (or its file_uuid), or nil if the file has no extraction row yet.

get_pdf(uuid)

@spec get_pdf(Ecto.UUID.t()) :: PhoenixKitCatalogue.Schemas.Pdf.t() | nil

Fetches a PDF by UUID. Returns nil if not found.

get_pdf!(uuid)

Fetches a PDF by UUID. Raises Ecto.NoResultsError if not found.

list_pdfs(opts \\ [])

@spec list_pdfs(keyword()) :: [PhoenixKitCatalogue.Schemas.Pdf.t()]

Lists PDFs in the library, newest first.

Options

  • :status — filter to a status string ("active" / "trashed"). Pass nil to include all. Defaults to "active".
  • :limit (default 100), :offset (default 0)

more_pdf_matches_for_item(item, pdf_uuid, opts \\ [])

@spec more_pdf_matches_for_item(
  PhoenixKitCatalogue.Schemas.Item.t(),
  Ecto.UUID.t(),
  keyword()
) :: [
  hit()
]

Loads additional hits for one PDF beyond what the initial grouped search returned. Used by the modal's per-PDF "Show more matches" expand action.

Returns a flat list of hit() ordered by page_number ASC (literal search) or similarity DESC (when a :trigram_query opt is given).

Options

  • :offset (default 0)
  • :limit (default 50)
  • :trigram_query — when set, score by pg_trgm similarity against this string (matches the trigram fallback's ordering).

permanently_delete_pdf(pdf, opts \\ [])

@spec permanently_delete_pdf(
  PhoenixKitCatalogue.Schemas.Pdf.t(),
  keyword()
) :: {:ok, PhoenixKitCatalogue.Schemas.Pdf.t()} | {:error, Ecto.Changeset.t()}

Permanently removes a phoenix_kit_cat_pdfs row.

When this is the last (active OR trashed) row referencing the underlying file_uuid, hands the file off to Storage.trash_file/1 so core's daily PruneTrashJob deletes the binary, cascading to the extraction and page rows.

prune_orphan_page_contents()

@spec prune_orphan_page_contents() :: non_neg_integer()

Removes phoenix_kit_cat_pdf_page_contents rows that no phoenix_kit_cat_pdf_pages row references anymore. Safe to call any time.

Returns the number of rows removed. Suitable for wiring to a daily Oban cron once the corpus is large enough to care.

requeue_stuck_extractions(opts \\ [])

@spec requeue_stuck_extractions(keyword()) ::
  {:ok,
   %{
     requeued: non_neg_integer(),
     skipped: non_neg_integer(),
     failed: non_neg_integer()
   }}

Re-enqueues extraction for every PDF stuck in a non-terminal state.

The heal path for PDFs uploaded while the :catalogue_pdf queue was unavailable (their jobs never ran) or orphaned extracting rows whose worker died mid-run. The per-upload enqueue_extraction/1 guard only fires at upload time, so without this nothing ever re-drives those rows.

pending rows are always re-enqueued — no live job can exist for them. extracting rows are re-enqueued only when older than :stale_after_seconds (default 900) so an actively-running extraction isn't double-processed.

Returns {:ok, %{requeued: n, skipped: s, failed: m}}:

  • requeued — rows whose extraction job was actually (re-)enqueued.
  • skipped — rows a live job already covers, so there was nothing to do (the app-level dedup). Reported separately so requeued can't claim credit for rows we didn't touch.
  • failed — rows whose enqueue was refused (e.g. the :catalogue_pdf queue is still not running, so they were marked failed with the actionable message instead).

The split keeps "re-queued N" honest when every enqueue actually failed or was a no-op. Safe to call repeatedly (the worker is idempotent).

The whole selection is de-duped against live jobs in a single query and enqueued with one Oban.insert_all/1, so a full 1000-row click is a handful of statements rather than ~2k per-row round-trips.

Capped at 1000 rows per call; re-run to process more.

Options

  • :stale_after_seconds (default 900) — minimum age of an extracting row before it's considered orphaned.
  • :limit (default 1000) — max rows touched per call.

restore_pdf(pdf, opts \\ [])

Restores a trashed PDF back to active.

retry_extraction(pdf_or_file_uuid, opts \\ [])

@spec retry_extraction(
  PhoenixKitCatalogue.Schemas.Pdf.t() | Ecto.UUID.t(),
  keyword()
) :: {:ok, PhoenixKitCatalogue.Schemas.PdfExtraction.t()} | {:error, term()}

Retries text extraction for a single PDF.

Resets the extraction row to pending (clearing any prior error_message) and re-enqueues the worker. Use for a failed row (transient failure: queue was down, pdftotext hiccup) or one that looks stuck in pending / extracting.

This is a retry, not a full re-extract: it does not delete existing pdf_pages rows or clear page_count / extracted_at. The worker's page inserts are upserts and mark_extracted/2 overwrites page_count on success, so a re-run self-heals. The admin UI only offers Retry on failed rows (which carry no successful page data), so the distinction rarely matters in practice.

The worker no-ops on a terminal status, so resetting to pending first is what lets a failed row run again.

Returns:

  • {:ok, extraction} — reset + enqueued.
  • {:error, :no_extraction} — the file has no extraction row.
  • {:error, :already_extracted} — the row is already in a SUCCESS terminal (extracted / scanned_no_text). Refused so a stray caller can't reset a good extraction back to pending and drop the PDF out of search mid-run. Pass force: true to override (e.g. a deliberate re-extract after a normalizer change). The admin UI only offers Retry on failed rows, so this only bites a programmatic caller.
  • {:error, reason} — the enqueue guard refused (e.g. :extraction_queue_unavailable when the :catalogue_pdf queue still isn't running). The row is left failed with the actionable message in that case, exactly as on upload.

Accepts a %Pdf{} (the LV path) or a bare file_uuid.

Options

  • :force (default false) — re-run even a success-terminal row.

search_pdfs_for_item(item, opts \\ [])

@spec search_pdfs_for_item(
  PhoenixKitCatalogue.Schemas.Item.t(),
  keyword()
) :: [group()]

Searches the PDF library for any active PDF whose pages match one of the item's translated names.

Returns groups keyed by PDF, each with the total match count for the corpus plus the first :per_pdf hits (default 5). Use more_pdf_matches_for_item/3 to load additional hits within one PDF on demand (the "Show more matches" expand action).

Strategy:

  1. Build the title list from the item's primary name + every enabled language's translated name. Drop blanks and duplicates.
  2. Literal ILIKE ANY against the deduped page-content table — fast and precise. Joined to active phoenix_kit_cat_pdfs rows via file_uuid. Rows are window-ranked per PDF and window-counted per PDF in a single SQL pass; the outer query caps at rn <= per_pdf so the result is bounded by per_pdf × distinct PDFs that match.
  3. If literal returns nothing, fall back to a pg_trgm similarity search using the longest title (default threshold 0.4) — same grouping shape, best similarity first within each PDF.

Trashed PDFs are excluded. Groups are ordered newest-PDF-first.

Options

  • :per_pdf (default 5) — preview hits returned per PDF.
  • :similarity_threshold (default 0.4) — trigram fallback threshold.

trash_pdf(pdf, opts \\ [])

Soft-deletes a PDF: flips status to "trashed" and records trashed_at. Underlying file + extraction + page rows untouched (other live PDF entries may still reference them).