PhoenixKitCatalogue.Catalogue.PdfLibrary (PhoenixKitCatalogue v0.2.0)

Copy Markdown View Source

PDF library — upload, extract, search.

Layered on top of core's phoenix_kit_files system. The catalogue owns only:

  • phoenix_kit_cat_pdfs — per-upload row (the user-facing "this name in the library"). Soft-delete via status (active / trashed).
  • phoenix_kit_cat_pdf_extractions — per unique file content (one row per file_uuid). Holds the worker state machine.
  • phoenix_kit_cat_pdf_pages — per-page join.
  • phoenix_kit_cat_pdf_page_contents — content-addressed page text dedup cache.

Core handles binary storage, content checksum dedup, multi-bucket redundancy, on-disk lifecycle (Storage.trash_file/1, PruneTrashJob).

Public surface re-exported from PhoenixKitCatalogue.Catalogue. Activity logging follows the catalogue convention — success-only on the context layer; the LV layer's Web.Helpers.log_operation_error/3 writes the db_pending: true audit row on failure.

Authorization

The mutating context functions accept :actor_uuid for activity attribution but do not enforce role checks — authorization is the LV mount layer's job (admin live_session + on_mount hook). Same convention as the rest of the catalogue context. New non-LV callers (background jobs, RPC, extension modules) MUST verify the caller is allowed before invoking these functions.

create_pdf_from_upload/3 does require a non-nil :actor_uuid — not as authorization, but because core's phoenix_kit_files.user_uuid is NOT NULL and we'd otherwise crash mid-flow after writing bytes to disk. Returns {:error, :missing_actor} cleanly when missing.

Summary

Types

Per-PDF group returned by search_pdfs_for_item/2.

One PDF search hit returned to the UI.

Functions

Returns the total PDF count, matching the optional status filter.

Returns the extraction state for a PDF (or its file_uuid), or nil if the file has no extraction row yet.

Fetches a PDF by UUID. Returns nil if not found.

Fetches a PDF by UUID. Raises Ecto.NoResultsError if not found.

Lists PDFs in the library, newest first.

Loads additional hits for one PDF beyond what the initial grouped search returned. Used by the modal's per-PDF "Show more matches" expand action.

Permanently removes a phoenix_kit_cat_pdfs row.

Removes phoenix_kit_cat_pdf_page_contents rows that no phoenix_kit_cat_pdf_pages row references anymore. Safe to call any time.

Restores a trashed PDF back to active.

Searches the PDF library for any active PDF whose pages match one of the item's translated names.

Soft-deletes a PDF: flips status to "trashed" and records trashed_at. Underlying file + extraction + page rows untouched (other live PDF entries may still reference them).

Types

group()

@type group() :: %{
  pdf: PhoenixKitCatalogue.Schemas.Pdf.t(),
  total_matches: non_neg_integer(),
  hits: [hit()]
}

Per-PDF group returned by search_pdfs_for_item/2.

hit()

@type hit() :: %{
  pdf: PhoenixKitCatalogue.Schemas.Pdf.t(),
  page_number: pos_integer(),
  snippet: String.t(),
  score: float()
}

One PDF search hit returned to the UI.

Functions

count_pdfs(opts \\ [])

@spec count_pdfs(keyword()) :: non_neg_integer()

Returns the total PDF count, matching the optional status filter.

create_pdf_from_upload(tmp_path, original_filename, opts \\ [])

@spec create_pdf_from_upload(String.t(), String.t(), keyword()) ::
  {:ok, PhoenixKitCatalogue.Schemas.Pdf.t()} | {:error, term()}

Stores an uploaded PDF.

tmp_path is the local file from consume_uploaded_entry's callback. original_filename is the user's chosen name. byte_size is from entry.client_size.

Flow:

  1. Storage.store_file/2 (core) — handles SHA-256 dedup, on-disk placement, multi-bucket redundancy. Same content uploaded twice (any name) returns the same file_uuid.
  2. Upsert the per-file extraction row. If newly created, enqueue the worker — otherwise the previous extraction is reused.
  3. Always insert a fresh phoenix_kit_cat_pdfs row so each upload gets its own per-name entry in the library.
  4. Activity action: pdf.uploaded. Metadata flags content_dedup: true when the file row was a hit.

Returns {:ok, pdf} on success.

The persisted byte_size is read from the file on disk via File.stat!/1 — never from a browser-supplied value — so the recorded size always matches the actual stored bytes.

get_extraction(file_uuid)

Returns the extraction state for a PDF (or its file_uuid), or nil if the file has no extraction row yet.

get_pdf(uuid)

@spec get_pdf(Ecto.UUID.t()) :: PhoenixKitCatalogue.Schemas.Pdf.t() | nil

Fetches a PDF by UUID. Returns nil if not found.

get_pdf!(uuid)

Fetches a PDF by UUID. Raises Ecto.NoResultsError if not found.

list_pdfs(opts \\ [])

@spec list_pdfs(keyword()) :: [PhoenixKitCatalogue.Schemas.Pdf.t()]

Lists PDFs in the library, newest first.

Options

  • :status — filter to a status string ("active" / "trashed"). Pass nil to include all. Defaults to "active".
  • :limit (default 100), :offset (default 0)

more_pdf_matches_for_item(item, pdf_uuid, opts \\ [])

@spec more_pdf_matches_for_item(
  PhoenixKitCatalogue.Schemas.Item.t(),
  Ecto.UUID.t(),
  keyword()
) :: [
  hit()
]

Loads additional hits for one PDF beyond what the initial grouped search returned. Used by the modal's per-PDF "Show more matches" expand action.

Returns a flat list of hit() ordered by page_number ASC (literal search) or similarity DESC (when a :trigram_query opt is given).

Options

  • :offset (default 0)
  • :limit (default 50)
  • :trigram_query — when set, score by pg_trgm similarity against this string (matches the trigram fallback's ordering).

permanently_delete_pdf(pdf, opts \\ [])

@spec permanently_delete_pdf(
  PhoenixKitCatalogue.Schemas.Pdf.t(),
  keyword()
) :: {:ok, PhoenixKitCatalogue.Schemas.Pdf.t()} | {:error, Ecto.Changeset.t()}

Permanently removes a phoenix_kit_cat_pdfs row.

When this is the last (active OR trashed) row referencing the underlying file_uuid, hands the file off to Storage.trash_file/1 so core's daily PruneTrashJob deletes the binary, cascading to the extraction and page rows.

prune_orphan_page_contents()

@spec prune_orphan_page_contents() :: non_neg_integer()

Removes phoenix_kit_cat_pdf_page_contents rows that no phoenix_kit_cat_pdf_pages row references anymore. Safe to call any time.

Returns the number of rows removed. Suitable for wiring to a daily Oban cron once the corpus is large enough to care.

restore_pdf(pdf, opts \\ [])

Restores a trashed PDF back to active.

search_pdfs_for_item(item, opts \\ [])

@spec search_pdfs_for_item(
  PhoenixKitCatalogue.Schemas.Item.t(),
  keyword()
) :: [group()]

Searches the PDF library for any active PDF whose pages match one of the item's translated names.

Returns groups keyed by PDF, each with the total match count for the corpus plus the first :per_pdf hits (default 5). Use more_pdf_matches_for_item/3 to load additional hits within one PDF on demand (the "Show more matches" expand action).

Strategy:

  1. Build the title list from the item's primary name + every enabled language's translated name. Drop blanks and duplicates.
  2. Literal ILIKE ANY against the deduped page-content table — fast and precise. Joined to active phoenix_kit_cat_pdfs rows via file_uuid. Rows are window-ranked per PDF and window-counted per PDF in a single SQL pass; the outer query caps at rn <= per_pdf so the result is bounded by per_pdf × distinct PDFs that match.
  3. If literal returns nothing, fall back to a pg_trgm similarity search using the longest title (default threshold 0.4) — same grouping shape, best similarity first within each PDF.

Trashed PDFs are excluded. Groups are ordered newest-PDF-first.

Options

  • :per_pdf (default 5) — preview hits returned per PDF.
  • :similarity_threshold (default 0.4) — trigram fallback threshold.

trash_pdf(pdf, opts \\ [])

Soft-deletes a PDF: flips status to "trashed" and records trashed_at. Underlying file + extraction + page rows untouched (other live PDF entries may still reference them).