PDF library — upload, extract, search.
Layered on top of core's phoenix_kit_files system. The catalogue
owns only:
phoenix_kit_cat_pdfs— per-upload row (the user-facing "this name in the library"). Soft-delete viastatus(active/trashed).phoenix_kit_cat_pdf_extractions— per unique file content (one row perfile_uuid). Holds the worker state machine.phoenix_kit_cat_pdf_pages— per-page join.phoenix_kit_cat_pdf_page_contents— content-addressed page text dedup cache.
Core handles binary storage, content checksum dedup, multi-bucket
redundancy, on-disk lifecycle (Storage.trash_file/1,
PruneTrashJob).
Public surface re-exported from PhoenixKitCatalogue.Catalogue.
Activity logging follows the catalogue convention — success-only on
the context layer; the LV layer's Web.Helpers.log_operation_error/3
writes the db_pending: true audit row on failure.
Authorization
The mutating context functions accept :actor_uuid for activity
attribution but do not enforce role checks — authorization is
the LV mount layer's job (admin live_session + on_mount hook).
Same convention as the rest of the catalogue context. New non-LV
callers (background jobs, RPC, extension modules) MUST verify the
caller is allowed before invoking these functions.
create_pdf_from_upload/3 does require a non-nil :actor_uuid —
not as authorization, but because core's phoenix_kit_files.user_uuid
is NOT NULL and we'd otherwise crash mid-flow after writing bytes
to disk. Returns {:error, :missing_actor} cleanly when missing.
Summary
Functions
Returns the total PDF count, matching the optional status filter.
Stores an uploaded PDF.
Returns the extraction state for a PDF (or its file_uuid), or
nil if the file has no extraction row yet.
Fetches a PDF by UUID. Returns nil if not found.
Fetches a PDF by UUID. Raises Ecto.NoResultsError if not found.
Lists PDFs in the library, newest first.
Loads additional hits for one PDF beyond what the initial grouped search returned. Used by the modal's per-PDF "Show more matches" expand action.
Permanently removes a phoenix_kit_cat_pdfs row.
Removes phoenix_kit_cat_pdf_page_contents rows that no
phoenix_kit_cat_pdf_pages row references anymore. Safe to call
any time.
Restores a trashed PDF back to active.
Searches the PDF library for any active PDF whose pages match one of the item's translated names.
Soft-deletes a PDF: flips status to "trashed" and records
trashed_at. Underlying file + extraction + page rows untouched
(other live PDF entries may still reference them).
Types
@type group() :: %{ pdf: PhoenixKitCatalogue.Schemas.Pdf.t(), total_matches: non_neg_integer(), hits: [hit()] }
Per-PDF group returned by search_pdfs_for_item/2.
@type hit() :: %{ pdf: PhoenixKitCatalogue.Schemas.Pdf.t(), page_number: pos_integer(), snippet: String.t(), score: float() }
One PDF search hit returned to the UI.
Functions
@spec count_pdfs(keyword()) :: non_neg_integer()
Returns the total PDF count, matching the optional status filter.
@spec create_pdf_from_upload(String.t(), String.t(), keyword()) :: {:ok, PhoenixKitCatalogue.Schemas.Pdf.t()} | {:error, term()}
Stores an uploaded PDF.
tmp_path is the local file from consume_uploaded_entry's callback.
original_filename is the user's chosen name. byte_size is from
entry.client_size.
Flow:
Storage.store_file/2(core) — handles SHA-256 dedup, on-disk placement, multi-bucket redundancy. Same content uploaded twice (any name) returns the samefile_uuid.- Upsert the per-file extraction row. If newly created, enqueue the worker — otherwise the previous extraction is reused.
- Always insert a fresh
phoenix_kit_cat_pdfsrow so each upload gets its own per-name entry in the library. - Activity action:
pdf.uploaded. Metadata flagscontent_dedup: truewhen the file row was a hit.
Returns {:ok, pdf} on success.
The persisted byte_size is read from the file on disk via
File.stat!/1 — never from a browser-supplied value — so the
recorded size always matches the actual stored bytes.
@spec get_extraction(PhoenixKitCatalogue.Schemas.Pdf.t() | Ecto.UUID.t()) :: PhoenixKitCatalogue.Schemas.PdfExtraction.t() | nil
Returns the extraction state for a PDF (or its file_uuid), or
nil if the file has no extraction row yet.
@spec get_pdf(Ecto.UUID.t()) :: PhoenixKitCatalogue.Schemas.Pdf.t() | nil
Fetches a PDF by UUID. Returns nil if not found.
@spec get_pdf!(Ecto.UUID.t()) :: PhoenixKitCatalogue.Schemas.Pdf.t()
Fetches a PDF by UUID. Raises Ecto.NoResultsError if not found.
@spec list_pdfs(keyword()) :: [PhoenixKitCatalogue.Schemas.Pdf.t()]
Lists PDFs in the library, newest first.
Options
:status— filter to a status string ("active"/"trashed"). Passnilto include all. Defaults to"active".:limit(default 100),:offset(default 0)
@spec more_pdf_matches_for_item( PhoenixKitCatalogue.Schemas.Item.t(), Ecto.UUID.t(), keyword() ) :: [ hit() ]
Loads additional hits for one PDF beyond what the initial grouped search returned. Used by the modal's per-PDF "Show more matches" expand action.
Returns a flat list of hit() ordered by page_number ASC (literal
search) or similarity DESC (when a :trigram_query opt is given).
Options
:offset(default 0):limit(default 50):trigram_query— when set, score bypg_trgmsimilarity against this string (matches the trigram fallback's ordering).
@spec permanently_delete_pdf( PhoenixKitCatalogue.Schemas.Pdf.t(), keyword() ) :: {:ok, PhoenixKitCatalogue.Schemas.Pdf.t()} | {:error, Ecto.Changeset.t()}
Permanently removes a phoenix_kit_cat_pdfs row.
When this is the last (active OR trashed) row referencing the
underlying file_uuid, hands the file off to Storage.trash_file/1
so core's daily PruneTrashJob deletes the binary, cascading to
the extraction and page rows.
@spec prune_orphan_page_contents() :: non_neg_integer()
Removes phoenix_kit_cat_pdf_page_contents rows that no
phoenix_kit_cat_pdf_pages row references anymore. Safe to call
any time.
Returns the number of rows removed. Suitable for wiring to a daily Oban cron once the corpus is large enough to care.
@spec restore_pdf( PhoenixKitCatalogue.Schemas.Pdf.t(), keyword() ) :: {:ok, PhoenixKitCatalogue.Schemas.Pdf.t()} | {:error, Ecto.Changeset.t()}
Restores a trashed PDF back to active.
@spec search_pdfs_for_item( PhoenixKitCatalogue.Schemas.Item.t(), keyword() ) :: [group()]
Searches the PDF library for any active PDF whose pages match one of the item's translated names.
Returns groups keyed by PDF, each with the total match count for
the corpus plus the first :per_pdf hits (default 5). Use
more_pdf_matches_for_item/3 to load additional hits within one PDF
on demand (the "Show more matches" expand action).
Strategy:
- Build the title list from the item's primary name + every enabled language's translated name. Drop blanks and duplicates.
- Literal
ILIKE ANYagainst the deduped page-content table — fast and precise. Joined to activephoenix_kit_cat_pdfsrows viafile_uuid. Rows are window-ranked per PDF and window-counted per PDF in a single SQL pass; the outer query caps atrn <= per_pdfso the result is bounded byper_pdf × distinct PDFs that match. - If literal returns nothing, fall back to a
pg_trgmsimilarity search using the longest title (default threshold 0.4) — same grouping shape, best similarity first within each PDF.
Trashed PDFs are excluded. Groups are ordered newest-PDF-first.
Options
:per_pdf(default 5) — preview hits returned per PDF.:similarity_threshold(default 0.4) — trigram fallback threshold.
@spec trash_pdf( PhoenixKitCatalogue.Schemas.Pdf.t(), keyword() ) :: {:ok, PhoenixKitCatalogue.Schemas.Pdf.t()} | {:error, Ecto.Changeset.t()}
Soft-deletes a PDF: flips status to "trashed" and records
trashed_at. Underlying file + extraction + page rows untouched
(other live PDF entries may still reference them).