PhoenixKit.Migrations.Postgres.V111 (phoenix_kit v1.7.116)

Copy Markdown View Source

V111: PDF library tables for the catalogue module.

Backs the "PDFs" subtab in phoenix_kit_catalogue, layered on top of core's phoenix_kit_files for binary storage / dedup / soft-delete / multi-bucket redundancy. Catalogue owns only the per-page text index and the user-facing per-upload row.

Tables

  • phoenix_kit_cat_pdfs — thin per-upload row. One row per "user uploaded this name". file_uuid FK → phoenix_kit_files.uuid ON DELETE RESTRICT (catalogue manages the lifecycle; core prune can't remove a file referenced by a live catalogue row). Two uploads of identical content (different filenames) → two phoenix_kit_cat_pdfs rows, one shared phoenix_kit_files row, one shared extraction. Soft-delete via status sentinel "active" / "trashed" (workspace convention) plus trashed_at for trashed-at age UI.

  • phoenix_kit_cat_pdf_extractions — keyed by file_uuid PK (one row per unique PDF content). Holds the worker's state machine (pending → extracting → extracted | scanned_no_text | failed), page_count, extracted_at, error_message. Cascades on the file row's hard delete.

  • phoenix_kit_cat_pdf_page_contents — content-addressed dedup cache. Keyed by content_hash (SHA-256 hex of the page's normalized text). Same page text across multiple PDFs (boilerplate, legal disclaimers, cross-referenced product entries) is stored once. The GIN trigram index on text lives here, so the search index doesn't grow with duplication.

  • phoenix_kit_cat_pdf_pages — per-page join. Composite PK (file_uuid, page_number). References both the file (cascade on file delete) and the page-content cache (restrict; orphaned content rows are GC'd by a catalogue-side helper, not by FK cascade, so the cache doesn't churn during normal upload/delete cycles).

Enables pg_trgm for the trigram index.

Summary

Functions

down(opts)

up(opts)