PhoenixKitCatalogue.Schemas.PdfPageContent (PhoenixKitCatalogue v0.1.17)

Copy Markdown View Source

Content-addressed cache of PDF page text.

Keyed by content_hash (SHA-256 hex of the page's normalized text). Same page text appearing in multiple PDFs (cross-referenced supplier catalogues, shared boilerplate, repeated legal disclaimers) is stored once.

The GIN trigram index lives on text here — duplicates indexed only once, so the index stays small as the corpus grows.

Write-once: pages either reference an existing row or insert a new one (insert-on-conflict-do-nothing). Orphaned rows (no pdf_pages row referencing them) are removed by a catalogue-side GC helper, not by FK cascade — pdf_pages.content_hash → ON DELETE RESTRICT keeps the cache stable during normal upload/delete cycles.

Summary

Types

t()

@type t() :: %PhoenixKitCatalogue.Schemas.PdfPageContent{
  __meta__: term(),
  content_hash: term(),
  inserted_at: term(),
  text: term()
}

Functions

changeset(content, attrs)

@spec changeset(
  t()
  | %PhoenixKitCatalogue.Schemas.PdfPageContent{
      __meta__: term(),
      content_hash: term(),
      inserted_at: term(),
      text: term()
    },
  map()
) :: Ecto.Changeset.t(t())