Content-addressed cache of PDF page text.
Keyed by content_hash (SHA-256 hex of the page's normalized text).
Same page text appearing in multiple PDFs (cross-referenced supplier
catalogues, shared boilerplate, repeated legal disclaimers) is stored
once.
The GIN trigram index lives on text here — duplicates indexed only
once, so the index stays small as the corpus grows.
Write-once: pages either reference an existing row or insert a new
one (insert-on-conflict-do-nothing). Orphaned rows (no pdf_pages
row referencing them) are removed by a catalogue-side GC helper, not
by FK cascade — pdf_pages.content_hash → ON DELETE RESTRICT keeps
the cache stable during normal upload/delete cycles.