Postgres COPY staging design

Copy Markdown View Source

This note records the next credible Postgres bulk-ingest direction for Exograph. The small --postgres-copy experiment that copied only fragment_terms did not improve indexing because Exograph currently appends many package-sized batches; COPY setup overhead dominated.

A fair COPY implementation should batch across packages and load larger staging tables.

Goals

  • Preserve existing functionality and persisted data: files, fragments, ASTs, hashes, terms, facts, graph nodes, call edges, and package scope.
  • Preserve upsert/conflict semantics for package versions, files, fragments, and terms.
  • Keep stable IDs available before dependent rows are inserted.
  • Keep query indexes deferred until after the load when --postgres-defer-indexes is enabled.

Proposed pipeline

  1. Create regular target tables with only required unique indexes/constraints.
  2. Create temporary or unlogged staging tables for high-volume rows:
    • files
    • fragments
    • comments
    • definitions
    • references
    • graph nodes
    • call edges
    • terms
    • fragment_terms
  3. Accumulate rows across many packages in memory or bounded disk-backed chunks.
  4. COPY staging chunks using COPY ... FROM STDIN.
  5. Merge dimensions first:
    • packages
    • package_versions
    • terms
    • files
    • fragments
  6. Fetch generated IDs into hash maps keyed by natural keys:
    • package (ecosystem, name)
    • package version (package_id, version)
    • file (package_version_id, sha256)
    • fragment content_hash
    • term text
  7. Rewrite dependent staging rows with database IDs.
  8. COPY/insert fact tables in dependency order.
  9. Build deferred indexes and run ANALYZE.

Candidate implementation shape

Add an opt-in loader module rather than complicating ordinary per-package indexing:

Exograph.Postgres.StagingLoader.load_packages(repo, prefix, package_rows, opts)

The Hex corpus task can route to it only when all of these are true:

  • backend is Postgres
  • postgres_copy?: true
  • package workload is corpus-style, not ad-hoc source indexing
  • staging batch size is above a threshold

Risks

  • Memory blow-up if too many package rows are accumulated.
  • Duplicate handling must remain identical to current insert_all conflict behavior.
  • Foreign-key order and ID resolution add complexity.
  • Temporary/unlogged staging tables are appropriate for rebuildable local indexes, not durable ingestion guarantees.

Recommendation

Do not extend the current row-by-row COPY helper further. The next useful implementation should be a dedicated staging loader with large batches and explicit ID-resolution phases. Until then, keep --postgres-copy documented as experimental and do not use it for headline benchmark claims.