This note records the next credible Postgres bulk-ingest direction for Exograph. The small --postgres-copy experiment that copied only fragment_terms did not improve indexing because Exograph currently appends many package-sized batches; COPY setup overhead dominated.
A fair COPY implementation should batch across packages and load larger staging tables.
Goals
- Preserve existing functionality and persisted data: files, fragments, ASTs, hashes, terms, facts, graph nodes, call edges, and package scope.
- Preserve upsert/conflict semantics for package versions, files, fragments, and terms.
- Keep stable IDs available before dependent rows are inserted.
- Keep query indexes deferred until after the load when
--postgres-defer-indexesis enabled.
Proposed pipeline
- Create regular target tables with only required unique indexes/constraints.
- Create temporary or unlogged staging tables for high-volume rows:
- files
- fragments
- comments
- definitions
- references
- graph nodes
- call edges
- terms
- fragment_terms
- Accumulate rows across many packages in memory or bounded disk-backed chunks.
- COPY staging chunks using
COPY ... FROM STDIN. - Merge dimensions first:
- packages
- package_versions
- terms
- files
- fragments
- Fetch generated IDs into hash maps keyed by natural keys:
- package
(ecosystem, name) - package version
(package_id, version) - file
(package_version_id, sha256) - fragment
content_hash - term text
- package
- Rewrite dependent staging rows with database IDs.
- COPY/insert fact tables in dependency order.
- Build deferred indexes and run
ANALYZE.
Candidate implementation shape
Add an opt-in loader module rather than complicating ordinary per-package indexing:
Exograph.Postgres.StagingLoader.load_packages(repo, prefix, package_rows, opts)The Hex corpus task can route to it only when all of these are true:
- backend is Postgres
postgres_copy?: true- package workload is corpus-style, not ad-hoc source indexing
- staging batch size is above a threshold
Risks
- Memory blow-up if too many package rows are accumulated.
- Duplicate handling must remain identical to current
insert_allconflict behavior. - Foreign-key order and ID resolution add complexity.
- Temporary/unlogged staging tables are appropriate for rebuildable local indexes, not durable ingestion guarantees.
Recommendation
Do not extend the current row-by-row COPY helper further. The next useful implementation should be a dedicated staging loader with large batches and explicit ID-resolution phases. Until then, keep --postgres-copy documented as experimental and do not use it for headline benchmark claims.