These are local Exograph backend benchmarks for the Hex.pm top workload. They are intended to compare Exograph's current backend implementations on this machine, not to make a universal claim about PostgreSQL or DuckDB.
Method
All runs used the same package cache and full Exograph persistence: files, fragments, ASTs, hashes, symbols, references, terms, and queryable facts were retained.
Common settings:
--mode top
--runs 3
--concurrency 4
--index-concurrency 4
--duckdb-threads 1
--postgres-defer-indexes
--postgres-synchronous-commit off
--postgres-maintenance-work-mem 512MB
--postgres-max-parallel-maintenance-workers 2
DuckDB sharded runs used:
--duckdb-shards 4 --duckdb-recovery-mode no_wal_writes
Postgres settings are a rebuildable/local-index challenge mode: deferred non-unique query indexes, synchronous_commit=off, and larger maintenance memory. They are not durable-production defaults.
Indexing medians
| Workload | Postgres plain | DuckDB plain | DuckDB sharded plain | Result |
|---|---|---|---|---|
top --limit 100 | 38.42s | 39.22s | 41.17s | tuned Postgres slightly faster |
top --limit 500 | 181.22s | 109.27s | 91.05s | DuckDB 1.66× faster; sharded DuckDB 1.99× faster |
For limit 100, the systems are close and tuned Postgres wins indexing. For limit 500, DuckDB wins indexing, and sharding improves throughput further.
Query medians
top --limit 100
| Query | Postgres plain | DuckDB plain | DuckDB sharded plain |
|---|---|---|---|
api_text_defmodule | 71.1ms | 27.7ms | 45.4ms |
references_enum | 24.1ms | 2.3ms | 3.3ms |
files_defmodule | 31.0ms | 7.3ms | 2.4ms |
api_comments_todo | 134.3ms | 129.8ms | 65.6ms |
top --limit 500
| Query | Postgres plain | DuckDB plain | DuckDB sharded plain |
|---|---|---|---|
api_text_defmodule | 134.2ms | 49.0ms | 37.0ms |
references_enum | 56.7ms | 8.9ms | 9.9ms |
files_defmodule | 98.3ms | 23.6ms | 7.1ms |
api_comments_todo | 143.3ms | 159.1ms | 199.9ms |
Search/query paths usually favor DuckDB materially, especially on the larger workload.
DuckDB ingestion decision checkpoint, June 2026
The benchmark evidence now supports keeping single-DuckDB online MERGE as the default ingestion path. It is correct, stable, uses released QuackDB dependencies, and has repeated fixed top500/top2000 parity. Sharding is faster for indexing (70.2s fixed top500 with four shards and global concurrency 8), but it changes global content/term de-duplication and broad structural search semantics, so it should remain an opt-in/product architecture candidate rather than a transparent speed flag.
Future work should choose between two explicit designs:
- sharded read architecture with documented shard-local/global semantics and package-scoped parity tests;
- global finalization pipeline that rebuilds one logical index after extraction/staging.
Do not spend more time on the exhausted micro-optimizations without new evidence: anti-join insert-select, direct append bypass, persistent staging, narrow staging, Quack structured logs, serial ingestion, transaction semaphores, deferred-term offline staging, larger fragment-stage chunks, or simple fanout result dedup.
DuckDB Hex corpus notes, June 2026
A focused DuckDB Hex-corpus tuning pass used the local Hex tarball mirror and 16 DuckDB shards:
mix exograph.index.hex \
--backend duckdb \
--mode top \
--limit 2000 \
--concurrency 16 \
--duckdb-shards 16 \
--duckdb-threads 1 \
--duckdb-recovery-mode no_wal_writes \
--tarball-dir /srv/toys/hex-mirror/tarballs \
--missing-tarballs-report-path /tmp/exograph-missing-top2000.json
The clean baseline after QuackDB 0.5.7 was later rechecked after QuackDB 0.5.8's DML-builder refactor. The 0.5.8 run used an equivalent temporary tarball mirror because amqp@4.1.1 was missing from the shared mirror.
| QuackDB | Workload | Indexed | Skipped | Failed | Index elapsed | Wall time |
|---|---|---|---|---|---|---|
0.5.7 | top --limit 2000 | 1635 | 365 | 0 | 158.09s | 168.21s |
0.5.8 | top --limit 2000 | 1635 | 365 | 0 | 156.99s | 166.86s |
No missing local tarballs were reported for either completed run. The largest cumulative timing buckets in the 0.5.7 baseline were:
| Stage | Total |
|---|---|
fragment_store_put | 1094.4s |
fragment_store_upsert_fragments | 555.0s |
fragment_store_code_facts | 433.2s |
fragment_store_resolve_fragment_ids | 237.1s |
fragment_append_rows | 230.7s |
code_facts_insert_references | 179.5s |
fragment_store_build_fragment_rows | 160.6s |
fragment_store_normalize_terms | 150.9s |
QuackDB 0.5.6 and 0.5.7 included small adapter/protocol improvements for Ecto native append paths. QuackDB 0.5.8 reused QuackDB's DML builder for the same Ecto append temporary-table SQL and was performance-neutral in the full workload. These changes were positive or neutral but not large enough to change the main bottleneck: fragment append/upsert remains dominated by the append + conflict-ignore + returning/staging path.
MERGE fragment append
DuckDB docs and source notes indicate that recent MERGE paths can choose better bulk strategies than older ON CONFLICT index-conflict paths. After repeated parity benchmarks, Exograph uses the DuckDB MERGE fragment append path by default. The previous Ecto-style path remains available for comparison with:
mix exograph.index.hex --duckdb-fragment-append ecto ...
Initial top --limit 2000 results:
| Mode | Indexed | Skipped | Failed | Index elapsed | Wall time | fragment_append_rows total |
|---|---|---|---|---|---|---|
default (0.5.8) | 1635 | 365 | 0 | 156.99s | 166.86s | 241.2s |
--duckdb-fragment-append merge run 1 | 1635 | 365 | 0 | 153.91s | 158.99s | 188.0s |
--duckdb-fragment-append merge run 2 | 1635 | 365 | 0 | 157.16s | 166.95s | 186.1s |
A persisted top --limit 500 quality check matched the default path:
| Query/check | Default | MERGE |
|---|---|---|
Map.get(_, _) | 7 | 7 |
Enum.map(_, _) | 959 | 959 |
| _ |> _ | 1864 | 1864 |
| jason fragments | 1391 | 1391 |
These early results made MERGE promising, but not yet sufficient to change the default. Later runs below resolved correctness blockers and provided the stronger evidence used to switch the DuckDB default to MERGE.
A later single-DB top --limit 500 comparison with local tarballs still did not justify changing the default:
| Fragment append | Indexed | Skipped | Failed | Index elapsed | Wall time | fragment_append_rows total | Notes |
|---|---|---|---|---|---|---|---|
| ecto | 379 | 121 | 0 | 1m42s | 112s | 97.3s | baseline |
| merge | 379 | 121 | 0 | 1m41s | 111s | 96.2s | representative structural/text/package checks matched, but fact table counts differed (definitions +140, references +762, comments +4) and need investigation before further promotion |
A follow-up investigation found two correctness issues around this comparison rather than a proven MERGE quality regression:
- File lookup after file upserts must be scoped by
{package_version_id, sha256}, not only by SHA, because different package versions can legitimately share identical files. duckdb_fragment_appendwas recorded in Hex index reports but was not forwarded fromExograph.Hex.Corpusinto per-packageExograph.index_sources/2calls, so earlier corpus-level MERGE comparisons were not a clean test of the MERGE path.
A focused mariaex@0.9.1 rerun after forwarding the option showed Ecto and MERGE fact counts match exactly (fragments 4614, definitions 1043, references 5307, comments 59). This restores confidence in MERGE correctness for the known repro, but the previous top500 MERGE numbers should be treated as invalidated.
Clean follow-up artifacts live under ignored bench-results/duckdb-merge-20260614/; large .duckdb files were removed after comparison. With --concurrency 4, the Ecto baseline hit a DuckDB unique-constraint race in the append path while indexing cachex@4.1.1, so that run is not a valid quality baseline. MERGE completed the same workload successfully:
| Fragment append | Concurrency | Indexed | Skipped | Failed | Index elapsed | Wall time | fragment_append_rows total | Notes |
|---|---|---|---|---|---|---|---|---|
| ecto | 4 | 378 | 121 | 1 | 1m53s | 125s | 111.3s | failed on duplicate content_hash unique constraint while indexing cachex@4.1.1 |
| merge | 4 | 379 | 121 | 0 | 1m50s | 122s | 93.3s | completed; not directly comparable because Ecto failed |
DuckDB source inspection explains the failure shape: PhysicalInsert::OnConflictHandling verifies conflicts before append, then concurrent transactions can still both try the same unique ART key and one fails at commit. The relevant source paths are src/execution/operator/persistent/physical_insert.cpp, src/common/types/conflict_manager.cpp, and src/execution/index/art/art.cpp. Exograph now retries DuckDB fragment appends on primary-key/unique-constraint races inside the fragment append operation, allowing the losing transaction to re-run after the winner commits.
After that retry guard, the same top --limit 500 --concurrency 4 Ecto workload completed:
| Fragment append | Concurrency | Indexed | Skipped | Failed | Index elapsed | Wall time | fragment_append_rows total | Notes |
|---|---|---|---|---|---|---|---|---|
| ecto + retry | 4 | 379 | 121 | 0 | 1m54s | 124s | 118.1s | clean completion after unique-conflict retry |
Repeated top --limit 500 --concurrency 4 Ecto/MERGE pairs still do not justify making MERGE the default. Artifacts live under ignored bench-results/duckdb-merge-repeat-20260614/; large .duckdb files were removed after each comparison.
| Run | Ecto result | MERGE result | Wall time Ecto/MERGE | fragment_append_rows Ecto/MERGE | Quality/count parity | Notes |
|---|---|---|---|---|---|---|
| 1 | 379 indexed, 0 failed | 378 indexed, 1 failed | 127s / 123s | 119.1s / 102.4s | no | MERGE hit the same DuckDB unique-constraint commit race before broadening the retry guard |
| 2 | 379 indexed, 0 failed | 379 indexed, 0 failed | 130s / 122s | 120.7s / 97.0s | yes | clean MERGE win |
| 3 | 379 indexed, 0 failed | 379 indexed, 0 failed | 126s / 121s | 120.0s / 97.6s | no | representative searches matched, but fact counts differed (definitions +238, references +1227, comments +14) |
| 4 | not rerun | 379 indexed, 0 failed | n/a / 125s | n/a / not summarized here | n/a | after broad retry, MERGE completed but produced the lower fact-count variant (definitions 175434, references 751847, comments 26135) |
The concurrency-4 fact-count nondeterminism was traced to duplicate source paths inside Hex tarballs rather than MERGE itself. mariaex@0.9.1 contains repeated identical entries for paths such as lib/mariaex/protocol.ex and lib/mariaex/row_parser.ex. Exograph now deduplicates source tuples by {path, package_version} before extraction, so duplicate archive entries do not insert duplicate definitions/references/comments.
A focused mariaex@0.9.1 rerun after source dedupe matched Ecto and MERGE exactly with the lower, deduplicated fact counts (fragments 4614, definitions 665, references 3318, comments 41).
Post-dedupe repeated top --limit 500 --concurrency 4 Ecto/MERGE pairs matched exactly for table counts and representative quality probes in clean runs:
| Run | Ecto wall / append | MERGE wall / append | Result |
|---|---|---|---|
| post-dedupe smoke | 129s / 117.3s | 120s / 93.7s | exact parity |
| decision run 1 | 132s / 117.5s | 122s / 96.1s | exact parity |
| decision run 4 | 132s / 115.2s | 153s / 123.2s | exact parity; MERGE wall-time outlier |
| decision run 5 | 132s / 109.7s | 131s / 101.0s | exact parity |
| decision run 6 | 129s / 117.6s | 123s / 99.9s | exact parity |
A clean top --limit 2000 --concurrency 4 pair, after filling missing local tarballs, also matched exactly:
| Fragment append | Concurrency | Indexed | Skipped | Failed | Index elapsed | Wall time | fragment_append_rows total | Notes |
|---|---|---|---|---|---|---|---|---|
| ecto + retry | 4 | 1635 | 365 | 0 | 9m00s | 569s | 652.9s | exact parity baseline |
| merge + retry | 4 | 1635 | 365 | 0 | 8m52s | 565s | 548.4s | exact parity; lower append time |
Based on the repeated exact-parity runs and the larger append-time reduction, DuckDB fragment append now defaults to MERGE while retaining --duckdb-fragment-append ecto as an escape hatch/comparison path.
Default-path validation after switching the default:
| Run | Indexed | Skipped | Failed | Index elapsed | Wall time | duckdb_fragment_append | fragment_append_rows total | Retry count | Notes |
|---|---|---|---|---|---|---|---|---|---|
top --limit 500 | 379 | 121 | 0 | 1m58s | 130s | merge | 100.6s | 0 | default path, no explicit append flag |
top --limit 100 --reach | 75 | 25 | 0 | 1m11s | 78s | merge | 21.6s | 0 | default path with Reach enabled |
A default top --limit 2000 run also reported duckdb_fragment_append: merge, but the live top list had moved beyond the local tarball cache (ash@3.29.0, then llm_db@2026.6.2), so those runs are not clean quality/timing baselines. Use fixed entries or refresh the tarball cache before future top2000 default validation.
The benchmark workflow now supports entry snapshots with --entries-output-path, writing NDJSON lines like {\"name\":\"jason\",\"version\":\"1.4.5\"}. Rerun a fixed set with --entries-file. A fixed top2000 snapshot was created under ignored bench-results/fixed-top2000-20260614/entries.ndjson; after downloading the single missing tarball, the default MERGE run completed cleanly:
| Run | Indexed | Skipped | Failed | Index elapsed | Wall time | duckdb_fragment_append | fragment_append_rows total | Retry count | Notes |
|---|---|---|---|---|---|---|---|---|---|
fixed top2000 entries | 1635 | 365 | 0 | 9m14s | 582s | merge | 596.1s | 1 | --entries-file bench-results/fixed-top2000-20260614/entries.ndjson |
A final fixed-snapshot comparison of explicit Ecto vs default MERGE matched exactly for table counts and representative probes:
| Fragment append | Indexed | Skipped | Failed | Index elapsed | Wall time | fragment_append_rows total | Retry count | Notes |
|---|---|---|---|---|---|---|---|---|
ecto | 1635 | 365 | 0 | 9m49s | 613s | 696.3s | 1 | explicit comparison path |
default merge | 1635 | 365 | 0 | 9m13s | 579s | 580.4s | 0 | default path; exact parity with Ecto |
Post-MERGE cost model
After making MERGE the default, fragment_append_rows is no longer the only story. Timing snapshots now include counter metrics for row volumes (total, avg, max) in addition to duration metrics (total_ms, avg_ms, max_ms). A fixed top500 default-MERGE run (bench-results/stage-cost-20260615/top500-default-timings.json) showed:
| Stage / metric | Value |
|---|---|
| wall time | 135s |
| index elapsed | 123s |
index_sources | 485.6s cumulative |
fragment_store_put | 414.6s cumulative |
fragment_store_upsert_fragments | 239.6s cumulative |
fragment_store_code_facts | 122.5s cumulative |
fragment_store_resolve_fragment_ids | 108.8s cumulative |
fragment_append_rows | 105.6s cumulative |
fragment_store_normalize_terms | 76.3s cumulative |
fragment_store_upsert_terms | 52.8s cumulative |
code_facts_insert_comments | 52.9s cumulative |
| input fragments | 985,339 rows |
| unique fragment rows | 976,408 rows |
| unique terms observed | 459,667 rows |
| reference facts | 739,115 rows |
| definition facts | 158,592 rows |
| comment facts | 21,064 rows |
| file rows | 3,714 rows |
| fragment append retries | 1 |
The next optimization target should be chosen from this profile rather than further MERGE tuning. The remaining storage-heavy cluster is fragment ID resolution + fact insertion + term normalization/upsert. Previous offline term-string batching and hybrid file handling both regressed, so prefer narrower, measurable changes such as reducing per-package ID lookup/fact flush overhead or improving fact/term batching without inflating staged payloads.
A follow-up split of fragment_store_resolve_fragment_ids showed that the fallback lookup is not the bottleneck in the default MERGE path (bench-results/fragment-id-resolution-20260615/top500-default-timings.json):
| Fragment ID resolution sub-stage / metric | Value |
|---|---|
fragment_store_resolve_fragment_ids | 108.6s |
fragment_store_insert_fragments_by_hash | 106.5s |
fragment_append_rows | 105.4s |
fragment_store_missing_hashes | 0.4s |
fragment_store_lookup_existing_fragment_ids | 1.5s |
| unique fragment rows | 985,178 |
| inserted fragment rows returned | 982,754 |
| missing existing fragment IDs | 657 |
So optimizing the fallback content_hash IN (...) lookup is not worth pursuing now. The time attributed to ID resolution is almost entirely the MERGE append itself. Move the next optimization attempt to term/fact batching or broader flush granularity rather than fragment-ID lookup shape.
A follow-up fact-buffer split (bench-results/fact-buffer-cost-20260615/top500-default-timings.json) showed that code-fact insert time is mostly synchronous InsertBuffer.insert/3 backpressure rather than final flush cost:
| Fact/buffer metric | Value |
|---|---|
fragment_store_code_facts | 109.7s |
code_facts_insert_comments | 41.4s |
code_facts_bulk_insert_comments | 41.3s |
code_facts_insert_references | 28.3s |
code_facts_bulk_insert_references | 27.0s |
code_facts_insert_definitions | 8.6s |
code_facts_bulk_insert_definitions | 8.3s |
duckdb_insert_buffer_flush_references | 26.0s across 14 flushes |
duckdb_insert_buffer_flush_definitions | 5.6s across 4 flushes |
duckdb_insert_buffer_flush_comments | 0.25s across 1 flush |
| reference rows enqueued/flushed | 732,375 / 750,994 |
| definition rows enqueued/flushed | 158,315 / 175,266 |
| comment rows enqueued/flushed | 20,283 / 26,131 |
The surprising comments cost is not comment row volume: comments flush only took ~0.25s. The expensive code_facts_bulk_insert_comments timing is the synchronous call into a shared InsertBuffer, which can block behind reference/definition flushes from other package tasks. The next low-risk experiment should therefore target InsertBuffer flush granularity/serialization (for example, larger per-source thresholds or per-source workers), not comment extraction or direct comment insert SQL.
Two immediate InsertBuffer variants did not justify a behavior change:
| Variant | Result |
|---|---|
--duckdb-insert-buffer-size 200000 | completed top500 but regressed wall time to 146s; fewer, larger flushes made final/large flushes expensive (duckdb_insert_buffer_flush 13.7s; definition flush 8.4s) and did not reduce comment enqueue backpressure enough |
| async flush prototype | not kept; naive Task.async from inside the GenServer let task replies hit handle_info before Task.await, causing final flush/reporting to hang |
A proper per-source worker version of the InsertBuffer did help. The public buffer API stayed the same, but each source table now has an independent worker, so comment enqueue calls no longer wait behind reference/definition flushes.
| Run | Indexed | Skipped | Failed | Index elapsed | Wall time | fragment_store_code_facts | code_facts_bulk_insert_comments | fragment_append_rows |
|---|---|---|---|---|---|---|---|---|
| fixed top500 baseline | 379 | 121 | 0 | 116.9s | ~124s | 109.7s | 41.3s | 105.4s |
| fixed top500 per-source workers | 379 | 121 | 0 | 97.6s | 106.7s | 72.8s | 0.05s | n/a in adjacent baseline |
| fixed top2000 default MERGE | 1635 | 365 | 0 | 553.3s | 579s | 537.1s | 233.8s | 580.4s |
| fixed top2000 per-source workers | 1635 | 365 | 0 | 470.9s | 495s | 459.6s | 7.9s | 435.7s |
| fixed top2000 per-source workers repeat | 1635 | 365 | 0 | 536.1s | 561s | 515.6s | 7.3s | 486.8s |
The top2000 runs used bench-results/fixed-top2000-20260614/entries.ndjson, completed with 1635 indexed / 365 skipped / 0 failed, and still reported the default duckdb_fragment_append: merge. This is a real win, but it changes fact flushing from globally serialized to per-source serialized. Keep watching larger fixed-entry runs for DuckDB append contention and retry counts.
A follow-up split showed the remaining code_facts_bulk_insert_references time was mostly same-source worker queueing, not row construction. Switching worker enqueue from synchronous call to non-blocking cast keeps final InsertBuffer.flush/1 as the durability barrier and moved that wait out of per-package indexing:
| Run | Indexed | Skipped | Failed | Index elapsed | Wall time | fragment_store_code_facts | code_facts_bulk_insert_references | reference worker service / flush |
|---|---|---|---|---|---|---|---|---|
| fixed top500 non-blocking enqueue | 379 | 121 | 0 | 89.6s | 99.1s | 28.9s | 0.4s | 17.8s / 18.5s |
| fixed top2000 non-blocking enqueue | 1635 | 365 | 0 | 466.1s | 490s | 147.5s | 1.8s | 101.6s / 101.3s |
Fact row flush counts matched the prior per-source worker top2000 run (references 2,889,685; definitions 748,332; comments 113,887), so this change did not drop buffered facts in the fixed workload. A fresh fixed top2000 quality run also preserved indexed/skipped/failed totals and the same package/search/fact counts as the previous non-blocking summary, with only a 4-row fragment_terms difference out of ~99.4M rows; keep watching this tiny nondeterminism when changing fragment append concurrency.
The remaining wall time is now dominated by fragment append/upsert and the actual serialized DuckDB fact flushes rather than caller backpressure. A fixed top500 MERGE split (bench-results/fragment-append-split-20260615/top500-split2-timings.json) showed:
| MERGE append sub-stage | Total |
|---|---|
fragment_append_rows / transaction | 92.3s / 92.2s |
| staging append into temp table | 32.5s |
| MERGE query | 27.6s |
| temp table create + clear | 1.4s |
| input / returned rows | 671,837 / 683,055 |
The temp-table DDL/cleanup is not the bottleneck. Increasing the fragment temp-stage append chunk size from 2k to 10k reduced fixed top500 fragment_append_rows from 92.3s to 80.5s and fixed top2000 elapsed to 460.4s (1635 indexed / 365 skipped / 0 failed). A 20k chunk was worse on fixed top500 (100.3s elapsed, 88.4s append), so 10k is the current local best.
A temporary cardinality probe showed the temp stage was already unique per call (661,790 rows / 661,790 distinct hashes in the fixed top500 probe), and only 308 staged hashes already existed in the target at the sampled MERGE point. That does not support revisiting the earlier prelookup/minimal-key path: there are too few existing hashes to offset an extra lookup/staging pass. The probe was removed after recording the result because it added query overhead.
A local QuackDB prototype optimized ordinary signed 64-bit vector encoding, including large BIGINT[] child vectors. Dogfooding it through a temporary local path dependency produced a fixed top500 run with 379 indexed / 121 skipped / 0 failed, 91.6s elapsed, and fragment_append_rows at 78.8s (stage_rows 28.7s, merge_query 20.0s). The change shipped in QuackDB 0.5.12; after updating Exograph to the released dependency, fixed-snapshot runs completed cleanly:
| Run | Indexed | Skipped | Failed | Index elapsed | Wall time | fragment_append_rows | stage_rows | MERGE query |
|---|---|---|---|---|---|---|---|---|
fixed top500, QuackDB 0.5.12 | 379 | 121 | 0 | 90.4s | 101.1s | 82.0s | 29.1s | 25.4s |
fixed top2000, QuackDB 0.5.12 | 1635 | 365 | 0 | 465.1s | 489.4s | 522.6s | 92.5s | 213.7s |
A follow-up attempt to replace MERGE with INSERT INTO ... SELECT ... WHERE NOT EXISTS ... RETURNING using the same temp stage regressed fixed top500 (114.2s elapsed, 98.6s append, 36.4s stage append, 29.7s insert query), so the MERGE shape remains better. The experiment was reverted.
QuackDB append telemetry is now captured in Hex timing snapshots when DuckDB indexing writes --timings-path. A fixed top500 telemetry run (bench-results/quackdb-append-telemetry-20260615/top500-telemetry-timings.json) showed the fragment temp-stage append payload is very wide:
| QuackDB append metric | Fragment temp stage |
|---|---|
| rows / batches | 671,837 / 3 |
| request bytes | 734.1 MB |
| encode duration | 10.6s |
| transport/server append duration | 3.5s transport / 14.3s append |
surrounding fragment_append_stage_rows | 30.2s |
The gap between QuackDB append duration and fragment_append_stage_rows suggests Ecto adapter row normalization/DBConnection queue/transaction overhead is also material, not just protocol encoding. Two direct-append variants were investigated and reverted:
| Variant | Result |
|---|---|
direct QuackDB.insert_columns/4 through repo pool | failed: checked out a different QuackDB connection than repo.transaction/2, so the connection-local temp table was invisible (Table .exograph_fragment_merge_* does not exist) |
| direct append on the active transaction connection via Ecto's process dictionary | worked, but regressed fixed top500 (105.4s elapsed, 88.6s append); stage append wrapper fell to 19.7s, but native append and MERGE time increased |
| unique persistent staging table plus direct append outside the transaction | worked, but regressed fixed top500 (105.6s elapsed, 117.9s append); regular table create/drop and cross-connection append cost dominated |
This means a QuackDB/Ecto active-transaction append helper is technically possible, but it is not yet justified by the measured workload.
A follow-up telemetry run also attached to QuackDB fetch spans and compared against an offline DuckDB CLI profile. Fixed top500 with fetch telemetry (bench-results/fetch-telemetry-20260615/top500-fetch-timings.json) completed in 91.9s elapsed with 79.9s fragment append, 28.1s stage append, and 21.3s MERGE query. Fetching the MERGE RETURNING content_hash, id result was not the bottleneck (1 fetch, 36 chunks, 0.55s). A DuckDB CLI profile over the completed database (bench-results/fetch-telemetry-20260615/cli-merge-returning-summary.json) recreated a MERGE-with-RETURNING shape from hex_fragments into an empty profiled table and reported only 3.53s latency / 2.61s CPU, with the MERGE_INTO operator at 2.10s and table scan at 0.51s. This is not the exact temp-stage transaction path, but it strongly suggests the 21s+ app-side MERGE query time is not primarily result fetching or DuckDB's logical MERGE operator alone.
A local QuackDB query-phase telemetry dogfood run (bench-results/quackdb-query-telemetry-20260615/top500-query-timings.json) added query/fetch encode, transport, decode, normalize, and byte counters. Fixed top500 completed in 93.5s elapsed with 86.7s fragment append, 28.3s stage append, and 27.1s MERGE query. Aggregated INSERT query telemetry showed nearly all query time is transport/server execution (63.0s total query duration, 62.8s transport, negligible encode/decode/normalize, tiny direct query response bytes). MERGE RETURNING fetch remained small (45 chunks, 0.43s, 5.2 MB response). QuackDB 0.5.13 published those telemetry metrics; the released top500 smoke (bench-results/quackdb-0.5.13-20260615/top500-release-timings.json) completed in 91.7s elapsed with 87.2s fragment append, 27.9s stage append, and 25.9s MERGE query. A synthetic QuackDB wide-append benchmark matching the Exograph fragment stage payload shape showed 670k rows / 721 MB request bytes. The matrix confirmed that concurrency improves wall time but inflates aggregate transport/server cost (1 connection, 10k chunks: 12.66s wall / 11.66s append / 5.67s transport-server; 4 connections, 10k chunks: 7.56s wall / 27.91s append / 19.83s transport-server). Larger chunks reduced per-request overhead in the isolated benchmark (4 connections, 40k chunks: 5.33s wall / 18.87s append), but a targeted Exograph fixed-top500 validation with 40k fragment-stage chunks (bench-results/fragment-stage-chunk-40k-20260615/top500-stage40k-timings.json) did not improve the full workload: 92.2s elapsed, 86.6s fragment append, 27.9s stage append, and 25.3s MERGE query, versus the released 0.5.13 baseline at 91.7s elapsed, 87.2s fragment append, 27.9s stage append, and 25.9s MERGE query. Keep the production stage chunk at 10k; isolated append chunk improvements do not transfer meaningfully to the full online MERGE workload. Remaining optimization should therefore focus on DuckDB/Quack server contention under concurrent native append and the MERGE/server execution path, not client-side query row materialization or simple chunk tuning.
An attempt to use DuckDB's built-in structured Quack log type was not useful. In small managed-server probes, both CALL enable_logging(['Quack']) and CALL enable_logging(level := 'debug') left duckdb_logs/duckdb_logs_parsed('Quack') empty for remote Quack requests. A fixed top500 run with CALL enable_logging(['Quack']) enabled before indexing then hung after the first few packages and hit per-package timeouts, so the experimental CLI log export hook was reverted. Treat built-in Quack logs as unavailable for this workload unless the upstream extension logging behavior changes; use client telemetry, DuckDB profiling, or purpose-built server instrumentation instead.
A DuckDB CLI reproduction of the actual wide temp-stage/MERGE shape (bench-results/duckdb-cli-merge-profile-20260615/summary.json) separated DuckDB engine cost from Quack remote path cost. The Exograph top500 run behind the profile had 84.2s fragment append, 30.6s stage append, and 22.1s app-side MERGE query. In the CLI over the same database, creating a temp stage from the full fragment columns took only 0.25s latency (1.20s CPU), and MERGE-with-RETURNING from that temp stage into an empty indexed target took 3.29s latency (2.26s CPU), with 2.18s in MERGE_INTO, 0.70s checkpoint latency, 610.5 MB written, and a 16.5 MB result. This is still a synthetic all-at-once CLI reproduction rather than the exact per-package concurrent remote path, but it strongly rules out DuckDB's core MERGE operator as the main 22s+ app-side MERGE cost. The remaining gap is most likely Quack server request handling/serialization around query execution, temp-stage remote append ingestion, transaction/commit scheduling, or connection contention.
A local QuackDB query-telemetry dogfood run with --concurrency 1 (bench-results/concurrency1-20260615/compare-normal-vs-c1.json) confirmed connection/concurrency contention is a large part of the remote MERGE cost. Compared with the normal local-telemetry top500 (93.5s elapsed, 86.7s fragment append, 28.3s stage append, 27.1s MERGE query), single package/DB concurrency regressed wall time to 251.3s because extraction/indexing became serial, but improved fragment append internals: 66.9s fragment append, 24.4s stage append, and 8.3s MERGE query. Aggregated INSERT query transport dropped from 62.8s to 24.3s, while COMMIT transport stayed high (42.5s normal vs 39.8s c1). This points to DB connection/write contention inflating MERGE and insert transport under normal concurrency, but serializing all package work is too costly.
A follow-up fragment-append semaphore prototype kept package concurrency at 4 while limiting only DuckDB fragment append transactions (bench-results/fragment-append-limiter-20260615/compare-normal-vs-limiters.json). It was reverted. Limit 1 improved MERGE query time (27.1s -> 14.3s) but added 74.8s of limiter wait, regressing elapsed time to 114.1s and fragment append aggregate time to 158.1s. Limit 2 also regressed (101.0s elapsed, 99.9s fragment append) and did not improve MERGE (27.7s). A coarse semaphore around fragment append is therefore not the right bounded-writer shape; the wait cost outweighs reduced contention. Future concurrency work should target a real pipelined writer/ingestion architecture or narrower critical sections, not a simple transaction-level semaphore.
A sharded DuckDB ingestion probe (bench-results/duckdb-shards-20260615/) split fixed top500 across four managed DuckDB databases. With global concurrency 4, per-shard concurrency 1, and --duckdb-threads 2, wall time regressed to 131.4s even though aggregate MERGE query dropped to 8.0s; per-shard serial extraction left too little parallelism. With global concurrency 8, per-shard concurrency/pool size 2, and --duckdb-threads 2, wall time improved to 70.2s versus 76.3s for a single DuckDB database at global concurrency 8 and 91.7s for the normal concurrency-4 baseline. The shard run reduced aggregate MERGE query (12.0s vs single-DB concurrency-8 33.1s) but still had similar/worse stage append cost (36.7s vs 32.1s). Correctness is not a drop-in match: package/text counts and sample package fragment counts matched, but global table counts differed because fragment/content/term de-duplication is local to each shard (fragments 689870 sharded vs 689495 single, terms 254414 vs 213090), and one broad structural query over-counted (_ |> _ 1506 sharded vs 2 single). A follow-up attempt to deduplicate merged sharded hits by fragment hash was reverted: it did not fix the broad structural over-count (_ |> _ stayed around 1498) and it regressed other global counts (Enum.map(_, _) 954 vs single 959, defmodule 6794 vs single 6807). Sharding is promising for write contention but changes global de-dup/search semantics in a way that cannot be fixed by a simple fanout result dedup pass.
An offline deferred-term prototype tried to avoid per-package term finalization by staging term text and fragment-term text rows, inserting fragments with empty terms, then backfilling fragments.terms and fragment_terms during finalization. It was reverted. It improved existing offline top500 elapsed time from about 249.9s to 160.9s, but still lost badly to online default (91.3s in the same comparison directory). It also failed exact parity: structural/text search counts matched, but table counts differed (comments 26023 default vs 22950 offline, references 750887 vs 750761, definitions off by 1, fragment_terms off by 4). The gap likely comes from subtle duplicate-content/fact attribution semantics in the offline finalization joins. Do not pursue deferred-term offline staging without a stronger correctness model for facts associated with duplicate fragments.
A fragment payload attribution run (bench-results/fragment-payload-20260615/top500-payload-summary.json) confirmed the wide payload is dominated by AST and token-list columns:
| Column | Approx payload |
|---|---|
ast | 452.6 MB |
terms | 234.1 MB |
sub_hashes | 56.4 MB |
exact_hash | 43.0 MB |
content_hash | 21.5 MB |
module | 13.4 MB |
These approximate column bytes explain nearly all of the ~734 MB QuackDB request. The ast blob is the largest single cost, followed by BIGINT[] term vectors. This made a narrower MERGE staging shape attractive, but the prototype regressed and was reverted. A first direct append into hex_fragments failed because DuckDB append requires all physical columns, including id (table hex_fragments has 18 columns but 17 values were supplied). A second variant allocated id values from hex_fragments_id_seq, staged only content_hash, direct-appended full missing rows, then looked up IDs; it completed correctly but regressed fixed top500 (111.3s elapsed, 103.8s append; direct insert alone 59.8s). The stage payload was much smaller (~105 MB of QuackDB append requests instead of ~734 MB), but losing MERGE's insert-from-temp execution path outweighed the protocol savings. The next fragment-append work should move to DuckDB MERGE profiling or QuackDB native append/server-side ingest profiling rather than more prelookup/direct-insert staging variants.
A serial top --limit 500 --concurrency 1 run remains the earlier clean correctness comparison. Counts and representative checks matched exactly for files, fragments, terms, fragmentterms, definitions, references, comments, packages, package_versions, defmodule, `Map.get(, ),Enum.map(, ), |> _, and package fragment counts forjason,ecto, andphoenix`:
| Fragment append | Concurrency | Indexed | Skipped | Failed | Index elapsed | Wall time | fragment_append_rows total | Notes |
|---|---|---|---|---|---|---|---|---|
| ecto | 1 | 379 | 121 | 0 | 5m04s | 315s | 86.8s | clean baseline |
| merge | 1 | 379 | 121 | 0 | 4m44s | 296s | 73.3s | exact quality/count parity; ~19s wall improvement in this serial run |
Historical invalidated artifacts: /tmp/exograph-top500-ecto-report.json, /tmp/exograph-top500-ecto-timings.json, /tmp/exograph-top500-merge-report.json, /tmp/exograph-top500-merge-timings.json.
Reports include selected DuckDB experiment metadata under options, including duckdb_fragment_append and duckdb_build_mode. The --duckdb-build-mode offline flag selects the experimental offline staging path for files, terms, fragments, definitions, references, comments, fragment_terms, graph_nodes, and call_edges. Keep it explicit until quality parity and larger repeated benchmarks are complete.
Automated top-package parity guards now compare online vs offline counts for files, fragments, terms, fragment_terms, definitions, references, comments, graph_nodes, call_edges, and representative text/definition/reference/caller/callee searches.
Initial top --limit 100 sharded smoke, using local tarballs and --duckdb-shards 2, showed quality-level totals match but no end-to-end win yet:
| Build mode | Indexed | Skipped | Failed | Index elapsed | Wall time | Notes |
|---|---|---|---|---|---|---|
| online | 75 | 25 | 0 | 50.54s | 52s | baseline online path |
| offline | 75 | 25 | 0 | 50.48s | 52s | stages files/terms/fragments/facts, but still finalizes files/terms during per-package puts |
This supports the next milestone: batch staging across larger units and finalize once per shard/corpus, rather than repeatedly resolving file/term IDs during FragmentStore.put/2.
Initial single-DB top --limit 100 --reach smoke results:
| Build mode | Indexed | Skipped | Failed | Index elapsed | Wall time | Notes |
|---|---|---|---|---|---|---|
| online + Reach | 75 | 25 | 0 | 64.35s | 69s | baseline online Reach path |
| offline + Reach | 75 | 25 | 0 | 61.87s | 73s | stages call graph facts; slower wall due finalization/staging overhead |
A sharded online + Reach top100 run exposed a dynamic-repo issue in asynchronous call-graph inserts (could not lookup Ecto repo Exograph.DuckDBRepo). Exograph.Storage.Ecto.SQL.bulk_insert_all/4 now preserves the current dynamic repo in async chunk insert tasks; sharded online + Reach top10 and top100 runs completed successfully after the fix.
Sharded top --limit 100 --reach, using local tarballs and --duckdb-shards 2 after the dynamic-repo fix:
| Build mode | Indexed | Skipped | Failed | Index elapsed | Wall time | Notes |
|---|---|---|---|---|---|---|
| online + Reach | 75 | 25 | 0 | 65.46s | 67s | sharded baseline after dynamic-repo fix |
| offline + Reach | 75 | 25 | 0 | 61.51s | 63s | stages call graph facts; still pays larger finalization cost |
Package batching experiment
mix exograph.index.hex has an explicit --package-batch-size option for experimenting with flushing multiple packages together. Quality checks on top --limit 500 matched the default mode for representative structural queries:
| Query/check | Default | Batch size 4 |
|---|---|---|
Map.get(_, _) | 7 | 7 |
Enum.map(_, _) | 959 | 959 |
| _ |> _ | 1864 | 1864 |
| jason fragments | 1391 | 1391 |
Controlled top --limit 1000 benchmarks with 16 shards were noisy and did not justify changing the default:
--package-batch-size | Run 1 | Run 2 | Median |
|---|---|---|---|
| 1 | 134.5s | 116.8s | 125.6s |
| 2 | 145.7s | 150.1s | 147.9s |
| 4 | 126.2s | 130.3s | 128.2s |
Keep package batching explicit until a larger corpus or repeated low-noise runs show a consistent win.
Dead ends from the tuning pass
The following experiments were reverted because they were neutral or worse on the full workload, even when some looked promising in microbenchmarks:
- Prelooking up existing fragment hashes and direct-appending only missing rows.
- Replacing the Ecto append/conflict path with an Exograph-local temp staging table plus
INSERT ... SELECT ... WHERE NOT EXISTS; atop --limit 500run was slower than the default path (1m12svs1m05sindex elapsed in adjacent runs). - Resolving existing fragment IDs through a DuckDB temp hash table and join. Isolated lookups were faster for large hash sets, but full
top --limit 2000indexing regressed (2m43sindex elapsed vs2m37sfor the adjacent0.5.8baseline), likely because transaction/temp-table overhead outweighed lookup gains in the real workload. - Batch-staging term strings in offline mode and resolving
fragment_termsat finalization time. Top-package parity passed, buttop --limit 100single-DB indexing regressed badly: online was75 indexed / 25 skipped / 0 failed,53sindex elapsed and60swall including a compile; the experimental offline term-batched path was75 indexed / 25 skipped / 0 failed,1m07sindex elapsed and75swall. Timings showedoffline_build_stage_fragmentsballooning to124.2stotal, so staging large per-fragment term-string payloads was worse than resolving term IDs during per-package puts. Artifacts:/tmp/exograph-batch-online-top100-report.json,/tmp/exograph-batch-online-top100-timings.json,/tmp/exograph-batch-offline-top100-report.json,/tmp/exograph-batch-offline-top100-timings.json. - Hybrid offline mode that used the normal online file upsert path while keeping offline fragment/fact staging. Top-package parity passed, but
top --limit 100remained slower than online: adjacent online was75 indexed / 25 skipped / 0 failed,55sindex elapsed and62swall; hybrid offline was75 indexed / 25 skipped / 0 failed,1m04sindex elapsed and73swall. Artifacts:/tmp/exograph-hybrid-online-top100-report.json,/tmp/exograph-hybrid-online-top100-timings.json,/tmp/exograph-hybrid-offline-top100-report.json,/tmp/exograph-hybrid-offline-top100-timings.json. - Removing post-insert temp-table cleanup.
- Combining temp-table create and clear statements.
- Switching temp-table cleanup from
DELETEtoTRUNCATE. - Increasing DuckDB code-fact insert buffer size from 50k to 100k.
- Changing AST compression level or selectively changing fragment row construction.
- Fusing code-fact extraction passes.
- Special-casing int64, blob, or LIST vector encoding in QuackDB without full-workload wins.
- Lowering per-package source parsing concurrency.
The next meaningful design target is a dedicated bulk fragment upsert/staging path below the current SQL/Ecto shape. A local SQL staging rewrite was not enough; the larger design likely needs adapter-level support that reduces append + conflict-ignore + ID lookup work together rather than rearranging the same operations.
Artifacts
Machine-readable benchmark artifacts are generated locally and intentionally not committed to git. Use --output-json for the JSON report and --explain-dir for Postgres plans, for example:
mix exograph.bench.backends \
--mode top --limit 500 --runs 3 \
--only postgres_plain,duckdb_plain,duckdb_sharded_plain \
--output-json bench-results/backend-limit500-runs3-current.json \
--explain-dir bench-results/explain-limit500-runs3-current
bench-results/ is gitignored to avoid polluting the repository with local benchmark outputs. The tables above record the latest checked benchmark summary; regenerate artifacts locally when you need machine-readable evidence or plans.
Current fair wording
A defensible summary is:
On Exograph's Hex.pm top-package workload, tuned Postgres is slightly faster at indexing 100 packages. At 500 packages, DuckDB indexes about 1.66× faster single-node and about 1.99× faster with 4 shards, while several important query paths are roughly 3×–14× faster on DuckDB. These numbers describe Exograph's current backends and local benchmark setup, not PostgreSQL or DuckDB universally.