These are local Exograph backend benchmarks for the Hex.pm top workload. They are intended to compare Exograph's current backend implementations on this machine, not to make a universal claim about PostgreSQL or DuckDB.

Method

All runs used the same package cache and full Exograph persistence: files, fragments, ASTs, hashes, symbols, references, terms, and queryable facts were retained.

Common settings:

--mode top
--runs 3
--concurrency 4
--index-concurrency 4
--duckdb-threads 1
--postgres-defer-indexes
--postgres-synchronous-commit off
--postgres-maintenance-work-mem 512MB
--postgres-max-parallel-maintenance-workers 2

DuckDB sharded runs used:

--duckdb-shards 4 --duckdb-recovery-mode no_wal_writes

Postgres settings are a rebuildable/local-index challenge mode: deferred non-unique query indexes, synchronous_commit=off, and larger maintenance memory. They are not durable-production defaults.

Indexing medians

WorkloadPostgres plainDuckDB plainDuckDB sharded plainResult
top --limit 10038.42s39.22s41.17stuned Postgres slightly faster
top --limit 500181.22s109.27s91.05sDuckDB 1.66× faster; sharded DuckDB 1.99× faster

For limit 100, the systems are close and tuned Postgres wins indexing. For limit 500, DuckDB wins indexing, and sharding improves throughput further.

Query medians

top --limit 100

QueryPostgres plainDuckDB plainDuckDB sharded plain
api_text_defmodule71.1ms27.7ms45.4ms
references_enum24.1ms2.3ms3.3ms
files_defmodule31.0ms7.3ms2.4ms
api_comments_todo134.3ms129.8ms65.6ms

top --limit 500

QueryPostgres plainDuckDB plainDuckDB sharded plain
api_text_defmodule134.2ms49.0ms37.0ms
references_enum56.7ms8.9ms9.9ms
files_defmodule98.3ms23.6ms7.1ms
api_comments_todo143.3ms159.1ms199.9ms

Search/query paths usually favor DuckDB materially, especially on the larger workload.

DuckDB ingestion decision checkpoint, June 2026

The benchmark evidence now supports keeping single-DuckDB online MERGE as the default ingestion path. It is correct, stable, uses released QuackDB dependencies, and has repeated fixed top500/top2000 parity. Sharding is faster for indexing (70.2s fixed top500 with four shards and global concurrency 8), but it changes global content/term de-duplication and broad structural search semantics, so it should remain an opt-in/product architecture candidate rather than a transparent speed flag.

Future work should choose between two explicit designs:

  • sharded read architecture with documented shard-local/global semantics and package-scoped parity tests;
  • global finalization pipeline that rebuilds one logical index after extraction/staging.

Do not spend more time on the exhausted micro-optimizations without new evidence: anti-join insert-select, direct append bypass, persistent staging, narrow staging, Quack structured logs, serial ingestion, transaction semaphores, deferred-term offline staging, larger fragment-stage chunks, or simple fanout result dedup.

DuckDB Hex corpus notes, June 2026

A focused DuckDB Hex-corpus tuning pass used the local Hex tarball mirror and 16 DuckDB shards:

mix exograph.index.hex \
  --backend duckdb \
  --mode top \
  --limit 2000 \
  --concurrency 16 \
  --duckdb-shards 16 \
  --duckdb-threads 1 \
  --duckdb-recovery-mode no_wal_writes \
  --tarball-dir /srv/toys/hex-mirror/tarballs \
  --missing-tarballs-report-path /tmp/exograph-missing-top2000.json

The clean baseline after QuackDB 0.5.7 was later rechecked after QuackDB 0.5.8's DML-builder refactor. The 0.5.8 run used an equivalent temporary tarball mirror because amqp@4.1.1 was missing from the shared mirror.

QuackDBWorkloadIndexedSkippedFailedIndex elapsedWall time
0.5.7top --limit 200016353650158.09s168.21s
0.5.8top --limit 200016353650156.99s166.86s

No missing local tarballs were reported for either completed run. The largest cumulative timing buckets in the 0.5.7 baseline were:

StageTotal
fragment_store_put1094.4s
fragment_store_upsert_fragments555.0s
fragment_store_code_facts433.2s
fragment_store_resolve_fragment_ids237.1s
fragment_append_rows230.7s
code_facts_insert_references179.5s
fragment_store_build_fragment_rows160.6s
fragment_store_normalize_terms150.9s

QuackDB 0.5.6 and 0.5.7 included small adapter/protocol improvements for Ecto native append paths. QuackDB 0.5.8 reused QuackDB's DML builder for the same Ecto append temporary-table SQL and was performance-neutral in the full workload. These changes were positive or neutral but not large enough to change the main bottleneck: fragment append/upsert remains dominated by the append + conflict-ignore + returning/staging path.

MERGE fragment append

DuckDB docs and source notes indicate that recent MERGE paths can choose better bulk strategies than older ON CONFLICT index-conflict paths. After repeated parity benchmarks, Exograph uses the DuckDB MERGE fragment append path by default. The previous Ecto-style path remains available for comparison with:

mix exograph.index.hex --duckdb-fragment-append ecto ...

Initial top --limit 2000 results:

ModeIndexedSkippedFailedIndex elapsedWall timefragment_append_rows total
default (0.5.8)16353650156.99s166.86s241.2s
--duckdb-fragment-append merge run 116353650153.91s158.99s188.0s
--duckdb-fragment-append merge run 216353650157.16s166.95s186.1s

A persisted top --limit 500 quality check matched the default path:

Query/checkDefaultMERGE
Map.get(_, _)77
Enum.map(_, _)959959

| _ |> _ | 1864 | 1864 | | jason fragments | 1391 | 1391 |

These early results made MERGE promising, but not yet sufficient to change the default. Later runs below resolved correctness blockers and provided the stronger evidence used to switch the DuckDB default to MERGE.

A later single-DB top --limit 500 comparison with local tarballs still did not justify changing the default:

Fragment appendIndexedSkippedFailedIndex elapsedWall timefragment_append_rows totalNotes
ecto37912101m42s112s97.3sbaseline
merge37912101m41s111s96.2srepresentative structural/text/package checks matched, but fact table counts differed (definitions +140, references +762, comments +4) and need investigation before further promotion

A follow-up investigation found two correctness issues around this comparison rather than a proven MERGE quality regression:

  • File lookup after file upserts must be scoped by {package_version_id, sha256}, not only by SHA, because different package versions can legitimately share identical files.
  • duckdb_fragment_append was recorded in Hex index reports but was not forwarded from Exograph.Hex.Corpus into per-package Exograph.index_sources/2 calls, so earlier corpus-level MERGE comparisons were not a clean test of the MERGE path.

A focused mariaex@0.9.1 rerun after forwarding the option showed Ecto and MERGE fact counts match exactly (fragments 4614, definitions 1043, references 5307, comments 59). This restores confidence in MERGE correctness for the known repro, but the previous top500 MERGE numbers should be treated as invalidated.

Clean follow-up artifacts live under ignored bench-results/duckdb-merge-20260614/; large .duckdb files were removed after comparison. With --concurrency 4, the Ecto baseline hit a DuckDB unique-constraint race in the append path while indexing cachex@4.1.1, so that run is not a valid quality baseline. MERGE completed the same workload successfully:

Fragment appendConcurrencyIndexedSkippedFailedIndex elapsedWall timefragment_append_rows totalNotes
ecto437812111m53s125s111.3sfailed on duplicate content_hash unique constraint while indexing cachex@4.1.1
merge437912101m50s122s93.3scompleted; not directly comparable because Ecto failed

DuckDB source inspection explains the failure shape: PhysicalInsert::OnConflictHandling verifies conflicts before append, then concurrent transactions can still both try the same unique ART key and one fails at commit. The relevant source paths are src/execution/operator/persistent/physical_insert.cpp, src/common/types/conflict_manager.cpp, and src/execution/index/art/art.cpp. Exograph now retries DuckDB fragment appends on primary-key/unique-constraint races inside the fragment append operation, allowing the losing transaction to re-run after the winner commits.

After that retry guard, the same top --limit 500 --concurrency 4 Ecto workload completed:

Fragment appendConcurrencyIndexedSkippedFailedIndex elapsedWall timefragment_append_rows totalNotes
ecto + retry437912101m54s124s118.1sclean completion after unique-conflict retry

Repeated top --limit 500 --concurrency 4 Ecto/MERGE pairs still do not justify making MERGE the default. Artifacts live under ignored bench-results/duckdb-merge-repeat-20260614/; large .duckdb files were removed after each comparison.

RunEcto resultMERGE resultWall time Ecto/MERGEfragment_append_rows Ecto/MERGEQuality/count parityNotes
1379 indexed, 0 failed378 indexed, 1 failed127s / 123s119.1s / 102.4snoMERGE hit the same DuckDB unique-constraint commit race before broadening the retry guard
2379 indexed, 0 failed379 indexed, 0 failed130s / 122s120.7s / 97.0syesclean MERGE win
3379 indexed, 0 failed379 indexed, 0 failed126s / 121s120.0s / 97.6snorepresentative searches matched, but fact counts differed (definitions +238, references +1227, comments +14)
4not rerun379 indexed, 0 failedn/a / 125sn/a / not summarized heren/aafter broad retry, MERGE completed but produced the lower fact-count variant (definitions 175434, references 751847, comments 26135)

The concurrency-4 fact-count nondeterminism was traced to duplicate source paths inside Hex tarballs rather than MERGE itself. mariaex@0.9.1 contains repeated identical entries for paths such as lib/mariaex/protocol.ex and lib/mariaex/row_parser.ex. Exograph now deduplicates source tuples by {path, package_version} before extraction, so duplicate archive entries do not insert duplicate definitions/references/comments.

A focused mariaex@0.9.1 rerun after source dedupe matched Ecto and MERGE exactly with the lower, deduplicated fact counts (fragments 4614, definitions 665, references 3318, comments 41).

Post-dedupe repeated top --limit 500 --concurrency 4 Ecto/MERGE pairs matched exactly for table counts and representative quality probes in clean runs:

RunEcto wall / appendMERGE wall / appendResult
post-dedupe smoke129s / 117.3s120s / 93.7sexact parity
decision run 1132s / 117.5s122s / 96.1sexact parity
decision run 4132s / 115.2s153s / 123.2sexact parity; MERGE wall-time outlier
decision run 5132s / 109.7s131s / 101.0sexact parity
decision run 6129s / 117.6s123s / 99.9sexact parity

A clean top --limit 2000 --concurrency 4 pair, after filling missing local tarballs, also matched exactly:

Fragment appendConcurrencyIndexedSkippedFailedIndex elapsedWall timefragment_append_rows totalNotes
ecto + retry4163536509m00s569s652.9sexact parity baseline
merge + retry4163536508m52s565s548.4sexact parity; lower append time

Based on the repeated exact-parity runs and the larger append-time reduction, DuckDB fragment append now defaults to MERGE while retaining --duckdb-fragment-append ecto as an escape hatch/comparison path.

Default-path validation after switching the default:

RunIndexedSkippedFailedIndex elapsedWall timeduckdb_fragment_appendfragment_append_rows totalRetry countNotes
top --limit 50037912101m58s130smerge100.6s0default path, no explicit append flag
top --limit 100 --reach752501m11s78smerge21.6s0default path with Reach enabled

A default top --limit 2000 run also reported duckdb_fragment_append: merge, but the live top list had moved beyond the local tarball cache (ash@3.29.0, then llm_db@2026.6.2), so those runs are not clean quality/timing baselines. Use fixed entries or refresh the tarball cache before future top2000 default validation.

The benchmark workflow now supports entry snapshots with --entries-output-path, writing NDJSON lines like {\"name\":\"jason\",\"version\":\"1.4.5\"}. Rerun a fixed set with --entries-file. A fixed top2000 snapshot was created under ignored bench-results/fixed-top2000-20260614/entries.ndjson; after downloading the single missing tarball, the default MERGE run completed cleanly:

RunIndexedSkippedFailedIndex elapsedWall timeduckdb_fragment_appendfragment_append_rows totalRetry countNotes
fixed top2000 entries163536509m14s582smerge596.1s1--entries-file bench-results/fixed-top2000-20260614/entries.ndjson

A final fixed-snapshot comparison of explicit Ecto vs default MERGE matched exactly for table counts and representative probes:

Fragment appendIndexedSkippedFailedIndex elapsedWall timefragment_append_rows totalRetry countNotes
ecto163536509m49s613s696.3s1explicit comparison path
default merge163536509m13s579s580.4s0default path; exact parity with Ecto

Post-MERGE cost model

After making MERGE the default, fragment_append_rows is no longer the only story. Timing snapshots now include counter metrics for row volumes (total, avg, max) in addition to duration metrics (total_ms, avg_ms, max_ms). A fixed top500 default-MERGE run (bench-results/stage-cost-20260615/top500-default-timings.json) showed:

Stage / metricValue
wall time135s
index elapsed123s
index_sources485.6s cumulative
fragment_store_put414.6s cumulative
fragment_store_upsert_fragments239.6s cumulative
fragment_store_code_facts122.5s cumulative
fragment_store_resolve_fragment_ids108.8s cumulative
fragment_append_rows105.6s cumulative
fragment_store_normalize_terms76.3s cumulative
fragment_store_upsert_terms52.8s cumulative
code_facts_insert_comments52.9s cumulative
input fragments985,339 rows
unique fragment rows976,408 rows
unique terms observed459,667 rows
reference facts739,115 rows
definition facts158,592 rows
comment facts21,064 rows
file rows3,714 rows
fragment append retries1

The next optimization target should be chosen from this profile rather than further MERGE tuning. The remaining storage-heavy cluster is fragment ID resolution + fact insertion + term normalization/upsert. Previous offline term-string batching and hybrid file handling both regressed, so prefer narrower, measurable changes such as reducing per-package ID lookup/fact flush overhead or improving fact/term batching without inflating staged payloads.

A follow-up split of fragment_store_resolve_fragment_ids showed that the fallback lookup is not the bottleneck in the default MERGE path (bench-results/fragment-id-resolution-20260615/top500-default-timings.json):

Fragment ID resolution sub-stage / metricValue
fragment_store_resolve_fragment_ids108.6s
fragment_store_insert_fragments_by_hash106.5s
fragment_append_rows105.4s
fragment_store_missing_hashes0.4s
fragment_store_lookup_existing_fragment_ids1.5s
unique fragment rows985,178
inserted fragment rows returned982,754
missing existing fragment IDs657

So optimizing the fallback content_hash IN (...) lookup is not worth pursuing now. The time attributed to ID resolution is almost entirely the MERGE append itself. Move the next optimization attempt to term/fact batching or broader flush granularity rather than fragment-ID lookup shape.

A follow-up fact-buffer split (bench-results/fact-buffer-cost-20260615/top500-default-timings.json) showed that code-fact insert time is mostly synchronous InsertBuffer.insert/3 backpressure rather than final flush cost:

Fact/buffer metricValue
fragment_store_code_facts109.7s
code_facts_insert_comments41.4s
code_facts_bulk_insert_comments41.3s
code_facts_insert_references28.3s
code_facts_bulk_insert_references27.0s
code_facts_insert_definitions8.6s
code_facts_bulk_insert_definitions8.3s
duckdb_insert_buffer_flush_references26.0s across 14 flushes
duckdb_insert_buffer_flush_definitions5.6s across 4 flushes
duckdb_insert_buffer_flush_comments0.25s across 1 flush
reference rows enqueued/flushed732,375 / 750,994
definition rows enqueued/flushed158,315 / 175,266
comment rows enqueued/flushed20,283 / 26,131

The surprising comments cost is not comment row volume: comments flush only took ~0.25s. The expensive code_facts_bulk_insert_comments timing is the synchronous call into a shared InsertBuffer, which can block behind reference/definition flushes from other package tasks. The next low-risk experiment should therefore target InsertBuffer flush granularity/serialization (for example, larger per-source thresholds or per-source workers), not comment extraction or direct comment insert SQL.

Two immediate InsertBuffer variants did not justify a behavior change:

VariantResult
--duckdb-insert-buffer-size 200000completed top500 but regressed wall time to 146s; fewer, larger flushes made final/large flushes expensive (duckdb_insert_buffer_flush 13.7s; definition flush 8.4s) and did not reduce comment enqueue backpressure enough
async flush prototypenot kept; naive Task.async from inside the GenServer let task replies hit handle_info before Task.await, causing final flush/reporting to hang

A proper per-source worker version of the InsertBuffer did help. The public buffer API stayed the same, but each source table now has an independent worker, so comment enqueue calls no longer wait behind reference/definition flushes.

RunIndexedSkippedFailedIndex elapsedWall timefragment_store_code_factscode_facts_bulk_insert_commentsfragment_append_rows
fixed top500 baseline3791210116.9s~124s109.7s41.3s105.4s
fixed top500 per-source workers379121097.6s106.7s72.8s0.05sn/a in adjacent baseline
fixed top2000 default MERGE16353650553.3s579s537.1s233.8s580.4s
fixed top2000 per-source workers16353650470.9s495s459.6s7.9s435.7s
fixed top2000 per-source workers repeat16353650536.1s561s515.6s7.3s486.8s

The top2000 runs used bench-results/fixed-top2000-20260614/entries.ndjson, completed with 1635 indexed / 365 skipped / 0 failed, and still reported the default duckdb_fragment_append: merge. This is a real win, but it changes fact flushing from globally serialized to per-source serialized. Keep watching larger fixed-entry runs for DuckDB append contention and retry counts.

A follow-up split showed the remaining code_facts_bulk_insert_references time was mostly same-source worker queueing, not row construction. Switching worker enqueue from synchronous call to non-blocking cast keeps final InsertBuffer.flush/1 as the durability barrier and moved that wait out of per-package indexing:

RunIndexedSkippedFailedIndex elapsedWall timefragment_store_code_factscode_facts_bulk_insert_referencesreference worker service / flush
fixed top500 non-blocking enqueue379121089.6s99.1s28.9s0.4s17.8s / 18.5s
fixed top2000 non-blocking enqueue16353650466.1s490s147.5s1.8s101.6s / 101.3s

Fact row flush counts matched the prior per-source worker top2000 run (references 2,889,685; definitions 748,332; comments 113,887), so this change did not drop buffered facts in the fixed workload. A fresh fixed top2000 quality run also preserved indexed/skipped/failed totals and the same package/search/fact counts as the previous non-blocking summary, with only a 4-row fragment_terms difference out of ~99.4M rows; keep watching this tiny nondeterminism when changing fragment append concurrency.

The remaining wall time is now dominated by fragment append/upsert and the actual serialized DuckDB fact flushes rather than caller backpressure. A fixed top500 MERGE split (bench-results/fragment-append-split-20260615/top500-split2-timings.json) showed:

MERGE append sub-stageTotal
fragment_append_rows / transaction92.3s / 92.2s
staging append into temp table32.5s
MERGE query27.6s
temp table create + clear1.4s
input / returned rows671,837 / 683,055

The temp-table DDL/cleanup is not the bottleneck. Increasing the fragment temp-stage append chunk size from 2k to 10k reduced fixed top500 fragment_append_rows from 92.3s to 80.5s and fixed top2000 elapsed to 460.4s (1635 indexed / 365 skipped / 0 failed). A 20k chunk was worse on fixed top500 (100.3s elapsed, 88.4s append), so 10k is the current local best.

A temporary cardinality probe showed the temp stage was already unique per call (661,790 rows / 661,790 distinct hashes in the fixed top500 probe), and only 308 staged hashes already existed in the target at the sampled MERGE point. That does not support revisiting the earlier prelookup/minimal-key path: there are too few existing hashes to offset an extra lookup/staging pass. The probe was removed after recording the result because it added query overhead.

A local QuackDB prototype optimized ordinary signed 64-bit vector encoding, including large BIGINT[] child vectors. Dogfooding it through a temporary local path dependency produced a fixed top500 run with 379 indexed / 121 skipped / 0 failed, 91.6s elapsed, and fragment_append_rows at 78.8s (stage_rows 28.7s, merge_query 20.0s). The change shipped in QuackDB 0.5.12; after updating Exograph to the released dependency, fixed-snapshot runs completed cleanly:

RunIndexedSkippedFailedIndex elapsedWall timefragment_append_rowsstage_rowsMERGE query
fixed top500, QuackDB 0.5.12379121090.4s101.1s82.0s29.1s25.4s
fixed top2000, QuackDB 0.5.1216353650465.1s489.4s522.6s92.5s213.7s

A follow-up attempt to replace MERGE with INSERT INTO ... SELECT ... WHERE NOT EXISTS ... RETURNING using the same temp stage regressed fixed top500 (114.2s elapsed, 98.6s append, 36.4s stage append, 29.7s insert query), so the MERGE shape remains better. The experiment was reverted.

QuackDB append telemetry is now captured in Hex timing snapshots when DuckDB indexing writes --timings-path. A fixed top500 telemetry run (bench-results/quackdb-append-telemetry-20260615/top500-telemetry-timings.json) showed the fragment temp-stage append payload is very wide:

QuackDB append metricFragment temp stage
rows / batches671,837 / 3
request bytes734.1 MB
encode duration10.6s
transport/server append duration3.5s transport / 14.3s append
surrounding fragment_append_stage_rows30.2s

The gap between QuackDB append duration and fragment_append_stage_rows suggests Ecto adapter row normalization/DBConnection queue/transaction overhead is also material, not just protocol encoding. Two direct-append variants were investigated and reverted:

VariantResult
direct QuackDB.insert_columns/4 through repo poolfailed: checked out a different QuackDB connection than repo.transaction/2, so the connection-local temp table was invisible (Table .exograph_fragment_merge_* does not exist)
direct append on the active transaction connection via Ecto's process dictionaryworked, but regressed fixed top500 (105.4s elapsed, 88.6s append); stage append wrapper fell to 19.7s, but native append and MERGE time increased
unique persistent staging table plus direct append outside the transactionworked, but regressed fixed top500 (105.6s elapsed, 117.9s append); regular table create/drop and cross-connection append cost dominated

This means a QuackDB/Ecto active-transaction append helper is technically possible, but it is not yet justified by the measured workload.

A follow-up telemetry run also attached to QuackDB fetch spans and compared against an offline DuckDB CLI profile. Fixed top500 with fetch telemetry (bench-results/fetch-telemetry-20260615/top500-fetch-timings.json) completed in 91.9s elapsed with 79.9s fragment append, 28.1s stage append, and 21.3s MERGE query. Fetching the MERGE RETURNING content_hash, id result was not the bottleneck (1 fetch, 36 chunks, 0.55s). A DuckDB CLI profile over the completed database (bench-results/fetch-telemetry-20260615/cli-merge-returning-summary.json) recreated a MERGE-with-RETURNING shape from hex_fragments into an empty profiled table and reported only 3.53s latency / 2.61s CPU, with the MERGE_INTO operator at 2.10s and table scan at 0.51s. This is not the exact temp-stage transaction path, but it strongly suggests the 21s+ app-side MERGE query time is not primarily result fetching or DuckDB's logical MERGE operator alone.

A local QuackDB query-phase telemetry dogfood run (bench-results/quackdb-query-telemetry-20260615/top500-query-timings.json) added query/fetch encode, transport, decode, normalize, and byte counters. Fixed top500 completed in 93.5s elapsed with 86.7s fragment append, 28.3s stage append, and 27.1s MERGE query. Aggregated INSERT query telemetry showed nearly all query time is transport/server execution (63.0s total query duration, 62.8s transport, negligible encode/decode/normalize, tiny direct query response bytes). MERGE RETURNING fetch remained small (45 chunks, 0.43s, 5.2 MB response). QuackDB 0.5.13 published those telemetry metrics; the released top500 smoke (bench-results/quackdb-0.5.13-20260615/top500-release-timings.json) completed in 91.7s elapsed with 87.2s fragment append, 27.9s stage append, and 25.9s MERGE query. A synthetic QuackDB wide-append benchmark matching the Exograph fragment stage payload shape showed 670k rows / 721 MB request bytes. The matrix confirmed that concurrency improves wall time but inflates aggregate transport/server cost (1 connection, 10k chunks: 12.66s wall / 11.66s append / 5.67s transport-server; 4 connections, 10k chunks: 7.56s wall / 27.91s append / 19.83s transport-server). Larger chunks reduced per-request overhead in the isolated benchmark (4 connections, 40k chunks: 5.33s wall / 18.87s append), but a targeted Exograph fixed-top500 validation with 40k fragment-stage chunks (bench-results/fragment-stage-chunk-40k-20260615/top500-stage40k-timings.json) did not improve the full workload: 92.2s elapsed, 86.6s fragment append, 27.9s stage append, and 25.3s MERGE query, versus the released 0.5.13 baseline at 91.7s elapsed, 87.2s fragment append, 27.9s stage append, and 25.9s MERGE query. Keep the production stage chunk at 10k; isolated append chunk improvements do not transfer meaningfully to the full online MERGE workload. Remaining optimization should therefore focus on DuckDB/Quack server contention under concurrent native append and the MERGE/server execution path, not client-side query row materialization or simple chunk tuning.

An attempt to use DuckDB's built-in structured Quack log type was not useful. In small managed-server probes, both CALL enable_logging(['Quack']) and CALL enable_logging(level := 'debug') left duckdb_logs/duckdb_logs_parsed('Quack') empty for remote Quack requests. A fixed top500 run with CALL enable_logging(['Quack']) enabled before indexing then hung after the first few packages and hit per-package timeouts, so the experimental CLI log export hook was reverted. Treat built-in Quack logs as unavailable for this workload unless the upstream extension logging behavior changes; use client telemetry, DuckDB profiling, or purpose-built server instrumentation instead.

A DuckDB CLI reproduction of the actual wide temp-stage/MERGE shape (bench-results/duckdb-cli-merge-profile-20260615/summary.json) separated DuckDB engine cost from Quack remote path cost. The Exograph top500 run behind the profile had 84.2s fragment append, 30.6s stage append, and 22.1s app-side MERGE query. In the CLI over the same database, creating a temp stage from the full fragment columns took only 0.25s latency (1.20s CPU), and MERGE-with-RETURNING from that temp stage into an empty indexed target took 3.29s latency (2.26s CPU), with 2.18s in MERGE_INTO, 0.70s checkpoint latency, 610.5 MB written, and a 16.5 MB result. This is still a synthetic all-at-once CLI reproduction rather than the exact per-package concurrent remote path, but it strongly rules out DuckDB's core MERGE operator as the main 22s+ app-side MERGE cost. The remaining gap is most likely Quack server request handling/serialization around query execution, temp-stage remote append ingestion, transaction/commit scheduling, or connection contention.

A local QuackDB query-telemetry dogfood run with --concurrency 1 (bench-results/concurrency1-20260615/compare-normal-vs-c1.json) confirmed connection/concurrency contention is a large part of the remote MERGE cost. Compared with the normal local-telemetry top500 (93.5s elapsed, 86.7s fragment append, 28.3s stage append, 27.1s MERGE query), single package/DB concurrency regressed wall time to 251.3s because extraction/indexing became serial, but improved fragment append internals: 66.9s fragment append, 24.4s stage append, and 8.3s MERGE query. Aggregated INSERT query transport dropped from 62.8s to 24.3s, while COMMIT transport stayed high (42.5s normal vs 39.8s c1). This points to DB connection/write contention inflating MERGE and insert transport under normal concurrency, but serializing all package work is too costly.

A follow-up fragment-append semaphore prototype kept package concurrency at 4 while limiting only DuckDB fragment append transactions (bench-results/fragment-append-limiter-20260615/compare-normal-vs-limiters.json). It was reverted. Limit 1 improved MERGE query time (27.1s -> 14.3s) but added 74.8s of limiter wait, regressing elapsed time to 114.1s and fragment append aggregate time to 158.1s. Limit 2 also regressed (101.0s elapsed, 99.9s fragment append) and did not improve MERGE (27.7s). A coarse semaphore around fragment append is therefore not the right bounded-writer shape; the wait cost outweighs reduced contention. Future concurrency work should target a real pipelined writer/ingestion architecture or narrower critical sections, not a simple transaction-level semaphore.

A sharded DuckDB ingestion probe (bench-results/duckdb-shards-20260615/) split fixed top500 across four managed DuckDB databases. With global concurrency 4, per-shard concurrency 1, and --duckdb-threads 2, wall time regressed to 131.4s even though aggregate MERGE query dropped to 8.0s; per-shard serial extraction left too little parallelism. With global concurrency 8, per-shard concurrency/pool size 2, and --duckdb-threads 2, wall time improved to 70.2s versus 76.3s for a single DuckDB database at global concurrency 8 and 91.7s for the normal concurrency-4 baseline. The shard run reduced aggregate MERGE query (12.0s vs single-DB concurrency-8 33.1s) but still had similar/worse stage append cost (36.7s vs 32.1s). Correctness is not a drop-in match: package/text counts and sample package fragment counts matched, but global table counts differed because fragment/content/term de-duplication is local to each shard (fragments 689870 sharded vs 689495 single, terms 254414 vs 213090), and one broad structural query over-counted (_ |> _ 1506 sharded vs 2 single). A follow-up attempt to deduplicate merged sharded hits by fragment hash was reverted: it did not fix the broad structural over-count (_ |> _ stayed around 1498) and it regressed other global counts (Enum.map(_, _) 954 vs single 959, defmodule 6794 vs single 6807). Sharding is promising for write contention but changes global de-dup/search semantics in a way that cannot be fixed by a simple fanout result dedup pass.

An offline deferred-term prototype tried to avoid per-package term finalization by staging term text and fragment-term text rows, inserting fragments with empty terms, then backfilling fragments.terms and fragment_terms during finalization. It was reverted. It improved existing offline top500 elapsed time from about 249.9s to 160.9s, but still lost badly to online default (91.3s in the same comparison directory). It also failed exact parity: structural/text search counts matched, but table counts differed (comments 26023 default vs 22950 offline, references 750887 vs 750761, definitions off by 1, fragment_terms off by 4). The gap likely comes from subtle duplicate-content/fact attribution semantics in the offline finalization joins. Do not pursue deferred-term offline staging without a stronger correctness model for facts associated with duplicate fragments.

A fragment payload attribution run (bench-results/fragment-payload-20260615/top500-payload-summary.json) confirmed the wide payload is dominated by AST and token-list columns:

ColumnApprox payload
ast452.6 MB
terms234.1 MB
sub_hashes56.4 MB
exact_hash43.0 MB
content_hash21.5 MB
module13.4 MB

These approximate column bytes explain nearly all of the ~734 MB QuackDB request. The ast blob is the largest single cost, followed by BIGINT[] term vectors. This made a narrower MERGE staging shape attractive, but the prototype regressed and was reverted. A first direct append into hex_fragments failed because DuckDB append requires all physical columns, including id (table hex_fragments has 18 columns but 17 values were supplied). A second variant allocated id values from hex_fragments_id_seq, staged only content_hash, direct-appended full missing rows, then looked up IDs; it completed correctly but regressed fixed top500 (111.3s elapsed, 103.8s append; direct insert alone 59.8s). The stage payload was much smaller (~105 MB of QuackDB append requests instead of ~734 MB), but losing MERGE's insert-from-temp execution path outweighed the protocol savings. The next fragment-append work should move to DuckDB MERGE profiling or QuackDB native append/server-side ingest profiling rather than more prelookup/direct-insert staging variants.

A serial top --limit 500 --concurrency 1 run remains the earlier clean correctness comparison. Counts and representative checks matched exactly for files, fragments, terms, fragmentterms, definitions, references, comments, packages, package_versions, defmodule, `Map.get(, ),Enum.map(, ), |> _, and package fragment counts forjason,ecto, andphoenix`:

Fragment appendConcurrencyIndexedSkippedFailedIndex elapsedWall timefragment_append_rows totalNotes
ecto137912105m04s315s86.8sclean baseline
merge137912104m44s296s73.3sexact quality/count parity; ~19s wall improvement in this serial run

Historical invalidated artifacts: /tmp/exograph-top500-ecto-report.json, /tmp/exograph-top500-ecto-timings.json, /tmp/exograph-top500-merge-report.json, /tmp/exograph-top500-merge-timings.json.

Reports include selected DuckDB experiment metadata under options, including duckdb_fragment_append and duckdb_build_mode. The --duckdb-build-mode offline flag selects the experimental offline staging path for files, terms, fragments, definitions, references, comments, fragment_terms, graph_nodes, and call_edges. Keep it explicit until quality parity and larger repeated benchmarks are complete.

Automated top-package parity guards now compare online vs offline counts for files, fragments, terms, fragment_terms, definitions, references, comments, graph_nodes, call_edges, and representative text/definition/reference/caller/callee searches.

Initial top --limit 100 sharded smoke, using local tarballs and --duckdb-shards 2, showed quality-level totals match but no end-to-end win yet:

Build modeIndexedSkippedFailedIndex elapsedWall timeNotes
online7525050.54s52sbaseline online path
offline7525050.48s52sstages files/terms/fragments/facts, but still finalizes files/terms during per-package puts

This supports the next milestone: batch staging across larger units and finalize once per shard/corpus, rather than repeatedly resolving file/term IDs during FragmentStore.put/2.

Initial single-DB top --limit 100 --reach smoke results:

Build modeIndexedSkippedFailedIndex elapsedWall timeNotes
online + Reach7525064.35s69sbaseline online Reach path
offline + Reach7525061.87s73sstages call graph facts; slower wall due finalization/staging overhead

A sharded online + Reach top100 run exposed a dynamic-repo issue in asynchronous call-graph inserts (could not lookup Ecto repo Exograph.DuckDBRepo). Exograph.Storage.Ecto.SQL.bulk_insert_all/4 now preserves the current dynamic repo in async chunk insert tasks; sharded online + Reach top10 and top100 runs completed successfully after the fix.

Sharded top --limit 100 --reach, using local tarballs and --duckdb-shards 2 after the dynamic-repo fix:

Build modeIndexedSkippedFailedIndex elapsedWall timeNotes
online + Reach7525065.46s67ssharded baseline after dynamic-repo fix
offline + Reach7525061.51s63sstages call graph facts; still pays larger finalization cost

Package batching experiment

mix exograph.index.hex has an explicit --package-batch-size option for experimenting with flushing multiple packages together. Quality checks on top --limit 500 matched the default mode for representative structural queries:

Query/checkDefaultBatch size 4
Map.get(_, _)77
Enum.map(_, _)959959

| _ |> _ | 1864 | 1864 | | jason fragments | 1391 | 1391 |

Controlled top --limit 1000 benchmarks with 16 shards were noisy and did not justify changing the default:

--package-batch-sizeRun 1Run 2Median
1134.5s116.8s125.6s
2145.7s150.1s147.9s
4126.2s130.3s128.2s

Keep package batching explicit until a larger corpus or repeated low-noise runs show a consistent win.

Dead ends from the tuning pass

The following experiments were reverted because they were neutral or worse on the full workload, even when some looked promising in microbenchmarks:

  • Prelooking up existing fragment hashes and direct-appending only missing rows.
  • Replacing the Ecto append/conflict path with an Exograph-local temp staging table plus INSERT ... SELECT ... WHERE NOT EXISTS; a top --limit 500 run was slower than the default path (1m12s vs 1m05s index elapsed in adjacent runs).
  • Resolving existing fragment IDs through a DuckDB temp hash table and join. Isolated lookups were faster for large hash sets, but full top --limit 2000 indexing regressed (2m43s index elapsed vs 2m37s for the adjacent 0.5.8 baseline), likely because transaction/temp-table overhead outweighed lookup gains in the real workload.
  • Batch-staging term strings in offline mode and resolving fragment_terms at finalization time. Top-package parity passed, but top --limit 100 single-DB indexing regressed badly: online was 75 indexed / 25 skipped / 0 failed, 53s index elapsed and 60s wall including a compile; the experimental offline term-batched path was 75 indexed / 25 skipped / 0 failed, 1m07s index elapsed and 75s wall. Timings showed offline_build_stage_fragments ballooning to 124.2s total, so staging large per-fragment term-string payloads was worse than resolving term IDs during per-package puts. Artifacts: /tmp/exograph-batch-online-top100-report.json, /tmp/exograph-batch-online-top100-timings.json, /tmp/exograph-batch-offline-top100-report.json, /tmp/exograph-batch-offline-top100-timings.json.
  • Hybrid offline mode that used the normal online file upsert path while keeping offline fragment/fact staging. Top-package parity passed, but top --limit 100 remained slower than online: adjacent online was 75 indexed / 25 skipped / 0 failed, 55s index elapsed and 62s wall; hybrid offline was 75 indexed / 25 skipped / 0 failed, 1m04s index elapsed and 73s wall. Artifacts: /tmp/exograph-hybrid-online-top100-report.json, /tmp/exograph-hybrid-online-top100-timings.json, /tmp/exograph-hybrid-offline-top100-report.json, /tmp/exograph-hybrid-offline-top100-timings.json.
  • Removing post-insert temp-table cleanup.
  • Combining temp-table create and clear statements.
  • Switching temp-table cleanup from DELETE to TRUNCATE.
  • Increasing DuckDB code-fact insert buffer size from 50k to 100k.
  • Changing AST compression level or selectively changing fragment row construction.
  • Fusing code-fact extraction passes.
  • Special-casing int64, blob, or LIST vector encoding in QuackDB without full-workload wins.
  • Lowering per-package source parsing concurrency.

The next meaningful design target is a dedicated bulk fragment upsert/staging path below the current SQL/Ecto shape. A local SQL staging rewrite was not enough; the larger design likely needs adapter-level support that reduces append + conflict-ignore + ID lookup work together rather than rearranging the same operations.

Artifacts

Machine-readable benchmark artifacts are generated locally and intentionally not committed to git. Use --output-json for the JSON report and --explain-dir for Postgres plans, for example:

mix exograph.bench.backends \
  --mode top --limit 500 --runs 3 \
  --only postgres_plain,duckdb_plain,duckdb_sharded_plain \
  --output-json bench-results/backend-limit500-runs3-current.json \
  --explain-dir bench-results/explain-limit500-runs3-current

bench-results/ is gitignored to avoid polluting the repository with local benchmark outputs. The tables above record the latest checked benchmark summary; regenerate artifacts locally when you need machine-readable evidence or plans.

Current fair wording

A defensible summary is:

On Exograph's Hex.pm top-package workload, tuned Postgres is slightly faster at indexing 100 packages. At 500 packages, DuckDB indexes about 1.66× faster single-node and about 1.99× faster with 4 shards, while several important query paths are roughly 3×–14× faster on DuckDB. These numbers describe Exograph's current backends and local benchmark setup, not PostgreSQL or DuckDB universally.