Exograph is built around one principle: storage and indexes are advisory; ExAST remains the semantic authority for structural matches.

Components

  • ExAST extracts structural terms, comments, symbols, and verifies patterns
  • ExDNA provides structural fingerprints for fragments and similarity search
  • Reach optionally extracts call graph facts
  • Ecto/Postgres stores normalized files, fragments, facts, package scope, and graph facts
  • ParadeDB optionally accelerates text and code-fact retrieval

Indexing pipeline

source files
  ├── ExAST extractor
  │   ├── fragments
  │   ├── comments
  │   ├── definitions
  │   └── references
  ├── Reach extractor (optional)
  │   ├── graph nodes
  │   └── call edges
  └── Postgres stores
      ├── files
      ├── fragments
      ├── facts
      └── package/version scope

For Hex.pm indexing, an outer streaming loop wraps the pipeline:

Hex registry
  └── for each package (concurrent, bounded)
        ├── download tarball (HTTP, mirror round-robin)
        ├── detect Elixir files (skip non-Elixir before disk write)
        ├── extract to tmpdir
        ├── indexing pipeline (above)
        └── rm -rf tmpdir

Storage model

Exograph.Index separates execution by concern:

  • Postgres inverted index: structural term candidate retrieval from fragment rows
  • fragment store: AST blobs, ExDNA hashes, symbols, and file joins
  • source files: source text and aggregated comment text stored once per file
  • code facts: normalized comments, definitions, references, graph nodes, and call edges
  • tree access: derived lazily from stored AST fragments
  • verifier: ExAST.Pattern / ExAST.Query
  • similarity: ExDNA structural reranking

Query execution

Structural queries are planned into candidate retrieval plus verification:

ExAST selector
  ├── required/advisory terms
  ├── Postgres candidate scan
  ├── hydrate fragments/source
  └── ExAST verification

DSL queries add relational candidate filters before structural verification:

Exograph.DSL.Query
  ├── Exograph.DSL.Plan validation
  ├── Ecto query over fragments/facts/calls
  ├── containing-function join semantics
  └── ExAST verification for fragment matches

Lateral joins for line-range containment

The "containing function" join — find the def that contains a given fragment at line N — uses a SQL LATERAL subquery rather than a self-join. The lateral join evaluates the subquery once per outer row and uses the (file_id, line, end_line) index to locate the enclosing fragment in O(log n) per row. This keeps the containing-function semantic available without materializing a closure table.

Advisory locks for concurrent term insertion

When multiple workers index packages concurrently, term insertion into the inverted index can deadlock on duplicate-key conflicts. Exograph acquires a Postgres advisory lock keyed on hash(term_text) before inserting or looking up a term record. This serializes conflicting inserts per term without locking the entire terms table, and retries automatically on the rare case where two workers hash-collide to different lock IDs.

(kind, name, arity) btree index

Most structural patterns extract kind, name, and arity at query planning time (e.g. def handle_call(_, _, _) do ... end → kind=def, name=handle_call, arity=3). A btree index on (kind, name, arity) on the fragments table lets these queries bypass the GIN term index entirely and go to a btree range scan, which is significantly faster at high fragment counts. The GIN term index is used only when the pattern has no extractable kind/name (e.g. _ + _).

File-first text search with lateral fragment lookup

Text and regex search operate file-first rather than fragment-first:

text query
  ├── scan files.source with pg_trgm ILIKE (or BM25 ranking)
  ├── collect matching file IDs
  └── LATERAL join: for each file, find fragments containing the match line

This avoids storing duplicated source text per fragment and keeps files.source as the single source of truth. The lateral join uses the (file_id, line, end_line) btree index to locate the containing fragment efficiently.

Why Postgres

Postgres gives Exograph:

  • durable local/self-hosted indexes
  • Ecto schemas and migrations
  • package/version scope
  • joins across structural and semantic facts
  • optional ParadeDB BM25 indexes
  • a natural substrate for tools that already run inside Elixir applications

Raw SQL boundary

Exograph uses Ecto where possible. Raw SQL is limited to extension/backend features Ecto cannot express directly, especially ParadeDB index creation, tokenizer casts, BM25 operators, and scoring.