Exgit.FS (exgit v0.1.0)

Copy Markdown View Source

Path-oriented read/write access to a git repository — the interface an agent actually wants.

All functions accept a reference that can be any of:

  • "HEAD" (or any ref name like "refs/heads/main")
  • a raw commit binary SHA (20 bytes)
  • a raw tree binary SHA (20 bytes) — treated as the root tree

Path separators are always forward slashes. Leading/trailing slashes are tolerated; "" and "/" refer to the root tree.

Threaded vs streaming

Strict operations (read_path/3, ls/3, stat/3, write_path/4, prefetch/3) return a tagged triple {:ok, result, repo} so that any object-store state grown during the call (e.g. lazy cache population from Promisor.resolve/2) is visible to subsequent calls:

{:ok, {_mode, blob1}, repo} = FS.read_path(repo, "HEAD", "a.ex")
{:ok, {_mode, blob2}, repo} = FS.read_path(repo, "HEAD", "b.ex")

Streaming operations (walk/2, grep/4) return lazy enumerables and use pure ObjectStore.get/2. They do NOT fetch missing objects from a promisor. Prime the cache first with prefetch/3 if needed.

Summary

Functions

Wait for an async prefetch task to complete.

Cancel an in-flight async prefetch.

Return true if the path exists under the given reference.

List the file paths matching pattern, like Path.wildcard/1.

Stream grep over the blobs reachable from reference. Streaming; does not grow the cache.

List entries of the directory at path. Returns {:ok, entries, repo}.

Stream grep over multiple patterns in a single tree walk.

Prefetch trees reachable from reference (and optionally all blobs) into the object store.

Kick off a prefetch against a RepoHandle as a background task.

Prefetch the commit graph (commits + trees, no blobs) reachable from reference. Required for operations that walk history such as Exgit.Blame.blame/3.

Read a slice of path at reference, returning only the lines within line_range.

Read the blob at path. Returns {:ok, {mode, %Blob{}}, repo} or {:error, reason}. The returned repo reflects any cache growth triggered during resolution.

Remove the entry at path from the tree at reference. Returns {:ok, new_tree_sha, repo} — the new tree omits the entry; existing blob/tree objects are left untouched (git is content-addressed; orphan objects are GC'd separately).

Size in bytes of the blob at path — WITHOUT reading its content.

Stat the path. Returns {:ok, stat, repo}.

Lazy {path, blob_sha} stream of every file reachable from the given reference's tree.

Write content to path. Returns {:ok, new_tree_sha, repo}.

Types

grep_match()

@type grep_match() :: %{
  :path => String.t(),
  :line_number => pos_integer(),
  :line => String.t(),
  :match => String.t(),
  optional(:context_before) => [{pos_integer(), String.t()}],
  optional(:context_after) => [{pos_integer(), String.t()}]
}

line_range()

@type line_range() :: pos_integer() | Range.t() | [pos_integer() | Range.t()]

multi_grep_match()

@type multi_grep_match() :: %{
  :tag => term(),
  :path => String.t(),
  :line_number => pos_integer(),
  :line => String.t(),
  :match => String.t(),
  optional(:context_before) => [{pos_integer(), String.t()}],
  optional(:context_after) => [{pos_integer(), String.t()}]
}

multi_grep_patterns()

@type multi_grep_patterns() ::
  %{required(atom() | String.t()) => String.t() | Regex.t()}
  | [String.t() | Regex.t()]

path()

@type path() :: String.t()

ref()

@type ref() :: String.t() | binary()

stat()

@type stat() :: %{
  type: :blob | :tree | :submodule,
  mode: String.t(),
  size: non_neg_integer() | nil
}

Functions

await_prefetch(task, timeout_ms \\ 60000)

@spec await_prefetch(Task.t(), timeout()) :: {:ok, :prefetched} | {:error, term()}

Wait for an async prefetch task to complete.

Returns {:ok, :prefetched} on success, {:error, reason} on failure, or {:error, :timeout} if the task didn't complete within timeout_ms (default 60_000).

Does NOT cancel the task on timeout — call cancel_prefetch/1 explicitly if you want to abort.

cancel_prefetch(task)

@spec cancel_prefetch(Task.t()) :: :ok

Cancel an in-flight async prefetch.

The task is shut down brutally; any work it had done is discarded (the handle was never updated). Emits [:exgit, :fs, :prefetch_async, :cancelled] telemetry so operators can see cancellation patterns.

Returns :ok whether the task was still running or already completed.

exists?(repo, reference, path)

@spec exists?(Exgit.Repository.t(), ref(), path()) :: boolean()

Return true if the path exists under the given reference.

Does not return the updated repo — this is a boolean shortcut. If you care about the grown cache after the check, use stat/3.

glob(repo, reference, pattern)

@spec glob(Exgit.Repository.t(), ref(), String.t()) :: [String.t()]

List the file paths matching pattern, like Path.wildcard/1.

Walks the tree lazily (pure ObjectStore.get/2, does not grow the cache) and returns the sorted list of matching paths. Unlike walk/2 and grep/4 this is not a stream — sorting requires collecting every match, so the whole tree is traversed before the call returns. An unmatched pattern returns [].

grep(repo, reference, pattern, opts \\ [])

@spec grep(Exgit.Repository.t(), ref(), String.t() | Regex.t(), keyword()) ::
  Enumerable.t()

Stream grep over the blobs reachable from reference. Streaming; does not grow the cache.

Options

  • :path — glob restricting which paths are searched. Default "**".
  • :max_count — stop after N matches (across all files). Default unlimited.
  • :include_binary — include binary blobs. Default false; binary detection is a NUL-byte heuristic on the first 8 KB.
  • :case_insensitive"i" regex flag. Default false.
  • :max_concurrency — parallel worker count. Default 1.

Context

  • :context — symmetric context, N lines before + after each match. Sets :before and :after to the same value.
  • :before — lines of context BEFORE each match.
  • :after — lines of context AFTER each match.

When any of :context, :before, or :after is positive, result rows gain :context_before and :context_after fields, each a list of {line_number, line} tuples. Lists may be empty when a match is near the start or end of a file. When no context is requested, those fields are absent from the returned map — existing callers pattern- matching %{path: _, line_number: _, line: _, match: _} continue to work.

When two matches in the same file are closer than before + after lines, their context ranges overlap. Each match emits its own independent row; the consumer may deduplicate by line number.

ls(repo, reference, path)

@spec ls(Exgit.Repository.t(), ref(), path()) ::
  {:ok, [{String.t(), String.t(), binary()}], Exgit.Repository.t()}
  | {:error, term()}

List entries of the directory at path. Returns {:ok, entries, repo}.

multi_grep(repo, reference, patterns, opts \\ [])

@spec multi_grep(Exgit.Repository.t(), ref(), multi_grep_patterns(), keyword()) ::
  Enumerable.t()

Stream grep over multiple patterns in a single tree walk.

Each result row is tagged with the pattern that matched, so an agent looking for "any of N vulnerability signatures" or "any of N identifiers" gets back a uniform stream with :tag identifying which pattern hit.

patterns may be either:

  • a map %{tag => pattern} — the tag is whatever the caller chose (atom, string, tuple, anything); each result row has :tag => that_tag.
  • a list [pattern, ...] — each pattern is its own tag (the pattern itself appears in :tag).

pattern in each position is the same type grep/4 accepts: a string (escaped to a literal regex) or a %Regex{}.

Result shape

%{
  tag: :auth,
  path: "lib/auth.ex",
  line_number: 42,
  line: "  @auth_token System.get_env(...)",
  match: "auth_token",
  # :context_before / :context_after present iff a context
  # option was set (same semantics as `grep/4`).
}

Two patterns matching the same (path, line_number) produce two result rows, each with its own tag. Consumers that want per-line deduplication can merge on (path, line_number).

Options

Same options as grep/4: :path, :max_count, :include_binary, :case_insensitive, :max_concurrency, :context / :before / :after. :case_insensitive applies uniformly to all patterns.

Patterns within a list/map are NOT deduplicated; duplicate patterns with distinct tags are scanned once per tag, producing duplicate result rows (the caller asked for it).

Implementation

Sequentially applies each compiled regex to each blob. For N patterns and M blobs this is O(N×M) regex scans, but the walk + path-glob filter + binary-detection + blob-decompress happen once per blob — dominating costs for real workloads. An alternation-regex alternative (one scan, pattern dispatch via named captures) would be a future optimization; the current shape is correct-first and easy to debug.

Example

patterns = %{
  token: ~r/auth_token/i,
  key:   ~r/api_key/i,
  secret: "SECRET"
}

repo
|> Exgit.FS.multi_grep("HEAD", patterns, context: 2)
|> Enum.group_by(& &1.tag)

prefetch(repo, reference, opts \\ [])

@spec prefetch(Exgit.Repository.t(), ref(), keyword()) ::
  {:ok, Exgit.Repository.t()} | {:error, term()}

Prefetch trees reachable from reference (and optionally all blobs) into the object store.

Options:

  • :blobs — when true, fetch blobs in addition to trees. When the call fetches blobs AND the repo is :lazy, the returned repo's :mode flips to :eager because every reachable object from reference is now resident and streaming ops can proceed without further transport calls. When blobs: false, mode is unchanged.

Prefer Exgit.Repository.materialize/2 for the one-shot "lazy-to-eager" conversion; prefetch/3 is the progressive variant that lets you stage trees and blobs independently.

Performance

For :lazy repos backed by a remote Promisor, prefetch uses a single batched fetch: one want <tree_sha> HTTP request that asks the server to pack up the tree AND everything reachable from it (or with filter: blob:none when blobs: false). This collapses what could be thousands of on-demand HTTP round trips into one request + one pack parse. Measured on anomalyco/opencode (4,605 files, ~53 MB pack): prefetch drops from ~52s (per-object) to ~7s (batched) on a home connection.

For already-materialized repos (eager, Disk, Memory) the call is a no-op — everything reachable is already local.

prefetch_async(handle, reference \\ "HEAD", opts \\ [])

@spec prefetch_async(Exgit.RepoHandle.t(), ref(), keyword()) ::
  {:ok, Task.t()} | {:error, term()}

Kick off a prefetch against a RepoHandle as a background task.

Returns {:ok, task} immediately. The task runs under Exgit.TaskSupervisor and calls prefetch/3 against the handle's current snapshot. When the task completes, it atomically commits the populated repo back to the handle via RepoHandle.update/2.

Critically, the prefetch uses update/2 (not put/2), so if other processes have written to the handle during the prefetch, their writes are preserved — the prefetch's new objects are imported into whatever the current cache is at commit time.

Options

Forwarded to prefetch/3:

  • :blobs (default true) — fetch blobs in addition to trees for the HEAD reachability. The sensible default for async prefetch: if you're going to search or read, you need blobs.

Lifecycle

  • Await completion: await_prefetch(task, timeout).
  • Cancel in flight: cancel_prefetch(task). Any work the task had done is discarded — the handle is unchanged.
  • Result on success: {:ok, :prefetched}.
  • Result on failure: {:error, reason}.

Telemetry

The task emits [:exgit, :fs, :prefetch_async, :start] and [:exgit, :fs, :prefetch_async, :stop] events so operators can see background prefetches in their dashboards.

Example

{:ok, handle} = Exgit.RepoHandle.start_link(repo)
{:ok, task} = Exgit.FS.prefetch_async(handle)

# ... do other work with the handle; reads see a growing cache ...

:ok = Exgit.FS.await_prefetch(task, 30_000)

prefetch_history(repo, reference \\ "HEAD")

@spec prefetch_history(Exgit.Repository.t(), ref()) ::
  {:ok, Exgit.Repository.t()} | {:error, term()}

Prefetch the commit graph (commits + trees, no blobs) reachable from reference. Required for operations that walk history such as Exgit.Blame.blame/3.

This is a separate concern from prefetch/3:

  • prefetch(repo, ref, blobs: true) fetches HEAD's tree + all reachable blobs — everything grep / read_path / walk need.
  • prefetch_history(repo, ref) additionally fetches ancestor commits + their trees. History walks (blame, log, merge_base) need this.

Splitting them means callers pay only for what they use. On anomalyco/opencode: prefetch(blobs: true) ~8s (53 MB of blobs), prefetch_history/2 ~2s (15 MB of commit graph), full materialization ~10s — vs ~52s in the older non-batched implementation.

Transport protocol

Issues one want <commit_sha> with filter: "blob:none". The server returns a pack containing every commit and tree reachable from the commit, minus blob content. For repos with deep history this is much smaller than the full reachability set — opencode's commit graph is 15 MB vs its full reachability of 211 MB.

For non-Promisor stores (Disk / Memory / eager) this is a no-op.

read_lines(repo, reference, path, line_range)

@spec read_lines(Exgit.Repository.t(), ref(), path(), line_range()) ::
  {:ok, [{pos_integer(), String.t()}], Exgit.Repository.t()} | {:error, term()}

Read a slice of path at reference, returning only the lines within line_range.

line_range is 1-indexed and accepts:

  • N — a single line number.
  • first..last — inclusive range (step must be 1).
  • a list of any of the above.

Returns {:ok, [{line_number, line}], repo}. Line numbers match FS.grep/4's convention:

  • trailing \n does NOT create a phantom empty line;
  • a file not ending in \n still counts its partial last line;
  • an empty file has zero lines.

Requested lines that fall outside the file are silently dropped (so read_lines(repo, ref, path, 1..1000) returns up to as many lines as the file has, rather than erroring). Duplicate or overlapping ranges in a list-form range are deduplicated, and returned lines are sorted ascending.

Errors

  • {:error, :not_found} — path missing
  • {:error, :not_a_blob} — path is a directory
  • {:error, {:invalid_line_range, term()}} — unparseable range (zero/negative line numbers, non-unit step, etc.)

Why not just read_path and slice?

For a 10k-line source file, read_path materializes the full decompressed blob and the caller then does the line splitting and binary_parts. This function does one decompress + one newline scan + O(requested_lines) binary_parts — same result, bounded work per call. It also composes with grep + :context: grep can give you a match and narrow context; read_lines can give you wider context only when the agent asks.

Examples

{:ok, [{42, "def foo do"}], _repo} =
  FS.read_lines(repo, "HEAD", "lib/a.ex", 42)

{:ok, lines, _repo} =
  FS.read_lines(repo, "HEAD", "lib/a.ex", 10..20)

{:ok, lines, _repo} =
  FS.read_lines(repo, "HEAD", "lib/a.ex", [1, 10..12, 100])

read_path(repo, reference, path, opts \\ [])

@spec read_path(Exgit.Repository.t(), ref(), path(), keyword()) ::
  {:ok,
   {String.t(),
    Exgit.Object.Blob.t() | {:lfs_pointer, Exgit.LFS.pointer_info()}},
   Exgit.Repository.t()}
  | {:error, :not_found | :not_a_blob | :submodule | term()}

Read the blob at path. Returns {:ok, {mode, %Blob{}}, repo} or {:error, reason}. The returned repo reflects any cache growth triggered during resolution.

Options

  • :detect_lfs_pointers (default false) — when true, blobs that parse as git-lfs pointer files are returned as {:ok, {mode, {:lfs_pointer, info}}, repo} instead of {:ok, {mode, %Blob{}}, repo}. info is a map with :oid, :size, and :raw (the original pointer bytes).

    Detection only — the actual LFS content is never fetched (that requires a separate batch-API protocol against the LFS server). Callers that need the real bytes can hand info.raw to an LFS client. An agent reading blobs without this flag against an LFS-using repo will silently receive ~130-byte pointer text as if it were file content — a correctness cliff. See Exgit.LFS for detection details.

rm_path(repo, reference, path, opts \\ [])

@spec rm_path(Exgit.Repository.t(), ref(), path(), keyword()) ::
  {:ok, binary(), Exgit.Repository.t()} | {:error, term()}

Remove the entry at path from the tree at reference. Returns {:ok, new_tree_sha, repo} — the new tree omits the entry; existing blob/tree objects are left untouched (git is content-addressed; orphan objects are GC'd separately).

Options

  • :recursive — when true, removing a directory also removes its contents. Default false; removing a directory without :recursive returns {:error, :eisdir}.

Errors:

  • {:error, :not_found}path does not exist in the tree
  • {:error, :eisdir}path is a directory and :recursive is not set
  • {:error, :cannot_rm_root}path is empty or "/"

Mirrors write_path/5's tree-rewrite shape so a workspace can chain rm_path and write_path calls to assemble multi-file edits before committing.

size(repo, reference, path)

@spec size(Exgit.Repository.t(), ref(), path()) ::
  {:ok, non_neg_integer(), Exgit.Repository.t()} | {:error, term()}

Size in bytes of the blob at path — WITHOUT reading its content.

The size-aware companion to read_path/4: use it to decide whether a blob is too large to pull into memory before you pull it. For the in-memory store this is O(1) (the size is indexed, not recomputed); for on-disk loose objects it inflates only the header.

Resolving the path may fetch tree objects (small) on a lazy clone, but the blob itself is never fetched. For a lazy/partial clone whose blob has not been materialized yet, returns {:error, :not_local} rather than triggering a possibly-multi-GB fetch — call read_path/4 when you actually want the bytes. Directories return {:error, :not_a_blob}. Gitlink (submodule) entries return {:error, :submodule} — the entry's SHA names a commit in the submodule's own repository, so it has no size here and no amount of prefetching will make it local.

{:ok, size, repo} = Exgit.FS.size(repo, "HEAD", "go.mod")

stat(repo, reference, path)

@spec stat(Exgit.Repository.t(), ref(), path()) ::
  {:ok, stat(), Exgit.Repository.t()} | {:error, term()}

Stat the path. Returns {:ok, stat, repo}.

Gitlink (submodule) entries stat as %{type: :submodule, size: nil} without fetching anything — the entry's SHA lives in the submodule's own repository.

walk(repo, reference)

@spec walk(Exgit.Repository.t(), ref()) :: Enumerable.t()

Lazy {path, blob_sha} stream of every file reachable from the given reference's tree.

This is a streaming operation — it does NOT grow the object store cache on a lazy repo. Prefetch first if needed.

write_path(repo, reference, path, content, opts \\ [])

@spec write_path(Exgit.Repository.t(), ref(), path(), binary(), keyword()) ::
  {:ok, binary(), Exgit.Repository.t()} | {:error, term()}

Write content to path. Returns {:ok, new_tree_sha, repo}.