Path-oriented read/write access to a git repository — the interface an agent actually wants.
All functions accept a reference that can be any of:
"HEAD"(or any ref name like"refs/heads/main")- a raw commit binary SHA (20 bytes)
- a raw tree binary SHA (20 bytes) — treated as the root tree
Path separators are always forward slashes. Leading/trailing slashes
are tolerated; "" and "/" refer to the root tree.
Threaded vs streaming
Strict operations (read_path/3, ls/3, stat/3, write_path/4,
prefetch/3) return a tagged triple {:ok, result, repo} so that
any object-store state grown during the call (e.g. lazy cache
population from Promisor.resolve/2) is visible to subsequent calls:
{:ok, {_mode, blob1}, repo} = FS.read_path(repo, "HEAD", "a.ex")
{:ok, {_mode, blob2}, repo} = FS.read_path(repo, "HEAD", "b.ex")Streaming operations (walk/2, grep/4) return lazy enumerables
and use pure ObjectStore.get/2. They do NOT fetch missing objects
from a promisor. Prime the cache first with prefetch/3 if needed.
Summary
Functions
Wait for an async prefetch task to complete.
Cancel an in-flight async prefetch.
Return true if the path exists under the given reference.
List the file paths matching pattern, like Path.wildcard/1.
Stream grep over the blobs reachable from reference. Streaming; does
not grow the cache.
List entries of the directory at path. Returns {:ok, entries, repo}.
Stream grep over multiple patterns in a single tree walk.
Prefetch trees reachable from reference (and optionally all
blobs) into the object store.
Kick off a prefetch against a RepoHandle as a background task.
Prefetch the commit graph (commits + trees, no blobs) reachable
from reference. Required for operations that walk history such
as Exgit.Blame.blame/3.
Read a slice of path at reference, returning only the lines
within line_range.
Read the blob at path. Returns {:ok, {mode, %Blob{}}, repo} or
{:error, reason}. The returned repo reflects any cache growth
triggered during resolution.
Remove the entry at path from the tree at reference. Returns
{:ok, new_tree_sha, repo} — the new tree omits the entry; existing
blob/tree objects are left untouched (git is content-addressed; orphan
objects are GC'd separately).
Size in bytes of the blob at path — WITHOUT reading its content.
Stat the path. Returns {:ok, stat, repo}.
Lazy {path, blob_sha} stream of every file reachable from the given
reference's tree.
Write content to path. Returns {:ok, new_tree_sha, repo}.
Types
@type grep_match() :: %{ :path => String.t(), :line_number => pos_integer(), :line => String.t(), :match => String.t(), optional(:context_before) => [{pos_integer(), String.t()}], optional(:context_after) => [{pos_integer(), String.t()}] }
@type line_range() :: pos_integer() | Range.t() | [pos_integer() | Range.t()]
@type multi_grep_match() :: %{ :tag => term(), :path => String.t(), :line_number => pos_integer(), :line => String.t(), :match => String.t(), optional(:context_before) => [{pos_integer(), String.t()}], optional(:context_after) => [{pos_integer(), String.t()}] }
@type path() :: String.t()
@type stat() :: %{ type: :blob | :tree | :submodule, mode: String.t(), size: non_neg_integer() | nil }
Functions
Wait for an async prefetch task to complete.
Returns {:ok, :prefetched} on success, {:error, reason} on
failure, or {:error, :timeout} if the task didn't complete
within timeout_ms (default 60_000).
Does NOT cancel the task on timeout — call cancel_prefetch/1
explicitly if you want to abort.
@spec cancel_prefetch(Task.t()) :: :ok
Cancel an in-flight async prefetch.
The task is shut down brutally; any work it had done is
discarded (the handle was never updated). Emits
[:exgit, :fs, :prefetch_async, :cancelled] telemetry so
operators can see cancellation patterns.
Returns :ok whether the task was still running or already
completed.
@spec exists?(Exgit.Repository.t(), ref(), path()) :: boolean()
Return true if the path exists under the given reference.
Does not return the updated repo — this is a boolean shortcut. If you
care about the grown cache after the check, use stat/3.
@spec glob(Exgit.Repository.t(), ref(), String.t()) :: [String.t()]
List the file paths matching pattern, like Path.wildcard/1.
Walks the tree lazily (pure ObjectStore.get/2, does not grow the
cache) and returns the sorted list of matching paths. Unlike
walk/2 and grep/4 this is not a stream — sorting requires
collecting every match, so the whole tree is traversed before the
call returns. An unmatched pattern returns [].
@spec grep(Exgit.Repository.t(), ref(), String.t() | Regex.t(), keyword()) :: Enumerable.t()
Stream grep over the blobs reachable from reference. Streaming; does
not grow the cache.
Options
:path— glob restricting which paths are searched. Default"**".:max_count— stop after N matches (across all files). Default unlimited.:include_binary— include binary blobs. Defaultfalse; binary detection is a NUL-byte heuristic on the first 8 KB.:case_insensitive—"i"regex flag. Defaultfalse.:max_concurrency— parallel worker count. Default1.
Context
:context— symmetric context, N lines before + after each match. Sets:beforeand:afterto the same value.:before— lines of context BEFORE each match.:after— lines of context AFTER each match.
When any of :context, :before, or :after is positive, result
rows gain :context_before and :context_after fields, each a list
of {line_number, line} tuples. Lists may be empty when a match is
near the start or end of a file. When no context is requested, those
fields are absent from the returned map — existing callers pattern-
matching %{path: _, line_number: _, line: _, match: _} continue to
work.
When two matches in the same file are closer than before + after
lines, their context ranges overlap. Each match emits its own
independent row; the consumer may deduplicate by line number.
@spec ls(Exgit.Repository.t(), ref(), path()) :: {:ok, [{String.t(), String.t(), binary()}], Exgit.Repository.t()} | {:error, term()}
List entries of the directory at path. Returns {:ok, entries, repo}.
@spec multi_grep(Exgit.Repository.t(), ref(), multi_grep_patterns(), keyword()) :: Enumerable.t()
Stream grep over multiple patterns in a single tree walk.
Each result row is tagged with the pattern that matched, so an
agent looking for "any of N vulnerability signatures" or "any
of N identifiers" gets back a uniform stream with :tag
identifying which pattern hit.
patterns may be either:
- a map
%{tag => pattern}— the tag is whatever the caller chose (atom, string, tuple, anything); each result row has:tag => that_tag. - a list
[pattern, ...]— each pattern is its own tag (the pattern itself appears in:tag).
pattern in each position is the same type grep/4 accepts:
a string (escaped to a literal regex) or a %Regex{}.
Result shape
%{
tag: :auth,
path: "lib/auth.ex",
line_number: 42,
line: " @auth_token System.get_env(...)",
match: "auth_token",
# :context_before / :context_after present iff a context
# option was set (same semantics as `grep/4`).
}Two patterns matching the same (path, line_number) produce
two result rows, each with its own tag. Consumers that want
per-line deduplication can merge on (path, line_number).
Options
Same options as grep/4: :path, :max_count,
:include_binary, :case_insensitive, :max_concurrency,
:context / :before / :after. :case_insensitive applies
uniformly to all patterns.
Patterns within a list/map are NOT deduplicated; duplicate patterns with distinct tags are scanned once per tag, producing duplicate result rows (the caller asked for it).
Implementation
Sequentially applies each compiled regex to each blob. For N patterns and M blobs this is O(N×M) regex scans, but the walk + path-glob filter + binary-detection + blob-decompress happen once per blob — dominating costs for real workloads. An alternation-regex alternative (one scan, pattern dispatch via named captures) would be a future optimization; the current shape is correct-first and easy to debug.
Example
patterns = %{
token: ~r/auth_token/i,
key: ~r/api_key/i,
secret: "SECRET"
}
repo
|> Exgit.FS.multi_grep("HEAD", patterns, context: 2)
|> Enum.group_by(& &1.tag)
@spec prefetch(Exgit.Repository.t(), ref(), keyword()) :: {:ok, Exgit.Repository.t()} | {:error, term()}
Prefetch trees reachable from reference (and optionally all
blobs) into the object store.
Options:
:blobs— whentrue, fetch blobs in addition to trees. When the call fetches blobs AND the repo is:lazy, the returned repo's:modeflips to:eagerbecause every reachable object fromreferenceis now resident and streaming ops can proceed without further transport calls. Whenblobs: false, mode is unchanged.
Prefer Exgit.Repository.materialize/2 for the one-shot
"lazy-to-eager" conversion; prefetch/3 is the progressive
variant that lets you stage trees and blobs independently.
Performance
For :lazy repos backed by a remote Promisor, prefetch uses a
single batched fetch: one want <tree_sha> HTTP request that
asks the server to pack up the tree AND everything reachable from
it (or with filter: blob:none when blobs: false). This
collapses what could be thousands of on-demand HTTP round trips
into one request + one pack parse. Measured on
anomalyco/opencode (4,605 files, ~53 MB pack): prefetch drops
from ~52s (per-object) to ~7s (batched) on a home connection.
For already-materialized repos (eager, Disk, Memory) the call
is a no-op — everything reachable is already local.
@spec prefetch_async(Exgit.RepoHandle.t(), ref(), keyword()) :: {:ok, Task.t()} | {:error, term()}
Kick off a prefetch against a RepoHandle as a background task.
Returns {:ok, task} immediately. The task runs under
Exgit.TaskSupervisor and calls prefetch/3 against the
handle's current snapshot. When the task completes, it atomically
commits the populated repo back to the handle via
RepoHandle.update/2.
Critically, the prefetch uses update/2 (not put/2), so if
other processes have written to the handle during the prefetch,
their writes are preserved — the prefetch's new objects are
imported into whatever the current cache is at commit time.
Options
Forwarded to prefetch/3:
:blobs(defaulttrue) — fetch blobs in addition to trees for the HEAD reachability. The sensible default for async prefetch: if you're going to search or read, you need blobs.
Lifecycle
- Await completion:
await_prefetch(task, timeout). - Cancel in flight:
cancel_prefetch(task). Any work the task had done is discarded — the handle is unchanged. - Result on success:
{:ok, :prefetched}. - Result on failure:
{:error, reason}.
Telemetry
The task emits [:exgit, :fs, :prefetch_async, :start] and
[:exgit, :fs, :prefetch_async, :stop] events so operators
can see background prefetches in their dashboards.
Example
{:ok, handle} = Exgit.RepoHandle.start_link(repo)
{:ok, task} = Exgit.FS.prefetch_async(handle)
# ... do other work with the handle; reads see a growing cache ...
:ok = Exgit.FS.await_prefetch(task, 30_000)
@spec prefetch_history(Exgit.Repository.t(), ref()) :: {:ok, Exgit.Repository.t()} | {:error, term()}
Prefetch the commit graph (commits + trees, no blobs) reachable
from reference. Required for operations that walk history such
as Exgit.Blame.blame/3.
This is a separate concern from prefetch/3:
prefetch(repo, ref, blobs: true)fetches HEAD's tree + all reachable blobs — everything grep / read_path / walk need.prefetch_history(repo, ref)additionally fetches ancestor commits + their trees. History walks (blame, log, merge_base) need this.
Splitting them means callers pay only for what they use. On
anomalyco/opencode: prefetch(blobs: true) ~8s (53 MB of
blobs), prefetch_history/2 ~2s (15 MB of commit graph), full
materialization ~10s — vs ~52s in the older non-batched
implementation.
Transport protocol
Issues one want <commit_sha> with filter: "blob:none". The
server returns a pack containing every commit and tree
reachable from the commit, minus blob content. For repos with
deep history this is much smaller than the full reachability
set — opencode's commit graph is 15 MB vs its full reachability
of 211 MB.
For non-Promisor stores (Disk / Memory / eager) this is a no-op.
@spec read_lines(Exgit.Repository.t(), ref(), path(), line_range()) :: {:ok, [{pos_integer(), String.t()}], Exgit.Repository.t()} | {:error, term()}
Read a slice of path at reference, returning only the lines
within line_range.
line_range is 1-indexed and accepts:
N— a single line number.first..last— inclusive range (step must be 1).- a list of any of the above.
Returns {:ok, [{line_number, line}], repo}. Line numbers match
FS.grep/4's convention:
- trailing
\ndoes NOT create a phantom empty line; - a file not ending in
\nstill counts its partial last line; - an empty file has zero lines.
Requested lines that fall outside the file are silently dropped
(so read_lines(repo, ref, path, 1..1000) returns up to as many
lines as the file has, rather than erroring). Duplicate or
overlapping ranges in a list-form range are deduplicated, and
returned lines are sorted ascending.
Errors
{:error, :not_found}— path missing{:error, :not_a_blob}— path is a directory{:error, {:invalid_line_range, term()}}— unparseable range (zero/negative line numbers, non-unit step, etc.)
Why not just read_path and slice?
For a 10k-line source file, read_path materializes the full
decompressed blob and the caller then does the line splitting
and binary_parts. This function does one decompress + one
newline scan + O(requested_lines) binary_parts — same result,
bounded work per call. It also composes with grep +
:context: grep can give you a match and narrow context;
read_lines can give you wider context only when the agent
asks.
Examples
{:ok, [{42, "def foo do"}], _repo} =
FS.read_lines(repo, "HEAD", "lib/a.ex", 42)
{:ok, lines, _repo} =
FS.read_lines(repo, "HEAD", "lib/a.ex", 10..20)
{:ok, lines, _repo} =
FS.read_lines(repo, "HEAD", "lib/a.ex", [1, 10..12, 100])
@spec read_path(Exgit.Repository.t(), ref(), path(), keyword()) :: {:ok, {String.t(), Exgit.Object.Blob.t() | {:lfs_pointer, Exgit.LFS.pointer_info()}}, Exgit.Repository.t()} | {:error, :not_found | :not_a_blob | :submodule | term()}
Read the blob at path. Returns {:ok, {mode, %Blob{}}, repo} or
{:error, reason}. The returned repo reflects any cache growth
triggered during resolution.
Options
:detect_lfs_pointers(defaultfalse) — whentrue, blobs that parse as git-lfs pointer files are returned as{:ok, {mode, {:lfs_pointer, info}}, repo}instead of{:ok, {mode, %Blob{}}, repo}.infois a map with:oid,:size, and:raw(the original pointer bytes).Detection only — the actual LFS content is never fetched (that requires a separate batch-API protocol against the LFS server). Callers that need the real bytes can hand
info.rawto an LFS client. An agent reading blobs without this flag against an LFS-using repo will silently receive ~130-byte pointer text as if it were file content — a correctness cliff. SeeExgit.LFSfor detection details.
@spec rm_path(Exgit.Repository.t(), ref(), path(), keyword()) :: {:ok, binary(), Exgit.Repository.t()} | {:error, term()}
Remove the entry at path from the tree at reference. Returns
{:ok, new_tree_sha, repo} — the new tree omits the entry; existing
blob/tree objects are left untouched (git is content-addressed; orphan
objects are GC'd separately).
Options
:recursive— whentrue, removing a directory also removes its contents. Defaultfalse; removing a directory without:recursivereturns{:error, :eisdir}.
Errors:
{:error, :not_found}—pathdoes not exist in the tree{:error, :eisdir}—pathis a directory and:recursiveis not set{:error, :cannot_rm_root}—pathis empty or"/"
Mirrors write_path/5's tree-rewrite shape so a workspace can chain
rm_path and write_path calls to assemble multi-file edits before
committing.
@spec size(Exgit.Repository.t(), ref(), path()) :: {:ok, non_neg_integer(), Exgit.Repository.t()} | {:error, term()}
Size in bytes of the blob at path — WITHOUT reading its content.
The size-aware companion to read_path/4: use it to decide whether a
blob is too large to pull into memory before you pull it. For the
in-memory store this is O(1) (the size is indexed, not recomputed);
for on-disk loose objects it inflates only the header.
Resolving the path may fetch tree objects (small) on a lazy clone,
but the blob itself is never fetched. For a lazy/partial clone whose
blob has not been materialized yet, returns {:error, :not_local}
rather than triggering a possibly-multi-GB fetch — call read_path/4
when you actually want the bytes. Directories return
{:error, :not_a_blob}. Gitlink (submodule) entries return
{:error, :submodule} — the entry's SHA names a commit in the
submodule's own repository, so it has no size here and no amount
of prefetching will make it local.
{:ok, size, repo} = Exgit.FS.size(repo, "HEAD", "go.mod")
@spec stat(Exgit.Repository.t(), ref(), path()) :: {:ok, stat(), Exgit.Repository.t()} | {:error, term()}
Stat the path. Returns {:ok, stat, repo}.
Gitlink (submodule) entries stat as %{type: :submodule, size: nil}
without fetching anything — the entry's SHA lives in the submodule's
own repository.
@spec walk(Exgit.Repository.t(), ref()) :: Enumerable.t()
Lazy {path, blob_sha} stream of every file reachable from the given
reference's tree.
This is a streaming operation — it does NOT grow the object store cache on a lazy repo. Prefetch first if needed.
@spec write_path(Exgit.Repository.t(), ref(), path(), binary(), keyword()) :: {:ok, binary(), Exgit.Repository.t()} | {:error, term()}
Write content to path. Returns {:ok, new_tree_sha, repo}.