Exgit.ObjectStore.Promisor (exgit v0.1.0)

Copy Markdown View Source

An object store that fetches missing objects on demand from a transport, caching them locally.

The Promisor is a pure value — no processes, no pids, no shared state. Growing the cache requires the caller to thread the updated struct forward via resolve/2.

{:ok, obj, promisor2} = Promisor.resolve(promisor, sha)
{:ok, obj2, promisor3} = Promisor.resolve(promisor2, other_sha)

Two callers holding the same %Promisor{} see the same cache. Comparing promisors by value (==) reflects their logical state. Sharing via message passing, snapshotting, or serialization just works.

Concurrency

Because the struct is pure, two concurrent resolve(p, sha_a) and resolve(p, sha_b) calls from the same p each fetch independently, and only the return value the caller threads forward "wins" — the other fetch's cache growth is discarded. This is a CACHE RACE, not a correctness race: both results are valid, but the merged cache is strictly smaller than if the calls had been serialized.

For workloads that do concurrent bulk reads against the same repo (e.g. a grep agent spawning N tasks), use Exgit.ObjectStore.SharedPromisor — a GenServer wrapper that serializes cache access across processes and eliminates the cache race entirely.

Integration

Exgit.FS threads the updated repo through its strict operations (read_path, ls, stat, write_path) so callers get the grown cache back:

{:ok, {mode, blob}, repo} = Exgit.FS.read_path(repo, "HEAD", path)

Streaming operations (FS.walk, FS.grep) use the pure ObjectStore.get/2 and do NOT grow the cache. For a warm cache, call Exgit.FS.prefetch/2 up front.

Memory

The cache is unbounded by default. Pass :max_cache_bytes explicitly to enable FIFO-by-commit eviction bounded at a byte count of your choosing.

The unbounded default is a deliberate choice: partial-clone and full-clone workflows typically prefetch 100-500 MB of tree and blob data up front, then do many reads against that working set. A small cap (e.g. 64 MiB) trips during prefetch on any real-world repo, and because eviction only evicts COMMITS (not blobs or trees — git's access patterns don't cleanly map to LRU at the blob level), triggering the evictor mid-stream can drop state the caller is actively using. For long-running daemons or memory-constrained deployments, size the cap to your actual envelope — e.g. max_cache_bytes: 2 * 1024 * 1024 * 1024 for a 2 GiB budget.

When a cap IS set and the cache approaches it, the eviction loop drops the oldest commits (and their associated pointer into the commit queue) in FIFO order. Tree and blob objects are NOT evicted individually; they remain until either (a) the process dies or (b) a higher-level operation discards the whole repo.

Server negotiation (haves)

On-demand fetches (resolve/2fetch_and_cache/2) deliberately send no haves to the server. This is counter-intuitive — every bulk git fetch DOES send haves to avoid redundant transfer — but on-demand fetches have different semantics:

  • Bulk fetch (Exgit.fetch/3): "I'm at commit X, catch me up to ref Y." Haves save bandwidth by excluding objects reachable from X.

  • On-demand fetch (Promisor): "Ship me exactly this blob, please." Haves actively break this. A smart server (GitHub, anything running modern git-upload-pack) treats haves as a reachability closure — "the client has commit X, therefore they have everything reachable from X" — and returns an empty pack. The blob is "reachable" from any cached commit that points at its containing tree, so every partial-clone read after the first would fail.

See test/exgit/security/haves_empty_pack_test.exs for an offline regression against this.

Overfull behavior

When the evictor runs out of commits to drop but cache_bytes is still above the cap, the cache is technically over-full. The :on_overfull option selects the policy:

  • :log (default) — emit [:exgit, :object_store, :cache_overfull] telemetry and keep going. Matches the previous behavior.
  • :error — next put/resolve returns {:error, :cache_overfull, promisor}. Force a fail-fast loop to surface misconfigured caps quickly.
  • {:callback, fun}fun.(promisor) is invoked; its return value is discarded. Use for custom metrics, alerting, or graceful shutdown.

Summary

Functions

True if the cache is empty (no objects). Provides a stable abstraction for callers that used to reach into %Promisor{cache: %Memory{objects: objs}} — e.g. FS.require_non_promisor!/2.

Fetch wants from the transport with explicit fetch options (e.g. a partial-clone filter), and merge the returned objects into the cache. Returns {:ok, new_promisor}.

True if sha is in the local cache. Does NOT trigger a fetch.

Merge raw_objects into the cache.

Build a fresh Promisor wrapping transport.

Uncompressed byte size of sha IF it is already cached locally — without triggering a fetch. Returns {:error, :not_local} when the object has not been fetched yet, so a size check can never silently pull a multi-GB blob over the network.

Return a new Promisor with object inserted into its cache.

Look up sha. On a cache hit, returns {:ok, obj, promisor} where the promisor is unchanged. On a miss, fetches from the transport, caches every object the pack returned, and returns {:ok, obj, new_promisor} — the returned struct carries the grown cache.

Types

overfull_policy()

@type overfull_policy() :: :log | :error | {:callback, (t() -> any())}

t()

@type t() :: %Exgit.ObjectStore.Promisor{
  cache: Exgit.ObjectStore.Memory.t(),
  cache_bytes: non_neg_integer(),
  commit_counter: non_neg_integer(),
  commit_queue: :gb_trees.tree() | nil,
  default_fetch_opts: keyword(),
  haves_cap: pos_integer(),
  max_cache_bytes: non_neg_integer() | :infinity,
  on_overfull: overfull_policy(),
  transport: term()
}

Functions

empty?(promisor)

@spec empty?(t()) :: boolean()

True if the cache is empty (no objects). Provides a stable abstraction for callers that used to reach into %Promisor{cache: %Memory{objects: objs}} — e.g. FS.require_non_promisor!/2.

fetch_with_filter(p, wants, opts)

@spec fetch_with_filter(t(), [binary()], keyword()) :: {:ok, t()} | {:error, term()}

Fetch wants from the transport with explicit fetch options (e.g. a partial-clone filter), and merge the returned objects into the cache. Returns {:ok, new_promisor}.

Used by Exgit.clone/2 (with filter:) to perform the eager commits+trees fetch under a blob:none filter at clone time. End users should normally rely on resolve/2, which handles misses transparently.

has_object?(promisor, sha)

@spec has_object?(t(), binary()) :: boolean()

True if sha is in the local cache. Does NOT trigger a fetch.

import_objects(p, raw_objects)

@spec import_objects(t(), [{atom(), binary(), binary()}]) :: {:ok, t()}

Merge raw_objects into the cache.

new(transport, opts \\ [])

@spec new(
  transport :: term(),
  keyword()
) :: t()

Build a fresh Promisor wrapping transport.

Options:

  • :initial_objects — list of pre-decoded objects to seed the cache.
  • :default_fetch_opts — keyword list merged into every Transport.fetch/3 call the Promisor makes. Used by lazy_clone to propagate things like the partial-clone filter spec onto subsequent on-demand fetches.
  • :max_cache_bytes — cap on total cached object bytes. Default :infinity (no cap). Set to an integer byte count for long-running daemons / memory-constrained deployments that need a bound. See "Memory" in the moduledoc for sizing guidance.
  • :on_overfull — policy when the eviction loop can't reduce cache_bytes below max_cache_bytes (commit queue empty; only raw blobs/trees left in the cache). One of:
    • :log (default) — emit [:exgit, :object_store, :cache_overfull] telemetry and keep accepting new objects.
    • :error — fail subsequent put/resolve with {:error, :cache_overfull, promisor}.
    • {:callback, fun} — invoke fun.(promisor). Return value is ignored; raise for hard-fail.

object_size(promisor, sha)

@spec object_size(t(), binary()) :: {:ok, non_neg_integer()} | {:error, :not_local}

Uncompressed byte size of sha IF it is already cached locally — without triggering a fetch. Returns {:error, :not_local} when the object has not been fetched yet, so a size check can never silently pull a multi-GB blob over the network.

put(p, object)

@spec put(t(), Exgit.Object.t()) ::
  {:ok, binary(), t()} | {:error, :cache_overfull, t()}

Return a new Promisor with object inserted into its cache.

When :on_overfull is :error and the post-insert cache exceeds :max_cache_bytes with no commits left to evict, returns {:error, :cache_overfull, promisor} instead — the promisor is still threaded back so the caller can inspect cache_bytes / decide what to do.

resolve(p, sha)

@spec resolve(t(), binary()) ::
  {:ok, Exgit.Object.t(), t()} | {:error, term()} | {:error, term(), t()}

Look up sha. On a cache hit, returns {:ok, obj, promisor} where the promisor is unchanged. On a miss, fetches from the transport, caches every object the pack returned, and returns {:ok, obj, new_promisor} — the returned struct carries the grown cache.

Error shape

Errors come in two flavors:

  • {:error, reason} — transport-level failure, no cache change. Returned when the fetch itself failed (connection error, HTTP non-2xx, malformed pack).
  • {:error, reason, promisor} — the fetch succeeded and the cache grew, but the specific SHA requested wasn't in the returned pack (rare; happens when a partial-clone server defers the requested object itself). Callers should thread the returned promisor forward to avoid refetching the sibling objects that WERE returned.

Pattern-match on both shapes:

case Promisor.resolve(p, sha) do
  {:ok, obj, p2} -> ...
  {:error, _, p2} -> ...      # grown cache, but sha missing
  {:error, _} -> ...          # fetch failed entirely
end