Exgit.Pack.StreamParser (exgit v0.1.0)

Copy Markdown View Source

Forward-only, bounded-memory streaming pack parser.

Accepts raw pack bytes incrementally via ingest/2 and writes each resolved object directly to an Exgit.ObjectStore as it is decoded.

Memory model (Phase 3+)

ComponentBound
Parse bufferO(zlib_window) per ingest/2 chunk
In-flight inflateO(one zlib output chunk, ~4 KB)
In-flight write handleO(compressed output) — raw content never sits
alongside the compressed form in the heap
offset_to_sha map~35 bytes × N objects
sha_to_depth map~30 bytes × N objects

For non-delta objects (types blob/tree/commit/tag), each decompressed chunk is piped immediately to the object store via ObjectStore.open_write / write_chunk / close_write. The raw content is never materialised in full — it flows inflate-port → write-handle → store one HTTP-chunk-sized piece at a time. The adler32 (for zlib boundary detection) and the git SHA are both computed incrementally.

For delta objects (OFS_DELTA / REF_DELTA), the decompressed delta instructions must be held in full to call Pack.Delta.apply/2. These objects are still accumulated in inflate_out; the resulting resolved content then goes through ObjectStore.put/2 as before.

The compressed-buffer spike of the naive approach (inflate_upper_bound bytes must be present before inflate can start) is eliminated: the zlib port is opened as soon as @zlib_min bytes are available and fed incrementally on every subsequent ingest/2.

Adversarial hardening (Phase 4)

Every limit is enforced per-object during the streaming parse:

  • max_object_bytes — rejects any object whose declared uncompressed size exceeds the limit before allocating.
  • max_inflate_ratio — zip-bomb defence; if uncompressed / compressed > ratio, the object is rejected.
  • max_delta_depth — cap on delta chain length; stops an attacker from constructing a chain that forces O(depth) store fetches per object.
  • max_objects — rejects packs with an absurd object count header before any objects are parsed.
  • deadline — monotonic deadline (:erlang.monotonic_time(:millisecond)); ingest/2 returns {:error, :deadline_exceeded} when the clock passes it.

OFS_DELTA / REF_DELTA resolution

Git packs guarantee that a delta's base always appears earlier in the pack. Each resolved object is written to the store immediately; OFS_DELTA looks up pack_offset → {type, sha, depth} in offset_to_sha and fetches from the store. REF_DELTA uses sha_to_depth to look up the base depth for chain-length tracking (defaults to 0 for objects already in the store from a prior fetch).

SHA-1 checksum

A rolling 20-byte delay ensures that sha_tail at finalize/1 contains exactly the pack's trailing checksum. Verification only happens in finalize/1 — not in the streaming loop — because sha_tail doesn't reach the correct final value until all bytes have been fed.

Summary

Functions

Assert the parse is complete: all N objects were decoded and the pack's SHA-1 trailer matches. Returns {:ok, n_objects, final_store} or {:error, reason}.

Feed a chunk of raw pack bytes into the parser.

Create a new StreamParser state that will write objects to store.

Types

t()

@type t() :: %Exgit.Pack.StreamParser{
  buffer: term(),
  buffer_start: term(),
  current: term(),
  limits: term(),
  num_objects: term(),
  objects_done: term(),
  offset_to_sha: term(),
  phase: term(),
  raw_cache: term(),
  raw_cache_bytes: term(),
  sha_ctx: term(),
  sha_tail: term(),
  sha_to_depth: term(),
  store: term()
}

Functions

finalize(stream_parser)

@spec finalize(t()) ::
  {:ok, non_neg_integer(), Exgit.ObjectStore.t()} | {:error, term()}

Assert the parse is complete: all N objects were decoded and the pack's SHA-1 trailer matches. Returns {:ok, n_objects, final_store} or {:error, reason}.

final_store is the object store after all objects have been written. For value-typed stores (e.g. Memory) this is the updated struct; for side-effect stores (e.g. Disk) it equals the original store reference.

ingest(state, bytes)

@spec ingest(t(), binary()) :: {:ok, t()} | {:error, term()}

Feed a chunk of raw pack bytes into the parser.

Objects are written to the store as they complete. Returns {:ok, state} when the chunk was processed successfully (the parser may need more bytes), or {:error, reason} on a fatal parse error.

new(store, opts \\ [])

@spec new(
  Exgit.ObjectStore.t(),
  keyword()
) :: t()

Create a new StreamParser state that will write objects to store.

Options:

  • :max_object_bytes — max inflated size of any single object (default 100 MB).
  • :max_objects — max number of objects in the pack (default 10 M).
  • :max_delta_depth — max delta chain depth (default 50, same as git).
  • :max_inflate_ratio — max uncompressed/compressed ratio; detects zip bombs
                       (default 1000×).
  • :deadline:erlang.monotonic_time(:millisecond) value after
                       which `ingest/2` returns `{:error, :deadline_exceeded}`.
                       `nil` (default) means no deadline.
  • :raw_cache_bytes — budget in bytes for the raw-content cache used to
                       speed up delta base resolution (default 64 MB). Set
                       to 0 to disable and always go through the store.