Exgit.Diff.LineDiff (exgit v0.1.0)

Copy Markdown View Source

Line-level diff between two sequences of lines via Myers diff.

The primary output is a list of matched pairs [{a_idx, b_idx}] — 0-based indices of lines that appear in both inputs in the same relative order. Anything in a not in that set is "deleted"; anything in b not in that set is "added."

Algorithm: Myers diff

Myers (1986) finds the shortest edit script (SES) — the minimum number of insertions and deletions that transforms a into b. An SES corresponds to the longest common subsequence (LCS).

Complexity

  • Time: O((N + M) × D) where D = |insertions| + |deletions|
  • Space: O((N + M) × D) for the backtracking trace

For typical blame workloads where each commit edits a handful of lines (small D), this is orders of magnitude faster than the previous O(N × M) LCS DP:

File sizeLCS DP (D≈10)Myers (D≈10)
100 lines~100K~2K
500 lines~2.5M~10K
5000 lines~250M~100K

The algorithm in brief

A diagonal k = x − y represents all positions (x, y) where we've consumed x lines from A and y lines from B. Walking diagonally (x++, y++ while A[x]==B[y]) is a snake — free matching. A single horizontal step (x++) is a deletion; a vertical step (y++) is an insertion; together they cost one edit.

Forward phase: for edit depth D = 0, 1, 2, …, explore every diagonal k ∈ {−D, −D+2, …, D−2, D}. On each diagonal, choose the greedy starting position (either V[k−1]+1 from the right or V[k+1] from below), extend the snake as far as possible, and record the furthest x. Save a snapshot of V at each D. Stop when some diagonal reaches (n, m).

Backtrack phase: replay the snapshots in reverse, mirroring the forward greedy choice exactly. At each level, locate the snake (diagonal run of matched lines) and emit the matched pairs; recurse with the position before the edit (not the snake start).

Tie-breaking convention

When V[k−1] == V[k+1], the forward algorithm prefers "right" (delete from A). The backtrack mirrors this identically using the same condition. Changing the tie-break in one place without the other produces wrong results.

API contract (unchanged from LCS implementation)

iex> Exgit.Diff.LineDiff.matched_pairs(
...>   ["a", "b", "c"],
...>   ["a", "x", "c"]
...> )
[{0, 0}, {2, 2}]

Summary

Functions

Convenience: given matched pairs, return the list of 0-based indices in b_lines that are new (not carried from a_lines).

Convenience: for each 0-based index in b_lines that was carried from a_lines, return {b_idx, a_idx}.

Compute matched line pairs between a_lines and b_lines.

Types

line_index()

@type line_index() :: non_neg_integer()

matched_pairs()

@type matched_pairs() :: [{line_index(), line_index()}]

Functions

b_additions(pairs, b_length)

@spec b_additions(matched_pairs(), non_neg_integer()) :: [line_index()]

Convenience: given matched pairs, return the list of 0-based indices in b_lines that are new (not carried from a_lines).

b_carryovers(pairs)

@spec b_carryovers(matched_pairs()) :: [{line_index(), line_index()}]

Convenience: for each 0-based index in b_lines that was carried from a_lines, return {b_idx, a_idx}.

matched_pairs(a_lines, b_lines)

@spec matched_pairs([String.t()], [String.t()]) :: matched_pairs()

Compute matched line pairs between a_lines and b_lines.

Returns a list of {a_index, b_index} tuples, 0-based, in strictly increasing order of both indices. Each index appears in at most one pair.

Examples

iex> Exgit.Diff.LineDiff.matched_pairs(
...>   ["a", "b", "c"],
...>   ["a", "x", "c"]
...> )
[{0, 0}, {2, 2}]

iex> Exgit.Diff.LineDiff.matched_pairs(["a", "b"], ["a", "b"])
[{0, 0}, {1, 1}]

iex> Exgit.Diff.LineDiff.matched_pairs([], ["a"])
[]

iex> Exgit.Diff.LineDiff.matched_pairs(["a"], [])
[]