Line-level diff between two sequences of lines via Myers diff.
The primary output is a list of matched pairs [{a_idx, b_idx}]
— 0-based indices of lines that appear in both inputs in the same
relative order. Anything in a not in that set is "deleted";
anything in b not in that set is "added."
Algorithm: Myers diff
Myers (1986) finds the shortest edit script (SES) — the minimum
number of insertions and deletions that transforms a into b.
An SES corresponds to the longest common subsequence (LCS).
Complexity
- Time: O((N + M) × D) where D = |insertions| + |deletions|
- Space: O((N + M) × D) for the backtracking trace
For typical blame workloads where each commit edits a handful of lines (small D), this is orders of magnitude faster than the previous O(N × M) LCS DP:
| File size | LCS DP (D≈10) | Myers (D≈10) |
|---|---|---|
| 100 lines | ~100K | ~2K |
| 500 lines | ~2.5M | ~10K |
| 5000 lines | ~250M | ~100K |
The algorithm in brief
A diagonal k = x − y represents all positions (x, y) where we've consumed x lines from A and y lines from B. Walking diagonally (x++, y++ while A[x]==B[y]) is a snake — free matching. A single horizontal step (x++) is a deletion; a vertical step (y++) is an insertion; together they cost one edit.
Forward phase: for edit depth D = 0, 1, 2, …, explore every diagonal k ∈ {−D, −D+2, …, D−2, D}. On each diagonal, choose the greedy starting position (either V[k−1]+1 from the right or V[k+1] from below), extend the snake as far as possible, and record the furthest x. Save a snapshot of V at each D. Stop when some diagonal reaches (n, m).
Backtrack phase: replay the snapshots in reverse, mirroring the forward greedy choice exactly. At each level, locate the snake (diagonal run of matched lines) and emit the matched pairs; recurse with the position before the edit (not the snake start).
Tie-breaking convention
When V[k−1] == V[k+1], the forward algorithm prefers "right" (delete from A). The backtrack mirrors this identically using the same condition. Changing the tie-break in one place without the other produces wrong results.
API contract (unchanged from LCS implementation)
iex> Exgit.Diff.LineDiff.matched_pairs(
...> ["a", "b", "c"],
...> ["a", "x", "c"]
...> )
[{0, 0}, {2, 2}]
Summary
Functions
Convenience: given matched pairs, return the list of 0-based
indices in b_lines that are new (not carried from a_lines).
Convenience: for each 0-based index in b_lines that was carried
from a_lines, return {b_idx, a_idx}.
Compute matched line pairs between a_lines and b_lines.
Types
@type line_index() :: non_neg_integer()
@type matched_pairs() :: [{line_index(), line_index()}]
Functions
@spec b_additions(matched_pairs(), non_neg_integer()) :: [line_index()]
Convenience: given matched pairs, return the list of 0-based
indices in b_lines that are new (not carried from a_lines).
@spec b_carryovers(matched_pairs()) :: [{line_index(), line_index()}]
Convenience: for each 0-based index in b_lines that was carried
from a_lines, return {b_idx, a_idx}.
@spec matched_pairs([String.t()], [String.t()]) :: matched_pairs()
Compute matched line pairs between a_lines and b_lines.
Returns a list of {a_index, b_index} tuples, 0-based, in
strictly increasing order of both indices. Each index appears in
at most one pair.
Examples
iex> Exgit.Diff.LineDiff.matched_pairs(
...> ["a", "b", "c"],
...> ["a", "x", "c"]
...> )
[{0, 0}, {2, 2}]
iex> Exgit.Diff.LineDiff.matched_pairs(["a", "b"], ["a", "b"])
[{0, 0}, {1, 1}]
iex> Exgit.Diff.LineDiff.matched_pairs([], ["a"])
[]
iex> Exgit.Diff.LineDiff.matched_pairs(["a"], [])
[]