textmetrics/distance

Edit-distance functions.

All string-typed functions operate on extended grapheme clusters as defined by Unicode UAX #29; user-visible characters (CJK ideographs, emoji ZWJ sequences, Hangul jamo) count as one unit. Inputs are not normalized — callers wanting NFC equivalence must normalize strings before invoking these functions.

levenshtein_list/2 exposes the same algorithm over arbitrary equality-comparable lists for callers diffing tokens, AST nodes, or any non-string sequence.

Types

Returned by hamming when its two inputs have a different number of graphemes.

pub type HammingError {
  LengthMismatch(left: Int, right: Int)
}

Constructors

  • LengthMismatch(left: Int, right: Int)

Values

pub fn damerau_levenshtein(a: String, b: String) -> Int

True Damerau-Levenshtein distance: insert, delete, substitute, and transposition of two adjacent graphemes are each counted as one operation. Unlike OSA, the same substring may participate in multiple edits (e.g. "ca""abc" is 2).

Time O(m·n), space O(m·n) (full matrix is required for the transposition recurrence).

pub fn hamming(a: String, b: String) -> Result(Int, HammingError)

Hamming distance: the number of grapheme positions at which the inputs differ. Returns Error(LengthMismatch(...)) when the inputs have different grapheme counts.

Edge cases:

  • hamming("", "") = Ok(0)
  • hamming("a", "") = Error(LengthMismatch(1, 0))

Time O(n), space O(1).

pub fn levenshtein(a: String, b: String) -> Int

Minimum number of single-grapheme insert / delete / substitute operations needed to transform a into b.

Edge cases:

  • levenshtein("", "") = 0
  • levenshtein("", b) = grapheme count of b
  • levenshtein(a, "") = grapheme count of a
  • levenshtein(a, a) = 0

Time O(m·n), space O(min(m, n)).

pub fn levenshtein_list(a: List(t), b: List(t)) -> Int

Generic Levenshtein distance over any equality-comparable list type. Used by diff internally and exposed for callers diffing token streams or AST nodes.

pub fn normalized_levenshtein(a: String, b: String) -> Float

Levenshtein-based similarity in [0.0, 1.0].

Defined as 1.0 - levenshtein(a, b) / max(|a|, |b|) where lengths are grapheme counts. 1.0 means identical, 0.0 means every grapheme position differs at the longer length.

Edge cases:

  • normalized_levenshtein("", "") = 1.0 (convention).
  • normalized_levenshtein(a, a) = 1.0.
  • normalized_levenshtein("", non_empty) = 0.0.

Use this when ranking by Levenshtein-style edit distance is preferred over Jaro / Jaro-Winkler. Time O(m·n), space O(min(m, n)) — same as levenshtein.

pub fn osa(a: String, b: String) -> Int

Optimal String Alignment distance. Same operations as Damerau-Levenshtein but each substring is edited at most once, which is what most “Damerau distance” implementations actually compute.

OSA does not satisfy the triangle inequality. Use damerau_levenshtein when a true metric is required.

Time O(m·n), space O(min(m, n)) (three rows).

Search Document