textmetrics/distance

Edit-distance functions.

All string-typed functions operate on extended grapheme clusters as defined by Unicode UAX #29; user-visible characters (CJK ideographs, emoji ZWJ sequences, Hangul jamo) count as one unit. Inputs are pre-normalised to Unicode Normalization Form C (NFC), so canonically-equivalent strings such as "\u{00C1}" (precomposed) and "A\u{0301}" (decomposed) compare as equal.

levenshtein_list/2 exposes the same algorithm over arbitrary equality-comparable lists for callers diffing tokens, AST nodes, or any non-string sequence. The list variant does not normalise its inputs — equality is defined by the element type’s own ==.

Types

Returned by hamming when its two inputs have a different number of graphemes.

pub type HammingError {
  LengthMismatch(left: Int, right: Int)
}

Constructors

  • LengthMismatch(left: Int, right: Int)

Values

pub fn damerau_levenshtein(a: String, b: String) -> Int

True Damerau-Levenshtein distance: insert, delete, substitute, and transposition of two adjacent graphemes are each counted as one operation. Unlike OSA, the same substring may participate in multiple edits (e.g. "ca""abc" is 2).

Both inputs are pre-normalised to NFC, so canonically-equivalent strings have distance 0.

Time O(m·n), space O(m·n) (full matrix is required for the transposition recurrence).

pub fn hamming(a: String, b: String) -> Result(Int, HammingError)

Hamming distance: the number of grapheme positions at which the inputs differ. Returns Error(LengthMismatch(...)) when the inputs have different grapheme counts.

Both inputs are pre-normalised to NFC, so canonically-equivalent strings have distance 0 (and equal grapheme counts after normalisation, so no spurious LengthMismatch).

Edge cases:

  • hamming("", "") = Ok(0)
  • hamming("a", "") = Error(LengthMismatch(1, 0))

Time O(n), space O(1).

pub fn levenshtein(a: String, b: String) -> Int

Minimum number of single-grapheme insert / delete / substitute operations needed to transform a into b.

Both inputs are pre-normalised to NFC, so canonically-equivalent strings ("\u{00C1}" vs "A\u{0301}") compare as equal.

Edge cases:

  • levenshtein("", "") = 0
  • levenshtein("", b) = grapheme count of b
  • levenshtein(a, "") = grapheme count of a
  • levenshtein(a, a) = 0

Time O(m·n), space O(min(m, n)).

pub fn levenshtein_list(a: List(t), b: List(t)) -> Int

Generic Levenshtein distance over any equality-comparable list type. Used by diff internally and exposed for callers diffing token streams or AST nodes.

pub fn normalized_levenshtein(a: String, b: String) -> Float

Levenshtein-based similarity in [0.0, 1.0].

Defined as 1.0 - levenshtein(a, b) / max(|a|, |b|) where lengths are grapheme counts. 1.0 means identical, 0.0 means every grapheme position differs at the longer length.

Both inputs are pre-normalised to NFC, so canonically-equivalent strings score 1.0.

Edge cases:

  • normalized_levenshtein("", "") = 1.0 (convention).
  • normalized_levenshtein(a, a) = 1.0.
  • normalized_levenshtein("", non_empty) = 0.0.

Use this when ranking by Levenshtein-style edit distance is preferred over Jaro / Jaro-Winkler. Time O(m·n), space O(min(m, n)) — same as levenshtein.

pub fn osa(a: String, b: String) -> Int

Optimal String Alignment distance. Same operations as Damerau-Levenshtein but each substring is edited at most once, which is what most “Damerau distance” implementations actually compute.

Both inputs are pre-normalised to NFC, so canonically-equivalent strings have distance 0.

OSA does not satisfy the triangle inequality. Use damerau_levenshtein when a true metric is required.

Time O(m·n), space O(min(m, n)) (three rows).

Search Document