textmetrics/distance
Edit-distance functions.
All string-typed functions operate on extended grapheme clusters as defined by Unicode UAX #29; user-visible characters (CJK ideographs, emoji ZWJ sequences, Hangul jamo) count as one unit. Inputs are not normalized — callers wanting NFC equivalence must normalize strings before invoking these functions.
levenshtein_list/2 exposes the same algorithm over arbitrary
equality-comparable lists for callers diffing tokens, AST nodes,
or any non-string sequence.
Types
Returned by hamming when its two inputs have a
different number of graphemes.
pub type HammingError {
LengthMismatch(left: Int, right: Int)
}
Constructors
-
LengthMismatch(left: Int, right: Int)
Values
pub fn damerau_levenshtein(a: String, b: String) -> Int
True Damerau-Levenshtein distance: insert, delete, substitute, and
transposition of two adjacent graphemes are each counted as one
operation. Unlike OSA, the same substring may participate in
multiple edits (e.g. "ca" → "abc" is 2).
Time O(m·n), space O(m·n) (full matrix is required for the
transposition recurrence).
pub fn hamming(a: String, b: String) -> Result(Int, HammingError)
Hamming distance: the number of grapheme positions at which the
inputs differ. Returns Error(LengthMismatch(...)) when the inputs
have different grapheme counts.
Edge cases:
hamming("", "")=Ok(0)hamming("a", "")=Error(LengthMismatch(1, 0))
Time O(n), space O(1).
pub fn levenshtein(a: String, b: String) -> Int
Minimum number of single-grapheme insert / delete / substitute
operations needed to transform a into b.
Edge cases:
levenshtein("", "")=0levenshtein("", b)= grapheme count ofblevenshtein(a, "")= grapheme count ofalevenshtein(a, a)=0
Time O(m·n), space O(min(m, n)).
pub fn levenshtein_list(a: List(t), b: List(t)) -> Int
Generic Levenshtein distance over any equality-comparable list
type. Used by diff internally and exposed for callers diffing
token streams or AST nodes.
pub fn normalized_levenshtein(a: String, b: String) -> Float
Levenshtein-based similarity in [0.0, 1.0].
Defined as 1.0 - levenshtein(a, b) / max(|a|, |b|) where lengths
are grapheme counts. 1.0 means identical, 0.0 means every
grapheme position differs at the longer length.
Edge cases:
normalized_levenshtein("", "")=1.0(convention).normalized_levenshtein(a, a)=1.0.normalized_levenshtein("", non_empty)=0.0.
Use this when ranking by Levenshtein-style edit distance is
preferred over Jaro / Jaro-Winkler. Time O(m·n), space
O(min(m, n)) — same as levenshtein.
pub fn osa(a: String, b: String) -> Int
Optimal String Alignment distance. Same operations as Damerau-Levenshtein but each substring is edited at most once, which is what most “Damerau distance” implementations actually compute.
OSA does not satisfy the triangle inequality. Use
damerau_levenshtein when a true metric is
required.
Time O(m·n), space O(min(m, n)) (three rows).