textmetrics/similarity
Similarity scores in the closed interval [0.0, 1.0].
1.0 means “identical”, 0.0 means “no similarity by this
metric”. No function in this module returns NaN or a negative
Float.
All string-typed functions operate on extended grapheme clusters and do not normalize their inputs — callers wanting NFC equivalence must normalize first.
Types
Validated parameter bag for jaro_winkler_with.
A value of this type is guaranteed to encode parameters that keep
Jaro-Winkler output inside [0.0, 1.0]. Construct via
default_jaro_winkler_config or
jaro_winkler_config.
pub opaque type JaroWinklerConfig
Returned by jaro_winkler_config when its
arguments fall outside the validated range.
pub type JaroWinklerConfigError {
PrefixScaleOutOfRange(got: Float)
PrefixMaxNegative(got: Int)
}
Constructors
-
PrefixScaleOutOfRange(got: Float) -
PrefixMaxNegative(got: Int)
Returned by sorensen_dice when given an n-gram
size below 1.
pub type SorensenDiceError {
NgramSizeInvalid(got: Int)
}
Constructors
-
NgramSizeInvalid(got: Int)
Values
pub fn default_jaro_winkler_config() -> JaroWinklerConfig
Winkler-1990 defaults: prefix_scale = 0.1, prefix_max = 4.
pub fn jaro(a: String, b: String) -> Float
Jaro similarity at the grapheme level.
Edge cases (defined by convention):
jaro("", "")=1.0jaro("", b)=0.0for non-emptybjaro(a, "")=0.0for non-emptya
Time O(m·n), space O(m + n).
pub fn jaro_winkler(a: String, b: String) -> Float
Jaro-Winkler similarity using Winkler-1990 defaults
(prefix_scale = 0.1, prefix_max = 4).
pub fn jaro_winkler_config(
prefix_scale prefix_scale: Float,
prefix_max prefix_max: Int,
) -> Result(JaroWinklerConfig, JaroWinklerConfigError)
Construct a JaroWinklerConfig.
Invariants:
prefix_scalemust be in[0.0, 0.25](Winkler’s upper bound that keeps the score in[0, 1]).prefix_maxmust be>= 0.
pub fn jaro_winkler_with(
a: String,
b: String,
config: JaroWinklerConfig,
) -> Float
Jaro-Winkler similarity with caller-supplied parameters.
pub fn prefix_max(config: JaroWinklerConfig) -> Int
Read the prefix-cap parameter of a config.
pub fn prefix_scale(config: JaroWinklerConfig) -> Float
Read the prefix-scale parameter of a config.
pub fn sorensen_dice(
a: String,
b: String,
n: Int,
) -> Result(Float, SorensenDiceError)
Sørensen-Dice coefficient over grapheme n-grams of size n.
Edge cases (per spec §7.5):
- When both n-gram multisets are empty, the result is
Ok(1.0)if the inputs are equal (including both empty) andOk(0.0)otherwise. - When exactly one input has no n-grams the score is
Ok(0.0)(no overlap is possible). n < 1returnsError(NgramSizeInvalid(n)).