textmetrics/similarity

Similarity scores in the closed interval [0.0, 1.0].

1.0 means “identical”, 0.0 means “no similarity by this metric”. No function in this module returns NaN or a negative Float.

All string-typed functions operate on extended grapheme clusters and do not normalize their inputs — callers wanting NFC equivalence must normalize first.

Types

Validated parameter bag for jaro_winkler_with.

A value of this type is guaranteed to encode parameters that keep Jaro-Winkler output inside [0.0, 1.0]. Construct via default_jaro_winkler_config or jaro_winkler_config.

pub opaque type JaroWinklerConfig

Returned by jaro_winkler_config when its arguments fall outside the validated range.

pub type JaroWinklerConfigError {
  PrefixScaleOutOfRange(got: Float)
  PrefixMaxNegative(got: Int)
}

Constructors

  • PrefixScaleOutOfRange(got: Float)
  • PrefixMaxNegative(got: Int)

Returned by sorensen_dice when given an n-gram size below 1.

pub type SorensenDiceError {
  NgramSizeInvalid(got: Int)
}

Constructors

  • NgramSizeInvalid(got: Int)

Values

pub fn default_jaro_winkler_config() -> JaroWinklerConfig

Winkler-1990 defaults: prefix_scale = 0.1, prefix_max = 4.

pub fn jaro(a: String, b: String) -> Float

Jaro similarity at the grapheme level.

Edge cases (defined by convention):

  • jaro("", "") = 1.0
  • jaro("", b) = 0.0 for non-empty b
  • jaro(a, "") = 0.0 for non-empty a

Time O(m·n), space O(m + n).

pub fn jaro_winkler(a: String, b: String) -> Float

Jaro-Winkler similarity using Winkler-1990 defaults (prefix_scale = 0.1, prefix_max = 4).

pub fn jaro_winkler_config(
  prefix_scale prefix_scale: Float,
  prefix_max prefix_max: Int,
) -> Result(JaroWinklerConfig, JaroWinklerConfigError)

Construct a JaroWinklerConfig.

Invariants:

  • prefix_scale must be in [0.0, 0.25] (Winkler’s upper bound that keeps the score in [0, 1]).
  • prefix_max must be >= 0.
pub fn jaro_winkler_with(
  a: String,
  b: String,
  config: JaroWinklerConfig,
) -> Float

Jaro-Winkler similarity with caller-supplied parameters.

pub fn prefix_max(config: JaroWinklerConfig) -> Int

Read the prefix-cap parameter of a config.

pub fn prefix_scale(config: JaroWinklerConfig) -> Float

Read the prefix-scale parameter of a config.

pub fn sorensen_dice(
  a: String,
  b: String,
  n: Int,
) -> Result(Float, SorensenDiceError)

Sørensen-Dice coefficient over grapheme n-grams of size n.

Edge cases (per spec §7.5):

  • When both n-gram multisets are empty, the result is Ok(1.0) if the inputs are equal (including both empty) and Ok(0.0) otherwise.
  • When exactly one input has no n-grams the score is Ok(0.0) (no overlap is possible).
  • n < 1 returns Error(NgramSizeInvalid(n)).
Search Document