fuzzy_compare v1.0.0 FuzzyCompare View Source
This module compares two strings for their similarity and uses multiple approaches to get high quality results.
Getting started
In order to compare two strings with each other do the following:
iex> FuzzyCompare.similarity("Oscar-Claude Monet", "monet, claude")
0.95
Inner workings
Imagine you had to match some names.
Try to match the following list of painters:
"Oscar-Claude Monet"
"Edouard Manet"
"Monet, Claude"
For a human it is easy to see that some of the names have just been flipped and that others are different but similar sounding.
A first approrach could be to compare the strings with a string similarity function like the Jaro-Winkler function.
iex> String.jaro_distance("Oscar-Claude Monet", "Monet, Claude")
0.6032763532763533
iex> String.jaro_distance("Oscar-Claude Monet", "Edouard Manet")
0.6749287749287749
This is not an improvement over exact equality.
In order to improve the results this library uses two different approaches,
FuzzyCompare.ChunkSet
and FuzzyCompare.SortedChunks
.
Sorted chunks
This approach yields good results when words within a string have been shuffled around. The strategy will sort all substrings by words and compare the sorted strings.
iex> FuzzyCompare.SortedChunks.substring_similarity("Oscar-Claude Monet", "Monet, Claude")
1.0
iex(4)> FuzzyCompare.SortedChunks.substring_similarity("Oscar-Claude Monet", "Edouard Manet")
0.6944444444444443
Chunkset
The chunkset approach is best in scenarios when the strings contain other substrings that are not relevant to what is being searched for.
iex> FuzzyCompare.ChunkSet.standard_similarity("Claude Monet", "Alice Hoschedé was the wife of Claude Monet")
1.0
Substring comparison
Should one of the strings be much longer than the other the library will attempt to compare matching substrings only.
Credits
This library is inspired by a seatgeek blogpost from 2011.
Link to this section Summary
Functions
Compares two binaries for their similarity and returns a float in the range of
0.0
and 1.0
where 0.0
means no similarity and 1.0
means exactly alike
Link to this section Functions
similarity( binary() | FuzzyCompare.Preprocessed.t(), binary() | FuzzyCompare.Preprocessed.t() ) :: float()
Compares two binaries for their similarity and returns a float in the range of
0.0
and 1.0
where 0.0
means no similarity and 1.0
means exactly alike.
Examples
iex> FuzzyCompare.similarity("Oscar-Claude Monet", "monet, claude")
0.95
iex> String.jaro_distance("Oscar-Claude Monet", "monet, claude")
0.6032763532763533
Preprocessing
The ratio function expects either strings or the FuzzyCompare.Preprocessed
struct.
When comparing a large list of strings against always the same string it is
advisable to run the preprocessing once and pass the FuzzyCompare.Preprocessed
struct.
That way you pay for preprocessing of the constant string only once.