simhash v0.1.2 Simhash
Provides simhash.
Examples
iex> Simhash.similarity("Universal Avenue", "Universe Avenue")
0.71875
iex> Simhash.similarity("hocus pocus", "pocus hocus")
0.8125
iex> Simhash.similarity("Sankt Eriksgatan 1", "S:t Eriksgatan 1")
0.8125
iex> Simhash.similarity("Purple flowers", "Green grass")
0.5625
iex> Simhash.similarity("Peanut butter", "Strawberry cocktail")
0.4375
By default trigrams (N-gram of size 3) are used as language features, but you can set a different N-gram size:
iex> Simhash.similarity("hocus pocus", "pocus hocus", 1)
1.0
iex> Simhash.similarity("Sankt Eriksgatan 1", "S:t Eriksgatan 1", 6)
0.859375
iex> Simhash.similarity("Purple flowers", "Green grass", 6)
0.546875
Algorithm description: http://matpalm.com/resemblance/simhash/
Summary
Functions
Returns list of lists of bits of 64bit Siphashes for each shingle
Hamming distance between the left and right hash, given as lists of bits
Generates the hash for the given subject. The feature hashes are N-grams, where N is given by the parameter n
Calculate the similarity between the left and right hash, using Simhash
Returns N-grams of input str
Calculates the similarity between the left and right string, using Simhash
Returns the 64bit Siphash for input str as bitstring
Reduce list of lists to list of integers, following vector addition
Functions
Hamming distance between the left and right hash, given as lists of bits.
iex> Simhash.hamming_distance([1, 1, 0, 1, 0], [0, 1, 1, 1, 0])
2
Generates the hash for the given subject. The feature hashes are N-grams, where N is given by the parameter n.
Calculate the similarity between the left and right hash, using Simhash.
Returns N-grams of input str.
iex> Simhash.n_grams("Universal")
["Uni", "niv", "ive", "ver", "ers", "rsa", "sal"]
Calculates the similarity between the left and right string, using Simhash.
Returns the 64bit Siphash for input str as bitstring.
iex> Simhash.siphash("abc")
<<249, 236, 145, 130, 66, 18, 3, 247>>
iex> byte_size(Simhash.siphash("abc"))
8