simhash v0.1.2 Simhash

Provides simhash.

Examples

iex> Simhash.similarity("Universal Avenue", "Universe Avenue")
0.71875
iex> Simhash.similarity("hocus pocus", "pocus hocus")
0.8125
iex> Simhash.similarity("Sankt Eriksgatan 1", "S:t Eriksgatan 1")
0.8125
iex> Simhash.similarity("Purple flowers", "Green grass")
0.5625
iex> Simhash.similarity("Peanut butter", "Strawberry cocktail")
0.4375

By default trigrams (N-gram of size 3) are used as language features, but you can set a different N-gram size:

iex> Simhash.similarity("hocus pocus", "pocus hocus", 1)
1.0
iex> Simhash.similarity("Sankt Eriksgatan 1", "S:t Eriksgatan 1", 6)
0.859375
iex> Simhash.similarity("Purple flowers", "Green grass", 6)
0.546875

Algorithm description: http://matpalm.com/resemblance/simhash/

Summary

Functions

Returns list of lists of bits of 64bit Siphashes for each shingle

Hamming distance between the left and right hash, given as lists of bits

Generates the hash for the given subject. The feature hashes are N-grams, where N is given by the parameter n

Calculate the similarity between the left and right hash, using Simhash

Returns N-grams of input str

Calculates the similarity between the left and right string, using Simhash

Returns the 64bit Siphash for input str as bitstring

Reduce list of lists to list of integers, following vector addition

Functions

feature_hashes(subject, n)

Returns list of lists of bits of 64bit Siphashes for each shingle

hamming_distance(left, right)

Hamming distance between the left and right hash, given as lists of bits.

iex> Simhash.hamming_distance([1, 1, 0, 1, 0], [0, 1, 1, 1, 0])
2
hash(subject, n \\ 3)

Generates the hash for the given subject. The feature hashes are N-grams, where N is given by the parameter n.

hash_similarity(left, right)

Calculate the similarity between the left and right hash, using Simhash.

n_grams(str, n \\ 3)

Returns N-grams of input str.

iex> Simhash.n_grams("Universal")
["Uni", "niv", "ive", "ver", "ers", "rsa", "sal"]

More about N-gram

similarity(left, right, n \\ 3)

Calculates the similarity between the left and right string, using Simhash.

siphash(str)

Returns the 64bit Siphash for input str as bitstring.

iex> Simhash.siphash("abc")
<<249, 236, 145, 130, 66, 18, 3, 247>>
iex> byte_size(Simhash.siphash("abc"))
8
vector_addition(lists)

Reduce list of lists to list of integers, following vector addition.

Example:

iex> Simhash.vector_addition([[1, 3, 2, 1], [0, 1, -1, 2], [2, 0, 0, 0]])
[3, 4, 1, 3]