ExLSH
Calculates a locality sensitive hash for text.
Locality-sensitive hashing is a
technique for dimensionality reduction. Its properties guarantee similar
output vectors for similar inputs. It can be used for clustering and
near-duplicate detection. This implementation is targeted for natural language as input. It takes a String
of arbitrary length and outputs a vector encoded as :binary
.
Installation
Add ex_lsh
to your list of dependencies in mix.exs
:
def deps do
[
{:ex_lsh, version: "~> 0.4"}
]
end
Usage
"Lorem ipsum dolor sit amet"
|> ExLSH.lsh()
|> Base.encode64()
Docs
Contributions
Please fork the project and submit a PR.
Credits
- SimHash is a very similar, but less versatile implementation that is focused on short strings only.
- Resemblance explores simhash and sketching in Ruby. The author has documented his findings in a series of articles. You may want to make yourself familiar with Part 3: The SimHash Algorithm.
- Near-duplicate detection is a very helpful article by Moz. It explains core concepts such as tokinization, shingling, MinHash, SimHash, etc.