ExLSH

Build Status

Calculates a locality sensitive hash for text.

Locality-sensitive hashing is a technique for dimensionality reduction. Its properties guarantee similar output vectors for similar inputs. It can be used for clustering and near-duplicate detection. This implementation is targeted for natural language as input. It takes a String of arbitrary length and outputs a vector encoded as :binary.

Installation

Add ex_lsh to your list of dependencies in mix.exs:

def deps do
  [
    {:ex_lsh, version: "~> 0.4"}
  ]
end

Usage

"Lorem ipsum dolor sit amet"
|> ExLSH.lsh()
|> Base.encode64()

Docs

see hexdocs.pm/ex_lsh

Contributions

Please fork the project and submit a PR.

Credits

  • SimHash is a very similar, but less versatile implementation that is focused on short strings only.
  • Resemblance explores simhash and sketching in Ruby. The author has documented his findings in a series of articles. You may want to make yourself familiar with Part 3: The SimHash Algorithm.
  • Near-duplicate detection is a very helpful article by Moz. It explains core concepts such as tokinization, shingling, MinHash, SimHash, etc.