ExLSH v0.4.0 ExLSH View Source

Calculates a locality sensitive hash for text.

Examples:

iex> "Lorem ipsum dolor sit amet"
...> |> ExLSH.lsh()
...> |> Base.encode64()
"uX05itKaghA0gQHCwDCIFg=="

iex> "Lorem ipsum dolor sit amet"
...> |> ExLSH.lsh(2, &:crypto.hash(:sha, &1))
...> |> Base.encode64()
"VhW06EEJyWQA1gKIAAlQgI4NHUE="

Link to this section Summary

Functions

Compute an LSH for a short string, e.g. a username or email

Default hash, uses :crypto.hash(:md5)

A noop filter

Default text normalizer: unicode normalization, lower case, replace all non-word chars with space, reduce consecutive spaces to one

Split a string to its unicode graphemes

Split a string into words

Compute an LSH for a piece of text, e.g. a document

Link to this section Functions

Link to this function

charwise_lsh(text, shingle_width \\ 3) View Source

Compute an LSH for a short string, e.g. a username or email.

Default hash, uses :crypto.hash(:md5)

A noop filter.

Link to this function

lsh(text, shingle_width \\ 3, hasher \\ &default_hash/1, normalizer \\ &normalize/1, tokenizer \\ &tokenize_words/1, filter \\ &filter/1) View Source
lsh(
  String.t(),
  pos_integer(),
  (iodata() -> binary()),
  (String.t() -> String.t()),
  (String.t() -> [String.t()]),
  ([String.t()] -> [String.t()])
) :: binary()

Compute an LSH/SimHash for a given text.

Returns a non-printable :binary of the hash.

The following parameters are configurable:

  • shingle_width: if given 1, it will use the "bag of words" approach. Given an int > 1, it will compute hashes of n-grams of the given width.
  • hasher: a function that takes an IOList and returns its hash in a :binary. LSH computation is significantly faster on shorter hashes. See :crypto.supports()[:hashs] for all available hash functions on your platform
  • normalizer: a function that takes a string and returns a normalized string
  • tokenizer: a function that takes a normalized string and returns tokens, e.g. graphemes or words
  • filter: a functions that filters a list of tokens, e.g. removes stop-words, non-ASCII chars, etc.

Examples:

iex> "Lorem ipsum dolor sit amet"
...> |> ExLSH.lsh()
...> |> Base.encode64()
"uX05itKaghA0gQHCwDCIFg=="

iex> "Lorem ipsum dolor sit amet"
...> |> ExLSH.lsh(2, &:crypto.hash(:sha, &1))
...> |> Base.encode64()
"VhW06EEJyWQA1gKIAAlQgI4NHUE="

Default text normalizer: unicode normalization, lower case, replace all non-word chars with space, reduce consecutive spaces to one.

Split a string to its unicode graphemes.

Split a string into words

Link to this function

wordwise_lsh(text, shingle_width \\ 3) View Source

Compute an LSH for a piece of text, e.g. a document.