ExLSH v0.4.0 ExLSH View Source
Calculates a locality sensitive hash for text.
Examples:
iex> "Lorem ipsum dolor sit amet"
...> |> ExLSH.lsh()
...> |> Base.encode64()
"uX05itKaghA0gQHCwDCIFg=="
iex> "Lorem ipsum dolor sit amet"
...> |> ExLSH.lsh(2, &:crypto.hash(:sha, &1))
...> |> Base.encode64()
"VhW06EEJyWQA1gKIAAlQgI4NHUE="
Link to this section Summary
Functions
Compute an LSH for a short string, e.g. a username or email
Default hash, uses :crypto.hash(:md5)
A noop filter
Compute an LSH/SimHash for a given text
Default text normalizer: unicode normalization, lower case, replace all non-word chars with space, reduce consecutive spaces to one
Split a string to its unicode graphemes
Split a string into words
Compute an LSH for a piece of text, e.g. a document
Link to this section Functions
charwise_lsh(text, shingle_width \\ 3) View Source
Compute an LSH for a short string, e.g. a username or email.
default_hash(message) View Source
Default hash, uses :crypto.hash(:md5)
filter(words) View Source
A noop filter.
lsh(text, shingle_width \\ 3, hasher \\ &default_hash/1, normalizer \\ &normalize/1, tokenizer \\ &tokenize_words/1, filter \\ &filter/1) View Source
Compute an LSH/SimHash for a given text.
Returns a non-printable :binary
of the hash.
The following parameters are configurable:
shingle_width
: if given 1, it will use the "bag of words" approach. Given an int > 1, it will compute hashes of n-grams of the given width.hasher
: a function that takes an IOList and returns its hash in a:binary
. LSH computation is significantly faster on shorter hashes. See :crypto.supports()[:hashs] for all available hash functions on your platformnormalizer
: a function that takes a string and returns a normalized stringtokenizer
: a function that takes a normalized string and returns tokens, e.g. graphemes or wordsfilter
: a functions that filters a list of tokens, e.g. removes stop-words, non-ASCII chars, etc.
Examples:
iex> "Lorem ipsum dolor sit amet"
...> |> ExLSH.lsh()
...> |> Base.encode64()
"uX05itKaghA0gQHCwDCIFg=="
iex> "Lorem ipsum dolor sit amet"
...> |> ExLSH.lsh(2, &:crypto.hash(:sha, &1))
...> |> Base.encode64()
"VhW06EEJyWQA1gKIAAlQgI4NHUE="
normalize(text) View Source
Default text normalizer: unicode normalization, lower case, replace all non-word chars with space, reduce consecutive spaces to one.
tokenize_chars(text) View Source
Split a string to its unicode graphemes.
tokenize_words(text) View Source
Split a string into words
wordwise_lsh(text, shingle_width \\ 3) View Source
Compute an LSH for a piece of text, e.g. a document.