Locality-Sensitive Hashing for fast approximate nearest-neighbor search on sub-hash sets.
Computes MinHash signatures and uses banding to find candidate fragment pairs with high Jaccard similarity without quadratic pairwise comparison.
How it works
MinHash — for each fragment's sub-hash set, compute
kminimum hash values underkdifferent hash functions (simulated via seeds). The probability that two signatures agree at a position equals their exact Jaccard similarity.Banding — split the
k-length signature intobbands ofrrows each. Two fragments become candidates if they match on all rows in any band. This creates an S-curve acceptance probability:P(candidate) = 1 - (1 - J^r)^bwhere J is Jaccard similarity. The inflection point sits at approximately
(1/b)^(1/r).Parameter selection —
optimal_params/1picks(b, r)so the S-curve inflection is near the desired similarity threshold, keeping the total hash count between 16 and 64.
Summary
Functions
Find candidate pairs using LSH banding.
Compute (num_bands, rows_per_band) that place the LSH S-curve inflection
point near threshold.
Compute a MinHash signature for a set of sub-hashes.
Functions
@spec candidate_pairs( [{non_neg_integer(), [non_neg_integer()]}], pos_integer(), pos_integer() ) :: MapSet.t()
Find candidate pairs using LSH banding.
Accepts a list of {index, signature} tuples and returns a MapSet of
{i, j} pairs (where i < j) that collided in at least one band.
@spec optimal_params(float()) :: {pos_integer(), pos_integer()}
Compute (num_bands, rows_per_band) that place the LSH S-curve inflection
point near threshold.
The inflection point of the banding S-curve is (1/b)^(1/r). We search
the parameter space for the pair that minimizes distance to the target
while keeping total hashes in a practical range.
@spec signature(MapSet.t(), pos_integer()) :: [non_neg_integer()] | nil
Compute a MinHash signature for a set of sub-hashes.
Returns nil for empty sets (these fragments cannot participate in
fuzzy matching).