ExDNA.Detection.LSH (ExDNA v1.5.1)

Copy Markdown View Source

Locality-Sensitive Hashing for fast approximate nearest-neighbor search on sub-hash sets.

Computes MinHash signatures and uses banding to find candidate fragment pairs with high Jaccard similarity without quadratic pairwise comparison.

How it works

  1. MinHash — for each fragment's sub-hash set, compute k minimum hash values under k different hash functions (simulated via seeds). The probability that two signatures agree at a position equals their exact Jaccard similarity.

  2. Banding — split the k-length signature into b bands of r rows each. Two fragments become candidates if they match on all rows in any band. This creates an S-curve acceptance probability:

    P(candidate) = 1 - (1 - J^r)^b

    where J is Jaccard similarity. The inflection point sits at approximately (1/b)^(1/r).

  3. Parameter selectionoptimal_params/1 picks (b, r) so the S-curve inflection is near the desired similarity threshold, keeping the total hash count between 16 and 64.

Summary

Functions

Find candidate pairs using LSH banding.

Compute (num_bands, rows_per_band) that place the LSH S-curve inflection point near threshold.

Compute a MinHash signature for a set of sub-hashes.

Functions

candidate_pairs(indexed_signatures, num_bands, rows_per_band)

@spec candidate_pairs(
  [{non_neg_integer(), [non_neg_integer()]}],
  pos_integer(),
  pos_integer()
) ::
  MapSet.t()

Find candidate pairs using LSH banding.

Accepts a list of {index, signature} tuples and returns a MapSet of {i, j} pairs (where i < j) that collided in at least one band.

optimal_params(threshold)

@spec optimal_params(float()) :: {pos_integer(), pos_integer()}

Compute (num_bands, rows_per_band) that place the LSH S-curve inflection point near threshold.

The inflection point of the banding S-curve is (1/b)^(1/r). We search the parameter space for the pair that minimizes distance to the target while keeping total hashes in a practical range.

signature(sub_hashes, num_hashes)

@spec signature(MapSet.t(), pos_integer()) :: [non_neg_integer()] | nil

Compute a MinHash signature for a set of sub-hashes.

Returns nil for empty sets (these fragments cannot participate in fuzzy matching).