Searching for Sets: Jaccard Index and MinHash

2022-01-02 3 minute read

Here’s a well-written intro to audio fingerprinting. One part of the article contains a clever trick that seems generally useful and interesting to think about. I will attempt to quickly describe the problem below and explain the solution.

The Problem: Fingerprints of Sparse Sets

To summarize the article in an extremely lossy fashion, the problem is that we want to take an audio clip, and find similar clips in a database. To do so, we can convert each clip into low dimensional “fingerprints”, and utilize nearest neighbor algorithms to find clips with the most similar fingerprints. Given any clip, one can cut it into small segments, run FFT to get spectrograms, then binarize the resulting images to get bit vectors. But even with, say, 128 x 32 = 4096 bits per fingerprint, it’s still too many to run nearest neighbors. How can we further reduce dimensions to facilitate faster matching?

It is important to note that in this context, the 0s and the 1s in the bit vectors aren’t symmetric. Instead, it is far more useful to think of the bit vectors as sets of 1s. Imagine if clip A has a C note and clip B has a D note. You would think they’re completely different because they don’t share any common notes; it would be silly to say they’re mostly the same because they have a lot of common missing notes.

If you think about it, finding similar sets is a general problem: you have a universe of unique items (pixels in the spectrogram), you have many sparse sets of them (spectrograms), and you wish to reduce each set’s dimensionality while preserving pairwise similarity. It’s just like comparing people’s book lists, movie lists, or interests.

The Trick: MinHash

First, we need to define a metric of similarity between two sets. It seems like a natural definition would be: the size of the intersection divided by the size of the union. If both sets are equal, you get 1; if both sets share no common elements, you get 0. It turns out this is called the Jaccard Index.

Here’s the interesting part: say you have two n-bit vectors. If you randomly permute both vectors in the same way, and then find the index of the first 1 bit (this is called the MinHash), then the probability that the two indices are equal is exactly equal to their Jaccard Index. Let’s go through a quick example before explaining why this works.

A = 0101 0001
B = 0011 0001

Random permutation 1:
abcdefgh becomes -> daefbcgh (1st position -> 2nd position, 2nd -> 5th, etc)
A becomes -> 1000 1001, first 1: 1
B becomes -> 1000 0101, first 1: 1, equal to A's

Random permutation 2:
abcdefgh becomes -> bcadfehg
A becomes -> 1001 0010, first 1: 1
B becomes -> 0101 0010, first 1: 2, different from A's

In the above example, the Jaccard Index is 2/4 (intersection / union), which means we should expect the test to return “equal” 50% of the time. So, why is this true?

First, note that matching zeros in both vectors will not affect the result of the test. This is because no matter where they end up after permutation, they will not affect the test result. Therefore, we can safely drop them from the inputs without changing the result.

A drops matching 0s -> 1011
B drops matching 0s -> 0111

After dropping matching zeros, we are left with matching 1s or differing bits. It is now easy to see that the first 1’s index will be equal iff an index with matching 1s is selected as the first digit in the random permutation, hence the probability will be equal to the Jaccard Index.

In a universe of n things, any n-permutation will yield a MinHash function. Now, we just have to precompute a bunch of these functions, and apply them to each bit vector to get the fingerprints. Given a pair of fingerprints, we just have to count the number of matching MinHash outputs to get an estimate of the similarity. One cool observation is that more hashes only gives you more accurate similarity estimates, which means even when the universe becomes larger, or more sets are added, you still don’t really have to linearly scale up the total fingerprint size.

Given the fingerprints, we still have to figure out how to search efficiently, but that’s another complicated subject for another day.