Hash Collisions

Bloom embeddings is one way of handling large number of user / item IDs. Instead of assigning each unique ID to a unique embedding, we may perform hashing to assign each unique ID to one or multiple bins. The embeddings at these positions are then taken and typically summed to get the final representation of each user / item.

One natural question arises: given n number of unique values and m number of bins, what is the expected number of bins to have 2 or more values assigned to it? This collision is especially a problem if we only represent each unique value with one embedding (i.e. num_hashes = 1). In practice, we would usually have at least 2 or more hashes to avoid this problem.

We may approach the problem by observing that each item is hashed uniformly and has a chance of landing in a particular bin. The hashing is also independent, i.e. it is not affected by the hashing for any other item.

This means that the expected number of items assigned to a particular bin may be modeled as a distribution, since we have n trials and a fixed probability of "success" of 1/m.

  • The probability we desire is
  • The PMF is , so:

So we may put the above together to obtain . It then remains to get the expected number of colliding bins by , by using the linearity of expectation.

For example, given , the expected number of colliding bins is 1,210 (out of 1 million bins), which is not insignificant.

Now note that since is large and is small, may be well approximated by . Recall that the Poisson PMF is . So:

Using this approximation, we get the expected number of colliding bins as 1,209, which is a very good approximation. Hence the formula we want is: