Dong 2023 - MINE Loss

Revisiting Recommendation Loss Functions through Contrastive Learning

This paper compares several recommendation loss functions like BPR, CCL and introduce two new losses: InfoNCE+ and MINE+.

Setup

Let denote the user and item sets. We denote that:

  • Each user has interactions with items, and has no interactions with the remaining set of items.
  • On the item side, denotes all users who interacted with item
  • We can also denote if there was an interaction and otherwise

Let the latent embeddings represent user and item respectively. The similarity measure between them is then denoted .

BPR Loss

The most widely used loss is Bayesian Personalized Ranking:

Note that for each user, we take the expectation over the set of items relevant to him. We then sample negatives from the overall item distribution (usually uniformly at random).

Softmax Loss

One common approach is to model as an extreme classification problem where is a very large set. The probability may then be modeled as a softmax:

In practice, it is infeasible to compute over the large item set, so we sample negative candidates for the denominator. The sampling is then corrected via importance weighting.

Using this approach, the loss may be formulated as:

Note that is a negative sampling distribution for each user , and is typically implemented as which is based on item popularity.

Contrastive Learning Loss (InfoNCE)

InfoNCE loss looks very similar to the sampled softmax loss, although the motivation is different. The key idea is to pull similar points closer and push dissimilar points apart. InfoNCE loss is the most famous contrastive learning loss:

Note that the only difference from the sampled softmax loss is that the is inside the sum rather than outside. The InfoNCE loss has been shown to maximize the mutual information between user and item and minimize mutual information between unrelated pairs.

Empirical Exploration of InfoNCE+

The authors propose an InfoNCE+, which is just adding some hyperparameters to InfoNCE and performing some empirical tuning of these hyperparameters. The InfoNCE+ proposes adding and :

Empirically, the authors find that setting and usually works best (tbh, the empirical evidence is not super convincing).

Theoretical Support for Removing Positive term from Denominator

As we can see, setting effectively removes the positive term from the denominator of the loss. This makes intuitive sense as it would constrain from increasing which is what we want.

This has theoretical backing as well, as explored in Decoupled Contrastive Learning - Yeh 2022. The DCL paper also shows that removing the positive term from the denominator leads to more stable training and less hyperparameter sensitivity.

The DCL loss is thus:

The authors also show that this "decoupling" is also justified from the Mutual Information Neural Estimator perspective. Specifically, the MINE paper shows that we can estimate the true mutual information between each user and item by the following optimization problem:

Intuitively, we want to maximize the above equation over the similarity function parametrized by the embeddings .

  • The first term takes an expectation of similarity scores over the joint user, item distribution where an interaction occurs (i.e. positive pairs).
  • The second term takes an expectation of exponentiated similarity scores over the product measure of marginal user and item distributions (i.e. assuming independence between user and item distribution).

MINE Loss

The authors then say that a "simple" adaptation of the MINE problem to the recommendation setting is formalized as the MINE loss:

Not too sure how this is derived from the above.

They also add a hyperparameter to control the relative weightage of the positive and negative samples. This results in what they term as MINE+:

Based on some ablation studies, they find that usually works best.

The paper also offers some lower bound analysis and de-biasing of InfoNCE which I will not delve into for now.