Dong 2023 - MINE Loss

Revisiting Recommendation Loss Functions through Contrastive Learning

This paper compares several recommendation loss functions like BPR, CCL and introduce two new losses: InfoNCE+ and MINE+.

Setup

Let $U, I$ denote the user and item sets. We denote that:

Each user $u \in U$ has interactions with $I_{u}^{+}$ items, and has no interactions with the remaining $I ∖ I_{u}^{+}$ set of items.
On the item side, $U_{i}^{+}$ denotes all users who interacted with item $i$
We can also denote $r_{u i} = 1$ if there was an interaction and $r_{u i} = 0$ otherwise

Let the latent embeddings $v_{u}, v_{i}$ represent user $u$ and item $i$ respectively. The similarity measure between them is then denoted $\overset{y}{^}_{u i} =< v_{u}, v_{i} >$ .

BPR Loss

The most widely used loss is Bayesian Personalized Ranking:

$L_{BPR} = E_{u} E_{i \sim p_{u}^{+}} j \sim p_{i} \sum N - lo g σ (\overset{y}{^}_{u i} - \overset{y}{^}_{u j}) = E_{u} E_{i \sim p_{u}^{+}} j \sim p_{i} \sum N - lo g (1 + e x p (\overset{y}{^}_{u j} - \overset{y}{^}_{u i}))$

Note that for each user, we take the expectation over the set of items relevant to him. We then sample $N$ negatives from the overall item distribution (usually uniformly at random).

Softmax Loss

One common approach is to model $P (I = i ∣ U = u)$ as an extreme classification problem where $I$ is a very large set. The probability may then be modeled as a softmax: $P (I = i ∣ U = u) = \frac{e ^{v_{u}^{T} v_{i}}}{\sum _{j \in I} e ^{v_{u}^{T} v_{j}}}$

In practice, it is infeasible to compute over the large item set, so we sample negative candidates for the denominator. The sampling is then corrected via importance weighting.

Covington 2016 for such an approach.
Jean 2014 for importance weighting correction

Using this approach, the loss may be formulated as: $L_{so f t} = - E_{u} lo g i \in I_{u}^{+} \sum \frac{e ^{\overset{y}{^}_{u i}}}{e ^{\overset{y}{^}_{u i}} + \sum _{j \sim p_{u}^{-}}^{N} e ^{\overset{y}{^}_{u j}}}$

Note that $p_{u}^{-}$ is a negative sampling distribution for each user $u$ , and is typically implemented as $p$ which is based on item popularity.

Contrastive Learning Loss (InfoNCE)

InfoNCE loss looks very similar to the sampled softmax loss, although the motivation is different. The key idea is to pull similar points closer and push dissimilar points apart. InfoNCE loss is the most famous contrastive learning loss:

$L_{in f o} = - E_{u} i \in I_{u}^{+} \sum lo g \frac{e ^{\overset{y}{^}_{u i}}}{e ^{\overset{y}{^}_{u i}} + \sum _{j \sim p_{i}^{-}}^{N} e ^{\overset{y}{^}_{u j}}} = - E_{u} i \in I_{u}^{+} \sum \overset{y}{^}_{u i} - lo g e^{\overset{y}{^}_{u i}} + j \sim p_{i}^{-} \sum N e^{\overset{y}{^}_{u j}}$

Note that the only difference from the sampled softmax loss is that the $lo g$ is inside the sum rather than outside. The InfoNCE loss has been shown to maximize the mutual information between user $u$ and item $i$ and minimize mutual information between unrelated pairs.

Empirical Exploration of InfoNCE+

The authors propose an InfoNCE+, which is just adding some hyperparameters to InfoNCE and performing some empirical tuning of these hyperparameters. The InfoNCE+ proposes adding $ϵ$ and $λ$ :

$L_{in f o +} N = - E_{u} i \in I_{u}^{+} \sum (\overset{y}{^}_{u i} - λ \cdot lo g N) = ϵ \cdot e^{\overset{y}{^}_{u i}} + j \sim p_{u}^{-} \sum N e^{\overset{y}{^}_{u j}}$

Empirically, the authors find that setting $ϵ = 0$ and $λ = 1.1$ usually works best (tbh, the empirical evidence is not super convincing).

Theoretical Support for Removing Positive term from Denominator

As we can see, setting $ϵ = 0$ effectively removes the positive term $e^{\overset{y}{^}_{u i}}$ from the denominator of the loss. This makes intuitive sense as it would constrain $\overset{y}{^}_{u i}$ from increasing which is what we want.

This has theoretical backing as well, as explored in Decoupled Contrastive Learning - Yeh 2022. The DCL paper also shows that removing the positive term from the denominator leads to more stable training and less hyperparameter sensitivity.

The DCL loss is thus: $L_{D C L} = - E_{u} i \in I_{u}^{+} \sum \overset{y}{^}_{u i} - lo g j \sim p_{u}^{-} \sum N e^{\overset{y}{^}_{u j}}$

The authors also show that this "decoupling" is also justified from the Mutual Information Neural Estimator perspective. Specifically, the MINE paper shows that we can estimate the true mutual information between each user $u$ and item $i$ by the following optimization problem:

$I (u, i) = sup_{(v_{u}; v_{i})} E_{p_{u, i}} (\overset{y}{^}_{u i}) - lo g E_{p_{u} \otimes p_{i}} (e^{\overset{y}{^}_{u i}})$

Intuitively, we want to maximize the above equation over the similarity function parametrized by the embeddings $v_{i}, v_{u}$ .

The first term takes an expectation of similarity scores over the joint user, item distribution where an interaction occurs (i.e. positive pairs).
The second term takes an expectation of exponentiated similarity scores over the product measure of marginal user and item distributions (i.e. assuming independence between user and item distribution).

MINE Loss

The authors then say that a "simple" adaptation of the MINE problem to the recommendation setting is formalized as the MINE loss:

$L_{min e} = - E_{u} E_{i \sim p_{i}^{+}} [\overset{y}{^}_{u i} - lo g E_{j \sim p_{i}} (e^{\overset{y}{^}_{u j}})]$

Not too sure how this is derived from the above.

They also add a hyperparameter $λ$ to control the relative weightage of the positive and negative samples. This results in what they term as MINE+:

$L_{min e +} = - E_{u} E_{i \sim p_{i}^{+}} [\overset{y}{^}_{u i} - λ lo g E_{j \sim p_{i}} (e^{\overset{y}{^}_{u j}})]$

Based on some ablation studies, they find that $λ = 1.1 - 1.2$ usually works best.

The paper also offers some lower bound analysis and de-biasing of InfoNCE which I will not delve into for now.

Keyboard shortcuts

Chux's Notebook