Yi 2019 - LogQ Correction for In Batch Sampling

Yi 2019 - Sampling Bias Corrected Neural Modelling for Large Corpus Item Recommendations

This paper proposes a way to perform logQ correction for sampling bias introduced by in-batch negative sampling when training two tower models. The algorithm proposed is a streaming algorithm that estimates item frequencies based updates after seeing each mini batch.

Setup

Let $x_{i}$ , $y_{j}$ denote a user and item respectively, where there are $i = 1, ..., N$ users and $j = 1, ..., M$ items. Let $u (.)$ and $v (.)$ denote user and item embedding functions that map each $x_{i}$ and $y_{j}$ to $R^{k}$ . These functions are typically:

Some sentence transformer model for texts
Some hash embedding in the collaborative filtering setting

The output of the model is the inner product of the embeddings, i.e. $s (x, y) =< u (x), v (y) >$ . The goal is to train the model from a training dataset of $T$ user-item interactions, denoted by $T := {x_{i}, y_{i}, r_{i}}_{i = 1}^{T}$ , where $x_{i}$ , $y_{i}$ are the interacting query and item and $r_{i} \in R$ is the associated reward.

Typically $r_{i} = 1$ to denote an interaction
We can also use $r_{i}$ to denote some quality weight, e.g. time spent on product

Given a query $x$ , we typically model the conditional probability of picking item $y$ based on the softmax function. $θ$ parametrizes the embedding model: $P (y ∣ x; θ) = \frac{e ^{s (x, y)}}{\sum _{j \in [M]} e ^{s (x, y_{j})}}$

We then design the loss function as a weighted log likelihood of the training interactions: $L_{T} (θ) := - \frac{1}{T} i \in [T] \sum r_{i} \cdot lo g (P (y_{i} ∣ x_{i}; θ))$

In Batch Sampling

In practice, the denominator for $P$ above is not feasible to compute when the number of items $M$ is very large. The common practice is to sample only a subset of $B$ items that are drawn in a mini batch. Hence given a mini batch of B pairs ${(x_{i}, y_{i}, r_{i})}_{i = 1}^{B}$ and for any $i \in [B]$ , the batch softmax becomes:

$P_{B} (y_{i} ∣ x_{i}; θ) = \frac{e ^{s (x_{i}, y_{i})}}{\sum _{j \in [B]} e ^{s (x_{i}, y_{j})}}$

Note that each $x_{i}, y_{i}$ refers to a positive pair. However, the batch softmax above is usually a very biased estimate of the full softmax. This is because our training data usually has a heavy bias toward popular items, hence the likelihood of a popular item being included in the denominator is usually quite skewed.

In other words, our model trained with this biased likelihood function may have a low training loss against popular items in the denominator during training. But when used in retrieval, the model may be assigning high scores to rare items that should be negatives, just that our model did not have a chance to discriminate against them due to the biased sampling during training.

This issues underlies the common phenomenon when training such retrieval embedding models where the reranking performance is good but retrieval performance is very bad. The reason is that reranking is often performed against popular items that the model sees often, but retrieval by definition searches across the whole item catalogue. Hence retrieval is (from this perspective) a harder task than reranking. Special attention must be paid during training to ensure that the model learns to discriminate well against all items in the catalogue, and this logQ correction is one of the methods at our disposal.

In Adaptive Importance Sampling to Accelerate Training of A Neural Probabilistic Language Model, the authors propose the following way to correct the biased batch softmax by correcting each score logit: $s^{c} (x_{i}, y_{j}) = s (x_{i}, y_{j}) - l o g (p_{j})$

Where $p_{j}$ denotes the probability of sampling an item $j$ in a random batch. With this correction, we can denote the batch softmax as: $P_{B}^{c} (y_{i} ∣ x_{i}; θ) = \frac{e ^{s^{c} (x_{i}, y_{i})}}{e ^{s^{c} (x_{i}, y_{i})} + \sum _{j \in [B], j \neq = i} e ^{s^{c} (x_{i}, y_{j})}}$

And finally we have the batch loss function as: $L_{B} (θ) := - \frac{1}{B} i \in [B] \sum r_{i} \cdot lo g (P_{B}^{c} (y_{i} ∣ x_{i}; θ))$

Estimating Sampling Probability in Stream Setting

Notably, the batch loss function does not require holding a fixed set of items in memory to serve as negative candidates, making it suitable for use in a streaming training data setting. Thus, the authors propose a method to estimate the sampling probability $p_{j}$ in a streaming fashion as well.

The first observation is that it is easier to track the number of steps (or batches) between two consecutive hits of item $j$ . e.g. if we only get one item once every 50 batches, then $p = 0.02$ . The proposed algorithm is as follows:

Initialize Arrays $A, B$ with size $H$
Let $h (.)$ be a hash function from an item ID to $[H]$
At batch $t$ , sample a batch of items. For each item $y$ in the batch:
- $B [h (y)] \leftarrow (1 - α) \cdot B [h (y)] + α \cdot (t - A [h (y)])$
- $A [h (y)] \leftarrow t$
At inference time, the sampling probability for item $y$ will be $p_{y} = 1/ B [h (y)]$ .

Other Notes

The authors note that adding l2-normalization to embeddings improves model trainability and leads to better retrieval quality. Also, adding a temperature $τ$ to each logit helps to sharpen the predictions. In their experiment, the best $τ$ is usually around 0.05 (i.e. logits get multipled by 20x).

Keyboard shortcuts

Chux's Notebook

Yi 2019 - LogQ Correction for In Batch Sampling

Setup

In Batch Sampling

Estimating Sampling Probability in Stream Setting

Other Notes