From RankNET $\to$ LambdaRank $\to$ LambdaMART

Paper Link.

This is an overview paper that explains the model behind LambdaMART, a technique used to learn Gradient Boosted Trees that optimize for NDCG on recommendation-type of datasets.

RankNET

RankNET is a generic framework for training a learn to rank model. Given a differentiable model $f : R^{d} \mapsto R$ that produces a score from an n-dimensional input vector, RankNET is able to train $f$ such that for a given query session with items $i = 1, ... n$ and corresponding features $x_{i} \in R^{d} : i = 1, ..., n$ , $f$ learns to predict the relationship $f (x_{i}) > f (x_{j})$ for any two items $i, j$ in the query session when $i$ is a better recommendation than $j$ . The differentiable model $f$ is typically a neural network or boosted trees.

RankNET uses revealed pair-wise preferences within each query session to train $f$ . Specifically, suppose we have a data for one query session as follows:

query	item	clicked
qid1	a	0
qid1	b	1
qid1	c	0

We can transform this data into a pairwise dataset as follows, where $y^{jk}$ denotes the preference relationship between $i t e m^{j}$ and $i t e m^{k}$ which we inferred from the click data. Note that the pairwise comparisons are only made within the same query session (e.g. qid1), as it reflects a given user's preferences in the context of a query and the items impressed to him/her in that session.

query	$item^{j}$	$item^{k}$	$y^{jk}$
qid1	a	b	-1
qid1	a	c	0
qid1	b	a	1
qid1	b	c	0

The pairwise setting is now more amenable to modelling (compared to directly optimizing for a good ranking), since we can now treat the task as a classification problem. For each row of the pairwise dataset, we only need to model the probability that $i t e m^{j}$ is preferred (or not) to $i t e m^{k}$ . This can be formalized using a cross entropy loss comparing the predicted preference of our model to the revealed preference in the dataset.

First, we model the predicted probability from the model. Given row $i$ of the pairwise dataset and $i t e m^{j}$ and $i t e m^{k}$ respectively, we model the predicted probability that $i t e m^{j}$ is preferred to $i t e m^{k}$ (using $▹$ to denote a preference relationship) by passing the score difference between the predicted scores $\overset{y}{^}_{i}^{j} := f (x_{i}^{j})$ and $\overset{y}{^}_{i}^{k} := f (x_{i}^{k})$ for items j and k respectively through a sigmoid function, like so:

$\hat{P}_{i}^{jk} := \hat{P} (i t e m_{i}^{j} ▹ i t e m_{i}^{k}) = \frac{1}{1 + e x p ( - a ( y ^ _{i}^{j} - y ^ _{i}^{k} ))}$

Now let us denote the revealed probability that $i t e m^{j}$ is preferred to $i t e m^{k}$ as $P_{i}^{jk}$ such that:

$P_{i}^{jk} = 1$ if we prefer item j to item k
$P_{i}^{jk} = 0.5$ if we have no preference between the two items
$P_{i}^{jk} = 0$ if we prefer item k to item j

The cross entropy loss of our model can then be expressed as:

$L = i \sum [- P_{i}^{jk} l o g \hat{P}_{i}^{jk} - (1 - P_{i}^{jk}) l o g (1 - \hat{P}_{i}^{jk})]$

For convenience, let us denote $y_{i}^{jk} := 2 P_{i}^{jk} - 1$ (and conversely, $P_{i}^{jk} = \frac{y _{i}^{jk} + 1}{2}$ ), which translates into the following:

$y_{i}^{jk} = 1$ if we prefer item j to item k
$y_{i}^{jk} = 0$ if we have no preference between the two items
$y_{i}^{jk} = - 1$ if we prefer item k to item j

Let us also define the convenience variable $z := - l o g \hat{P}_{i}^{jk} = l o g [1 + e x p (- a (\overset{y}{^}_{i}^{j} - \overset{y}{^}_{i}^{k}))]$ . The cross entropy loss then simplifies to: $L = i \sum [- P_{i}^{jk} l o g \hat{P}_{i}^{jk} - (1 - P_{i}^{jk}) l o g (1 - \hat{P}_{i}^{jk})] = i \sum [\frac{1}{2} (1 + y_{i}^{jk}) z - \frac{1}{2} (1 - y_{i}^{jk}) [(- a) (\overset{y}{^}_{i}^{j} - \overset{y}{^}_{i}^{k}) - z]] = i \sum [z + \frac{a}{2} (1 - y_{i}^{jk}) (\overset{y}{^}_{i}^{j} - \overset{y}{^}_{i}^{k})]$

Note that in line 2 of the above, we use the useful identity that $l o g (1 - s i g m o i d (x)) = x + l o g (s i g m o i d (x))$ and $l o g \hat{P}_{i}^{jk} = - z$ . In line 3, the first and last term of line 2 cancel out to simply return $z$ .

Having written out the loss function, we now need to differentiate the loss with respect to the model scores and parameters to obtain the gradient descent formula used to train the RankNET model. Differentiating $L$ wrt $\overset{y}{^}_{i}^{j}$ and $\overset{y}{^}_{i}^{k}$ gives:

$\frac{\partial L}{\partial y ^ _{i}^{j}} = \frac{a}{2} (1 - y_{i}^{jk}) + \frac{- a \cdot e x p ( - a ( y ^ _{i}^{j} - y ^ _{i}^{k} ))}{1 + e x p ( - a ( y ^ _{i}^{j} - y ^ _{i}^{k} ))} = \frac{a}{2} (1 - y_{i}^{jk}) - \frac{a}{1 + e x p ( a ( y ^ _{i}^{j} - y ^ _{i}^{k} ))} = - \frac{\partial L}{\partial y ^ _{i}^{k}}$

Note that the first line of the above uses the result $\frac{d}{d x} l n f (x) = \frac{f ^{'} ( x )}{f ( x )}$ . We obtain line 2 by multiplying the right term by $e x p (a (\overset{y}{^}_{i}^{j} - \overset{y}{^}_{i}^{k}))$ in both the numerator and denominator. We obtain line 3 by observing that $L$ is a function of $d := \overset{y}{^}_{i}^{j} - \overset{y}{^}_{i}^{k}$ , such that $\frac{\partial L}{\partial y ^ _{i}^{j}} = \frac{\partial L}{\partial d} \cdot \frac{\partial d}{\partial y ^ _{i}^{j}} = \frac{\partial L}{\partial d}$ and likewise $\frac{\partial L}{\partial y ^ _{i}^{k}} = \frac{\partial L}{\partial d} \cdot \frac{\partial d}{\partial y ^ _{i}^{k}} = - \frac{\partial L}{\partial d}$ . The symmetry of the derivative wrt $j$ and $k$ will be important for the next section on factorizing RankNET to speed it up.

Finally, we use the gradient to update individual parameters $w_{l} \in R$ of the model $f$ . In the below, $n$ denotes the number of data points in the pairwise dataset. This update procedure rounds up the discussion on RankNET and is sufficient for training a generic differentiable model $f$ from ranking data. $w_{l} \leftarrow w_{l} - η \frac{\partial L}{\partial w _{l}} = w_{l} - \frac{η}{n} i \sum [\frac{\partial L}{\partial y ^ _{i}^{j}} \cdot \frac{\partial y ^ _{i}^{j}}{\partial w _{l}} + \frac{\partial L}{\partial y ^ _{i}^{k}} \cdot \frac{\partial y ^ _{i}^{k}}{\partial w _{l}}]$

Chux's Notebook

From RankNET $\to$ LambdaRank $\to$ LambdaMART

RankNET

Factorizing RankNET

Keyboard shortcuts

Chux's Notebook

From RankNET → LambdaRank → LambdaMART

RankNET

Factorizing RankNET

From RankNET $\to$ LambdaRank $\to$ LambdaMART