From RankNET LambdaRank LambdaMART

Paper Link.

This is an overview paper that explains the model behind LambdaMART, a technique used to learn Gradient Boosted Trees that optimize for NDCG on recommendation-type of datasets.

RankNET

RankNET is a generic framework for training a learn to rank model. Given a differentiable model that produces a score from an n-dimensional input vector, RankNET is able to train such that for a given query session with items and corresponding features , learns to predict the relationship for any two items in the query session when is a better recommendation than . The differentiable model is typically a neural network or boosted trees.

RankNET uses revealed pair-wise preferences within each query session to train . Specifically, suppose we have a data for one query session as follows:

queryitemclicked
qid1a0
qid1b1
qid1c0

We can transform this data into a pairwise dataset as follows, where denotes the preference relationship between and which we inferred from the click data. Note that the pairwise comparisons are only made within the same query session (e.g. qid1), as it reflects a given user's preferences in the context of a query and the items impressed to him/her in that session.

query
qid1ab-1
qid1ac0
qid1ba1
qid1bc0

The pairwise setting is now more amenable to modelling (compared to directly optimizing for a good ranking), since we can now treat the task as a classification problem. For each row of the pairwise dataset, we only need to model the probability that is preferred (or not) to . This can be formalized using a cross entropy loss comparing the predicted preference of our model to the revealed preference in the dataset.

First, we model the predicted probability from the model. Given row of the pairwise dataset and and respectively, we model the predicted probability that is preferred to (using to denote a preference relationship) by passing the score difference between the predicted scores and for items j and k respectively through a sigmoid function, like so:

Now let us denote the revealed probability that is preferred to as such that:

  • if we prefer item j to item k
  • if we have no preference between the two items
  • if we prefer item k to item j

The cross entropy loss of our model can then be expressed as:

For convenience, let us denote (and conversely, ), which translates into the following:

  • if we prefer item j to item k
  • if we have no preference between the two items
  • if we prefer item k to item j

Let us also define the convenience variable . The cross entropy loss then simplifies to:

Note that in line 2 of the above, we use the useful identity that and . In line 3, the first and last term of line 2 cancel out to simply return .

Having written out the loss function, we now need to differentiate the loss with respect to the model scores and parameters to obtain the gradient descent formula used to train the RankNET model. Differentiating wrt and gives:

Note that the first line of the above uses the result . We obtain line 2 by multiplying the right term by in both the numerator and denominator. We obtain line 3 by observing that is a function of , such that and likewise . The symmetry of the derivative wrt and will be important for the next section on factorizing RankNET to speed it up.

Finally, we use the gradient to update individual parameters of the model . In the below, denotes the number of data points in the pairwise dataset. This update procedure rounds up the discussion on RankNET and is sufficient for training a generic differentiable model from ranking data.

Factorizing RankNET