Kang 2018 - SASRec
Self-Attentive Sequential Recommendation
This paper uses a transformer model with self causal attention to perform recommendations, by representing each user as a sequence of item embeddings and predicting the item interacted at time t+1
by using all the information up to time t
.
Background
This paper came out shortly after the transformer (Attention is all you need) was invented. Up to this point, sequential recommendation was performed using Markov Chain methods or RNN-based methods. Since the self-attention mechanism of transformers is well suited to sequential modelling, this paper makes the natural adaptation of self-attention to the recommendation task.
Setup
In the setting of sequential recommendation, we have for each user a sequence of item interactions where each element represents an item. For computation reasons we may choose to truncate to the most recent interactions. For simplicity we may also denote . We have user and item sets . Let us also define:
- as the full item embedding matrix with latent dimension
- as the learned position embedding matrix
For each user, we receive and truncate it to the most recent items. If there are less items, we left-pad the sequence with a constant zero vector. This results in an input embedding matrix for the user.
Analogous to the language modelling task, the targets for each user is simply shifted to the left by one. In other words, the target at time step would be the item interacted with at time step .
Model
Position Embeddings. We start by adding position embeddings to the user representation; absolute position embeddings are used here. Since this is a transformer model, the model has no notion of the item sequences if we do not inject the position embedding, and would not be able to learn that more recent items contain more valuable information about the next item to predict. The authors later show that visualizing the self-attention heatmap reveals that without position embedding, all items are attended to similarly, but with position embedding the attention weights are concentrated near the diagonal, i.e. more recent items are attended to stronger.
Specifically, we simply add the position embedding matrix to the input embedding matrix, such that:
Self attention. The standard scaled dot product attention is used to perform self attention on the input embedding. Specifically:
Where are the projection matrices. We then make sure to mask the softmax attention matrix in a causal manner, so that there can be no interaction between and for all .
Feedforward network. A point-wise two-layer feedforward network is applied to the output of the self attention (i.e. ), like so:
Where and . Note that in the feedforward networks, there remains no interaction between any at different time positions.
Stacking self attention layers. Now we stack the self attention layers and also apply (i) residual connection, (ii) dropout and (iii) LayerNorm each time. This is standard practice in transformers and leads to more stable training.
Where we define the composite function as follows, and is defined similarly.
Note: In modern transformers, the is replaced by the simpler and the function is replaced by the function.
This gives us the full specification for one layer (layer ) of the transformer. Several layers are stacked to provide the full model.
Prediction. After the final layer, we have as our representation. The predicted score at each time step for any item is made according to a simple dot product:
Training Loss
As discussed above, the target for each time step is simply the next item at time step . Specifically, if we define as the target output at time step , we have:
- if is a padding item
- for
Note: Not sure if we need to predict the padding item, or just simply mask the loss at those positions. Similarly for time step , where we do not know the next item to predict after the last item in the sequence.
Finally, the binary cross entropy loss is chosen as the objective function for each time step . Specifically, a random negative item that user has not interacted with is sampled for each time step and used as the negative example. The loss is:
Note: The binary cross entropy loss is chosen here with one sampled negative per time step. A later paper Turning Dross Into Gold Loss: is BERT4Rec really better than SASRec? will show that using softmax cross entropy loss with many sampled negatives will lead to much better performance.
Experiments
For the experiments, two attention layers were used (i.e. ). Item embeddings are shared between the input embedding layer and also used in the prediction layer (for ). The latent dimension is set to .
The ablation studies found that:
- Increasing number of layers saturates at around
- Using multi-head attention did not improve over single head attention
- The absolute position embeddings generally improved performance relative to no position embeddings