Kang 2018 - SASRec

Self-Attentive Sequential Recommendation

This paper uses a transformer model with self causal attention to perform recommendations, by representing each user as a sequence of item embeddings and predicting the item interacted at time t+1 by using all the information up to time t.

Background

This paper came out shortly after the transformer (Attention is all you need) was invented. Up to this point, sequential recommendation was performed using Markov Chain methods or RNN-based methods. Since the self-attention mechanism of transformers is well suited to sequential modelling, this paper makes the natural adaptation of self-attention to the recommendation task.

Setup

In the setting of sequential recommendation, we have for each user a sequence of item interactions $S^{u} = (S_{1}^{u}, S_{2}^{u}, ..., S_{∣ S^{u} ∣}^{u})$ where each element represents an item. For computation reasons we may choose to truncate $S^{u}$ to the most recent $n$ interactions. For simplicity we may also denote $S^{u} = (s_{1}, ..., s_{n})$ . We have user and item sets $U, I$ . Let us also define:

$M \in R^{∣ I ∣ \times d}$ as the full item embedding matrix with latent dimension $d$
$P \in R^{n \times d}$ as the learned position embedding matrix

For each user, we receive $S^{u}$ and truncate it to the most recent $n$ items. If there are less items, we left-pad the sequence with a constant zero vector. This results in an input embedding matrix $E \in R^{n \times d}$ for the user.

Analogous to the language modelling task, the targets for each user is simply $S^{u}$ shifted to the left by one. In other words, the target at time step $t$ would be the item interacted with at time step $t + 1$ .

Model

Position Embeddings. We start by adding position embeddings to the user representation; absolute position embeddings are used here. Since this is a transformer model, the model has no notion of the item sequences if we do not inject the position embedding, and would not be able to learn that more recent items contain more valuable information about the next item to predict. The authors later show that visualizing the self-attention heatmap reveals that without position embedding, all items are attended to similarly, but with position embedding the attention weights are concentrated near the diagonal, i.e. more recent items are attended to stronger.

Specifically, we simply add the position embedding matrix to the input embedding matrix, such that: $\hat{E}_{t} = M_{s_{t}} + P_{t}$

Self attention. The standard scaled dot product attention is used to perform self attention on the input embedding. Specifically: $SA (\hat{E}) Attention (Q, K, V) = Attention (\hat{E} W^{Q}, \hat{E} W^{K}, \hat{E} W^{V}), where = so f t ma x (\frac{Q K ^{T}}{d}) V$

Where $W^{Q}, W^{K}, W^{V} \in R^{d \times d}$ are the projection matrices. We then make sure to mask the softmax attention matrix in a causal manner, so that there can be no interaction between $Q_{i}$ and $K_{j}$ for all $j > i$ .

Feedforward network. A point-wise two-layer feedforward network is applied to the output of the self attention (i.e. $S \in R^{n \times d}$ ), like so: $FFN (S) = ReLU (S W^{(1)} + b^{(1)}) W^{(2)} + b^{(2)}$

Where $W^{(1)}, W^{(2)} \in R^{d \times d}$ and $b^{(1)}, b^{(2)} \in R^{d}$ . Note that in the feedforward networks, there remains no interaction between any $S_{i}, S_{j}$ at different time positions.

Stacking self attention layers. Now we stack the self attention layers and also apply (i) residual connection, (ii) dropout and (iii) LayerNorm each time. This is standard practice in transformers and leads to more stable training. $S^{(b)} F^{(b)} = g(SA) (F^{(b - 1)}) = g(FFN) (S^{(b)})$

Where we define the composite function $g(SA)$ as follows, and $g(FFN)$ is defined similarly. $g(SA) (X) := X + Dropout (SA (LayerNorm (X)))$

Note: In modern transformers, the $LayerNorm$ is replaced by the simpler $RMSNorm$ and the $ReLU$ function is replaced by the $geLU$ function.

This gives us the full specification for one layer (layer $b$ ) of the transformer. Several layers are stacked to provide the full model.

Prediction. After the final $B$ layer, we have $F^{(B)} \in R^{n \times d}$ as our representation. The predicted score at each time step $t$ for any item $i \in I$ is made according to a simple dot product: $r_{i, t} = F_{t}^{(B)} M_{i}^{T}$

Training Loss

As discussed above, the target for each time step $t$ is simply the next item at time step $t + 1$ . Specifically, if we define $o_{t}$ as the target output at time step $t$ , we have:

$o_{t} = <pad>$ if $s_{t}$ is a padding item
$o_{t} = s_{t + 1}$ for $1 \leq t < n$

Note: Not sure if we need to predict the padding item, or just simply mask the loss at those positions. Similarly for time step $n$ , where we do not know the next item to predict after the last item in the sequence.

Finally, the binary cross entropy loss is chosen as the objective function for each time step $t$ . Specifically, a random negative item that user $u$ has not interacted with is sampled for each time step and used as the negative example. The loss is: $L = - S^{u} \sum t \sum lo g (σ (r_{o_{t}, t})) + j \in / S^{u} \sum lo g (1 - σ (r_{j, t}))$

Note: The binary cross entropy loss is chosen here with one sampled negative per time step. A later paper Turning Dross Into Gold Loss: is BERT4Rec really better than SASRec? will show that using softmax cross entropy loss with many sampled negatives will lead to much better performance.

Experiments

For the experiments, two attention layers were used (i.e. $B = 2$ ). Item embeddings are shared between the input embedding layer and also used in the prediction layer (for $M_{i}$ ). The latent dimension is set to $d = 50$ .

The ablation studies found that:

Increasing number of layers saturates at around $B = 3$
Using multi-head attention did not improve over single head attention
The absolute position embeddings generally improved performance relative to no position embeddings

Implementation

The authors endorsed a pytorch implementation here.

Keyboard shortcuts