He 2020 - LightGCN

He 2020 - LightGCN is a simple and effective Graph Convolution Network for recommendation.

LightGCN is an adaptation of Graph Convolutional Neural Networks (GCN) to the task of recommendations. In a typical Convolutional Neural Network for vision, convolution aggregations (such as linear projections, pooling, average) are applied to a neighbourhood of pixels that are near to one another. The aggregations transform the raw pixel values into a hidden layer of "embedding" values, and the next layer of aggregations is applied to the hidden layer, allowing the CNN to learn more abstract features with each increasing layer. A GCN uses essentially the same idea, except that the definition of neighbourhood of a node A are the neighbouring nodes that are connected by an edge to A. The GCN thus allows us to train node embeddings on all types of graphical data, such as social networks, user-item interactions etc.

Setting

This paper tackles the task of collaborative filtering without features, i.e. making recommendations purely from the user and item id. Also, no negative samples are required - all we need is edges between users and items based on some form of implicit interaction.

Neural Graph Collaborative Filtering (NGCF)

The LightGCN model is essentially a simplification of the NGCF model, so the paper starts here. Btw, there are some overlaps between LightGCN authors and NGCF authors. The setup is as follows:

Each user and item are embedded from their id -> embedding
Let $e_{u}^{(0)}$ denote the ID embedding of user $u$ and $e_{i}^{(0)}$ denote the ID embedding of item $i$

NGCF uses the user-item interaction graph (derived from data) to propagate the embeddings as follows: $e_{u}^{(k + 1)} e_{i}^{(k + 1)} = σ (W_{1}^{(k)} e_{u}^{(k)} + i \in N_{u} \sum \frac{1}{∣ N _{u} ∣∣ N _{i} ∣} (W_{1}^{(k)} e_{i}^{(k)} + W_{2}^{(k)} (e_{i}^{(k)} ⊙ e_{u}^{(k)}))) = σ (W_{1}^{(k)} e_{i}^{(k)} + u \in N_{i} \sum \frac{1}{∣ N _{u} ∣∣ N _{i} ∣} (W_{1}^{(k)} e_{u}^{(k)} + W_{2}^{(k)} (e_{i}^{(k)} ⊙ e_{u}^{(k)})))$

Some notes about the propagation equations above:

$e_{u}^{(k + 1)}$ and $e_{i}^{(k + 1)}$ denote the embedding of user $u$ and item $i$ respectively after k layers of propagation
$σ$ is a non-linear activation function
$N_{u}$ denotes the set of items that interacted with user $u$ . For instance, it could be all the items the user purchased within the past 3 months. $N_{i}$ is the set of users defined in a similar way.
$W_{1}^{(k)}$ and $W_{2}^{(k)}$ are trainable weights

Intuitively for a given user, the equation propagates (i) the user embeddings itself (order-1), (ii) the embeddings of neighbouring items (order-1) and (iii) the hadamard interaction between the user and the neighbouring items (order-2). And likewise for the item embeddings. Note that is performed - the entire neighbour set is taken per node.

Finally, after training the network of $L$ layers, we obtain $L + 1$ embeddings for each user and item. The embeddings are concatenated as such $e_{u} = [e_{u}^{(0)}, ..., e_{u}^{(L)}]$ and $e_{i} = [e_{i}^{(0)}, ..., e_{i}^{(L)}]$ where $e_{u}, e_{i}$ are vectors of dimension $R^{k L}$ . Prediction scores for the match between user $u$ and item $i$ are then computed via the inner product $⟨ e_{u}, e_{i} ⟩$ .

Problem With NGCF

The authors argue that NGCF is unnecessarily complicated because traditionally, the base embedding layer $e_{u}^{(0)}, e_{i}^{(0)}$ is derived from rich semantic features such as embedding the title of papers etc. This justifies the usage of the activation function $σ$ and the projection weights $W_{1}^{(k)}, W_{2}^{(k)}$ etc. to learn a transformation of the semantic features. In contrast, for the collaborative filtering setting, the embeddings are arbitrary numbers tied to each user or item ID. Hence, performing multiple non-linear transformations will not lead to better feature learning.

: I'm not fully convinced by this argument, although the empirical results do support it. I agree with the argument to the extent that the base embedding layer is arbitrary, but imo NGCF can still learn a bigger representation space of models through its non-linear transformations. The problem seems to be more that (i) the richer feature representation is not very useful and (ii) the additional complexity makes the model harder to learn.

LightGCN Forward Propagation

In LightGCN, we essentially remove the non-linear activation and weight projections. The propagation equations simplify to the following:

$e_{u}^{(k + 1)} e_{i}^{(k + 1)} = i \in N_{u} \sum \frac{e _{i}^{(k)}}{∣ N _{u} ∣∣ N _{i} ∣} = u \in N_{i} \sum \frac{e _{u}^{(k)}}{∣ N _{u} ∣∣ N _{i} ∣} = \frac{1}{∣ N _{u} ∣} i \in N_{u} \sum \frac{e _{i}^{(k)}}{∣ N _{i} ∣} = \frac{1}{∣ N _{i} ∣} u \in N_{i} \sum \frac{e _{u}^{(k)}}{∣ N _{u} ∣}$

The final representation of each node $v$ (whether user or item) is then a weighted sum of its hidden representation across all layers:

$e_{v} = k = 0 \sum K α_{k} \cdot e_{v}^{(k)}$

Although $α_{k}$ could be a parameter to be optimized, the authors propose just setting $α_{k} = 1/ (K + 1)$ for simplicity.

Noticeably, the forward propagation does not include the self-connection from the previous layer, i.e. the update for $e_{u}^{k + 1}$ does not explicitly include $e_{u}^{k}$ , which other papers like GraphSAGE argue is important. The authors argue that because they use a weighted sum of hidden representations across all layers, this essentially is equivalent to including self-connections, so that is no longer necessary.

Loss

The only trainable parameters of the model are the embeddings at the 0th layer, i.e. $E^{(0)}$ . The authors propose using Bayesian Personalized Ranking loss, which is a pairwise loss that encourages the score of a neighbour to be higher than the score of an unobserved, randomly sampled counterpart.

$L_{BPR} = - u = 1 \sum M i \in N_{u} \sum j \in / N_{u} \sum l n σ (\overset{y}{^}_{u i} - \overset{y}{^}_{u j}) + λ ∣∣ E^{(0)} ∣ ∣^{2}$

In contrast to NGCF and other GCN approaches, the authors do not use dropout as a regularizer. Instead, they think the L2 regularization on the embedding layer is sufficient, as these are the only parameters in the model. Training of the model is done in a mini-batch manner, where batches of (user, item) tuples are drawn, negative items sampled, and the loss evaluated.

Ablation Studies

The paper has a few ablation findings:

Symmetric Normalization is important, i.e. it is important in the forward propagation to divide by $∣ N_{u} ∣∣ N_{i} ∣$ . Omitting either one leads to performance degradation. Note that in GraphSAGE, the GraphSAGE-mean variant essentially does $\sum_{i \in N_{u}} e_{i}^{(k)} /∣ N_{u} ∣$ , i.e. it only normalizes by the user degree. I suppose normalizing by the item degree as well penalizes popular items, so it could be useful.
Layer combination is important for robustness as we increase the number of layers, i.e. instead of just taking $e^{K}$ as the final embeddings, it is useful to take the element-wise mean of the embeddings at each layer. This might be analogous to the impact of including self connections.

Cornac Implementation

Cornac has a torch implementation of LightGCN:

The code relies on the dmlc/dgl package for constructing the bipartite user-item graph which will be used to compute neighbourhoods. The construct_graph function works as follows:

user_indices and item_indices are lists of the same length where each element at index i contains a pair of user, item that interacted
A dgl.heterograph is constructed with both directions:
- ("user", "user_item", "item") represents user -> item direction
- ("item", "item_user", "user") represents item -> user direction
- Hence there are two node types and two edge types in the graph
Starting with the user->item direction:
- src and dst are torch tensors containing the users and items respectively that interacted, both of length M
- dst_degree is a torch float tensor of length M containing the number of users interacting with each item in dst
- src_degree is a torch float tensor of length M containing the number of items interacting with each user in src

At model initialization, self.feature_dict is initialized with xavier initialization as follows. Note that because we have a heterograph, the nodes are defined as a dictionary of the form node_type: feature_tensor.

    self.feature_dict = {
        "user": user_embed, # (n_users, embed_dim)
        "item": item_embed, # (n_items, embed_dim)
    }

The GCNLayer class represents one layer of the message passing network.

Chux's Notebook