Collaborative Filtering

Collaborative filtering is typically done with implicit feedback in the RecSys setting. In this setting, interactions are often very sparse. Most of the time, only positive signals are recorded, but a non-interaction could either mean (i) user dislikes the item or (ii) the user was not exposed to the item. Hence, we cannot use algorithms like SVD which assume no interactions as irrelevance.

A useful repository is https://github.com/recommenders-team/recommenders.

A generic and fairly common architecture for the collaborative filtering model is to embed each user and item into separate fixed size vectors, and use the cosine similarity between the vectors to represent a score. This score is fed into a cross entropy loss against the labelled relevance of user to item to train the embeddings.

Setup

Let $f (u)$ and $f (i)$ denote the $k$ dimensional embedding vector for user $u$ and item $i$ . Let the similarity function be $s (u, i)$ which is typically $f (u)^{T} f (i)$ , and distance function $d (u, i)$ which is typically $∣∣ f (u) - f (i) ∣ ∣_{2}^{2}$ . Then some common loss functions may be denoted as below.

Pointwise Losses are typically low-performing. For a given (u, i) pair, pointwise losses assume the presence of a 0, 1 label for relevance, and tries to predict it. The typical pointwise loss is Binary Cross Entropy, which may be expressed as:

$L_{BCE} = (u, i) \in D \sum lo g σ (s (u, i)) - (u, j) \in / D \sum lo g (1 - σ (s (u, j)))$

Pairwise Losses assume the presence of training triplets (u, i, j) which correspond to user, positive item and negative item. A typical pairwise loss is Bayesian Personalized Ranking, as follows:

$- (u, i, j) \in τ \sum lo g σ [s (u, i) - s (u, j)]$

Weighted Matrix Factorization

This describes the Cornac implementation of WMF. The code:

Let $A \in R^{n \times m}$ describe a rating matrix of $n$ users and $m$ items. For simplicity, we may restrict $A_{ij} \in [0, 1]$ . Given a user embedding matrix $U \in R^{n \times k}$ and item embedding matrix $V \in R^{m \times k}$ , WMF computes the similarity score as the dot product $U \cdot V^{T} \in R^{n \times m}$ .

The general loss function is:

$L = i, j : A_{ij} = 1 \sum (A_{ij} - U_{i} \cdot V_{j}^{T})^{2} + b i, j : A_{ij} = 0 \sum (A_{ij} - U_{i} \cdot V_{j}^{T})^{2} + u \cdot ∣∣ U ∣ ∣_{F}^{2} + v \cdot ∣∣ V ∣ ∣_{F}^{2}$

The idea is to simply take the squared error from the true ratings matrix as our loss, but apply a lower weightage to elements in the rating matrix where the rating is zero (as these are usually unobserved / implicit negatives that we are less confident about). Usually b is set to 0.01. Regularization is performed on the user and item embedding matrices, with $u \in R, v \in R$ as hyperparameters to adjust the strength of regularization.

For cornac, this loss is adapted to the mini batch setting. Specifically, the algorithm is:

Draw a mini batch (default: B = 128) of items but use all the users
Compute the model predictions $P = U \cdot V_{batch}^{T} \in R^{n \times B}$
Compute squared error $E = (A_{ba t c h} - P)^{2} \in R^{n \times B}$
Multiply matrix of weights (either 1 for positive ratings or b for negative ratings) element-wise with $E$
$loss = sum (E) + u \cdot ∣∣ U ∣ ∣_{F}^{2} + v \cdot ∣∣ V_{ba t c h} ∣ ∣_{F}^{2}$

Note that Adam optimizer is used, and gradients are clipped between [-5, 5].

Bilateral Variational Autoencoder (BiVAE)

Recommenders BiVAE Deep Dive BiVAE Paper

A working implementation of BiVAE is available on Cornac.

A variational autoencoder improves over traditional linear matrix factorization methods by using non-linearity and a probabilistic formulation. Given a user, the autoencoder encodes the data representing the entity into a vector in some latent space. A decoder then takes the vector in the latent space and decodes it into something close to the original data.

The difference between VAE and a regular autoencoder is that it doesn't learn a fixed vector representation, but rather a probability distribution in the latent space. This allows it to model noisy, sparse interaction data better.

Splitting

recommenders uses a few different types of data splitting:

Stratified spltting.

Keyboard shortcuts

Chux's Notebook

Collaborative Filtering

Setup

Weighted Matrix Factorization

Bilateral Variational Autoencoder (BiVAE)

Splitting