Lee 2020 - Large Scale Video Representation Learning

Large Scale Video Representation Learning via Relational Graph Clustering

This paper proposes an efficient method to learn video representations from relational graph data (i.e. user item interactions). Specifically, it learns a small-ish embedding model that transforms raw audio-visual features for a given video into a vector representation that is useful for downstream recommendation tasks. This method seems to be in use at YouTube at least until 2023 as it was mentioned in Singh 2023 - Better Generalization with Semantic IDs.

This paper is in the genre of metric learning of similarity metric between videos. Similar to FaceNet, it uses triplet contrastive loss to push related videos together and random videos apart. The contribution of the paper is a hierarchical clustering approach to sample smart negatives which they show to be much more informative for learning than random negatives.

Setup

We start with a raw input representation of a given video $x \in R^{d}$ , where $x$ could be a concatenation of various raw input features or a representation from some off-the-shelf pretrained embedding model. We are also given a relational graph $G = (V, E)$ where each node is a video and each edge $(x_{1}, x_{2}) \in E$ represents some relationship between two videos. The edge weight can be binary or real numbers. We may obtain such edge relationships from implicit feedback, e.g. how frequently two videos are co-watched, co-clicked, co-searched, etc. The aim is to learn a representation $z \in R^{k}$ where $k$ is much smaller than $d$ , such that $z_{1}^{T} z_{2} \sim 1$ if they are related and $0$ otherwise.

Method

The relational graph is pre-processed with a hierarchical clustering algorithm. A hierarchical algorithm is chosen so that we can sample negatives with varying levels of difficulty later on. The paper uses Affinity Clustering, although it notes that any suitable clustering algorithm works as well. At a high level, affinity clustering at each step chooses the lowest outgoing edge-weight to add from each cluster. Some desirable properties of affinity clustering are (i) tends to produce clusters of similar size, which helps negative sampling to be consistent across clusters and (ii) easily parallelizable.

Once we have the relational graph $G$ and clustering $C$ , the training proceeds as follows:

Construct triplets:
- Sample a random anchor video from all videos
- Choose a positive video from its neighbours in $G$
- Sample a negative video at a desired level from sibling clusters in $C$
Compute the distances for each triplet $i$ :
- Anchor-positive distance: $d_{a p} := ∣∣ f (x_{i}^{a}) - f (x_{i}^{p}) ∣ ∣^{2}$
- Anchor-negative distance: $d_{an} := ∣∣ f (x_{i}^{a}) - f (x_{i}^{n}) ∣ ∣^{2}$
Perform online semi-hard negative mining:
- In each mini-batch, resample negatives to choose the hardest semi-hard negative in the mini-batch for each row
- Recall that a semi-hard negative is where $0 < d_{an} - d_{a p} \leq α$
- So we are choosing negatives such that $d_{an}$ is as close as possible to $d_{a p}$ without being lower
- Note that this means we need to compute $d_{an}$ for all anchor-negative pairs in the mini-batch
Optimize for the following objective: $min i = 1 \sum N [d_{a p}^{i} - d_{an}^{i} + α]_{+}$

This is essentially the same procedure as FaceNET, except the smart and dynamic choice of negatives. Also note that since we only train with semi-hard negatives, the inner term $0 \leq d_{a p} - d_{an} + α < α$ , so the $+$ operator on the loss can actually be ignored.

Denote the anchor video as A. The smart negative sampling from sibling clusters is defined as:

At L0, negatives are chosen from all the descendants of $P (A)$ from the affinity tree
At L1, negatives are chosen from all the descendants of $Gp (A) := P (P (A))$
And so on

Experiments

The embedding network is simply a fully connected MLP with layers of dimension 4,000 and 256 respectively. Each video is preprocessed into an embedding representing raw audio-visual features using techniques like fourier transform, feeding into pre-trained audio-visual ResNets etc. and fed into the MLP to get a resulting embedding.

The relational graph is constructed by adding an edge between a pair of videos if they are frequently co-watched by multiple users. Specifics are not provided but I presume the rule is something like co-watched by at least n user pairs. In other words, the edge rule sounds fairly stringent to minimize the number of false edges. The videos are then split into training and test with a 7:3 split. Each mini-batch comprises 9,600 triplets and the margin is set to 0.5. The learning rate starts at 0.1 and decays by 0.98 for every 300k steps.

The evaluation simulates a cold start scenario, where both the query video and candidate videos are all taken from the unseen test set. For each query video, retrieval is performed using the model (using cosine similarity) on all videos from the test set. The NDCG and MAP are then computed based on whether we successfully retrieved relevant videos for the query video based on the true relational graph $G$ .

Findings:

Smart negatives are significantly better than random negatives. To further prove this point, the authors tracked the % of negatives that remain the same after the hard negative re-sampling step, and showed that a much larger % of smart negatives were retained (10% to 15%) than random negatives (close to 1 / batch_size). Interestingly, the usage % of smart negatives actually increases steadily as training goes on. This could be because at the start, many smart negatives are too difficult for the model and do not qualify as semi-hard. As the model learns, it is able to handle more smart negatives and the usage % increases.
Online semi-hard negative mining is important. The authors found that using smart negatives without the online mining step does not perform well. This could be because the smart negatives are too difficult for the model at the beginning and the model fails to learn. The authors suggest that an alternative to the online mining step is curriculum learning, where easier negatives are provided at the start and difficulty is increased gradually.
More difficult negatives work better. The authors tried training at L0, L1 and L2 difficulty levels respectively, and found that training at L0 consistently does the best. This kind of defeats the point of the hierarchical clustering, but I suppose it depends on the use case and in other settings other difficulty levels may work better.

Takeaways

This paper proposes a scaleable and easy-to-understand method of item representation learning. Obviously, it can be extended to other modalities such as text so long as we find a reasonable way to represent the raw inputs. For text, we can probably directly fine-tune some BERT encoder using the same procedure.

Nevertheless, note that this paper alone would not provide optimal performance if we use the embeddings here directly for retrieval. One big reason is that we are considering an undirected relational graph, so we learn only symmetric relationships, whereas in retrieval item sequence often matters. For example, people often progress from beginner videos to more advanced videos for a particular subject, and this paper would not be able to capture such relationships. Hence the embeddings in this paper have to be fed into another retrieval or ranking model for optimal performance.

Keyboard shortcuts

Chux's Notebook

Lee 2020 - Large Scale Video Representation Learning

Setup

Method

Experiments

Takeaways