Chux's Notebook

This is a collection of my notes which I refer to on a regular basis. Hope it is also helpful for others stumbling by.

I think of these notes as mountaineering pegs. Often at the time of studying a particular paper or topic, the concepts are clear. Over time, however, only a vague impression remains, and I can no longer tussle with the issues. So these notes serve as pegs that I hope to use to re-scale an old hill, or at least scale it with less effort than starting from scratch.

Generally, I highlight main points like so, and put in-line code or numbers like so.

Built with mdBook.

About Me

I am working as a Lead Data Scientist at GovTech Singapore, on a team called JumpStart. We power search and recommender systems for Singapore Government products like the job searching portal MyCareersFuture or the CareersFinder tool to discover next steps for one's career.

My LinkedIn profile is here.

Current Focus

2025 Goals

GNN-based approach for search and recommendation
- Use past sequential history of items as user representation
- Transformer-based reranking
- Transformer-based user encoding for ANN retrieval
Multi-task, multi-purpose embeddings
- For retrieval and reranking
- Across various services (jobs, courses, skills)

Research

Optimizing LLM explanations based on implicit feedback
- How to optimize an LLM to provide better recommendation explanations by fine-tuning on implicit feedback?
Replacing BM25
- How to design a search system that matches BM25 performance at cold start and gradually improves with more data, without dropping below BM25 performance?
Precise Retrieval
- The common two tower approach to embedding retrieval leaves much to be desired
  - There is no natural score threshold at which items are deemed irrelevant. Traditionally, classifiers have a 0.5 score cut-off.
  - Embedding retrieval tends to retrieve unrelated items. This is a well documented problem. For example, Nike shoes retrieves Adidas Shoes.

Recent Trends in Search & Recommendations

This document summarizes some important ideas in both recommender systems and search systems that have emerged in the past few years. It aims to be a helpful read for someone trying to get acquainted with modern practices in these fields.

What we will cover:

Embedding learning. Embedding learning for items and users is an upstream task that are used for downstream retrieval and ranking.
Recommendations
Search retrieval and ranking

Embedding Learning

Embedding learning is a foundational tool that powers all recommendation and search systems. Typically, users and items are represented by one or more embeddings that get fed into neural networks for prediction and recommendation tasks.

Whilst many off-the-shelf LLM embedding models exist today, the best performing embeddings are often context-specific. For example, from the perspective of a job-seeker, we may want the embeddings for Physical Education Teacher and Gym Trainer to be similar, as the focus is on similar job functions / skills. However, from the perspective of a HR personnel trying to do manpower planning, we may want Physical Education Teacher to be more similar to Executive at Singapore Sports School, since the focus is more on eligible pathways for internal rotation (btw, I made this example up). Hence, there is a need to effectively learn embedding models in a specific product setting. Lee 2020 describes the task like so:

(the goal is to) Learn embeddings that preserve the item-to-item similarities in a specific product context.

It is common practice to have an upstream pipeline to learn item embeddings that get re-used for multiple downstream tasks (let's call this a "universal" embedding). For example, YouTube uses a common set of video embeddings used by all its models [Lee 2020]. To adapt the embeddings to a specific task, the task model itself can always add a simple translation MLP layer to translate the universal embedding so that the embeddings can be more performant in each task setting.

Another way is to have each retrieval and ranker model learn its own embeddings for its specific task.

However, this leads to redundancy to store and update each set of embeddings, often in some in-memory feature store, which is costly.
This approach is also often not as performant as learning a universal set of embeddings and then adapting the embeddings to each task via a small learned MLP layer
Training from a frozen set of universal embeddings is also much more compute efficient, as we do not need to forward/backward propagate into a potentially large language model that generates the embeddings from text or some other modality.

We can think of upstream training of an embedding model as analogous to the common practice of pre-training large language models in a semi-supervised manner on a large text corpus. The LLM is then fine-tuned for specific applications.

Although the performance of using multiple embeddings to represent an item has been proven to be more performant than using a single embedding (e.g. using ColBert), most companies seem to adopt a single embedding to reduce engineering complexity. For example, Pinterest briefly experimented with multiple embeddings per item but reverted to a single embedding to reduce costs.

There are 3 primary ways of embedding representation in the literature:

ID-Based: This method maps an item's ID directly to an embedding.
- $f (item_id) \to embedding$
- This is often used in a task-specific model, but not usually used for a universal embedding model shared by downstream tasks, as we cannot represent new items that have no user activity
Content-Based: This is the most common approach, generating embeddings from an item's content, such as its text, images, or audio
- $f (item_text, item_image, ...) \to embedding$
- Textual content is usually passed into an LLM to generate embeddings
- For other modalities, task-agnostic preprocessing may be employed. For example, Lee 2020 sampling frames from a video and runs them through a ResNET model to create raw embeddings before they are fed into the main embedder model
Graph-Based: This method can be seen as an extension of the content-based approach. It generates an embedding from the attributes of an item's neighbors in a network.
- $f (attributes of neighbours of item) \to embedding$
- This is particularly effective in social network products where network interactions are crucial, such as on LinkedIn, Pinterest, or Facebook
- While powerful, this method is more computationally expensive at both training and inference times because it requires access to the graph to compute embeddings

Embeddings are typically trained in a contrastive learning manner, which simply means that we want to encourage related items (e.g. videos frequently co-watched) to have high similarity scores and unrelated items (e.g. randomly sampled vidoe pairs) to have low scores.

Thus, we can think of contrastive learning as comprising 3 main components:

Positive Sampling. How we mine for related items from data is a non-trivial, but an oft-neglected topic. Usually some behavioural statistics are used, e.g. Lee 2020 uses videos frequently co-watched by users to determine positive pairs. The guiding principle is to err on the side of being stricter in the selection of positive samples to minimize the appearance of false positives in the data (which are more harmful than false negatives).
- In JumpStart, we have explored using Pointwise Mutual Information score to discard item pairs with low PMI score. Recall that the PMI score is computed as: $p mi (i t e m_{a}, i t e m_{b}) = lo g_{2} \frac{p ( i t e m _{a} co-occuring with i t e m _{b} )}{p ( i t e m _{a} ) \cdot p ( i t e m _{b} )}$
- Considering co-occurrence statistics focuses only on the numerator. Normalizing by the marginal probabilities of the items reduces the incidence of popular items being labelled as related when they are actually not
- In my experience, the bottom 10% of co-interacted items by pmi score can be safely discarded
- Another subtle issue is that power users contribute disproportionately to positive pairs.
  - If a given user interacts with $n$ items, then he/she generates $(2 n) = \frac{n ( n - 1 )}{2}$ item co-occurrence pairs
  - The quadratic scaling in $n$ means that the item interactions from power users may dominate that of all other users, but these are also the least informative item pairs :'( as power users may have less specific item preferences
  - Hence it is advisable to discard power users beyond a particular percentile to improve positive sampling
Negative Sampling. Much of the research literature focuses on this issue. The reason is that as research data (e.g. Movielens) is fixed, the positive pairs are usually pre-determined in the data. But the negative samples are usually not pre-specified in the data, so methods are devised to sample:
- Random sampling. This is the simplest and most common approach: for a given anchor item, we sample negatives uniformly from the item catalogue, omitting the positive item(s) for the anchor
- Impressed but not positive. This is also a simple approach if we have access to item impression data. The impressed items are often hard negatives because they were deemed to be relevant by the existing recommendation system and surfaced to the user.
  - The potential danger is that the model may only learn to distinguish negatives in a specific retrieval setting and is not robust to changes in the retrieval model
- Hard negative mining. Various other methods exist to mine for hard negatives, e.g. using BM25 score or embedding similarity. We discuss two ideas on this below in a bit more detail.
- Usually, it is best to mix negatives from all of the above so that the model learns to distinguish positives from a wide range of negatives
Loss. Again, much ink has been spilt on this topic, but there are really only two main losses that we need to know:
- Triplet loss.
- Cross entropy (or InfoNCE) loss.

WORK IN PROGRESS

To improve training, two ideas are critical:

Semi-hard Negative Mining: Originating from FaceNET, this technique addresses the problem of "false negatives" (items that are incorrectly labeled as negative) which can confuse the model. It focuses on training examples that are in a "goldilocks zone"—not too easy and not too hard. Lee (2020) demonstrated that mining for the hardest semi-hard negatives within a mini-batch is a computationally cheap and critical step for model performance.
Smart Negative Sampling: Since random sampling often fails to find challenging negative examples, methods are devised to mine for them. This can involve using impression data, surrogate algorithms like BM25, or approximate-nearest-neighbor search. Lee (2020) also introduced using hierarchical clustering on a relational graph to find negatives from nearby clusters, which can be combined with semi-hard negative mining to prevent issues with false negatives.

From Matrix Factorization to Sequential Recommendations

Traditional matrix factorization methods learn a single, static user representation to predict their entire interaction history. This is problematic for users with diverse or evolving interests, as the model learns a "diluted" representation. While time-decayed ratings can help by giving more weight to recent interactions, the model still fails to learn the sequential nature of those interactions fully.

Sequential recommenders address this by mirroring the actual recommendation task: given a user's history up to time t, predict their next action at time t+1. The user's representation is therefore dynamic and based on their recent actions. There are three main approaches to this:

Markov Approach: A simple but often effective method that uses the last interacted item as the primary feature to predict the next one.
Pointwise Approach: This method involves "rolling back" features to what was available at the time of each action in the training data. While this explicitly computes every user-item interaction, it requires significant data-loading effort and careful feature engineering. Google chose this approach for its simplicity in distributed model training.
Sequential Modelling Approach: This is analogous to language modeling, where the model learns to predict the next item in a sequence, learning from many items simultaneously.

Case Studies in Sequential Recommendations

Covington (2016) - Pointwise Approach at YouTube:

YouTube's influential 2016 paper detailed a pointwise approach for both retrieval and ranking.

Retrieval: The model used features like the average embedding of the last 50 video watches and search queries, along with user features like gender. A small MLP was trained with a sampled softmax loss to predict the next video watch. A crucial finding was that predicting the next watch was far more effective in online A/B tests than predicting a randomly held-out watch from the user's history.

[Diagram: Pointwise Approach for Retrieval - Covington 2016 architecture from slide 13]

"Example Age" Feature: To capture the time-sensitive nature of video popularity, an "Example Age" feature was introduced. This simple feature encodes when an action took place, allowing the model to learn the "popularity lifecycle" of items, where popularity often spikes shortly after upload and then fades.

[Diagram: Chart showing Class Probability vs. Days Since Upload with and without the "Example Age" feature, from slide 15]

Ranking: The ranking model was similar to the retrieval model but incorporated many more features, particularly those related to user-item historical interactions (e.g., "how many videos from this channel has the user watched?"). The paper noted that even simple transformations of features, like square roots, improved performance, highlighting the reality of manual feature engineering.

[Diagram: Pointwise Approach for Ranking - Covington 2016 architecture from slide 16]

SASRec (2018) - The Transformer-Based Approach:

SASRec (Self-Attentive Sequential Recommendation) applied the transformer architecture to recommendations, treating a user's interaction history as a sequence of items to be predicted, much like words in a sentence.

Architecture: Using causal masking, the transformer embedding at each time step encodes all previous interactions. The training loss is typically a binary cross-entropy loss comparing the dot product of the model's prediction with the target item embedding against a random negative sample. This approach significantly outperforms traditional matrix factorization methods.
Position Embeddings: Since transformers have no inherent notion of sequence, SASRec adds a learned position embedding to each item in the input sequence. This helps the model weigh recent items more heavily, as shown by visualizations of the attention matrix.
Efficiency and Power: This paradigm is more efficient than the pointwise approach because it learns from N time steps simultaneously for each user. It also reduces the need for manual feature engineering, as the model can implicitly learn user-item interaction features.

[Diagram: Visualization of Average Attention Matrix from SASRec, with and without Positional Embeddings, from slide 21]

A subsequent paper, BERT4Rec (2019), proposed a "masked token prediction" task similar to BERT, allowing the model to use future items as input. While it initially appeared superior, Klenitskiy (2023) showed this was due to a difference in loss functions; when both use the same sampled softmax loss, SASRec consistently performs better and trains faster. The conclusion is that SASRec with sampled softmax loss is the current industry standard.

PinnerFormer (2022) - Long-Term Sequential Modeling at Pinterest:

PinnerFormer evolved from the need to switch from real-time user embedding computation to less costly daily batch jobs.

Long-Term Loss: The key idea is that at each time step, instead of only predicting the very next item (like SASRec), the model predicts a random item from the user's future interactions over a longer window (e.g., 28 days). This creates more stable embeddings and surprisingly, PinnerFormer beats SASRec even when retrained in real-time.
Advanced Features: PinnerFormer uses pre-computed PinSage (graph-based) embeddings as a base. It distinguishes between different action types (e.g., "Pin Save" vs. "Long Click") by concatenating a learned "action" embedding to the item embedding. It also heavily relies on time features, using Time2Vec to encode timestamps, noting a significant performance drop without them.
Training Details: The model uses a combination of in-batch and random negatives with Log-Q correction, and ensures each user in a mini-batch is weighted equally to avoid bias from users with very long histories.

[Diagram: Illustration of PinnerFormer's "Long Term Loss" compared to SASRec from slide 25]

Multi-Task Learning in Recommendations

Recommendation systems often need to optimize for multiple objectives (e.g., clicks, saves, purchases). Several architectures address this, primarily for the pointwise approach:

ESMM (2018): This model explicitly encodes the relationship that a conversion can only happen after a click (P(convert) = P(click) * P(convert|click)). This allows the conversion model to benefit from the abundant click data while learning from the sparse conversion data.

[Diagram: ESMM architecture from slide 29]

Shared-Bottom Architecture: A common approach where bottom layers of a neural network are shared across tasks, with separate "towers" or heads for each specific task prediction. The final recommendation is a manually tuned weighted average of the logits from each head.

[Diagram: Shared-Bottom Model architecture from slide 30]

Mixture of Experts (MoE): An improvement over the shared-bottom model, MoE uses specialized "expert" layers. For each task, a gating network learns a weighted average over the outputs of these experts, allowing the model to modularize and handle potentially conflicting tasks more effectively.

[Diagram: Multi-gate Mixture-of-Expert (MoE) model architecture from slide 31]

PinnerFormer's Approach: For sequential models, PinnerFormer took a simpler route, finding that treating all positive actions (repins, clicks, etc.) as equal signals was the best all-around strategy for their use case.

Training Objective	10s Closeup	10s Click	Repin	All
10s Closeup	0.27	0.02	0.09	0.17
10s Click	0.01	0.49	0.01	0.12
Repin	0.15	0.03	0.17	0.13
Multi-task	0.23	0.28	0.13	0.23

II. Trends in Search Systems

While related to recommendations, search systems have distinct challenges and characteristics in practice.

Search vs. Recommendation

Relevance is Stricter: In search, there are objectively correct and incorrect results. Irrelevant results must not be surfaced, as they erode user trust.
Latency is Critical: Users expect instant search results, making pre-computation of recommendations for a given query impossible.
LLMs Have a Larger Impact: Search benefits more directly from advances in LLMs due to its text-centric nature and the overlap with fields like Information Retrieval and Question-Answering.

Search tasks exist on a spectrum from pure relevance (like academic Q&A) to pure personalization (which is essentially recommendation). E-commerce search lies in the middle, where relevance is key, but personalization plays a major role in user satisfaction given the large number of potentially relevant items for a broad query.

Academic vs. Industry Search

Academic research often focuses on Question-Answering (Q&A) datasets like MSMarco or TriviaQA, where queries are well-formed questions with a single correct answer. In this setting, pre-trained LLMs perform exceptionally well, and it is easier to beat traditional baselines like BM25.

Standard Academic Models:

Ranking (Cross-Encoders): The standard is a cross-encoder architecture (often called MonoBERT), where a pre-trained language model like BERT takes a [Query, Document] pair as input and outputs a relevance score. Later papers showed that using a sampled softmax loss with many random negatives is more effective than the original binary cross-entropy loss. As models scale, so does performance, with RankLlama showing stronger results than older BERT or T5-based rankers. A 2023 paper by Sun even showed that zero-shot prompting with ChatGPT, using a "permutation generation" method and sliding windows, can achieve state-of-the-art performance on Q&A tasks without any fine-tuning.

[Diagram: Standard Cross Encoder / MonoBERT architecture from slide 39]

Ranking Model	MRR@10 for MSMarco
monoBERT	37.2
RankT5	43.4
LLaMA2-8b	44.9

Retrieval (Two-Tower Models): The standard is Dense Passage Retrieval (DPR), which uses two separate encoders—one for the query and one for the document (passage). The model is trained with a sampled softmax loss to make the cosine similarity between a query and its positive document high, relative to negative documents. These negatives often include "in-batch" negatives (other positive documents in the same training batch) and "hard" negatives mined using BM25. To improve these retrievers, a common technique is distillation, where a powerful but slow cross-encoder (the "teacher") is used to train a faster two-tower model (the "student").

Retrieval Model	MRR@10 for MSMarco
BM25	18.4
ANCE (BERT-based)	33.0
LLaMA2-8b	41.2

[Diagram: Illustration of a Mini Batch for Dense Passage Retrieval from slide 42]

Complexities of Industry Search Systems

Industry search systems are significantly more complex, needing to balance relevance, latency, and personalization.

Case Study: Pinterest Search (2024)

Teacher-Student Distillation: Pinterest fine-tunes a large teacher LLM (Llama-3-8B) on 300k human-labeled query-pin pairs. To improve the teacher's performance, the text representation for each pin is enriched with its title, description, AI-generated image captions, high-engagement historical queries, and common board titles it's saved to.

Model	Accuracy	AUROC 3+/4+/5+
SearchSAGE	0.503	0.878/0.845/0.826
mBERT_base	0.535	0.887/0.864/0.861
T5_base	0.569	0.909/0.884/0.886
mDeBERTaV3_base	0.580	0.917/0.892/0.895
XLM-ROBERTalarge	0.588	0.919/0.897/0.900
Llama-3-8B	0.602	0.930/0.904/0.908

Scaling Up with Distilled Data: This powerful teacher model then generates labels for 30 million un-labeled data points. A small, fast student MLP model is then trained on this massive distilled dataset. This student model recovers up to 91% of the teacher's accuracy and, because the teacher was multilingual, the student model also learns to handle multiple languages effectively despite being trained only on English human-annotated data.

Training Data	Accuracy	AUROC 3+/4+/5+
0.3M human labels	0.484	0.850/0.817/0.794
6M distilled labels	0.535	0.897/0.850/0.841
12M distilled labels	0.539	0.903/0.856/0.847
30M distilled labels	0.548	0.908/0.860/0.850

[Diagram: Teacher-Student distillation process at Pinterest from slide 47]

Case Study: Baidu Search (2021, 2023)

Bootstrapping Labels with Weak Signals: Instead of training a large teacher model, Baidu uses a simple decision tree model trained on "weak relevance signals" (e.g., click-through rates, long-click rates, skip rates) to predict a "calibrated relevance score" for unlabeled data. Training a cross-encoder on this calibrated score was shown to be far more effective than training on raw clicks alone.

[Diagram: Baidu's decision tree model over weak relevance signals from slide 48]

Query-Sensitive Summary: To avoid the latency of processing full documents, Baidu developed an algorithm to extract the most relevant sentences from a document based on word-match scores with the query. This summary is then used as input for the ranker, significantly boosting offline performance.
Modular Attention for Speed: To accelerate their cross-encoder, Baidu uses a modular attention mechanism. Self-attention is first applied independently within the [query-title] segment and the document summary segment for several layers, before a few final layers of full cross-attention are applied. This structure reduces computation and increases inference speed by 30%.

[Diagram: Baidu's Modular Attention architecture from slide 50]

From Relevance to Satisfaction: A 2023 paper from Baidu argued that relevance alone is insufficient and that models must optimize for user satisfaction. They engineered features for quality (e.g., number of ads), authority (e.g., PageRank), and recency, and even built a query analyzer to determine if a query was authority or recency-sensitive. These features, along with relevance, were used to train a model that boosted performance by 2.5% over a pure relevance model. Notably, numeric satisfaction features were simply normalized and appended directly into the text input for the LLM to process.

Case Study: Google DCNv2 (2020)

Google's Deep & Cross Network V2 (DCNv2) addresses a weakness in standard MLP rankers: they are not ideal for explicitly modeling feature interactions (e.g., user_is_interested_in_topic_X AND item_is_about_topic_X). DCNv2 introduces explicit "cross layers" that are designed to generate these feature crosses efficiently before feeding them into a standard deep network. This architecture provides performance gains over a standard MLP with the same number of parameters.

[Diagram: Google DCNv2 architecture with Cross Layers from slide 53]

Case Studies: Embedding-Based Retrieval (EBR) at Facebook and Taobao

EBR systems are used for semantic search but can struggle with relevance compared to traditional keyword-based lexical matching.

Facebook's Hybrid Approach: Facebook integrates nearest-neighbor search directly into its in-house lexical search engine. A query can contain both standard lexical terms (e.g., location:seattle) and an nn operator that finds items within a certain embedding-distance radius. This ensures documents fulfill both lexical and semantic requirements. Facebook's query and document encoder towers use a mix of term-based (char-3-grams) and LLM-based representations.
Taobao's Relevance Filters: Taobao found their EBR system sometimes returned irrelevant but semantically close items (e.g., a search for "Nike shoes" returning "Adidas shoes"). Their solution was to apply explicit lexical boolean filters on top of the ANN search (e.g., ANN search + Brand:Nike + Product:Shoes). While this sacrifices some of the "fuzzy" semantic search capability, it guarantees relevance. Taobao also found that optimizing for engagement (clicks) does not guarantee relevance, and ultimately chose to accept slightly lower engagement to ensure a higher percentage of relevant results. Their query representation is enhanced by using cross-attention between the query and the user's historical interactions to personalize the query embedding.

Experiment	Engagement %	Relevance %
Baseline	85.6%	71.2%
Baseline + personalization	86.4%	71.4%
Baseline + Lower temperature	85.5%	79.0%
Baseline + all	84.7%	80.0%

LightGBM Memory

TLDR: Solutions for memory issues during training of a LightGBM model:

Cast numeric values into np.float32 to save data space
Keep num_leaves <= 100 (or some reasonable number)
If feature dimension is large (e.g. M >= 1000), try colsample_bytree = 0.1, although this might not help too much if the bottleneck is during bin histogram construction (rather than the actual training)
If number of rows and features are both large (e.g. N >= 1_000_000 and M >= 1000, i.e. >= 4 GB) then the data itself is taking up a lot of memory. It would be worthwhile to put the data on disk and use lgb.Dataset by providing the file path as the data argument instead. Then, we should set two_round=True for the train method params. The explanation for two round is rather unclear, but it should help with memory when Dataset is loading from disk (rather than from a numpy.array in memory). For this option, I had some trouble getting it to work with categorical columns.

For more details can refer to the experiments below.

Experiments

I often run into memory issues running LightGBM. So here are some experiments to measure memory usage and understand how hyperparameters can affect memory usage.

The function of interest is the fit method for the learn to rank task.

import lightGBM as lgb
def f():
    model = lgb.LGBMRanker(**params, objective="lambdarank")
    model.fit(
        X=data,
        y=y,
        group=groups,
    )

The memory usage is measured using the memory_profiler module, which checks the memory usage at .1 second intervals. The maximum is then taken to represent the maximum memory usage of the fit function. We also take note of the size of the data itself (using data.nbytes) and subtract that away to get closer to the LightGBM memory usage. Do note that this memory profiling is not very rigorous, so the results are best for relative comparison within each experiment rather than across experiments.

from memory_profiler import memory_usage
def run(params):
    mem_usage = memory_usage(f)
    return max(mem_usage) / 1000 # GB

We set the default parameters as follows and generate the data this way. For the experiments below, the default parameters are used unless specified otherwise.

DEFAULT_PARAMS = {
    "N": 200000, # number of instances
    "M": 500, # feature dimension
    "n_estimators": 100,
    "num_leaves": 100,
    "histogram_pool_size": -1,
}
data = np.random.randn(DEFAULT_PARAMS["N"], DEFAULT_PARAMS["M"])
groups = [20] * int(N / 20) # assume each session has 20 rows
y = np.random.randint(2, size=N) # randomly choose 0 or 1

Large num_leaves can get very memory intensive. We should not need too many leaves, so generally using num_leaves <= 100 and increasing the number of estimators seems sensible to Gme.

num_leaves: 10, Maximum memory usage: 2.28 GB - 0.80 GB = 1.48 GB
num_leaves: 100, Maximum memory usage: 2.52 GB - 0.80 GB = 1.72 GB
num_leaves: 1000, Maximum memory usage: 4.04 GB - 0.80 GB = 3.24 GB

Increasing n_estimators doesn't seem to raise memory much, but increases run time because each tree is fitted sequentially on the residual errors, so it cannot be parallelized.

n_estimators: 10, Maximum memory usage: 2.28 GB - 0.80 GB = 1.48 GB
n_estimators: 100, Maximum memory usage: 2.53 GB - 0.80 GB = 1.73 GB
n_estimators: 1000, Maximum memory usage: 2.69 GB - 0.80 GB = 1.89 GB

Increasing N increases memory sublinearly. It seems that the data size itself will be more of a problem than the increase in LightGBM memory usage as N increases. For extremely large N, we can also set the subsample parameter to use only a fraction of the training instances for each step (i.e. stochastic rather than full gradient descent). By default subsample=1.0.

N: 1,000, Maximum memory usage: 0.38 GB - 0.00 GB = 0.38 GB
N: 10,000, Maximum memory usage: 0.45 GB - 0.04 GB = 0.41 GB
N: 100,000, Maximum memory usage: 1.46 GB - 0.40 GB = 1.06 GB
N: 1,000,000, Maximum memory usage: 6.12 GB - 4.00 GB = 2.12 GB
N: 2,000,000, Maximum memory usage: 10.48 GB - 8.00 GB = 2.48 GB

In contrast to N, memory usage is quite sensitive to M, seems to increase linearly when M gets large. M=10,000 blows up my memory. I suppose this could be mitigated by setting colsample_bytree or colsample_bynode to sample a smaller subset.

M: 100, Maximum memory usage: 2.08 GB - 0.16 GB = 1.92 GB
M: 1000, Maximum memory usage: 4.92 GB - 1.60 GB = 3.32 GB
M: 2000, Maximum memory usage: 9.69 GB - 3.20 GB = 6.49 GB
M: 3000, Maximum memory usage: 14.35 GB - 4.80 GB = 9.55 GB

To deal with the high memory usage of large M, we can set colsample_bytree which samples a subset of columns before training each tree. This will help to mitigate the memory usage. For this experiment, we set M=2000 to simulate data with high number of dimensions.

colsample_bytree: 0.1, Maximum memory usage: 8.60 GB - 3.20 GB = 5.40 GB
colsample_bytree: 0.2, Maximum memory usage: 9.58 GB - 3.20 GB = 6.38 GB
colsample_bytree: 0.4, Maximum memory usage: 10.06 GB - 3.20 GB = 6.86 GB
colsample_bytree: 0.6, Maximum memory usage: 10.07 GB - 3.20 GB = 6.87 GB
colsample_bytree: 0.8, Maximum memory usage: 10.46 GB - 3.20 GB = 7.26 GB

In contrast, setting colsample_bynode does not help memory usage at all. Not too sure why, but I suppose since multiple nodes for the same tree can be split at the same time, the full feature set still has to be kept in memory.

colsample_bynode: 0.1, Maximum memory usage: 10.49 GB - 3.20 GB = 7.29 GB
colsample_bynode: 0.2, Maximum memory usage: 10.49 GB - 3.20 GB = 7.29 GB
colsample_bynode: 0.4, Maximum memory usage: 10.49 GB - 3.20 GB = 7.29 GB
colsample_bynode: 0.6, Maximum memory usage: 10.49 GB - 3.20 GB = 7.29 GB
colsample_bynode: 0.8, Maximum memory usage: 10.48 GB - 3.20 GB = 7.28 GB

Tweaking boosting and data_sample_strategy don't seem to affect memory usage too much. Using dart seems to require a bit more memory than the traditional gbdt.

data_sample_strategy: bagging, boosting: gbdt, Maximum memory usage: 8.90 GB - 3.20 GB = 5.70 GB
data_sample_strategy: goss, boosting: gbdt, Maximum memory usage: 9.58 GB - 3.20 GB = 6.38 GB
data_sample_strategy: bagging, boosting: dart, Maximum memory usage: 9.81 GB - 3.20 GB = 6.61 GB
data_sample_strategy: goss, boosting: dart, Maximum memory usage: 9.80 GB - 3.20 GB = 6.60 GB

Another bottleneck we can tackle is to realize that LightGBM is a two-stage algorithm. In the first stage, LightGBM uses the full dataset to construct bins for each numeric variable (controlled by the max_bins argument) based on the optimal splits. In the second stage, these discretized bins are then used to map and split the numeric variables during the actual training process to contruct trees. From my understanding, the first stage cannot be chunked as it requires the full dataset, but the second stage can be chunked (as per any stochastic gradient descent algorithm) where a fraction of the dataset is loaded at each time. Hence, the real bottleneck appears to be the first stage, when the bins are constructed.

According to this thread, we can separate the memory usage between the two stages by using lgb.Dataset. First, we initialize the Dataset object and make sure to set free_raw_data=True (this tells it to free the original data array after the binning is done). Then, we trigger the actual dataset construction using dataset.construct(). Thereafter, we are free to delete the original data array to free up memory for the actual training. The following code illustrates this concept.

dataset = lgb.Dataset(data=data, label=y, group=groups, free_raw_data=True)
del data
dataset.construct()
lgb.train(params=params, train_set=dataset)

TF-IDF

Term Frequency - Inverse Document Frequency is a well known method for representing a document as a bag of words. For a given corpus $C$ , we compute the IDF value for each word $w$ by taking $i d f_{w} := \frac{1}{l o g d f _{w}}$ , with $d f_{w}$ denoting the number of documents in $C$ containing the word $w$ . The document $d$ is represented by a vector of length corresponding to the number of unique words in $C$ . Each element of the vector will be a tf-idf value for the word, i.e. $t f i d f_{w}^{d} := t f_{w}^{d} \cdot i d f_{w}$ , where $t f_{w}^{d}$ represents the term frequency of the word $w$ in document $d$ . Sometimes, we may l1 or l2 normalize the tf-idf vector so that the dot product between document vectors represents the cosine similarity between them.

Bayesian Smoothing

We may want to apply some bayesian smoothing to the $t f_{w}^{d}$ terms to avoid spurious matches. For example, suppose that a rare word $w_{r}$ appears only in documents $d_{1}$ and $d_{2}$ in the entire corpus just by random chance. The $i d f_{w_{r}}$ will be a large value, and hence documents $d_{1}$ and $d_{2}$ will have a high cosine similarity just because of this rare word.

For the specific setting I am considering, we can deal with this problem using bayesian smoothing. The setting is as follows:

Each document $d$ represents a job, and each job is tagged to an occupation $o$
An occupation can have one or more jobs tagged to it
We wish to represent each occupation as a TF-IDF vector of words

To apply bayesian smoothing to this scenario, notice that we only need to smooth the term frequencies $t f_{w}^{d}$ . Since the IDF values $i d f_{w}$ are estimated across the whole corpus, we can assume that those are relatively reliable. And since term frequencies are counts, we can use a poisson random variable to represent them. See reference for a primer on the gamma-poisson bayesian inference model.

Specifically, we assume that $θ_{w}^{o}$ is the poisson parameter that dictates the term frequency of $w$ in any document belonging to $o$ , i.e. $t f_{w}^{d} \sim P o i sso n (θ_{w}^{o})$ . We treat the observed term frequency $t f_{w}^{d}$ for each document $d$ belonging to $o$ as a data sample to update our beliefs about $θ_{w}^{o}$ . We start with an uninformative gamma prior for $θ_{w}^{o}$ , and obtain the MAP estimate for $\hat{θ}_{w}^{o}$ as below, with $d f_{d \in o}$ denoting the number of documents that belong to occupation $o$ .

$\hat{θ}_{w}^{o} = \frac{a + \sum _{d \in o} t f _{w}^{d} - 1}{b + d f _{d \in o}}$

We can thus use this formula to obtain posterior estimates for each $\hat{θ}_{w}^{o}$ . One possible choice of the prior parameters $a$ and $b$ is to set $a$ to be the mean term frequency for word $w$ per document in the entire corpus, and to set $b := 1$ . This prior corresponds to $\hat{θ}_{w}^{o}$ following a gamma distribution with mean $a$ and variance $a$ , which seems to be a reasonable choice that can be overrided by a reasonable amount of data.

The posterior variance, which may also be helpful in quantifying the confidence of this estimate, is: $Va r (\hat{θ}_{w}^{o}) = \frac{a + \sum _{d \in o} t f _{w}^{d}}{( b + d f _{d \in o} ) ^{2}}$

Finally, after obtaining the posterior estimates for each $\hat{θ}_{w}^{o}$ , we can just use them as our term frequencies and multiply them by the IDF values as per normal. We can also apply l1 or l2 normalization thereafter to the tf-idf vectors. This method should produce tf-idf vectors that are more robust to the original problem of spurious matches.

For illustration, for a very rare word $w_{r}$ , $a$ will be a low value close to 0 (say 0.01). Suppose we were to observe $n$ number of new documents, each containing one occurrence of word $w_{r}$ . Then the posterior estimate of $\hat{θ}_{w}$ will update as follows:

n	$\hat{θ}_{w}$
1	0.005
2	0.337
3	0.503
4	0.602
$...$	$...$
20	0.905

As desired, the estimate for $\hat{θ}_{w}$ starts off at a very small value and gradually approaches the true value $1$ . This will help mitigate the effect of spurious matches. If we desire for the update to match the data more quickly, we can simply scale $a$ and $b$ down by some factor, e.g. now $a := \frac{a}{5} = 0.002$ and $b := \frac{b}{5} = 0.2$ . Then we have:

n	$\hat{θ}_{w}$
1	0.001
2	0.455
3	0.626
4	0.715
$...$	$...$
20	0.941

As a final detail, note that the update formula can result in negative estimates if $a < 1$ and $\sum_{d} t f_{w}^{d} = 0$ . The small negative value is probably not a big problem for our purposes, but we could also resolve it by setting the negative values to zero if desired.

Cross Encoders

Cross encoder is a type of model architecture used for re-ranking a relatively small set of candidates (typically 1,000 or less) with great precision. In the Question-Answering or machine reading literature, typically the task involves finding the top matching documents to a given query. A typical task is the MS MARCO dataset, which seeks to find the top documents that are relevant to a given bing query.

Basic Setup

Typically, the base model is some kind of pre-trained BERT model, and a classification head is added on top to output a probability. Each (query, document) pair is concatenated with [SEP] token in-between to form a sentence. The sentence is fed into the classification model to output a probability. The model is trained using binary cross-entropy loss against 0,1 labels (irrelevant or relevant).

This is the setup used by <Nogeuira 2019>, possibly the first paper to propose the cross encoder. Some specifics for their setup:

The query is truncated to max 64 tokens, while the passage is truncated such that the concatenated sentence is max 512 tokens. They use the [CLS] embedding as input to a classifier head.
The loss for a single query is formulated as below. $s_{i}, s_{j}$ refers to the score from the classifier model, $J_{p os}$ refers to the documents that are relevant, and $J_{n e g}$ refers to documents in the top 1,000 retrieved by BM25 that are not relevant. Note that this results in a very imbalanced dataset.

$L = - j \in J_{p os} \sum l o g (s_{j}) - j \in J_{n e g} \sum l o g (1 - s_{j})$

The model is fine-tuned with a batch size of 128 sentence pairs for 100k batches.

As opposed to bi-encoders (or dual encoders), which take a dot product between the query embedding and the document embedding, we cannot pre-compute embeddings in the cross encoder setting, because the cross encoder requires a forward pass on the concatenated (query, document) pair. Due to the bi-directional attention on the full concatenated sentence, we need the full sentence before we can compute the score, which requires the query that we only see at inference time. Hence, the cross encoder is limited to reranking a small set of candidates as it requires a full forward pass on each query, candidate_document pair separately.

Contrastive Loss

The vanilla binary cross entropy loss proposed above may be thought of as a loss, in which each document is either relevant or irrelevant in absolute terms. However, treating relevance as a concept often better reflects reality. For example, given the first page of search results for a Google query, most of the documents should be relevant to some extent, but some are more relevant than the rest (and get clicked on). Simply treating all clicks as relevant and all non-clicks as irrelevant naively ignores the context (i.e. the neighbouring search results) in which the clicks were generated. It assumes that across query sessions, the average level of relevance of the results is comparable. Treating relevance as a concept within the same query session weakens this assumption and hence often works better.

Thus <Gao 2021> proposes the Local Contrastive Estimation loss. For a given query q, a positive document $d_{q}^{+}$ is selected, and a few negative documents $d_{q}^{-}$ are sampled using a retriever (e.g. BM25). The contrastive loss then seeks to maximize the softmax probability of the positive document against the negative documents.

$L_{L CE} = \frac{1}{∣ Q ∣} q \in Q, G_{q} \sum - l o g \frac{e x p ( s _{q, d_{q}^{+}} )}{\sum _{d_{q}^{-} \in G_{q}} e x p ( s _{q, d_{q}^{-}} )}$

It is confirmed in multiple experiments in Gao 2021 and Pradeep 2022 that LCE consistently out-performs point-wise cross entropy loss. Furthermore, the performance consistently improves as the number of negative documents per query (i.e. $∣ G_{q} ∣$ ) increases. In Gao 2021, up to 7 negatives (i.e. batch size of 8) were used. Pradeep 2022 shows that increasing the batch size up to 32 continues to yield gains consistently (albeit diminishingly).

Other details

Pradeep 2022's experiments show that using a stronger retrieval model (a ColBERT-based model) during inference generates slight gains in final performance (as opposed to BM25). Although Gao 2021 argues that it is also important to use the same retrieval model during model training (so that the cross encoder sees the same distribution of negatives during training and inference), Pradeep 2022 argues that the alignment is not as important as the stronger retrieval performance during inference.

References

SentenceTransformers

SentenceTransformers is a useful library for training various BERT-based models, including two-tower embedding models and cross encoder reranking models.

Cross Encoder

SentenceTransformers v4.0 updated their cross encoder training interface (see the v4.0 blogpost). Here we try to follow the key components for cross encoder training using their API.

The main class for training is CrossEncoderTrainer. We rely on a Huggingface datasets.Dataset class to provide training and validation data. CrossEncoderTrainer requires that the dataset format matches the chosen loss function.

The loss overview page provides a summary of cross encoder losses and the required dataset format. In general for cross encoder training, we have two sentences which are either positively or negatively related to each other. Which loss function we choose depends on the specific dataset format we possess.

BinaryCrossEntropyLoss

Use this loss if we have inputs in the form of (sentence_A, sentence_B) and a label of either 0: negative, 1: positive or a float score between 0-1. In the huggingface dataset, we would need to ensure that the label column is named label or score, and have two other input columns corresponding to sentence_A and sentence_B. For sentence_transformers package in general, order of columns matter, so we should set it to sentence_A, sentence_B, label.

Inspecting the source code would show that each sentence pair is tokenized and encoded by the cross encoder model. The cross encoder must output a single logit (i.e. initialized with num_labels=1). Thus we get a prediction vector $x$ of dim batch_size. The torch.nn.BCEWithLogitsLoss is then used to compute the binary cross entropy loss of the prediction logits against the actual labels $y$ , according to the standard bce loss: $L (x_{i}, y_{i}) = - w_{i} [y_{i} \cdot lo g σ (x_{i}) + (1 - y_{i}) \cdot lo g (1 - σ (x_{i}))]$

This is a simple and effective loss. The user should ensure that the labels are well distributed (between $0$ and $1$ ) without any severe class imbalance.

CrossEntropyLoss

The CrossEntropyLoss is used for a classification task, where for a given input sentence pair (sentence_A, sentence_B), the label is a class. For example, we may have data where each sentence pair is tagged to a 1-5 rating scale. We need to instatiate the CrossEncoder class with num_labels=num_classes for this use case. This creates a prediction head for each class.

Looking at the source code, we see that this loss simply takes the prediction logits from the model (of dimension num_labels) and computes the torch.nn.CrossEntropyLoss against the actual labels.

Note that the cross entropy loss takes the following form. Given num_labels=C and logits of $x_{1}, ..., x_{C}$ , where the correct label is index $y$ , we have:

$L (x, y) = - lo g (\frac{e ^{x_{y}}}{\sum _{j = 1}^{C} e ^{x_{j}}})$

MultipleNegativesRankingLoss

This is basically InfoNCE loss or in-batch negatives loss. The inputs to this loss can take the following forms:

(anchor, positive) sentences
(anchor, positive, negative) sentences
(anchor, positive, negative_1, ..., negative_n) sentences

The documentation page has a nice description of what this loss does: Given an anchor, assign the highest similarity to the corresponding positive document out of every single positive and negative document in the batch.

Diving into the source code:

The inputs are list[list[str]], where the outer list corresponds to [anchor, positive, *negatives]. The inner list corresponds to the batch size.
scores of dimension (batch_size) are computed for each anchor, positive pair
get_in_batch_negatives is then called to mine negatives for each anchor.
- candidates (positive and negatives) are extracted at inputs[1:] and flattened into a long list
- A mask is created such that for each anchor, all the matching positive and negative candidates are masked out (not participating)
- The matching negatives do not participate because they will be added later on
- Amongst the remaining negatives, torch.multinomial is used to select self.num_negatives number of documents per anchor at random
- self.num_negatives defaults to 4
- These randomly selected negative texts are then returned as list[str]
For each negative in num_negatives mined in-batch negatives:
- score of dimension (batch_size) is computed for the anchor, negative pair
- The result is appended to scores
Similarly, for each hard matching negative:
- score of dimension (batch_size) is computed for the anchor, hard negative pair
- The result is appended to scores

Now scores is passed into calculate_loss:

Recall that scores is a list of tensors where the outer list is size 1 + num_rand_negatives + num_hard_negatives, and each tensor is of dimension batch_size
Thus torch.cat + tranpose is called to make it (batch_size, 1 + num_rand_negatives + num_hard_negatives)
Note that for each row, the first column corresponds to the positive document
Hence the labels may be created as torch.zeros(batch_size)
Then torch.nn.CrossEntropyLoss()(scores, labels) may be called to get the loss

This sums up the loss computation for MultipleNegativesRankingLoss.

CachedMultipleNegativesRankingLoss

Collaborative Filtering

Collaborative filtering is typically done with implicit feedback in the RecSys setting. In this setting, interactions are often very sparse. Most of the time, only positive signals are recorded, but a non-interaction could either mean (i) user dislikes the item or (ii) the user was not exposed to the item. Hence, we cannot use algorithms like SVD which assume no interactions as irrelevance.

A useful repository is https://github.com/recommenders-team/recommenders.

A generic and fairly common architecture for the collaborative filtering model is to embed each user and item into separate fixed size vectors, and use the cosine similarity between the vectors to represent a score. This score is fed into a cross entropy loss against the labelled relevance of user to item to train the embeddings.

Setup

Let $f (u)$ and $f (i)$ denote the $k$ dimensional embedding vector for user $u$ and item $i$ . Let the similarity function be $s (u, i)$ which is typically $f (u)^{T} f (i)$ , and distance function $d (u, i)$ which is typically $∣∣ f (u) - f (i) ∣ ∣_{2}^{2}$ . Then some common loss functions may be denoted as below.

Pointwise Losses are typically low-performing. For a given (u, i) pair, pointwise losses assume the presence of a 0, 1 label for relevance, and tries to predict it. The typical pointwise loss is Binary Cross Entropy, which may be expressed as:

$L_{BCE} = (u, i) \in D \sum lo g σ (s (u, i)) - (u, j) \in / D \sum lo g (1 - σ (s (u, j)))$

Pairwise Losses assume the presence of training triplets (u, i, j) which correspond to user, positive item and negative item. A typical pairwise loss is Bayesian Personalized Ranking, as follows:

$- (u, i, j) \in τ \sum lo g σ [s (u, i) - s (u, j)]$

Weighted Matrix Factorization

This describes the Cornac implementation of WMF. The code:

Let $A \in R^{n \times m}$ describe a rating matrix of $n$ users and $m$ items. For simplicity, we may restrict $A_{ij} \in [0, 1]$ . Given a user embedding matrix $U \in R^{n \times k}$ and item embedding matrix $V \in R^{m \times k}$ , WMF computes the similarity score as the dot product $U \cdot V^{T} \in R^{n \times m}$ .

The general loss function is:

$L = i, j : A_{ij} = 1 \sum (A_{ij} - U_{i} \cdot V_{j}^{T})^{2} + b i, j : A_{ij} = 0 \sum (A_{ij} - U_{i} \cdot V_{j}^{T})^{2} + u \cdot ∣∣ U ∣ ∣_{F}^{2} + v \cdot ∣∣ V ∣ ∣_{F}^{2}$

The idea is to simply take the squared error from the true ratings matrix as our loss, but apply a lower weightage to elements in the rating matrix where the rating is zero (as these are usually unobserved / implicit negatives that we are less confident about). Usually b is set to 0.01. Regularization is performed on the user and item embedding matrices, with $u \in R, v \in R$ as hyperparameters to adjust the strength of regularization.

For cornac, this loss is adapted to the mini batch setting. Specifically, the algorithm is:

Draw a mini batch (default: B = 128) of items but use all the users
Compute the model predictions $P = U \cdot V_{batch}^{T} \in R^{n \times B}$
Compute squared error $E = (A_{ba t c h} - P)^{2} \in R^{n \times B}$
Multiply matrix of weights (either 1 for positive ratings or b for negative ratings) element-wise with $E$
$loss = sum (E) + u \cdot ∣∣ U ∣ ∣_{F}^{2} + v \cdot ∣∣ V_{ba t c h} ∣ ∣_{F}^{2}$

Note that Adam optimizer is used, and gradients are clipped between [-5, 5].

Bilateral Variational Autoencoder (BiVAE)

Recommenders BiVAE Deep Dive BiVAE Paper

A working implementation of BiVAE is available on Cornac.

A variational autoencoder improves over traditional linear matrix factorization methods by using non-linearity and a probabilistic formulation. Given a user, the autoencoder encodes the data representing the entity into a vector in some latent space. A decoder then takes the vector in the latent space and decodes it into something close to the original data.

The difference between VAE and a regular autoencoder is that it doesn't learn a fixed vector representation, but rather a probability distribution in the latent space. This allows it to model noisy, sparse interaction data better.

Splitting

recommenders uses a few different types of data splitting:

Stratified spltting.

Evaluation

Evaluation is a non-trivial topic for recsys, and different approaches measure different things. Suppose we have a dataset of user-item interactions with a timestamp.

Random splitting simply takes a random split of say 75% for train and 25% for test. The problems with this approach:

No guarantee of overlap in users across train and test. If a user does not appear in the train set, it is not possible to recommend items for him/her in the test set.
Chronological overlap between train and test set, leading to data leakage issues.

Stratified splitting addresses the user overlap issue by ensuring that the number of rows per user in the train and test set are approximately 75% and 25% of the number of rows in the original data respectively. This ensures that we have sufficient training and test data for each user, so that the collaborative filtering algorithm has a fair chance of recommending items for each user.

However, stratified splitting still involves randomly assigning rows for each user into the train and test set, which does not address the chronological overlap issue. Temporal stratified splitting addresses this issue by assigning the 75% and 25% of train and test data based on chronological order. In other words, the oldest 75% of data for each user is assigned to the train set.

The extreme version of temporal stratified splitting is leave last out splitting, in which all but the latest row for each user is put into the train set. This is suitable for settings where the task is to predict the very next action which the user will take (e.g. which song will the user listen to next).

Note that temporal stratified splitting may potentially introduce temporal overlap between the train and test sets across users. That is, the train set period for user A may potentially overlap with the test set period for user B. Hence, if there are strong concerns with temporal effects in the dataset, we may need to be mindful of this.

Global temporal splitting addresses this issue by assigning the oldest 75% of data across all users to the train set. This addresses the data leakage issue and more closely resembles actual production setting. However, there is no guarantee on the amount of train/test data for each user. Hence we may need to drop rows where there exists test data for user A but no corresponding train data due to the global temporal cutoff.

AB Testing

References

Examples

These summaries are based on reading PostHog's article on AB testing and studying the ones of interest further.

AB Testing at Airbnb

AB Testing is crucial because the outside world often has a larger effect on metrics than product changes. Factors like seasonality, economy can cause metrics to fluctuate greatly, hence a controlled experiment is necessary to control for external factors and isolate the effects of the product change.

Airbnb has a complex booking flow of search -> contact -> accept -> book. While they track the AB impact of each stage, the main metric of interest is the search to book metric.

One pitfall is stopping the AB Test too early. Airbnb noticed that their AB tests tend to follow a pattern of hitting significance early on but returning to neutral when the test has run its full course. This is a phenomenon known as peeking, which is to repeatedly examine an ongoing AB test. It makes it much more likely to find a significant effect when there isn't, since we are doing a statistical test each time we peek. For Airbnb, they hypothesize that this is also a phenomenon caused by the long lead time it takes from search -> book, such that early converters have a disproportionately large influence at the beginning of the experiment.

The natural solution to this problem is to conduct power analysis and determine the desired sample size prior to the experiment. However, Airbnb runs multiple AB tests at the same time, hence they required an automatic way to track ongoing AB tests and report when significance has been reached. Their solution back then was to create a dynamic p-value graph. The idea is that on day 1 of the experiment, we would require a very low p-value to declare success. As time goes on and more samples are collected, we can gradually increase the p-value until it hits 5%. The shape of this graph is unique to their platform and they did extensive simulations to create it, so this solution is not very generalizable.

Another pitfall was assuming that the system is working. After running an AB test for shifting from more words to more pictures, the initial result was neutral. However, they investigated and found that most browsers had a significant positive effect except for Internet Explorer. It turned out that the change had some breaking effect on older IE browsers. After fixing that, the overall result became positive. Hence some investigation is warranted when the AB test results are unintuitive. However, one needs to be cautious of multiple-testing when investigating breakdowns, since we are conducting multiple statistical tests.

Airbnb has a strong AB testing culture - only 8% of AB tests are successful (see Ron Kohavi's LinkedIn Post).

AB Testing at Monzo

Monzo has a big AB testing culture - they ran 21 experiments in 6 months. Monzo has a bottom-up AB testing culture where anyone can write a proposal on Notion. Some of the best ideas come from the customer operations staff working on the frontlines. A proposal comprises the following sections:

What problem are you trying to solve?
Why should we solve it?
How should we solve it (optional)?
What is the ideal way to solve this problem (optional)?

Many proposals end up becoming AB experiments. Monzo prefers launching pellets rather than cannonballs. This means that each experiment comprises small changes, is quick to build, and helps the team learn quickly.

AB Testing at Convoy

Convoy argues that bayesian AB testing is more efficient than frequentist AB testing and allows them to push out product changes faster while still controlling risk.

The argument against frequentist AB testing is as follows. Under traditional AB testing, we define a null hypothesis using the control group (call it A), and declare a treatment (call it B) as successful if the treatment value has a significant p-value, i.e. it falls outside of the range of reasonable values under the null. Based on power analysis and an expected effect size, we predetermine the necessary sample size to achieve sufficient power, and once this sample size is reached, we have a binary success or failure result based on the p-value.

Convoy argues that this approach is safe but inefficient. This is because prior to the sample size being reached, we do not have a principled way of saying anything about the effectiveness of the treatment, even if it is performing better. Furthermore, frequentist AB testing gives us a binary result, but it does not quantify the size of the difference. Specifically, an insignificant test where E(A)=10%, E(B)=11% is quite different from E(A)=15%, E(B)=10%. For the former case, one can argue for launching B even if the p-value did not hit significance, whereas for the latter we should definitely not launch.

Bayesian analysis comes in to make the above intuition concrete. Suppose we are interested in the clickthrough rate (CTR) of variant A vs B. Bayesian analysis provides a distribution of the average CTR for each variant A, B at any point of the AB test, based on the results that it has seen thus far. These posterior distributions reflect both the mean of the data (how far apart $E (A)$ is from $E (B)$ ) and the variance of the data (how spread out the distributions are), allowing us to quantify how much we stand to gain if we were to pick either variant A or B at this point in time.

Concretely, they define a loss function as follows. Let $α$ and $β$ be the unobserved true CTR for variants A and B respectively, and let the variable $x$ denote which variant we decide to choose. Then our loss for choosing each variant can be expressed as:

$L (α, β, x) = {ma x (β - α, 0) if ma x (α - β, 0) if x = A x = B$

In other words, the loss above expresses how much we stand to lose by picking the unfortunately wrong variant based on incomplete evidence at this point in time. Of course, we do not know the true values of $α$ and $β$ , so we need to estimate the loss using our posterior distributions which we computed from data. We then compute the expected loss based on the posterior distributions $\overset{α}{^} \sim A$ , $\hat{β} \sim B$ as such: $E (\hat{L}) = \int_{\overset{α}{^} \sim A} \int_{\hat{β} \sim B} L (\overset{α}{^}, \hat{β}, x) \cdot f (\overset{α}{^}, \hat{β}) d α d β$

Here, $f (\overset{α}{^}, \hat{β})$ is the joint posterior distribution, which I believe we can obtain by multiplying the two independent posterior distributions $\overset{α}{^}$ , $\hat{β}$ together. We can also perform random draws from the posterior distributions to estimate this statistic. Finally, we make a decision by choosing the variant that dips below a certain loss threshold, which is usually a very small value.

The appeal of the bayesian approach is two-fold:

It allows us to make faster decisions. Suppose an experiment is wildly successful, and it is clear within a day that variant B is better. Bayesian analysis will be able to reveal this result, whereas frequentist analysis will tell us to wait longer (since we estimated the effect size to be smaller).
It allows us to control risk. Since we are making decisions based on minimizing risk (supposing we had picked the poorer variant), we may be sure that even if we are wrong, it will not severely degrade our product. So supposing that there is no significant engineering cost between variant A and B, we can more rapidly roll out new iterations with the assurance that on average our product will be improving.

Power Analysis

Reference: Probing into Minimum Sample Size by Mintao Wei

How to determine the minimum sample size required to achieve a certain significance level and power desired?

The following table helps us understand how type I and type II errors come into play:

	Null Hypothesis: A is True	Alternate Hypothesis: B is True
Reject A	Type I Error	Good statistical power
Accept A	Good significance level	Type II Error

Type I Error refers to rejecting the null hypothesis when it is actually true, e.g. when we think that an AA test has significant difference. In short, it means we were too eager to deploy a poor variant. This should happen with probability $α$ , which is the significance level which we set (typically 5%). We have a better handle on type I error because the baseline conversion rate is typically known prior to an experiment.

Type II Error refers to failing to reject the null hypothesis when the alternate is actually true, i.e. we failed to get a significant effect on an improvement that is known to be better. In short, we were too conservative and failed to deploy a winning variant. In order to reason about type II error, we need to make a guess on what is the distribution of test variant B. Typically, this is done by assuming a minimum effect $δ$ we wish to detect, and setting $μ_{B} = μ_{A} + δ$ , and re-using the standard deviation from A. With these assumptions in place, we use $p o w er = 1 - β$ to determine the type II error that should only occur with probability $β$ (typically 20%). Note that since $δ$ is the minimum effect we wish to detect, if the actual effect turned out to be larger, the type II error can only be smaller than our desired amount, which is ok.

Now we can derive the formula for the minimum sample size required to achieve the desired levels of type I and type II error respectively.

Let us define the baseline conversion rate as $p$ , and the minimum relative detectable effect rate as $d$ . Consequently, the minimum detectable delta is $δ = d \times p$ . Let the desired power level be $1 - β$ , and the desired significance level as $α$ . Assume the scenario where we are running an AA or AB test with two variants of sample size $N$ each.

Firstly, we write down the distribution of the sample mean difference supposing we knew the true population means and standard deviations. Let $E (X_{A}) = μ_{A}, Va r (X_{A}) = σ_{A}^{2}$ and $E (X_{B}) = μ_{B}, Va r (X_{B}) = σ_{B}^{2}$ . Note that $X_{A}, X_{B}$ may have arbitrary distributions, e.g. they could measure proportions, revenue etc.

Under the central limit theorem, the sample means will be distributed like so with $N_{A}, N_{B}$ samples: $\overset{ˉ}{X}_{A} \sim N (μ_{A}, \frac{σ _{A}^{2}}{N _{A}})$ , $\overset{ˉ}{X}_{B} \sim N (μ_{B}, \frac{σ _{B}^{2}}{N _{B}})$ . Importantly, the difference of the sample means will have the distribution below. Note that we add the variances together because $Va r (B - A) = Va r (B) + Va r (A)$ for any two independent random variables $A, B$ .

$\overset{ˉ}{X}_{D} = \overset{ˉ}{X}_{B} - \overset{ˉ}{X}_{A} \sim N (μ_{B} - μ_{A}, \frac{σ _{A}^{2}}{N _{A}} + \frac{σ _{B}^{2}}{N _{B}})$

Now we can start working from the desired $α, β$ levels to the minimum sample size. We need to ensure that both objectives below are achieved with our sample size $N_{A}, N_{B}$ :

Assuming null hypothesis to be true, ensure that type I error $\leq α$ .
Assuming alternate hypothesis to be true, ensure that type II error $\leq 1 - β$ .

Let us define some notation first.

Let $z (ϕ)$ denote the critical value under the standard normal distribution such that $P (Z \leq z (ϕ)) = ϕ$ . This is basically the scipy.stats.norm.ppf function, e.g. $z (0.975) = 1.96$ .
We also want to denote the critical value under the distribution $\overset{ˉ}{X}_{D}$ of the sample mean difference under the null or alternate hypothesis (these are non-standard normal distributions). Let these be $z_{\overset{ˉ}{X}_{D} ∣ H_{0}} (ϕ)$ and $z_{\overset{ˉ}{X}_{D} ∣ H_{1}} (ϕ)$ respectively.


Illustration for Power Analysis Derivation

For objective 1, assuming the null hypothesis and using equation (1) above, we have $\overset{ˉ}{X}_{D} ∣ H_{0} \sim N (0, \frac{σ _{A}^{2}}{N _{A}} + \frac{σ _{B}^{2}}{N _{B}})$ . Since $α$ is a two-tailed probability and we want the critical region on the right-side, let $α^{'} = 1 - α /2$ . E.g. $α = 0.05$ implies $α^{'} = 0.975$ . Then:

$z_{\overset{ˉ}{X}_{D} ∣ H_{0}} (α^{'}) = z (α^{'}) \cdot \frac{σ _{A}^{2}}{N _{A}} + \frac{σ _{B}^{2}}{N _{B}}$

Note that equation (2) above tells us the critical value such that we will reject the null hypothesis if the sample mean of $B$ is greater than this value. To satisfy objective 2, we must thus ensure that the probability of rejecting the null hypothesis is at least $p o w er = 1 - β$ . In other words, we want $δ - z_{\overset{ˉ}{X}_{D} ∣ H_{1}} (1 - β) \geq z_{\overset{ˉ}{X}_{D} ∣ H_{0}} (α^{'})$ . Assuming the alternate hypothesis and again using equation (1), we have $\overset{ˉ}{X}_{D} ∣ H_{1} \sim N (δ, \frac{σ _{A}^{2}}{N _{A}} + \frac{σ _{B}^{2}}{N _{B}})$ . So then:

$δ - z_{\overset{ˉ}{X}_{D} ∣ H_{1}} (1 - β) δ - z (1 - β) \times \frac{σ _{A}^{2}}{N _{A}} + \frac{σ _{B}^{2}}{N _{B}} \geq z_{\overset{ˉ}{X}_{D} ∣ H_{0}} (α^{'}) \geq z (α^{'}) \times \frac{σ _{A}^{2}}{N _{A}} + \frac{σ _{B}^{2}}{N _{B}}$

For the purpose of getting a minimum $N$ , we assume $N = N_{A} = N_{B}$ . Then using this and squaring both sides with some rearranging gives us:

$N \geq \frac{( σ _{A}^{2} + σ _{B}^{2} ) ( z ( 1 - β ) + z ( α ^{'} ) ) ^{2}}{δ ^{2}}$

Which gives us the required minimum sample size equation. If we assume $σ_{A} = σ_{B}$ , as is often assumed because we do not know the variance of the treatment, then it simplifies to the following form (as seen in Ron Kohavi's paper).

$N \geq \frac{2 σ _{A}^{2} \cdot ( z ( 1 - β ) + z ( α ^{'} ) ) ^{2}}{δ ^{2}}$

Bernoulli Events

The equation (3) above for the minimum sample size requires us to know the standard deviation under the null and alternate hypotheses. Usually, the standard deviation under the null is computed from historical data, and it is assumed that $σ_{A} = σ_{B}$ . However, if the event we are interested in may be represented as a bernoulli random variable (e.g. an impression is shown and user either clicks or does not click with some probability), the equation may be simplified.

Specifically, the variance of a bernoulli random variable with probability $p$ is $p \cdot (1 - p)$ . Thus, if $X_{A} \sim B er n o u ll i (p_{A})$ , then $Va r (X_{A}) = p_{A} \cdot (1 - p_{A})$ , and likewise for $X_{B}$ .

So we can use $σ_{A} = p_{A} \cdot (1 - p_{A})$ and $σ_{B} = (p_{A} + δ) \cdot (1 - p_{A} - δ)$ and substitute these into equation (3). We will then be able to have a minimum sample size formula by just specifying $α$ , $β$ , baseline conversion $p_{A}$ and minimum relative difference $d$ . This is the formula used by Evan Miller's sample size calculator.

Imbalanced AB Test

Another common scenario is the case where we do not split 50-50, i.e. $N_{A} \neq = N_{B}$ . In this case, suppose we have $N_{A} = p \times N_{B} = p \times n$ , where $p > 1$ . For example, if we have a 90-10 split, then p=9. Then we get:

$δ δ n \geq (z (α^{'}) + z (1 - β)) \frac{σ _{A}^{2}}{N _{A}} + \frac{σ _{B}^{2}}{N _{B}} \geq (z (α^{'}) + z (1 - β)) \frac{σ _{A}^{2} + p \cdot σ _{B}^{2}}{p n} \geq \frac{( z ( α ^{'} ) + z ( 1 - β ))}{δ ^{2}} (\frac{σ _{A}^{2}}{p} + σ_{B}^{2})$

Note that $n$ gives the sample size required for $N_{B}$ , and the total sample size required across both groups is $n (p + 1)$ .

Large Language Models

LLMs are generally used in an auto-regressive way, where the user supplies a prompt and the LLM returns a generated response. This framework makes it amenable to a wide range of tasks.

HuggingFace has a great blog post explaining how we can run LLMs on humble hardware. Typically, LLMs have billions of parameters. The following rules of thumb helps us know how much memory we need to load the LLM into memory. For a model with X billion parameters:

Loading in float32 requires 4X GB of VRAM
Loading in bfloat16 requires 2X GB of VRAM
Loading in int8 requires X GB of VRAM

Hence we see that we can load a ~7 billion parameters model with around 14GB of VRAM if loaded in bfloat16, which makes it feasible to run on GPUs like Tesla T4 with 16GB of VRAM. This can be done when loading the model with from_pretrained(..., torch_dtype=torch.bfloat16). Most models are trained in bfloat16 anyway, so it makes sense to load them at that precision.

Current popular open source LLM of that size includes mosaicml/mpt-7b, which can be easily downloaded and used using huggingface.

Quantization

It turns out that we can lower the precision of models even further than 16 bits if we use a quantization method (see e.g. Dettmers 2022, this paper is the basis for the package bitsandbytes used for quantization). The general idea is akin to encoding - we encode each number from a higher precision into a "codeword" in the lower precision (i.e. quantization). Numbers that are close to one another in the higher precision may get mapped to the same "codeword". When we want to use the encoded value, we look up the value in the higher precision that it maps to (i.e. de-quantization).

When applying this to quantizing a neural network, the steps involved are:

Quantize all model weights to target precision (e.g. int8)
Pass the input vector at bfloat16
At each layer, dequantize the weights and perform matmul in bfloat16
Quantize the weights again for storage

Hence while quantization lowers the memory footprint of the model, it may increase inference time. To use quantization, we need to do the following (also make sure bitsandbytes is pip installed). We can also pass load_in_4bit=True for 4bit quantization. More info on quantization usage is available at HuggingFace. An important note is that a GPU is required for quantization, at least in the bitsandbytes package.

model = AutoModelForCausalLM.from_pretrained(..., load_in_8bit=True)

Flash Attention

The self-attention mechanism (Dao 2022) is at the heart of the transformer performance but is also a major bottleneck in terms of memory and computational cost. One of the most successful optimizations for the attention mechanism is the Flash Attention paper.

Suppose we have an input sequence of embeddings $X = (x_{1}, ..., x_{N})$ where $x_{i} \in R^{k}$ , such that $X \in R^{k \times N}$ . The transformer stores parameters $W_{q}, W_{k}, W_{v} \in R^{k \times d}$ , such that $Q = X^{T} \cdot W_{q}, K = X^{T} \cdot W_{k}, V = X^{T} \cdot W_{v}$ such that $Q, K, V \in R^{N \times d}$ . The self-attention matrix $S = Q K^{T} \in R^{N \times N}$ is then computed to represent the pairwise interaction between tokens at position $i$ ( $i^{t h}$ row) and position $j$ ( $j^{t h}$ column). The row-wise softmax is taken $P = so f t ma x (S)$ to convert these into probabilities and finally the output is $O = P \cdot V$ .

Typically, $N$ is much larger than the hidden dimensions $k, d$ , as $N$ can be 2,048 or larger. Hence the $Q K^{T} \in R^{N \times N}$ matrix is the bottleneck for memory and computation. The flash attention proposes to do this computation in a block-wise manner to reduce the memory usage. Furthermore, the algorithm also speeds up the computation compared to naive attention because the block-wise implementation minimizes the number of read-write operations between the faster SRAM and slower HBM.

More details can be found in the notebook at Dao 2022 - FlashAttention. We can utilize flash attention like so:

%pip install optimum

model.to_bettertransformer()

Note that this is only supported for models that have implemented flash attention, e.g. gpt-neox, bloom etc.

Flash Attention is now support natively within Pytorch as torch.nn.functional.scaled_dot_product_attention (see blog). The usage is like below. We need transformers>=4.36 and torch>=2.1.1 to use it.

from optimum.bettertransformer import BetterTransformer
from transformers import AutoModelForCausalLM

model = AutoModelForCausalLM.from_pretrained("gpt2-large", torch_dtype=torch.float16)
model = BetterTransformer.transform(model, keep_original_model=False)

Position Representation

Recent innovations in position encoding has led to accuracy improvements for long input sequences. The initial attention papers used absolute position embeddings. Given an input sequence of embeddings $X = (x_{1}, ..., x_{N}) \in R^{d \times N}$ , absolute position embeddings $p_{1}, ..., p_{N} \in R^{d}$ are generated by the model. These position embeddings are added to the input sequence $\hat{X} = (x_{i} + p_{1}, ..., x_{N} + p_{N})$ , thereby allowing to model to use these position cues.

It turns out that fixed positional embeddings are not ideal because they require the model to learn a fixed, unique representation of each position $1, ..., N$ . This does not represent language well, because the word in position i in one sentence does not necessarily serve the same purpose as a word in the same position in another sentence. Rather, it is the relative distance between words that we want to encode in the model. Furthermore, training absolute position embeddings makes it difficult for our model to generalize to texts with longer sequences than what it was trained with.

Recent papers advocate for relative positional embeddings, with the following differences:

Relative position encoding rather than absolute position encoding
The encoding of relative position is done most naturally within the $Q K^{T}$ self-attention matrix, since that is where the relative degree of interaction between tokens at different positions is encoded
The encoding should be such that tokens further apart have a lower value in the self-attention matrix and tokens closer together have a higher value

Rotational Position Embeddings (RoPE) (Su 2021) proposes rotating the query and key vectors by an angle proportional to the relative distance between the two positions. Specifically $\overset{q}{^}_{i}^{T} \hat{k}_{j} = q_{i}^{T} R_{θ, i - j} k^{j}$ , where $R_{θ, i - j}$ is a rotational matrix that performs the rotation.

Attention with Linear Biases (ALiBi) (Press 2022) proposes an even simpler method. It simply subtracts $∣ i - j ∣/ m$ from row $i$ and column $j$ of the self-attention matrix, where $m$ is a fixed scalar (specific to each attention head). Intuitively, it penalizes the attention proportional to the distance between the tokens. The study shows that this method outperforms RoPE as we extrapolate to longer sequences, and is conceptually simpler.

Key-Value Cache

Most LLMs work in an auto-regressive manner, i.e. we provide an input sequence, generate the next token with the LLM, then append this token to the input sequence for the next iteration. Most LLMs are also trained with the causal language modelling objective and mask the upper triangle of the self-attention matrix, so that each query token $q_{i}$ can only interact with key token $k_{j}$ and value token $v_{j}$ if $j \geq i$ . This setup encourages us to cache results from previous time steps, since a lot of computation is repeated.

The following is based on how I imagine this to work, after reading Cameron R. Wolfe's LinkedIn post. During training, we compute the projections $Q, K, V \in R^{N \times d}$ , where $N$ is the maximum sequence length and $d$ is the hidden dimension. The final output $O = so f t ma x (Q K^{T}) V \in R^{N \times d}$ actually provides a $d$ -dimension representation of the model's prediction at each of the $N$ positions.

For next token generation, we can add a projection head, say $W_{p} \in R^{d \times p}$ , where $p$ represents the size of the vocabulary, such that $A = O W_{p} \in R^{N \times p}$ can represent the activations at each of the $N$ positions for the next token. Specifically, $A_{[0, :]}$ represents the predictions of position $1$ given input tokens $[0]$ , $A_{[1, :]}$ represents the predictions of position $2$ given input tokens $[0, 1]$ , and so on. These activations will then be fed into some cross-entropy loss such that activations at the correct token for each position gets rewarded. This allows us to do efficient training, since we simultaneously provide losses for the prediction at each of the $N$ positions to the model for backpropagation.

However, when we are doing inference generation, we only need to predict for the final position of the input sequence (suppose it is position $c$ ), i.e. we are only interested in $A_{[c, :]}$ and $O_{[c, :]}$ . Hence for starters, we only need $q_{c} := Q_{[c, :]}$ instead of the entire $Q$ matrix, since only that row comes into play. However, we still need the entire $K$ and $V$ matrices, since we want $q_{c}$ to interact with all tokens in the input sequence. This is where the KV cache comes in - we cache the existing $K$ and $V$ matrices, so that we only need to project the final token of the input sequence $x_{c}^{T} W_{k} \in R^{1 \times d}$ and $x_{c}^{T} W_{v} \in R^{1 \times d}$ at each step and append it to the existing cached $K$ and $V$ . We can then compute $O_{c} = so f t ma x (q_{c} K^{T}) V \in R^{1 \times d}$ .

As one can imagine, this saves a lot of computation, but also increases memory costs. Kwon 2023 - PagedAttention shows that serving a 13B model on NVIDIA A100 with 40GB of memory:

$65%$ of memory is model parameters
$> 30%$ is the KV cache
A small amount of memory is used ephemerally for activation

The usage of KV cache is like so:

model = AutoModelForCausalLM.from_pretrained(...)
model.generate(..., use_cache=True)

How to fine-tune an LLM

trl RL example
- Fine tune a 20B GPT model on text generation on IMDB dataset (loaded in 8 bit)
- Since step 1 used PEFT, we need to merge the adapter weights with the base model
- Finally, use RLHF to generate positive movie reviews. They used a BERT IMDB sentiment classifer to generate rewards
DataCamp example
- Using SFTTrainer from the trl library to do supervised fine-tuning
PEFT - based on LORA - PEFT is built by hugginface to support LORA.

Automatic Prompt Optimization

Automatic prompt optimization is a research area of recent interest. This trend arises from the following observations:

Adjusting textual prompts can very significantly change LLM's ability to accurately complete tasks
The correlation between prompt and performance is not always the most obvious or intuitive to humans
The best prompt varies depending on the LLM model (or even different iterations of the same model)

LLMs are increasingly used to solve diverse problems by varying the instructions. This is often a simpler solution than maintaining many models that do various specific tasks. For example, in the recsys world, companies like LinkedIn and Netflix are moving toward foundational LLM models that can replace a collection of traditional recommender models. By adjusting the instructions to the foundation model, it can achieve comparable or even better performance to these task-specific models due to favourable scaling laws with the size of the model.

However, manually tuning the instruction for each task is brittle and model performance can easily change as the underlying foundation model is continually trained and improved, or swapped out. Also, we may often want to adapt a foundational model to a new task without performing further fine-tuning or training adapter weights. Hence, automatic prompt optimization becomes an important tool to maintain model performance in an operational setting.

This note focuses on gradient-free methods to optimize black-box LLMs, as they are often simpler and are applicable to LLM usage via API, which is still the most common LLM usage method outside of big tech companies. I also follow Wolfe 2024 heavily but simplify quite a bit for readability.

Main Idea

The main idea of automatic prompt optimization is to simply treat the task of prompt optimization as a standard machine learning problem. That is, we construct a training dataset and optimize our prompt to improve performance on it, and validate the effectiveness of the prompt on a held out validation set. The only difference is that we have to be creative in how we optimize the prompt, since we cannot use gradient descent to do so.

Specifically, the general setup that we consider is such:

We have a dataset of inputs and labels that can be split into train and validation sets
- Labels are optional if we are using another LLM as judge
For each training example, we can produce textual output from an LLM call: $o u tp u t = LL M (p ro m pt, in p u t)$
We have some evaluation function that returns a score for this example instance: $score = E v a l (o u tp u t, l ab e l)$
- The evaluation function obviously depends on the specific task
- E.g. for simple tasks, the evaluation function can simply be the accuracy of $o u tp u t == l ab e l$
- The evaluation function can also be another LLM for ambiguous tasks
We can thus compute the mean score across a set of training instances to evaluate the effectiveness of a given prompt
At each iteration, we try one or more new prompt(s) and evaluate their performance, then have some way of generating new prompts to try again
At the end, the best prompt is selected

Methods

Here we dive into the different methods. Some are quite ingenious in how they use LLMs to elicit the best instructions. Note that we omit papers that perform prompt optimization through some gradient-based optimization (e.g. using reinforcement learning), as these cannot be used with LLMs accessed through API.

Instruction Induction

Honovich 2022 is an early paper that suggests taking a random subset of training (input, label) pairs, showing them to an LLM, and asking the LLM to guess the instruction that produces the label from the inputs.

For example, the LLM prompt may look something like:

Here are the input-output pairs:

Input: As soon as you can.
Output: At your earliest convenience.
...

The instruction was <PLS FILL IN>

The LLM may then guess something like The instruction was translate the inputs into more formal language. We can then use this instruction as the new optimized prompt.

Note that this method is a one step process, i.e. it does not iterate for further improvements. But I suppose we can sample multiple random subsets of training instances and generate prompts for each before picking the best prompt.

Bibliography

Fine-tuning

LLMs are typically trained with next-token prediction task on large amounts of text in an unsupervised manner. Some of the behaviour in these texts are not desirable to imitate. For example, while Github is full of code repositories with common programming mistakes, we do not want the LLM to replicate such behaviour. Hence, a process of alignment is necessary to encourage the model to produce desired responses.

There are typically two stages to this fine-tuning: Supervised Fine-Tuning and Reinforcement Learning from Human Feedback.

Supervised Fine-Tuning (SFT). In this stage, pairs of (prompt, desired response) are provided to the LLM. The desired responses are often called "demonstrations" by human annotators. Some form of cross-entropy loss is then used to update the LLM to encourage it to generate the desired response. This is a straightforward approach that is similar to the next-token prediction task (except we are predicting the desired response given the prompt). In the InstructGPT paper, the authors show that an SFT-aligned 1.3B model (using ~13k training data) generates human-preferred outputs compared to a 175B GPT model, showing the importance of SFT.

The problem with SFT is that it trains the model to provide a very specific response to a particular prompt, which makes it hard for the model to generalize. It also fails to express the natural ambiguity of responses for an open-ended prompt (e.g. write a poem about AI). Hence, unless we can collect quality responses for a very wide variety of prompts, SFT is limited in its ability to generalize.

Reinforcement Learning from Human Feedback (RLHF). This is where RLHF comes in: given triplets of (prompt, preferred response, not preferred response), we train the model to generate the preferred response in a more generalizable way.

More recently, a method called Direct Preference Optimization is used to solve this problem without needing to do reinforcement learning. More on that here.

Useful Models

There are too many models on Huggingface, so I try to keep track of useful ones here. Generally the open_llm_leaderboard is a good place to start.

Model Name	# Params (Billions)	openllm score	Comments
SOLAR-10.7B-Instruct-v1.0	10.7B	74.2	This model was trained using Supervised Finetuning (SFT) and Direct Preference Optimization (DPO) for single-turn chat. Seems to be derived from Llama. Verified that the outputs are sensible.

Testing SOLAR-10.7B

Here are some results from testing out SOLAR-10.7B-Instruct-v1.0 on Google Colab, with a T4 GPU. First we install packages and load the model. Note that we are quantizing it to 4bits to save memory, and specifying the compute type of torch.float16 as that is the tensor type that this model was trained in. Using torch.float32 (which is the default) will incur unnecessary conversion cost and slow down inference.

%pip install transformers~=4.36 torch~=2.1.1 bitsandbytes accelerate optimum

from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
import torch

quantization_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.float16
)

MODEL_NAME = "Upstage/SOLAR-10.7B-Instruct-v1.0"
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
model = AutoModelForCausalLM.from_pretrained(
    MODEL_NAME,
    device_map="auto",
    quantization_config=quantization_config,
)

We define a generate function.

def generate(prompt: str, max_length: int = 2048):  
    conversation = [{'role': 'user', 'content': prompt}] 
    prompt = tokenizer.apply_chat_template(conversation, tokenize=False, add_generation_prompt=True)
    inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
    outputs = model.generate(**inputs, use_cache=True, max_length=max_length)
    output_text = tokenizer.decode(outputs[0])
    return output_text

And now we can test it out on a few prompts.

Prompt: Can you explain web3 to a 5 year old? Response: Web3 is like a special world on the internet that lets people share and use things in a new and more open way. Imagine a playground where everyone can add their own swings, slides, and sandboxes without needing permission from a big boss. This makes it easier for people to work together and create new fun things online. (Inference time: 8s)

Prompt: Can a firefighter transit into the finance industry? Response: While it may not be a typical career transition, it is possible for a firefighter to transition into the finance industry. To make a successful move, the firefighter would need to acquire the necessary skills and qualifications relevant to the finance sector. This could involve pursuing additional education or training in finance, accounting, or business administration. Networking, gaining relevant work experience, and showcasing transferable skills from the firefighting profession can also help in making the transition.\n\nSome transferable skills from firefighting to finance could include strong problem-solving abilities, teamwork, decision-making under pressure, and the ability to manage risk. However, it's essential to note that the finance industry has its unique requirements, and the firefighter would need to demonstrate a strong commitment to learning and adapting to the new environment. (Inference time: 19s)

Seems to be pretty decent performance at a somewhat reasonable inference time out of the box.

Speeding up GPU inference.

Gemma Models

References:

Google released open source small models under the Gemma label. It comes in two sizes: 2B and 7B parameters, each with base (pretrained) and instruction-tuned versions. All have a context length of 8k tokens. Gemma integration is supported with transformers==4.38. Note that user consent is needed to access Gemma models, so one needs to create a HuggingFace account and create an access token.

Sample Code

# Use model_name="google/gemma-7b-it"
def load(model_name: str):
    tokenizer = AutoTokenizer.from_pretrained(model_name, token=TOKEN)
    model = AutoModelForCausalLM.from_pretrained(
        model_name,
        quantization_config=BitsAndBytesConfig(
            load_in_4bit=True,
            bnb_4bit_compute_dtype=torch.bfloat16,
            bnb_4bit_use_double_quant=True,  # nested quantization
            bnb_4bit_quant_type="nf4",  # normalized quantization
        ),
        token=TOKEN,
    )
    return model, tokenizer

Inference Speed

Tested out the inference speed of the model using the following prompt: Can you recommend good alternative occupations for a occupation_name? Please be concise.

An example is as follows, for an occupation counsellor:

Sure, here are some alternative occupations for a counsellor:

* **Social worker:** Similar to counselling, social workers help individuals and families facing various challenges.
* **Community organizer:** Organizes and leads community initiatives to address social issues.
* **Teacher:** Can be a good fit for those who enjoy working with children and adolescents.
* **Human resource manager:** HR managers handle employee relations and conflicts.
* **Public relations specialist:** Helps organizations communicate effectively with the public.

The inference speed is about 8-10 seconds with 4-bit quantization.

Gemma-2b

Testing out google/gemma-2b-it shows that it has poorer performance but faster inference speed at 2 billion parameters. Here is an example of the results for the same prompt.

* **Life coach**
* **Coach in education**
* **Career coach**
* **Executive coach**
* **Leadership coach**
* **Management consultant**
* **Business development consultant**
* **Financial advisor**
* **Investment banker**

The inference speed is as follows:

Single: 3s
Batch of 4: 14s
Batch of 8: 16s
Batch of 16: 20s
Batch of 32: OOM on T4 GPU

Phi-2

Mobius Labs fine-tuned the phi-2 model from Microsoft which seems promising, and released it under mobiuslabsgmbh/aanaphi2-v0.1. The output seems better than gemma-2b-it.

1. Social worker
2. Mental health therapist
3. School counselor
4. Employee assistance program (EAP) specialist
5. Rehabilitation counselor
6. Family therapist
7. Substance abuse counselor
8. Career counselor
9. Trauma-focused therapist
10. Child and adolescent therapist

These occupations involve working with individuals, families, and communities to promote mental health and well-being, and may provide similar skills and experiences to those of a counsellor.

bitsandbytes example notebook from HuggingFace

Flash attention, torch.compile, quantization.

Encoder vs Decoder

There are broadly two categories of LLMs: Encoder-Decoder architecture (typified by BERT) and Decoder only architecture (typified by GPT-2 series). There are some innate differences between the two that affect the type of application each is well suited for.

Encoder-Decoder Architecture

The Encoder-Decoder architecture was proposed in the original Attention is All You Need paper by Vaswani et al. As the name suggests, there is a separate encoder and decoder module. The design choice of two distinct modules is easier to understand when we consider that the original paper was designed for the machine translation task, such that the inputs may be in English and the outputs may be in French.

On the encoder-side, the encoder module sees the full input sequence and encodes it into an numeric encoding representation matrix of size (seq_len, embed_dim). There is no input masking required because we assume we see the full English text before we begin translation. This encoding representation is passed to the decoder module as context.

On the decoder-side, we cannot allow the decoder module to see the full output sequence during training, since that would cause data leakage and the model to collapse. Note that the decoder simultaneously (in one forward pass) predicts a token for each position in the output sequence during training. Hence to avoid data leakage, causal masking is applied such that for each position being predicted, the decoder module is only allowed to "see" tokens in previous positions.

Note that this can be implemented simply by masking the attention matrix in each self-attention block in a triangular manner. The feedforward layers only map information from position i to position i, so there is no need to adjust that.


BERT Encoder Decoder Architecture (From Attention is All You Need)

The decoder module itself creates a representation of the output sequence (with causal masking) of size (seq_len, embed_dim). Now for the tricky part of merging the encoder and decoder representations. Naively, we can simply concatenate the two to form a representation of (seq_len x 2, embed_dim), and add further self-attention blocks (with appropriate causal masking). However, this would increase the number of parameters of the model.

The paper instead chose to take the Q matrix from the decoder and the K, V matrices from the encoder. Since the attention mechanism is $so f t ma x (Q \cdot K^{T}) V$ , this allows every token on the decoder representation to attend to every token on the encoder representation. The result is a weighted combination of the V vectors taken from the encoder representation which is eventually used to generate the tokens for each decoder position.

Here we can already observe some inductive biases baked into the encoder-decoder architecture. These comments are inspired by / taken from Hyung Won Chung's lecture:

There is a clear distinction between the tokens on the encoder-side and the decoder-side, such that tokens on either side cannot directly attend to each other until the final block when the mixing happens. This may be suitable for the translation task but not for generic chat completion, where it is quite arbitary which point in the conversation we determine as the encoding and which point we determine as the decoding.
The V matrix is taken entirely from the encoder-side, which may again limit the expressiveness of the model (what if there are important tokens on the decoder-side that would be useful to taken the V values from?). One may argue that in translation, the tokens on the encoder-side capture the full meaning of the sentence to be translated, so it is comprehensive. But the same is not true of generic chat completions.
The separation of an encoder and decoder module suggests that there is significant difference between the tokens on the encoder-side and decoder-side. Again, this inductive bias

Contextualized Recommendations

Contextualized recommendations is an emerging use case from LLMs. The idea is that we use a traditional search and recommender system to generate recommendations, and use an LLM to craft an explanation for why each recommendation is relevant to the user in a very personalized way. Multiple companies have reported that this has driven engagement and clicks up significantly.

Spotify

Contextualized recommendations through personalized narratives using LLMs

Traditionally, spotify users use just the cover art to decide whether to engage with a new music recommendation. Spotify wants to include a short one-liner to explain why a particular item might resonate with users. For example, Dead Rabbitts latest single is a metalcore adrenaline rush! or Relive U2's iconic 1993 Dublin concert with ZOO TV Live EP.

Spotify highlights some challenges they faced:

Ensuring a consistent generation style and tone
Avoiding harmful or inappropriate outputs
Mitigating hallucinations and inaccuracies
Understanding user preferences to deliver tailored meaningful explanations

Initial tests with zero-shot / few-shot Llama did not work too well. They adopted a human-in-the-loop approach:

Expert editors provide "golden examples" for instruction fine-tuning
Provide ongoing feedback to address errors in LLM output
- Artist attribution errors
- Tone inconsistencies
- Factual inaccuracies

The AB tests showed that explanations containing meaningful details about the artist or music led to significantly higher user engagement.

For LLM fine-tuning, they found that Llama 3.1 8B worked well and could be trained with multiple adapters for 10 different tasks. Throughout the training process, they used MMLU benchmark as a guardrail to ensure that the model's overall ability remained intact. Spotify uses vLLM for inference.

Our new AI powered LinkedIn

LinkedIn provides AI features for premium users. When users click on a job, they can ask questions like "Am I a good fit for the job?". The LLM will respond with a short bullet-pointed explanation on:

Whether the user is a good fit
Details from the user's profile that make them a good fit
Areas that the users are missing

MinGPT

This is a walkthrough of KArpathy's MinGPT implementation

`bpe.py`

This file contains code to implement a Byte Pair Encoding encoder. It does not contain code for training, just loads openai's GPT-2 bpe encoding for inference.

Most text is represented in UTF-8 encoding, which is just a sequence of bytes (values 0 to 255). For example, 0xf0 corresponds to 33 in decimal which corresponds to the character !. This means that all text can be treated as a sequence of byte values.

As a fallback, we first need token representations for individual byte values (in case we encounter out of vocab tokens with unknown byte sequences).

First, two files are downloaded in get_encoder, from https://openaipublic.blob.core.windows.net/gpt-2/models/124M/:

encoder.json is a dict of len 50257 mapping from a token to its index. This represents the entirety of the vocabulary
- The first 256 tokens represents the 256 byte values. Each token is some arbitrarily chosen character (just need to make sure it is printable)
- These tokens are a fallback to ensure that we can encode any text sequence. For example, if we encounter a new emoji with unknown byte sequence, at the very least we can encode each byte separately.
- The next 50k tokens map from a byte sequence of length 2 and above to an index, these are the BPE mined sequences of merged bytes
- The last token is <|endoftext|> which is a special token
vocab.bpe is a \n separated list of byte sequences that should be merged (50k of them)
- In contrast to the above, these sequences are not merged yet (e.g. a line is R ocket)
- We store these as a list[tuple] in bpe_merges

These two data are passed into the Encoder main class.

Statistics

Conformal Predictions

Exchangeability

In this section we discuss the difference between independence and exchangeability.

Definition. Independence.

Two events $A$ and $B$ from a sample space $S$ are said to be independent if $P (A and B) = P (A) P (B)$ .

If $P (B) > 0$ , it is equivalent to say $P (A ∣ B) = P (A)$ , since: $P (A ∣ B) = \frac{P ( A , B )}{P ( B )} = \frac{P ( A ) P ( B )}{P ( B )} = P (A)$

Definition. Exchangeability

Two events $A$ and $B$ are said to be exchangeable if $P (A_{1} = a, A_{2} = b) = P (A_{1} = b, A_{2} = a)$ , which means there is indifference with respect to the order of events.

More generally, exchangeability of a sequence of events $A_{1}, ..., A_{n}$ means that the joint distribution is unchanged when we permute the order of events: $P (A_{1}, ..., A_{n}) = P (A_{σ (1)}, ..., A_{σ (n)}), for all permutations σ$

The simplest way to understand is to use the example of drawing balls from an urn without replacement (example taken from Cordani 2006). Suppose we have an urn with 10 red balls and 5 white balls. Then the following tree shows the draw probabilities at each step:

graph TD;
    A((Start)) --> R1("R1 (10/15)");
    A --> W1("W1 (5/15)");

    R1 --> R2("R2 (9/14)");
    R1 --> W2("W2 (5/14)");

    W1 --> R2_2("R2 (10/14)");
    W1 --> W2_2("W2 (4/14)");

    R2 --> R3("R3 (8/13)");
    R2 --> W3("W3 (5/13)");

    W2 --> R3_2("R3 (9/13)");
    W2 --> W3_2("W3 (4/13)");

    R2_2 --> R3_3("R3 (9/13)");
    R2_2 --> W3_3("W3 (4/13)");

    W2_2 --> R3_4("R3 (10/13)");
    W2_2 --> W3_4("W3 (3/13)");

Suppose we

References

Miscellaneous Notes

A collection of miscellaneous, useful notes.

f-strings

To surround a text with a symbol (say =) to a fixed length:

>>> text = "Title"
>>> print(f"{text:=^20}")

=======Title========

Vim

Command to interactively change each 'foo' to 'bar'. :%s triggers the substitute global command, followed by the search and replace phrases respectively. Finally g means replace all occurrences and c means with confirmation. Note that :s will only do the same for one line.

:%s/foo/bar/gc

Find Files

To find files anywhere on the system with the filename python using bash, use:

find . -name python

We can add * before and/or after the filename to allow other characters before or after our keyword:

find . -name *python*

To search not just in the filename but also in the full path (e.g. we only want to search in Desktop), we can do:

find . -wholename "*Desktop*python*"

Note that if we want to locate executable binaries, another useful command is whereis:

whereis cat
---
cat: /usr/bin/cat /usr/share/man/man1/cat.1.gz

Numpy Indexing

Suppose we have a 2D array X and would like to take a slice of certain rows and columns. We might try, for e.g., to take the first two rows of X and the 3rd/4th column of X, i.e. we expect to get a 2 by 2 matrix.

import numpy as np
X = np.random.randn(5, 5)
X[[0, 1], [2, 3]]

Unfortunately, this will return an array of items array(X[0, 2], X[1, 3]), which is not what we want. Instead, a slightly inefficient but clear way is to slice each axis separately. It is not optimal because the first slice creates a temporary array view before the second slice is applied.

X[[0, 1], :][:, [2, 3]]

Finally, the recommended way seems to be to use np.ix_. Doing np.ix_([0, 1], [2, 3]) creates a tuple of two elements.

idx = np.ix_([0, 1], [2, 3])
idx
>> (array([[0],
           [1]]),
>>  array([[2, 3]]))

Indexing the original array with this output X[idx] will then give us what we want.

asyncio

reference

asyncio is a single-threaded framework that does not use multi-threading or multi-processing to speed up tasks. Instead, a coordinator (or event loop) passes control from a time-consuming blocking function (e.g. time.sleep or an I/O operation) to other functions to run. This passing of control occurs with the await keyword. When the blocking function is completed, it notifies the coordinator and control returns to where the blocking function left off.

asyncio does not speed up CPU-bound tasks, due to its single-threaded design. It only works when the function being awaited is an I/O operation that is supported by asyncio. This includes stuff like:

HTTP (supported through aiohttp)
DB calls (e.g. aioredis)

asyncio is preferred for such tasks (e.g. making multiple I/O calls and doing something when they all return) over multi-processing because of its single-thread design, making it easier to debug.

relatedness

In the context of information retrieval, Trey Grainger in AI-Powered Search suggests a relatedness measure to connect arbitrary entities together. Suppose we have a collection of jobs and each job is tagged with a set of skills. Suppose we wish to retrieve relevant skills to an arbitrary free text query $q$ .

The relatedness idea is to define a foreground of documents, e.g. based on a retrieval of documents using query $q$ which are related to the query, and to compare the attributes of the foreground against the background, i.e. all documents.

Mathematically, we can think of the foreground documents as a sample, and the background documents as the population. The strength of the relationship between each skill $t$ to the query $q$ may then be defined as the z-statistic of the one-sample z-test of proportions of the occurrence of skill $t$ in the foreground sample compared against the background population. A significantly greater occurrence in the sample compared to the population suggests a strong relationship between $t$ and $q$ , and vice versa. Specifically:

$z = \frac{p ^ - p}{\frac{p ( 1 - p )}{n}}$

Where:

$\overset{p}{^} = \frac{df ( q , t )}{df ( q )}$ is the sample proportion.
$df (q, t)$ is the number of documents in the foreground corresponding to query $q$ and contains skill $t$ .
$df (q)$ is the total number of documents in the foreground corresponding to query $q$ . It is also the number of samples $n$ .
$p = P (t)$ is the probability of skill t appearing across all documents.

By performing a retrieval and ranking skills based on the z-statistic, we can compute a relationship between any arbitrary query and attribute of the documents (on the fly, if necessary). This functionality is implemented in solr.

Package Versioning

link

SemVer is the versioning standard for Python packages. For a package version of e.g. 1.2.3:

The first number 1 is the major version. We update it when we make major API changes. A major of 0 indicates that the package is under initial development and anything may change.
The second number 2 is the minor version. We update it when we add functionality in a backward compatible manner.
The third number 3 is the patch version. We update it when we patch bugs in a backward compatible manner.

Poetry is a useful tool to manage package dependencies in a Python library. In the pyproject.toml file, we specify package dependencies in the following manner:

[tool.poetry.dependencies]
python = ">=3.8,<3.11"
requests = "^2.26.0"
pytz = "~2022.1"

The caret requirement (e.g. requests = "^2.26.0") means that SemVer compatible changes are allowed, i.e. an update is allowed if the version number does not modify the left-most non-zero digit. E.g.:

^1.2.3 means that 1.3.0 is allowed but not 2.0.0
^0.2.4 means that 0.2.5 is allowed but not 0.3.0

The tilde requirement is somewhat stricter. It specifies a minimal version but allows room to upgrade depending on how many digits are supplied. If major.minor.patch or major.minor is supplied, only patch-level changes are allowed. If major is supplied, minor and patch level changes are allowed.

~1.2.3 means that 1.2.7 is allowed but not 1.3.0
~1.2 means that 1.2.7 is allowed but not 1.3.0
~1 means that 1.3.0 is allowed but not 2.0.0

The poetry versioning specification depends on developer updating the library versions in a disciplined way. It allows us to provide some flexibility in package versioning while avoiding updating a dependent package to a version that breaks our code. Having more flexibility in specifying package dependencies reduces the risk of dependencies version conflicting.

Parallelism in Python

References: realpython, superfastpython

The Python Global Interpreter Lock (GIL) is a lock that allows only one thread to run at a time. The reason for this design feature in Python is because the Python interpreter is not thread-safe, meaning that if the GIL did not exist, the Python interpreter could introduce race conditions and other fatal errors due to multiple threads modifying memory at the same time. The GIL also allowed other non-thread-safe C programs to be easily integrated into the Python ecosystem, which contributed to Python's success as a language.

The GIL also improves performance for single-threaded programs, as it only requires a single lock to be managed (not sure how this works). There have been attempts to remove the GIL from Python but none have been successful because they would degrade single-threaded performance.

For many C extensions (e.g. numpy), multi-threading is still possible because these extensions are allowed to manually release the GIL (as long as they restore things back to normal when the functions return). This allows us to still use multi-threading for CPU-intensive functions with the GIL around. Similarly for Rust, we can release the GIL to achieve parallelism.

Alternatively, we can use multiprocessing to create multiple processes (instead of threads). Each process contains its own Python interpreter (and GIL) and hence can run truly in parallel. The downside is that the overhead of creating and managing processes is much more than that for threads, meaning that the benefits of multiprocessing are much dampened compared to multi-threading.

Memory Profiling

It is often useful to profile the memory usage of our script. In python, we can use memory_profiler to check the memory usage of our program line by line.

from memory_profiler import profile
import sys


@profile(stream=sys.stdout)
def f():
    a = [1] * (10**6)
    b = [2] * (2 * 10**7)
    del b


if __name__ == "__main__":
    f()

This will print the following useful information to stdout. Note that even before we did anything, there is background memory usage of 17MiB.

Filename: memory_profiling.py

Line #    Mem usage    Increment  Occurrences   Line Contents
=============================================================
     5     17.1 MiB     17.1 MiB           1   @profile(stream=sys.stdout)
     6                                         def f():
     7     24.5 MiB      7.5 MiB           1       a = [1] * (10**6)
     8    177.2 MiB    152.6 MiB           1       b = [2] * (2 * 10**7)
     9     24.8 MiB   -152.4 MiB           1       del b

We might also want to track memory usage of a function over time. We can use memory_usage instead for that.

import time
from memory_profiler import memory_usage
def g(a, n: int = 100):
    time.sleep(1)
    b = [a] * n
    time.sleep(1)
    del b
    time.sleep(1)

if __name__ == "__main__":
    usage = memory_usage((g, (1,), {"n": int(1e7)}), interval=0.5)
    print(usage)

This will give us an array like so, showing the spike in memory in the middle of g.

[17.375, 17.5234375, 17.5234375, 19.34765625, 93.59765625, 93.59765625, 17.53125, 17.53125, 17.53125]

CUDA

PyTorch may sometimes throw errors if the installed torch version does not match the installed CUDA version. To address this, we need to first check the CUDA version using the nvcc command:

/usr/local/cuda/bin/nvcc --version
---
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2022 NVIDIA Corporation
Built on Wed_Jun__8_16:49:14_PDT_2022
Cuda compilation tools, release 11.7, V11.7.99
Build cuda_11.7.r11.7/compiler.31442593_0

Then install the correct version of torch:

pip install -U torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118

List Comprehension

When doing nested list comprehensions in python, the later loop becomes nested as the inner loop. So this:

l = [i * j for i in range(4) for j in range(5)]

Is equivalent to:

l = []
for i in range(4):
    for j in range(5):
        l.append(i*j)

This also explains why flattening a list of lists is of the following form:

list_of_lists = [[0, 1, 2], [3, 4, 5], [6, 7, 8]]
flattened = [item for sublist in list_of_lists for item in sublist]

When we translate it into a nested for loop, we can see that the inner loop should indeed be placed at the end:

flattened = []
for sublist in list_of_lists:
    for item in sublist:
        flattened.append(item)

Configure VsCode Snippets

When typing stuff in VsCode, we sometimes write the same boilerplate and want to have a shortcut to type it for us. For example, when writing latex, we often have to create an align environment like so:

$$
\begin{align*}
    Equations go here..
\end{align*}
$$

VsCode has a snippets function to help us with that.

Press Ctrl + Shift + P and type Configure User Snippets
Create a local or global snippets file and name it (e.g. charles.code-snippets)
Add the snippet definition like the below

{
	"LaTeX Align Environment": {
	  "prefix": "align",
	  "body": [
	    "$$",
	    "\\begin{align*}",
	    "\t$1",
	    "\\end{align*}",
	    "$$$0"
	  ],
	  "description": "Inserts a LaTeX align* environment."
	}
}

Note that the prefix is the keyword to trigger the template. To use the template, we simply enter insert mode in vim as per normal (e.g. using a or i or o), then type align. At this point, we would see the text as per normal. We can then press ctrl+space to show the snippets auto-completion, if there are any. Finally, we just press enter to get the template generated. The cursor will now come to rest at position $1 as indicated in the snippet.

Github in Colab

We may want to run .ipynb notebooks within a github project in colab. There are two steps to this.

First, to get the .ipynb itself into colab, we can either use the UI to manually upload the notebook, or if we are loading from Github, we can simply start with the url to the notebook https://github.com/charleslow/REPOSITORY/blob/master/NOTEBOOK.ipynb and change github to githubtocolab. This will automatically open up the notebook in colab.

However, this only gives us the standalone notebook, and we will not have access to the project directory and packages. Hence, in the notebook, we need to:

!git clone <Insert REPOSITORY name>.git
%cd folder name

This will allow us to import packages and run the notebook as per normal.

Quartz Cron

We often use quartz cron expressions to schedule jobs. A cron expression is simply 6 or 7 values in a space-separated string, such as:

"0 0 0 1 * ?"

Which means run on the:

0th second of the
0th minute of the
0th hour of the
1st day the month of
every month
and doesn't matter which day of the week

The specifics of each field are in the table below, from the cron trigger tutorial:

Field Name	Mandatory	Allowed Values	Allowed Special Characters
Seconds	YES	0-59	`, - * /`
Minutes	YES	0-59	`, - * /`
Hours	YES	0-23	`, - * /`
Day of month	YES	1-31	`, - * ? / L W`
Month	YES	1-12 or JAN-DEC	`, - * /`
Day of week	YES	1-7 or SUN-SAT	`, - * ? / L #`
Year	NO	empty, 1970-2099	`, - * /`

On the special characters:

* is commonly used, meaning "select all the values"
? is only used for Day of month or Day of week. Basically if we set one, we must set the other to ? to avoid a conflict
- is used to specify a range. E.g. 10-12 in the hour field means run on the 10th, 11th and 12th hour
, is used to specify a few values, e.g. 10,12 n the hour field means run on the 10th and 12th hour
/ is used to specify increments. e.g. 0/15 in the minutes field means run on minutes 0, 15, 30, 45
# is a special case for Day of the Week. Basically, MON#1 means the 1st occurrence of a Monday of the given month

That's about it! Or we can just use an LLM to generate the expression we need.

Bradley-Terry Model

Based on wikipedia. The Bradley-Terry model (1952) is a probability model that allows us to infer scores for individual objects based on a dataset of pairwise comparisons between them. Specifically, it estimates the probability that $i ▹ j$ (i.e. $i$ is preferred to $j$ ) as:

$P (i ▹ j) = \frac{p _{i}}{p _{i} + p _{j}}$

Where $p_{i}$ is a positive real-valued score assigned to object $i$ (not necessarily a probability). Typically, $p_{i}$ is parametrized as an exponential score $p_{i} = e^{β_{i}}$ , and the goal is to learn the parameters $β_{i}$ from pairwise comparisons. This results in:

$P (i ▹ j) = \frac{e ^{β_{i}}}{e ^{β_{i}} + e ^{β_{j}}}$

Parameter Estimation

Parameter estimation is typically done using maximum likelihood. Starting with a set of pairwise comparisons between individual objects, let $w_{ij}$ be the number of times object $i$ beats object $j$ . Then the likelihood of a given set of parameters $p := [p_{1}, ..., p_{n}]$ ( $n$ denotes number of objects) is as follows:

$L (p) = l n ij \prod P (i ▹ j)^{w_{ij}} = i = 1 \sum n j = 1 \sum n l n (\frac{p _{i}}{p _{i} + p _{j}})^{w_{ij}} = i = 1 \sum n j = 1 \sum n w_{ij} [l n p_{i} - l n (p_{i} + p_{j})]$

This likelihood function can then be minimized by differentiating wrt $p_{i}$ and solved by setting to zero.

BT Model as a Sigmoid Function

We can also express the likelihood as a function of the difference in scores $β_{i} - β_{j}$ . Recall that the sigmoid function is $σ (x) = 1/ (1 + e^{- x})$ . Then: $L (p) = i = 1 \sum n j = 1 \sum n w_{ij} \cdot l n [\frac{e ^{β_{i}}}{e ^{β_{i}} + e ^{β_{j}}}] = i = 1 \sum n j = 1 \sum n w_{ij} \cdot l n σ (β_{i} - β_{j})$

The derivation for the second line above is found at /Identities/sigmoid. This re-parametrization shows that the BT-model is really modelling the preference as a difference in scores and then running that through the sigmoid function to convert it into a probability. This means that we are basically running a pairwise logistic regression.

Setting up WSL

We may need to uninstall and re-install WSL (Windows Subsystem for Linux) from time to time. Here is the step-by-step.

Tear down and delete all files. wsl --unregister Ubuntu-22.04
Re-install wsl. wsl --install Ubuntu-22.04
Set up keychain to auto-find ssh-agent (from Windows) and add keys
- Copy .ssh folder from Windows to wsl ~/.ssh
- Add to ~/.bashrc the following: eval $(keychain --eval --agents ssh id_rsa)
Install Docker
- Add users to docker to allow vscode access
- https://docs.docker.com/engine/install/linux-postinstall/
Essentials (get C linker)
- sudo apt install build-essential
Install rust
- https://www.rust-lang.org/tools/install
Install mdbook
- cargo install mdbook mdbook-katex

To Read

https://netflixtechblog.com/improve-your-next-experiment-by-learning-better-proxy-metrics-from-past-experiments-64c786c2a3ac
https://netflixtechblog.com/sequential-a-b-testing-keeps-the-world-streaming-netflix-part-1-continuous-data-cba6c7ed49df
https://arxiv.org/abs/2407.02464
https://arxiv.org/abs/2305.12102
https://arxiv.org/abs/1709.03933
https://arxiv.org/abs/2101.08769
https://arxiv.org/abs/1205.2618 # BPR
https://arxiv.org/abs/2312.08520 # InfoNCE+
https://arxiv.org/abs/2308.06091 # MAWU
https://arxiv.org/abs/2109.12613 # CCL
https://arxiv.org/abs/2201.02327 # SSM
https://arxiv.org/abs/2206.12811 # DirectAU
KG-BERT
XR-Transformer
SetFit
Lora vs Fine tuning
Better Generalization with semantic IDs
Mutual Information Neural Estimation
Contrastive Learning with Hard Negatives
Decoupled Contrastive Learning
Improving Graph Collaborative Filtering with Neighborhood-enriched Contrastive Learning
Semi-Supervised Classification with Graph Convolutional Networks
Ranking Distillation
PinnerFormer
BERT4Rec
Retrieval as Attention
Generate Rather than Retrieve
PLM in Baidu Search
Emebdding Based Retrieval in Facebook Search
Self Attentive Sequential Recommendation
Is BERT4Rec really better than SASRec?
Transformers4Rec
Pretrained Transformers for Text Ranking: BERT and Beyond
Turning Dross Into Gold Loss: is BERT4Rec really better than SASRec?
Modularized Transfomer-based Ranking Framework
Pre-trained Language Model for Web-scale Retrieval in Baidu Search
Related Pins at Pinterest: The Evolution of a Real-World Recommender System
Improving Pinterest Search Relevance Using Large Language Models
Pixie: A System for Recommending 3+ Billion Items to 200+ Million Users in Real-Time
Towards Personalized and Semantic Retrieval: An End-to-End Solution for E-commerce Search via Embedding Learning
Ranking Relevance in Yahoo Search
Fine-Tuning LLaMA for Multi-Stage Text Retrieval
Revisiting Deep Learning Models for Tabular Data
OmniSearchSage: Multi-Task Multi-Entity Embeddings for Pinterest Search
Search-based User Interest Modeling with Lifelong Sequential Behavior Data for Click-Through Rate Prediction
Behavior Sequence Transformer for E-commerce Recommendation in Alibaba
Graph Convolutional Neural Networks for Web-Scale Recommender Systems
PinnerSage: Multi-Modal User Embedding Framework for Recommendations at Pinterest
ItemSage: Learning Product Embeddings for Shopping Recommendations at Pinterest
RankT5: Fine-Tuning T5 for Text Ranking with Ranking Losses
Pretrained Language Model based Web Search Ranking: From Relevance to Satisfaction
Rethink Training of BERT Rerankers in Multi-Stage Retrieval Pipeline
Tevatron: An Efficient and Flexible Toolkit for Dense Retrieval
SimLM: Pre-training with Representation Bottleneck for Dense Passage Retrieval
Bootstrap your own latent: A new approach to self-supervised Learning
Gradient Boosting Neural Networks: GrowNet
GISTEmbed: Guided In-sample Selection of Training Negatives for Text Embedding Fine-tuning
Scaling Deep Contrastive Learning Batch Size under Memory Limited Setup
Improving Recommendation Systems & Search in the Age of LLMs
Scaling Law of Large Sequential Recommendation Models
Unified Embedding Based Personalized Retrieval in Etsy Search
360Brew: A Decoder-only Foundation Model for Personalized Ranking and Recommendation
Building a Scalable, Effective, and Steerable Search and Ranking Platform
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
DeepSeek-V3 Technical Report
The Llama 3 Herd of Models
Adalflow
LLMs by dropping the text
Layer by layer
How much do LLMs memorize
TextGrad
LLMs as optimizers
DSPY
LLMs are human level prompt engineers
LLMs can self improve
RLPrompt
Rationale augmented Ensembles
Self Consistency Improves CoT
CoT Prompting Elicits Reasoning
LLMs are zero shot Reasoners
Instruction Induction
Efficient Transformers
Fantastically ordered prompts
Protegi
React
MedPrompt
DCNv1
DCNv2

Packages

KeyBERT, KeyLLM

References:

KeyBERT and KeyLLM are packages to perform unsupervised keyword extraction from text. KeyBERT relies on BERT-based models, and the main idea is to extract n-gram phrases which have high semantic similarity to the overall document embedding. Some additional features are:

Allow user to specify phrase length for extraction
Add diversification via MMR to get diverse phrases

KeyLLM taps on LLMs to enhance the keyword extraction. Basically, it creates a prompt to ask an LLM to extract keywords from a document. It integrates with KeyBERT such that we can use KeyBERT to cluster documents, and only run KeyLLM on one document per cluster to save costs. It can also use KeyBERT to suggest candidates and use the LLM to verify.

Pytorch Lightning

The following pseudo code captures almost everything we need to know about pytorch lightning. Taken from here.

def fit(self):
    configure_callbacks()

    if local_rank == 0:
        prepare_data()

    setup("fit")
    configure_model()
    configure_optimizers()

    on_fit_start()

    # the sanity check runs here

    on_train_start()
    for epoch in epochs:
        fit_loop()
    on_train_end()

    on_fit_end()
    teardown("fit")


def fit_loop():
    torch.set_grad_enabled(True)

    on_train_epoch_start()

    for batch in train_dataloader():
        on_train_batch_start()

        on_before_batch_transfer()
        transfer_batch_to_device()
        on_after_batch_transfer()

        out = training_step()

        on_before_zero_grad()
        optimizer_zero_grad()

        on_before_backward()
        backward()
        on_after_backward()

        on_before_optimizer_step()
        configure_gradient_clipping()
        optimizer_step()

        on_train_batch_end(out, batch, batch_idx)

        if should_check_val:
            val_loop()

    on_train_epoch_end()


def val_loop():
    on_validation_model_eval()  # calls `model.eval()`
    torch.set_grad_enabled(False)

    on_validation_start()
    on_validation_epoch_start()

    for batch_idx, batch in enumerate(val_dataloader()):
        on_validation_batch_start(batch, batch_idx)

        batch = on_before_batch_transfer(batch)
        batch = transfer_batch_to_device(batch)
        batch = on_after_batch_transfer(batch)

        out = validation_step(batch, batch_idx)

        on_validation_batch_end(out, batch, batch_idx)

    on_validation_epoch_end()
    on_validation_end()

    # set up for train
    on_validation_model_train()  # calls `model.train()`
    torch.set_grad_enabled(True)

Skills

This documents LinkedIn's approach to constructing a knowledge graph around skills.

November 2022 - Building LinkedIn's Skills Graph to Power a Skills-First World

An important model for LinkedIn is the skills graph - it maps 39k skills over 26 languages and over 347k aliases. The components of their structured skills graph:

39k nodes, each of which is a skill
Each skill has multiple aliases
Skills are connected via edges which specify a hierarchical relationship. (e.g. Marketing is a parent of Demand Generation which is a parent of Email Marketing)

Skills Usage. Skills are automatically extracted from job postings and member histories. Job postings are allowed to tag up to 10 skills. Users can see something like 4/10 skills match your profile to help them determine job fit.

The skills graph is constructed both using ML and manual review by taxonomists.

March 2023 - Building and maintaining the skills taxonomy that powers LinkedIn's Skills Graph

This article is more about how the Skills Graph is constructed.

Each skill node includes the following foundational details (e.g. Machine Learning):

Description of the skill: the study of computer algorithms...
Aliases: ML, ...
Skills type: Hard or Soft
Skill ID

Each skill is represented by a node in the graph and nodes are linked via edges called knowledge lineages. Skills can relate for various reasons. Edges are directed and represent a parent-child relationship (e.g. Software Development -> Back-end Software Development -> Python)

Each node can have multiple parents and/or children.

Multiple parents example: Both Back-end Software Development and Mobile Software Development are parents of Java
Multiple children example: Supply Chain Management has children Supply Chain Engineering, Logistics Management and Digital Supply Chain

The hierarchical relationship allows us to enrich skills understanding, since if a person knows a particular skill, we may infer that he knows something about all the parent nodes and perhaps some of the "sibling" nodes.

LinkedIn has certain quality guardrails on the structured skills, one of which is discouraging ambiguity. For example, an ambiguous skill like Networking may be mapped to skills like Computer Networking or Professional Networking. This type of ambigious relationship is not allowed, and LinkedIn either removes such edges or disallows such a node. The meaning of a phrase is determined by analyzing how the skill is used predominantly in LinkedIn. In cases where a phrase can have divergent meanings, the skill is disambiguated by expanding the phrase. For example, Cadence is disambiguated to Cadence Software and Boundary to Boundary Line.

Architecture. The components to their ecosystem are as follows:

Human Curated KG. Presumably, this is a purely human-curated KG where all nodes and edges are taken as true, and constantly curated by human taxonomists.
AI-generated KG. This is generated using ML models which use the human curated KG as training and validation data. The model behind this is KGBERT, which will be briefly covered below.
Serving. Both the AI-generated KG and human-generated KG are made available to all LinkedIn services via a REST API for online serving and also on HDFS to power offline inference.

KGBERT is a model for automatic edge prediction, which can generate the AI-generated KG from the human curated one. The basic idea is that the human curated KG is used to generate training and validation data in the following form:

[CLS] Tensorflow [SEP] Machine Learning [SEP] -> label: child - parent

Two skills are concatenated to form the context. The skill is represented at random by its title, description or title + description. The context is fed into a BERT model and a linear + softmax layer is attached to the [CLS] token to generate a softmax probability over 3 options:

Parent -> Child
Child -> Parent
No relation

The positive labels are taken from edges in the human curated graph. The negative labels are generated by some heuristics:

Skill pairs from different industries
Niece / Nephew pairs e.g. Tensorflow vs Cognitive Computing
Sibling pairs with the same parent e.g. Tensorflow vs Pytorch
Loosely related pairs that are 3 or more steps apart

The nice thing about this architecture is the clean separation of the human curated graph from the AI-generated one. The human-in-the-loop setup allows humans to review the AI-generated graph and add new edges to the human curated graph, which in turn improves the AI-generated graph. This seems superior over mixing both AI-generated edges and human-curated edges in the same graph.

December 2023 - Extracting skills from content to fuel the LinkedIn Skills Graph

This post takes a deeper look into how skills are extracted from data by LinkedIn from various contexts, such as job listings or member profiles.

Note that skills can be mentioned either directly or indirectly (e.g. you are expected to know how to apply different techniques to extract information from data and communicate insights through meaningful visualizations). Hence, a simple span extraction approach will not be exhaustive in extracting skills. On the other hand, not constraining the problem to span extraction could lead to false positive errors, so our model has to be very accurate.


An overview of the skills extraction architecture

At a high level, the steps are:

The Skill Segmenter organizes the raw text into structured input, e.g. a resume is split into the qualification, career history etc. sections
The next step aims to generate a candidate list of skills from the context:
- 2a. The Text Skill Tagger is a trie-based model that simply finds matches in the text that exactly match a node in the Skills Taxonomy. This puts the burden on the Skills Taxonomy to have all the aliases that cover every possible utterance of the skill. However, this model is very fast
- 2b. The Semantic Tagger aims to overcome the coverage issues above and uses BERT semantic match to surface more candidates
- 2c. The skills from 2a. and 2b. are expanded into more skills using the Skills Graph. It seems like they add immediate parents, children and siblings of each skill.
Each skill candidate is scored against the context to generate a confidence score. This section has a Shared Model and Domain-specific Model.
- The Shared Model contains a context-encoder, which encodes the text surrounding the skill into an embedding, and a entity-encoder, which encodes surrounding skills, the job title, etc. All these embeddings are fed as context into the next stage. The Shared Model assumes that the relationship between the context surrounding each skill and the skill itself is constant across the different domains and thus benefit from a shared module. Their AB tests confirm this hypothesis and show that such multi-task training does increase lift in online metrics.
- Each domain (job posting, member profile, course) has its own scoring model. The embeddings from the Shared Model and each skill candidate are fed into the Domain-specific Model, which is presumably a form of cross encoder that generate a confidence score for each skill candidate. Finally, some threshold is applied and a final list of skills is generated.

LinkedIn needs to generate skill extractions within 100ms for a high volume of edits. Hence, they used knowledge distillation to distil the BERT model down 80% in parameters.

As we can see, the LinkedIn architecture is fairly complex and context specific. A natural question for smaller players is how we can tap on LLMs today to simplify parts of this architecture to achieve comparable performance.

Hash Collisions

Bloom embeddings is one way of handling large number of user / item IDs. Instead of assigning each unique ID to a unique embedding, we may perform hashing to assign each unique ID to one or multiple bins. The embeddings at these positions are then taken and typically summed to get the final representation of each user / item.

One natural question arises: given n number of unique values and m number of bins, what is the expected number of bins to have 2 or more values assigned to it? This collision is especially a problem if we only represent each unique value with one embedding (i.e. num_hashes = 1). In practice, we would usually have at least 2 or more hashes to avoid this problem.

We may approach the problem by observing that each item is hashed uniformly and has a $\frac{1}{m}$ chance of landing in a particular bin. The hashing is also independent, i.e. it is not affected by the hashing for any other item.

This means that the expected number of items assigned to a particular bin may be modeled as a $X \sim Binomial (n, \frac{1}{m})$ distribution, since we have n trials and a fixed probability of "success" of 1/m.

The probability we desire is $P (X \geq 2) = 1 - P (X = 0) - P (X = 1)$
The PMF is $P (X = k) = (k n) (\frac{1}{m})^{k} (1 - \frac{1}{m})^{n - k}$ , so:
- $P (X = 0) = (1 - \frac{1}{m})^{n}$
- $P (X = 1) = \frac{n}{m} \cdot (1 - \frac{1}{m})^{n - 1}$

So we may put the above together to obtain $P (X \geq 2)$ . It then remains to get the expected number of colliding bins by $m \cdot P (X \geq 2)$ , by using the linearity of expectation.

For example, given $n = 50, 000, m = 1, 000, 000$ , the expected number of colliding bins is 1,210 (out of 1 million bins), which is not insignificant.

Now note that since $n$ is large and $p = 1/ m$ is small, $X$ may be well approximated by $Y \sim Poisson (n / m)$ . Recall that the Poisson PMF is $P (Y = k) = λ^{k} \cdot e^{- λ} / k!$ . So:

$P (Y = 0) = e^{- n / m}$
$P (Y = 1) = \frac{n}{m} \cdot e^{- n / m}$

Using this approximation, we get the expected number of colliding bins as 1,209, which is a very good approximation. Hence the formula we want is: $Expected bins with collision = m \times (1 - e^{- n / m} - \frac{n}{m} \cdot e^{- n / m})$

Identities

Sigmoid

Sigmoid Relationship to Bradley-Terry Model

Suppose we have scores $s_{i}, s_{j}$ for objects $i$ and $j$ . Under the Bradley-Terry model, we can express the preference probability as follows:

$P (i ▹ j) = \frac{e ^{s_{i}}}{e ^{s_{i}} + e ^{s_{j}}} = \frac{1}{( e ^{s_{i}} + e ^{s_{j}} ) / e ^{s_{i}}} = \frac{1}{1 + e ^{s_{j}} / e ^{s_{i}}} = \frac{1}{1 + e ^{s_{j} - s_{i}}} = \frac{1}{1 + e ^{- (s_{i} - s_{j})}} = σ (s_{i} - s_{j})$

Where $σ (x) = 1/ (1 + e^{- x})$ is the sigmoid function.

Statistics

Adding / Subtracting RVs

Let $X, Y$ be two independent random variables. Then $Va r (X \pm Y) = Va r (X) + Va r (Y)$ .

First, show that $Va r (X) = E (x^{2}) - μ_{X}^{2}$ where $x \sim X$ . Proof: $Va r (X) = E []$

Reference: https://apcentral.collegeboard.org/courses/ap-statistics/classroom-resources/why-variances-add-and-why-it-matters

Summaries of individual papers that I have taken a deeper look into.

Weinberger 2009 - Hashing for Multitask Learning

Weinberger 2009 - Feature Hashing for Large Scale Multitask Learning

This paper proposes a method to represent a large feature set in a smaller space by using hashing. It shows analytically that with a sufficiently large hash dimension $m$ :

The inner product between instances is preserved, i.e. doing a dot product between instances in the hashed dimension approximates the true dot product in the original dimension
The same applies to learning a weight vector to generate predictions in the hashed space: the error of approximation goes to zero as $M$ increases

Setup

Consider having data points $x^{(1)}, ..., x^{(n)} \in R^{d}$ , where $d$ can be very large (e.g. millions). This setting is easily realized when we use, for example, word bi-grams and tri-grams as term-frequency features to perform some kind of text classification task. Such a large feature vector is unwieldy, and also inefficient since the feature vector is very sparse for a given text.

The hashing trick maps the high dimensional input vector to a smaller dimension feature space with the notation $ϕ : X \to R^{m}$ , such that $m << d$ .

We start with the following definitions:

Let $h$ be a hash function $h : N \to {1, ..., m}$
Let $E$ be a hash function $E : N \to {\pm 1}$

Note that while the definitions map from an input integer, we may apply them to texts as well, since any finite-length text may be assigned to a unique integer. This is typically done in practice by applying some hash algorithm to a given string, and then using the modulo function to restrict it to the desired range.

With this, and given two vectors $x, x^{'} \in R^{d}$ , we define the hash feature map:

$ϕ_{j}^{(h, E)} (x) = i \in Z : h (i) = j, 1 \leq i \leq d \sum E (i) x_{i}$

Where $j \in 1, ..., m$ is an index in the hashed dimension space, and $i \in 1, ..., d$ is an index in the input dimension space. We get a hash collision if more than one $i$ term is hashed into a given position $j$ . For brevity, we may just write $ϕ_{j}^{(h, E)} := ϕ$ .

Analysis

With this setup, the paper aims to prove analytically that hashing in this manner preserves the characteristics of the original space. In other words, we can significantly reduce the dimension of our features but achieve the same predictive effect as the original space by doing the hashing trick. This also means that the detrimental effect of hash collisions is minimal with a sufficiently large $m$ .

We won't trace through all the results, just the important and simple ones.

Lemma 2 The hash kernel is unbiased, i.e. $E_{ϕ} [⟨ x, x^{'} ⟩_{ϕ}] = ⟨ x, x^{'} ⟩$ .

Proof. The proof simply starts by expanding the inner product in the hashed space as follows: $⟨ x, x^{'} ⟩_{ϕ} = i = 1, ..., d \sum j = 1, ..., d \sum E (i) E (j) \cdot x_{i} x_{j}^{'} \cdot δ_{h (i), h (j)}$

Where $δ_{h (i) = h (j)}$ is an indicator variable which takes $1$ if $h (i) = h (j)$ (i.e. they are hashed to the same position) and $0$ otherwise.

To see that this expansion is true, consider a position in the hashed space, e.g. $k$ . The value at position $k$ looks something like the following. We just need to move the summands to the left and use the $δ$ variable to denote the common hash positions where $x_{i}$ and $x_{j}^{'}$ interact (if i and j are hashed to different positions, they clearly do not interact in an inner product). $[⟨ x, x^{'} ⟩_{ϕ}]_{k} = i \in Z : h (i) = k, 1 \leq i \leq d \sum E (i) x_{i} j \in Z : h (j) = k, 1 \leq j \leq d \sum E (j) x_{j}$

Now note that we can decompose the expectation over $ϕ$ into its independent constituents, i.e. $h$ and $E$ respectively (since the two hashes are independent):

$E_{ϕ} [⟨ x, x^{'} ⟩_{ϕ}] = E_{h} [E_{E} ⟨ x, x^{'} ⟩_{ϕ}]$

Now we just need to observe that the hashed values $E (i), E (j)$ are independent from all other terms in general, but also independent from each other whenever $i \neq = j$ (provided our hash function is pairwise independent). Thus when $i \neq = j$ , the summand is:

$E_{E} [E (i)] \cdot E_{E} [E (j)] \cdot E_{E} [x_{i} x_{j}^{'} \cdot δ_{h (i), h (j)}]$

These are clearly $0$ because $E [E (i)] = 0$ . So the original summation reduces to:

$E_{ϕ} [⟨ x, x^{'} ⟩_{ϕ}] = E_{ϕ} i = 1, ..., d \sum E (i)^{2} \cdot x_{i} x_{i}^{'} = ⟨ x, x^{'} ⟩$

Not only is the hashed inner product unbiased, it also has a variance that scales down in $O (\frac{1}{m})$ . The proof does a similar but more tedious expansion as the above, and assumes that $x, x^{'}$ have l2-norm of $1$ . This suggests that the hashed inner product will be concentrated within $O (\frac{1}{m})$ of the true value.

These results are sufficient to justify use of the hashed inner product space in practice. That is, we can perform recommendations in the hashed space with sufficiently large $m$ (we can tune that using validation error) to make the large feature space tractable. The paper goes on to prove more detailed bounds on the error and norm which are of less practical significance.

Multi-task Learning

The authors argue that this method is especially useful in the multi-task learning setting. Consider an email spam classification task where the vocab space is $V$ and the user space is $U$ . The parameter space is thus $V \times U$ , i.e. we wish to learn a user-specific weight vector $w_{u} \in R^{∣ V ∣}$ for each user $u$ , which allows us to personalize the spam filter for each user (different users have slightly differing definitions of what is spam).

The authors suggest the following approach:

Use the hashing trick to hash each term $v$ into the hashed space. e.g. data is passed into a global hash function $ϕ_{0}$ and assigned to a position
Each user gets his/her own hash function $ϕ_{u}$ . This may be implemented by using the same hash function but appending the user_id like so: user1_data, which hashes the same term into a new position.
We may thus represent each instance by $ϕ_{0} (x) + ϕ_{u} (x) \in R^{m}$ , capturing both a global element (some terms are universally spam-indicative) and a personalized element (some terms are specifically indicative for a user)
Finally, we learn a weight parameter $w_{h} \in R^{m}$ by training it in the hashed space

Empirically, for their task of $∣ V ∣ = 40 million$ , $∣ U ∣ = 400,000$ , they found that performance starts to saturate with $m \approx 4 million$ . This is a very small fraction of the total space $∣ V ∣ \times ∣ U ∣$ , showing the effectiveness of their method. Nevertheless, we should note that 4 million is still a rather large space.

Rendle 2009 - Bayesian Personalized Ranking

BPR: Bayesian Personalized Ranking from Implicit Feedback

This is one of the most cited papers for collaborative filtering. It proposes a pairwise learning algorithm for item recommendation from implicit feedback that remains one of the most performant to date. It is pitched as a rival method to Weighted Matrix Factorization, the other strong model of the day. BPR argues that it is superior to WMF because it explicitly optimizes for ranking.

Setup

Let $U$ , $I$ denote the set of all users and items respectively. We have access to an implicit feedback data $S \subset U \times I$ . The task is to produce each user with a personalized ranking $>_{u} \subset I^{2}$ . Note that $I^{2}$ denotes the cartesian product of $I$ with itself, so it represents the set of all ordered pairs of elements of $I$ . The ordering $>_{u}$ is a subset of these pairs where a preference relationship is indicated by the model. For convenience we also denote:

$I_{u}^{+} := {i \in I : (u, i) \in S}$
$U_{i}^{+} := {u \in U : (u, i) \in S}$

In implicit feedback systems, only positive interactions between user and item and observed. The typical approach is to put the interactions in a matrix and fill in unobserved entries with $0$ . A model is then fitted to this model. The paper makes an interesting observation that this problem is ill-defined, since a perfectly expressive model that fits the training data perfectly will fail to make any prediction at all, since all unobserved entries will be given a score of $0$ . Hence regularization methods are employed to avoid such overfitting.

The idea of BPR is to avoid making judgment on the pairwise preference between two items with the same score. That is, if items $i$ and $j$ are both interacted by user $u$ , we cannot judge if one is preferred over the other. Also, if $i$ and $j$ are both unobserved interactions, we cannot make such judgement either. Thus the training data $D_{S} : I \times I \times I$ is denoted by the following.

$D_{S} := {(u, i, j) ∣ i \in I_{u}^{+} AND j \in I ∖ I_{u}^{+}}$

Note that this definition means that $(u, i, j) \in D_{S} ⟹ (u, j, i) \in / D_{S}$ , since $j$ cannot be positive for $u$ if it was included as a negative.

BPR Loss

The bayesian formulation for finding a ranking is to find the model parameters $Θ$ that maximize the following probability: $p (Θ∣ >_{u}) \propto p (>_{u} ∣Θ) \cdot p (Θ)$

We assume that:

All users act independently of each other
The ordering of each pair of items $(i, j)$ for a given user is independent of the ordering of every other pair

Then, we can write the likelihood across all users as: $u \in U \prod p (>_{u} ∣Θ) = (u, i, j) \in U \times I \times I \prod p (i >_{u} j ∣Θ)^{δ ((u, i, j) \in D_{S})} \cdot [1 - p (i >_{u} j ∣Θ)]^{δ ((u, j, i) \in D_{S})}$ Where $δ$ is the indicator function for the preference relationship. In other words, the likelihood of the overall ordering is the product of the likelihood of each $u, i, j$ triplet. For each $u, i, j$ triplet, the likelihood is given by the model's prediction for given label $0$ or $1$ .

Note that the above term is only not equals to $1$ if $i >_{u} j$ or $j >_{u} i$ . Also, since we observed above that $(u, i, j) \in D_{S} ⟹ (u, j, i) \in / D_{S}$ and vice versa, only one of the two terms will come into play for each entry. The above formula can be simplified to: $u \in U \prod p (>_{u} ∣Θ) = (u, i, j) \in D_{S} \prod p (i >_{u} j ∣Θ)$

Note: Not too sure about the above step, would have thought that the $1 - p$ term should also come into play when $j >_{u} i$ .

Model

Now we can model the preference probability using a model as: $p (i >_{u} j ∣Θ) := σ (\overset{y}{^}_{u ij} (Θ))$ Where $\overset{y}{^}$ denotes a BPR-agnostic model that generates a predicted real-valued score for the triplet $(u, i, j)$ . Note that $σ (x) := 1/ (1 + e x p (- x))$ is the sigmoid function.

To complete the bayesian modelling, a prior distribution $p (Θ)$ is introduced over the model parameters. For simplicity, we let $p (Θ)$ take the form of a multivariate normal distribution with zero-mean and all covariances zero, with equal variance for each parameter (i.e. $λ_{Θ} \in R$ ): $p (Θ) \sim N (0, λ_{Θ} I)$

The maximum posterior estimator for $Θ$ is thus given by the following: $argmax_{Θ} ln p (>_{u} ∣Θ) \cdot p (Θ) = ln (u, i, j) \in D_{S} \prod σ (\overset{y}{^}_{u ij}) \cdot p (Θ) = (u, i, j) \in D_{S} \sum ln σ (\overset{y}{^}_{u ij}) + ln p (Θ) = (u, i, j) \in D_{S} \sum ln σ (\overset{y}{^}_{u ij}) + λ_{Θ} ∣∣Θ∣ ∣^{2}$

Note that in the last step, the log prior is translated into L2 regularization. This is apparently well studied and I will explore the derivation at a later time. A possible resource is Yuki Shizuya - Understanding L1 and L2 regularization with analytical and probabilistic views.

The paper suggests using SGD updates (i.e. batch size of 1) by randomly sampling triplets with replacement from $D_{S}$ . This converges much faster than doing user-wise SGD.

Finally, we've assumed $\overset{y}{^}$ is model-agnostic up to now. For matrix factorization, the authors suggest the following form. In other words, we compute the prediction for each user, item pair (usually via dot product of embeddings, but can be anything) and take the difference. $\overset{y}{^}_{u ij} = \overset{y}{^}_{u i} - \overset{y}{^}_{u j}$

Cornac Implementation

Cornac has a Cython implementation of BPR that is fast (but not memory scalable when number of user and items is large). num_threads can be increased for faster parallel processing (Cython overrides the GIL).

At initialization:

User embedding U of shape n_users, k is drawn randomly from a standard uniform distribution where k is the embedding dimension
Then, we take $(U - 0.5) / k$ . I suppose this leads to small, centered values which leads to stable initial training. Normalizing by k ensures that the dot product between vectors does not explode.
Item embedding of shape n_items, k is initialized similarly
Optional bias vector of shape n_items is initialized to the zero vector

Note that train_set.X is a scipy.sparse.csr_matrix, which has 3 main vectors:

data: values of non-zero entries
indices: column indices of the non-zero values
indptr: index pointers to the start of each row in data and indices

The vector user_ids is constructed from the csr_matrix. It is of the same length as X.indices and represents the row indices of the non-zero values. Thus we have both row and column indices of the non-zero values.

The main training loop samples a (u, i, j) triplet randomly each turn. It does so by the following steps:

A random index i between 0 and len(user_ids) is generated
The user_id and item_i_id are obtained by indexing user_ids[i] and X.indices[i] respectively
Now a random index j between 0 and n_items is generated
The item_j_id is obtained by indexing neg_item_ids[j] where neg_item_ids = np.arange(n_items)
A check is performed on the sparse matrix that item_j_id is not a positive item for user_id. If so, we skip this triplet.
Pointers to the start of the relevant u, i, j embeddings are then obtained
Note that 1 epoch comprises len(user_ids) number of triplets

The Cython code then computes and manually applies the SGD updates derived in the paper. Will omit since we do not need to manually compute the updates if using autograd.

Note: This implementation of BPR assumes binary relevance (an item is interacted or not). It does not allow for finer-grained preference relationships (e.g. a floating point rating score to rank items), which in theory BPR does support.

From RankNET $\to$ LambdaRank $\to$ LambdaMART

Paper Link.

This is an overview paper that explains the model behind LambdaMART, a technique used to learn Gradient Boosted Trees that optimize for NDCG on recommendation-type of datasets.

RankNET

RankNET is a generic framework for training a learn to rank model. Given a differentiable model $f : R^{d} \mapsto R$ that produces a score from an n-dimensional input vector, RankNET is able to train $f$ such that for a given query session with items $i = 1, ... n$ and corresponding features $x_{i} \in R^{d} : i = 1, ..., n$ , $f$ learns to predict the relationship $f (x_{i}) > f (x_{j})$ for any two items $i, j$ in the query session when $i$ is a better recommendation than $j$ . The differentiable model $f$ is typically a neural network or boosted trees.

RankNET uses revealed pair-wise preferences within each query session to train $f$ . Specifically, suppose we have a data for one query session as follows:

query	item	clicked
qid1	a	0
qid1	b	1
qid1	c	0

We can transform this data into a pairwise dataset as follows, where $y^{jk}$ denotes the preference relationship between $i t e m^{j}$ and $i t e m^{k}$ which we inferred from the click data. Note that the pairwise comparisons are only made within the same query session (e.g. qid1), as it reflects a given user's preferences in the context of a query and the items impressed to him/her in that session.

query	$item^{j}$	$item^{k}$	$y^{jk}$
qid1	a	b	-1
qid1	a	c	0
qid1	b	a	1
qid1	b	c	0

The pairwise setting is now more amenable to modelling (compared to directly optimizing for a good ranking), since we can now treat the task as a classification problem. For each row of the pairwise dataset, we only need to model the probability that $i t e m^{j}$ is preferred (or not) to $i t e m^{k}$ . This can be formalized using a cross entropy loss comparing the predicted preference of our model to the revealed preference in the dataset.

First, we model the predicted probability from the model. Given row $i$ of the pairwise dataset and $i t e m^{j}$ and $i t e m^{k}$ respectively, we model the predicted probability that $i t e m^{j}$ is preferred to $i t e m^{k}$ (using $▹$ to denote a preference relationship) by passing the score difference between the predicted scores $\overset{y}{^}_{i}^{j} := f (x_{i}^{j})$ and $\overset{y}{^}_{i}^{k} := f (x_{i}^{k})$ for items j and k respectively through a sigmoid function, like so:

$\hat{P}_{i}^{jk} := \hat{P} (i t e m_{i}^{j} ▹ i t e m_{i}^{k}) = \frac{1}{1 + e x p ( - a ( y ^ _{i}^{j} - y ^ _{i}^{k} ))}$

Now let us denote the revealed probability that $i t e m^{j}$ is preferred to $i t e m^{k}$ as $P_{i}^{jk}$ such that:

$P_{i}^{jk} = 1$ if we prefer item j to item k
$P_{i}^{jk} = 0.5$ if we have no preference between the two items
$P_{i}^{jk} = 0$ if we prefer item k to item j

The cross entropy loss of our model can then be expressed as:

$L = i \sum [- P_{i}^{jk} l o g \hat{P}_{i}^{jk} - (1 - P_{i}^{jk}) l o g (1 - \hat{P}_{i}^{jk})]$

For convenience, let us denote $y_{i}^{jk} := 2 P_{i}^{jk} - 1$ (and conversely, $P_{i}^{jk} = \frac{y _{i}^{jk} + 1}{2}$ ), which translates into the following:

$y_{i}^{jk} = 1$ if we prefer item j to item k
$y_{i}^{jk} = 0$ if we have no preference between the two items
$y_{i}^{jk} = - 1$ if we prefer item k to item j

Let us also define the convenience variable $z := - l o g \hat{P}_{i}^{jk} = l o g [1 + e x p (- a (\overset{y}{^}_{i}^{j} - \overset{y}{^}_{i}^{k}))]$ . The cross entropy loss then simplifies to: $L = i \sum [- P_{i}^{jk} l o g \hat{P}_{i}^{jk} - (1 - P_{i}^{jk}) l o g (1 - \hat{P}_{i}^{jk})] = i \sum [\frac{1}{2} (1 + y_{i}^{jk}) z - \frac{1}{2} (1 - y_{i}^{jk}) [(- a) (\overset{y}{^}_{i}^{j} - \overset{y}{^}_{i}^{k}) - z]] = i \sum [z + \frac{a}{2} (1 - y_{i}^{jk}) (\overset{y}{^}_{i}^{j} - \overset{y}{^}_{i}^{k})]$

Note that in line 2 of the above, we use the useful identity that $l o g (1 - s i g m o i d (x)) = x + l o g (s i g m o i d (x))$ and $l o g \hat{P}_{i}^{jk} = - z$ . In line 3, the first and last term of line 2 cancel out to simply return $z$ .

Having written out the loss function, we now need to differentiate the loss with respect to the model scores and parameters to obtain the gradient descent formula used to train the RankNET model. Differentiating $L$ wrt $\overset{y}{^}_{i}^{j}$ and $\overset{y}{^}_{i}^{k}$ gives:

$\frac{\partial L}{\partial y ^ _{i}^{j}} = \frac{a}{2} (1 - y_{i}^{jk}) + \frac{- a \cdot e x p ( - a ( y ^ _{i}^{j} - y ^ _{i}^{k} ))}{1 + e x p ( - a ( y ^ _{i}^{j} - y ^ _{i}^{k} ))} = \frac{a}{2} (1 - y_{i}^{jk}) - \frac{a}{1 + e x p ( a ( y ^ _{i}^{j} - y ^ _{i}^{k} ))} = - \frac{\partial L}{\partial y ^ _{i}^{k}}$

Note that the first line of the above uses the result $\frac{d}{d x} l n f (x) = \frac{f ^{'} ( x )}{f ( x )}$ . We obtain line 2 by multiplying the right term by $e x p (a (\overset{y}{^}_{i}^{j} - \overset{y}{^}_{i}^{k}))$ in both the numerator and denominator. We obtain line 3 by observing that $L$ is a function of $d := \overset{y}{^}_{i}^{j} - \overset{y}{^}_{i}^{k}$ , such that $\frac{\partial L}{\partial y ^ _{i}^{j}} = \frac{\partial L}{\partial d} \cdot \frac{\partial d}{\partial y ^ _{i}^{j}} = \frac{\partial L}{\partial d}$ and likewise $\frac{\partial L}{\partial y ^ _{i}^{k}} = \frac{\partial L}{\partial d} \cdot \frac{\partial d}{\partial y ^ _{i}^{k}} = - \frac{\partial L}{\partial d}$ . The symmetry of the derivative wrt $j$ and $k$ will be important for the next section on factorizing RankNET to speed it up.

Finally, we use the gradient to update individual parameters $w_{l} \in R$ of the model $f$ . In the below, $n$ denotes the number of data points in the pairwise dataset. This update procedure rounds up the discussion on RankNET and is sufficient for training a generic differentiable model $f$ from ranking data. $w_{l} \leftarrow w_{l} - η \frac{\partial L}{\partial w _{l}} = w_{l} - \frac{η}{n} i \sum [\frac{\partial L}{\partial y ^ _{i}^{j}} \cdot \frac{\partial y ^ _{i}^{j}}{\partial w _{l}} + \frac{\partial L}{\partial y ^ _{i}^{k}} \cdot \frac{\partial y ^ _{i}^{k}}{\partial w _{l}}]$

Factorizing RankNET

Schroff 2015 - FaceNET

Schroff 2015 - FaceNet: A Unified Embedding for Face Recognition and Clustering

This paper proposes the Triplet Loss for learning a face recognition system.

The task is that given a large number of person identities with a number of images associated with each person, to learn a model representation for an image in euclidean space, such that images belonging to the same person are close by and images for different persons are far away. This paper was impactful because it improved the SOTA on face verification by a large margin.

At this point, representation learning often trained a CNN classification model to classify images to a known identity. A bottleneck layer of relatively low dimensionality in the middle of the network is chosen as the representation of an image. In contrast to this indirect method, this paper directly optimizes the representation using contrastive learning.

Setup

Let the embedding of image $x$ be represented by $f (x) \in R^{d}$ , and constrain $∣∣ f (x) ∣ ∣_{2} = 1$ . Now for a given person i, we want the anchor image $x_{i}^{a}$ to be closer to other images of the same person $x_{i}^{p}$ (positive) than images of any other person $x_{i}^{n}$ (negative), with some margin $α$ . That is, we desire a model $f$ such that:

$∣∣ f (x_{i}^{a}) - f (x_{i}^{p}) ∣ ∣_{2}^{2} + α < ∣∣ f (x_{i}^{a}) - f (x_{i}^{n}) ∣ ∣_{2}^{2} \forall (x_{i}^{a}, x_{i}^{p}, x_{i}^{n}) \in T$

Where $T$ denotes the set of all possible triplets in the dataset. The authors note that most triplets would easily satisfy the desired constraint, hence we need some way to picking good triplets to optimize the learning process.

The loss we desire to minimize is thus as below. The intuition is that we want the anchor-positive distance $d_{a p}$ to be small but the anchor-negative distance $d_{an}$ to be small.

$L = i = 1 \sum N [∣∣ f (x_{i}^{a}) - f (x_{i}^{p}) ∣ ∣_{2}^{2} - ∣∣ f (x_{i}^{a}) - f (x_{i}^{n}) ∣ ∣_{2}^{2} + α]_{+}$

With this loss function, we see that we have 3 types of triplets:

Easy triplets where $α < d_{an} - d_{a p}$ . These will contribute 0 loss since $L = d_{a p} - d_{an} + α < 0$ will be a negative value, and the positive operator makes the loss 0. This makes sense because the model already categorizes these triplets well and there is nothing more to learn from these examples.
Semi-hard triplets where $0 < d_{an} - d_{a p} \leq α$ . These will contribute a small loss, since $0 \leq L = d_{a p} - d_{an} + α < α$ . These are triplets where the positive is already closer to the anchor than the negative (which is good), but the distance is still within the margin so we want to make the distance greater. In other words, .
Hard triplets where $d_{an} - d_{a p} < 0$ . These will contribute a large loss. These are triplets where the negative is closer to the anchor than the negative.

Triplet Selection

As with most contrastive learning methods, the authors observe that the selection of triplets is important for learning well. In an ideal world, we would want to select the hardest negatives for each anchor, i.e.: $x_{i}^{n} = a r g mi n_{x_{i}^{n}} ∣∣ f (x_{i}^{a}) - f (x_{i}^{n}) ∣ ∣_{2}$

However, this is undesirable because:

We may end up mostly selecting mislabelled instances or corrupted images
It is computationally infeasible to search across the whole corpus
Trying to learn hard negatives at the start of training may cause learning to collapse

Hence, the authors propose:

Mini batch negatives. Instead of finding the hardest negatives across the entire corpus, they mine for the hardest negatives in the mini-batch. This improves computational efficiency and mitigates the mislabelled/corrupted data issue.
Curriculum Learning. At the start of training, the authors select easier triplets, and gradually increase the difficulty as training goes along. Presumably, this is done by thresholding negatives based on $d_{an} - d_{a p} > t$ and starting from high $t$ to low $t$ .
Semi-hard negatives. The authors mention that it is helpful to limit negatives to $d_{an} - d_{a p} > 0$ , but it is unclear whether they always do this or only at the start of training. The intuition is to cap the difficulty of the triplets, as triplets that are too difficult may actually hinder learning due to corrupted data or mislabelling.

Note: In a later paper Lee 2020 - Large Scale Video Representation Learning via Relational Graph Clustering, it implies that FaceNET uses semi-hard negatives throughout the entire training process, which makes sense given that the paper warns against using hard negatives. In light of this, I would presume FaceNET searches for the hardest semi-hard negatives in each mini batch as negatives for each triplet.

Application to Semantic Search

The triplet loss may be applied directly to semantic search. However, it is important to note that the paper assumes that for each person, all others-labelled instances are negatives. This is a suitable assumption for face recognition as each image can only belong to one person, but it is not true for semantic search, where a document may be relevant for multiple queries. Hence, the mislabelling issue when mining hard negatives is amplified for semantic search. I imagine that the selection of accurate negatives for semantic search would require more verification and filtering.

Covington 2016 - Deep NNs for Youtube Recs

Deep Neural Networks for YouTube Recommendations

This is an old paper that documents YouTube's recommender system, but quite foundational in the Recsys world. This paper also occurs at a time when there was a migration of models into deep learning methods, and Tensorflow was open sourced in the year before.

Retrieval

As with most recommender systems, there is a candidate generation and ranking step. The candidate generation surfaces "hundreds" of videos. The predecessor to this work is a matrix factorization approach trained under rank loss (assuming it is BPR). This work may be thought of as generalizing the MF approach to non-linear transformations of arbitrary features.

The paper views recommendation as an extreme multiclass classification problem, where the aim is to predict the probability that the video watch $w_{t}$ at time $t$ is video $i$ . In the below, $V$ is the corpus of videos, $U$ is a specific user and $C$ is a specific context. Also $u, v \in R^{N}$ are high dimensional embeddings representing user and item respectively. $P (w_{t} = i ∣ U, C) = \frac{e ^{v_{i}^{T} u}}{\sum _{j \in V} e ^{v_{j}^{T} u}}$

It is well known that the above is intractable with millions of classes, so this paper samples negative classes from the background distribution and then corrects for the sampling via importance weighting. This paper samples "several thousands of negatives" for each true label, and then minimizes the softmax loss for the true label against the negatives. Note: it is not specified what is the background distribution, it could either be a uniform distribution over items or the empirical marginalized impression probability of each item over a fixed time period. Also note that the modern approach is to use in-batch negatives for efficiency, as opposed to the purely random negatives approach here.

At serving time, a fast ANN search algorithm is used to find the top scoring items. This paper used a hashing algorithm in Weston 2013 - Label Partitioning for Sublinear Ranking for this, which is probably an outdated approach today.

Retrieval: Features

Each user is represented by features such as:

Watch history, represented by a variable-length sequence of video IDs. Each video is represented by an embedding, and then the element-wise average is taken such that the whole history is represented by a dense embedding. Note: this is prior to the attention mechanism, the modern approach is to run the sequence through self attention before taking the element-wise average to get a more nuanced representation.
Search history, represented by a variable-length sequence of search terms, treated similarly to watch history. Note that in the experiments, both watch and search history were capped at 50 terms each.
Categorical features such as Geographic location are embedded into a dense vector.
Simple binary and float features such as gender, log in status are input directly into the network as real values normalized to [0, 1].

The features are all concatenated together to form a long dense vector representing the raw user inputs. e.g. if watch and search history are length $n$ , concatenating would give a vector of $2 n$ and so on. The dense vector is passed into a fully connected feed forward network with shrinking layers to get a final fixed size dense vector $u$ representing the user. Similar approach is done for items.


Candidate Retrieval Architecture

Retrieval: Example Age Feature

Since video popularity typically spikes in the first few days after upload and dies down after, an important feature is the "example age" to capture the "freshness" of an item. Since we typically have a training window of several weeks, failing to account for the age of an item would mean that the model will learn to predict the average watch probability of a video over the entire training period. This is not ideal since it makes the model biased toward older videos (which have had more time for exposure) than newer videos.

The authors correct for this by including the age of the training example as a feature during training. Suppose our training window is 90 days, then this feature would range from 90 (or 1 after normalization) for training examples on the first day of the training window to 0 for training examples on the last day of the training window. At serving time, the value of this feature is set to 0 to reflect the fact that we are making predictions at the very end of the window. The authors show that this approach helps the model learn to predict the time-sensitive effect of popularity for a given item that matches the distribution in the data.

Note: This is different from incorporating the age of the item as a feature during training. I suppose while age of the item can capture the same information, it requires more work at serving time, since we need to set the feature value to the actual age of the item at the time of serving. In contrast, the above approach allows us to just set the feature to 0 which is more elegant.

Retrieval: Surrogate Problem

The paper emphasizes that recommendation involves solving a surrogate problem that is then transferred to live inference. Hence care must be taken to ensure that we do not overfit the surrogate problem which can hurt live metrics. Some insights are:

Generate training examples from all YouTube video watches, not just from YouTube's recommendations, so that relevant videos can be propagated quickly even if they did not originate on YouTube itself.
Generate a fixed number of training examples per user, so that very active users do not dominate the loss function. Not sure how this works in practice, presumably a fixed number of examples per user (or just 1?) is sampled for each training epoch. This significantly improved live metrics.
Withhold information from the classifier. This is a counter-intuitive point that shows us how tricky it is to find a good surrogate problem. Given the structure of YouTube, users can find videos either from homepage recommendations or from searching for a video. In the latter case, if a user searches for say taylor swift, it is typically followed by user watches of taylor swift videos. However, we do not want to reproduce this behaviour for homepage recommendations, such that a user is always shown search results based on his / her last search query - this would perform very poorly! Hence the representation of search terms using a simple element-wise average (instead of using a sequence model like RNNs) actually leads to better live results.
Predict the next watched video. It is important that training is not performed on random held-out samples, but that the next video is always held out and a "rollback" of features available prior to the video watch is supplied to the classifier. This is because video watching is highly sequential in nature, e.g. episodic videos are watched in sequence and users explore artists starting with the popular videos before moving to niche ones, hence it is important that the surrogate problem captures this behaviour.

Retrieval: Ablation Studies

The paper explored a few settings. At depth 0, the MLP simply transforms the dense input vector in one step to the output dimension of 256, i.e. essentially a linear approach. For depth 1, an in-between ReLU layer of 256 is introduced; for depth 2 in-between ReLU layers of 512 -> 256 is introduced and so on. The performance increases significantly from depth 0 to depth 1 with diminishing benefits as we increase the model size.

In terms of features, it is interesting that incorporating searches is very important for performance.

Watches only has MAP of around 6%
Watches + Searches has MAP of around 9%
Watches + Searches + Example Age has MAP of around 10%
All features has MAP of around 11%

Ranking

The difference between ranking and retrieval is that ranking aims to specialize predictions to the specific setting in which impressions are shown. In practice, this means that the ranker has access to many more variables about the video and the user's relationship to the video than the retriever. The general architecture for the ranking model is similar to the retrieval model, with the difference that the impressed video is also included as part of the features and the model simply outputs a logit for prediction.


Ranking Architecture

Ranking: Feature Engineering

Despite the promise of deep learning, this paper found that significant time was still spent engineering features due to the challenge of encoding features for recommender systems. The main challenge for the authors was in representing a sequence of temporal user interactions and how these actions relate to the video being impressed. Note: The modern approach seems to move away from such intense feature engineering. Instead, the past interactions and video being impressed are simply embedded and passed into a self attention mechanism. The model can then learn interaction features in the hidden layers. The downside of the modern approach is paying higher compute cost at inference time, but it may still be worthwhile given the significant engineering effort required for generating and retrieving hundreds of features manually.

The paper highlighted User-item previous interaction features as the most important and impactful. This includes features such as:

How many videos from this channel did the user watch before?
When was the last time the user watched a video on this topic? These continuous features describing past user interactions on related items are especially powerful
Features describing number of past impressions of this (user, item) pair are important for a responsive recommender system. For example, if the user was shown this item many times recently but did not interact with it, the recommender can learn to demote this item so that the user's recommendations can be "refreshed".

Note: For YouTube, they maintain such impression features at up-to-the-second latency because of its importance. For other use cases, a latency of say 15 minutes might still be quite helpful.
Feature scores from candidate generation step are also useful

Similar to the retrieval step, categorical features are represented by dense embeddings, with a separate embedding table per categorical feature. The vocabulary for each feature is determined by a single pass over the training data at training time, and the vocabulary size is capped at 1 million based on impression count. Out of vocabulary values during inference time are simply mapped to the zero embedding. Multivalent features (e.g. a sequence of video IDs for watch history) are averaged element-wise before being concatenated.

The paper also found that normalizing continuous features was critical for performance. Specifically, a feature value $x$ with empirical distribution $f$ will be normalized to $\tilde{x} = \int_{- \infty}^{x} df$ , which is the cumulative probability of the empirical distribution up to value $x$ . To make it tractable, the authors compute the quantiles of the empirical distribution and do linear interpolation between the quantiles. For e.g. if the empirical distribution of $x$ is (1, 2, 8, 10), then 1 will be mapped to 0.25, 2 to 0.5 and so on. Note: this is different from min max normalization, which only uses the min and max value.

Ranking: Predicting Watch Time

The authors found that predicting watch time instead of a binary watch variable gets much better performance at live metrics. This is incorporated into the model in the form of weighted logistic regression. For positive examples (where the user did watch the video), the example is weighted by the amount of time the user spent watching the video. For negative examples (where the user did not watch), the example gets unit weight.

Recommendations as Treatments

Paper Link.

Training and evaluation data for recommender systems is subject to selection bias, because the probability of observing a data point depends on (i) users interacting with items which they like and (ii) the recommender system pushing items which they think the user likes. This leads to data Missing Not at Random (MNAR) and leads to biased model training and evaluation.

Consider users interacting with movies. Denote users $u \in {1, ..., U}$ and movies $i \in {1, ..., I}$ . Let $Y \in R^{U \times I}$ denote the true rating / interaction matrix between user and item, and $\hat{Y}$ the predicted matrix. Let $O \in {0, 1}^{U \times I}$ be the observability matrix of whether an interaction was observed, and $P \in R^{U \times I}$ be the probability of observation, i.e. $P_{u, i} = P (O_{u, i} = 1)$ . Ideally, we want the propensity matrix P to be uniform across all entries, which gives us the Missing Completely at Random (MCAR) condition.

Evaluating only on observed entries is biased

Given a model $\hat{Y}$ , we often want to compute evaluation metrics $δ_{u, i} (\hat{Y}, Y)$ on how well $\hat{Y}$ approximates $Y$ . For example:

$δ_{u, i} (\hat{Y}, Y) = (Y_{u, i} - \hat{Y}_{u, i})^{2} MSE$

$δ_{u, i} (\hat{Y}, Y) = \frac{Y _{u, i}}{l o g r ank ( Y ^ _{u, i} )} DCG$

The Risk function may then be denoted:

$R (\hat{Y}) = \frac{\sum _{u} \sum _{i} δ _{u, i} ( Y ^ , Y )}{U \times I}$

However, $R (\hat{Y})$ cannot be computed because most entries in $Y$ are missing. Typically, we estimate $\hat{R}$ using only observed entries. This estimator is naive because it simply assumes the propensity matrix $P$ is uniform. This assumption is often false in recommender systems because a small set of popular items tend to get impressed a lot. Hence this estimator will favour models that lean towards the popular items compared to models that recommend rarer / new items.

$\hat{R}_{naive} (\hat{Y}) = \frac{1}{∣ {( u , i ) : O _{u, i} = 1 } ∣} \cdot u, i : O_{u, i} = 1 \sum δ_{u, i} (\hat{Y}, Y)$

Unbiased estimators

The main idea in this paper is to view this problem as analogous to estimating treatment effects in causal inference. Think of recommending an item as an intervention analogous to treating a patient with a specific drug. The goal is to estimate the outcome of a new recommendation (clicked or not) or new treatment (recovered or not), while most outcomes between u,i pairs are not known.

The key to resolving this bias is to understand the assignment mechanism that generated $O$ , namely the propensity matrix $P$ . We can then correct for the propensity. We need to assume that $P_{u, i} > 0\forall u, i$ , i.e. full support, because otherwise the IPS estimator below is undefined.

IPS Estimator

The main estimator is the Inverse Propensity Score (IPS) estimator. Assuming that the assignment mechanism $P$ is known:

$\hat{R}_{ips} (\hat{Y} ∣ P) = \frac{1}{∣ {( u , i ) : O _{u, i} = 1 } ∣} \cdot u, i : O_{u, i} = 1 \sum \frac{δ _{u, i} ( Y ^ , Y )}{P _{u, i}}$

We have simply normalized each score by its inverse propensity, so that popular items with a high chance of being shown get their score reduced proportionally (and likewise in the opposite direction for rare items).

We can show that the IPS estimator is unbiased as we take the expectation over the random observability matrix. That is, suppose we are allowed to sample an infinitely large number of observations based on $P_{u, i}$ , the average of the IPS estimator over these datasets will be the true risk function. This simply happens because the expected value of $O_{u, i}$ is simply the propensity, and if we know the propensity we can just cancel it out.

$E_{O} [\hat{R}_{ips} (\hat{Y} ∣ P)] = \frac{1}{U \cdot I} u \sum i \sum E_{O_{u, i}} [\frac{δ _ u , i ( Y ^ , Y )}{P _{u, i}} \cdot O_{u, i}] = \frac{1}{U \cdot I} u \sum i \sum δ_{u, i} (\hat{Y}, Y) = R (\hat{Y})$

Comments

This paper only addresses exposure bias, but not position bias. In other words, it assumes that all impressions are equal, which is valid when a user is shown an ad, but not when a user is shown a list of search results. In the latter case, items in lower positions have a lower probability of being considered by the user, even though both are considered to be observed.

Bateni 2017 - Affinity Clustering

Bateni 2017 - Affinity Clustering: Hierarchical Clustering at Scale

This paper proposes a hierarchical clustering algorithm which they call "affinity clustering". It is essentially Boruvka's algorithm but with slight modifications, and the main contribution of the paper is a distributed algorithm to perform such clustering at scale.

Much of this paper is devoted to theoretical analysis, but my interest here is just in implementing and understanding why the distributed algorithm is correct.

Naive Algorithm

Suppose we have an undirected graph $G = (V, E)$ , where $V$ is the set of nodes and $E$ is the set of undirected edges. The naive algorithm for affinity clustering (closely following Boruvka's algorithm) is as follows:

Start with every node in its own cluster $c_{i}$ , where $i = 1, ..., ∣ V ∣$ .
In each round, find the lowest weighted outgoing edge from each cluster $c_{i}$ and add the edge
- i.e. for each cluster $c_{i}$ , find $arg max_{u, v} {w e i g h t (u, v) : u \in c_{i}, v \in / c_{i}}$
- Note that this step ensures that the number of clusters at least halves, since each cluster is connected to another cluster
If the number of clusters becomes lower than $k$ , we undo the most recent edge added until we get $k$ clusters
At the end of each round, we have the desired $k$ number of clusters
The next round then commences with these obtained clusters

Note that this is essentially Boruvka's algorithm (minus the undo steps), which is guaranteed to find the Minimum Spanning Tree (MST) of the graph $G$ . This implies that if we start with $G^{'} := MST (G)$ , we can run the naive algorithm on the MST $G^{'}$ and get exactly the same clusters as running it on $G$ , because only the edges in $G^{'}$ come into play when running affinity clustering on $G$ . Since the number of nodes in the $G^{'}$ is $∣ V ∣ - 1$ (by definition of an MST), we would be able to do this efficiently on a single machine (unless |V| is much larger than a few millions).

Therefore, we just need a distributed algorithm to find the MST of $G$ efficiently, and we can perform affinity clustering efficiently.

Efficient MST algorithm

The main algorithm is the distributed MST algorithm (see below). Some notations:

$m$ is the number of edges $∣ E ∣$ , $n$ is the number of nodes $∣ V ∣$
$c$ measures the average density of the graph, i.e. the average number of edges per node. $l o g_{n} (m / n)$ is taken such that $c \in [0, 1]$ , where $c = 0$ implies that $∣ V ∣ = ∣ E ∣$ and $c = 1$ implies that $G$ is a fully connected graph.
$0 < ϵ < c$ is a density parameter controlling the final number of edges remaining before we run MST on a single machine. We can probably set it to $0.1$ or something like that. A higher $ϵ$ implies that we can run ReduceEdges for less steps, but we need more memory to run the final round of MST.


Distributed MST Algorithm

The algorithm is really quite simple. The main idea is that we can independently process each subgraph of $G$ comprising a random pair of vertex sets $V_{i} ⋃ U_{j}$ in a distributed fashion. Each worker finds the MST of the subgraph $G_{i, j}$ assigned to it, and any edge that is in $E (G_{i, j})$ but not in $MST (G_{i, j})$ may be removed from the global edge set $E (G)$ . In this way, we can whittle down the number of edges in $E (G)$ until it is a small set that can fit in memory (i.e. in the order of $O (n^{1 + ϵ})$ which is not much larger than the number of nodes). Then we can just run MST one final time on the master node.

So the algorithm really hinges on Lemma 6 in the paper, which tells us that removing edges in this distributed way will still give us the correct MST at the end.

Lemma 6. Let

G^{'} = (V^{'}, E^{'})

be a (not necessarily connected) subgraph of the input graph

G

. If an edge

e \in E^{'}

is not in the MST of

G^{'}

, then it is not in the MST of

G

either.

My proof by contradiction. Suppose an edge $e \in E (G)$ exists between nodes $A, B$ . Let $G^{'}$ denote a subgraph containing nodes $A, B$ , and suppose for contradiction that $e \in MST (G)$ but $e \in / MST (G^{'})$ .

First cut $MST (G)$ by removing the edge $e$ such that $A, B$ are in different partitions $P_{A}, P_{B}$ (each partition is a set of nodes). Observe that since $e$ exists in $MST (G)$ , it must be the lowest weight edge connecting $P_{A}$ and $P_{B}$ , since otherwise we could have replaced $e$ with a lower weight edge to complete the MST.

Now consider the subgraph $G^{'}$ and partition the nodes in $V^{'}$ according to $P_{A}, P_{B}$ to form $P_{A}^{'} \subset P_{A}$ and $P_{B}^{'} \subset P_{B}$ . Consider the MST of $G^{'}$ (might be a Minimum Spanning Forest instead, if not all nodes can be connected), and observe that there must exist a path between $A$ and $B$ in $G^{'}$ . Now remove any edges that cross $P_{A}^{'}$ and $P_{B}^{'}$ (call this edge set $E_{R}$ ). Then add back all the edges in $E_{R}$ that do not connect a path between nodes $A$ and $B$ .

Now there must exist exactly one edge remaining in $E_{R}$ that connects a path from $A$ to $B$ , since (i) a path exists between $A$ and $B$ in $G$ and (ii) there cannot be more than one edge that does so, otherwise there would have been a cycle in $MST (G^{'})$ . We also know that this edge is not $e$ .

But $e$ is the minimum weight edge between $P_{A}$ and $P_{B}$ , and therefore it must also be the minimum weight edge between $P_{A}^{'}$ and $P_{B}^{'}$ . Hence this other edge could not have been in the MST of $G^{'}$ . We reach a contradiction.

Thus we are justified in removing edges in this way. This lemma is great because it allows us to independently process each subgraph, and if memory is of concern, we can also batch the number of subgraphs processed at each step according to the number of workers we have.

Lastly, note that $c$ is a dynamic parameter in the algorithm that measures the density (i.e. number of edges relative to number of nodes) of the graph at each step. Since we reduce the density of the graph in every step, $c$ is stepped down progressively. This also results in $k$ being stepped down progressively, where $k$ controls the number of subgraphs at each step. We start with a large $k$ , which requires less memory since small subgraphs implies that many edges are "chopped off". As the graph becomes less dense, we can afford to lower $k$ progressively and remove more edges.

Implementation

I can't seem to find an implementation of this algorithm, but it is probably easy to write a naive version of it using multiprocessing in python. We could store the edges in a sqlite database and distribute a batch of subgraphs at each turn, collect the edges to remove in the master and remove them from the db, and repeat.

Guo 2017 - DeepFM

Guo 2017 is an important innovation for recommender systems. The task tackled in this paper is to predict clickthrough rate (CTR) of recommended items.

The paper argues that it is important to learn implicit interaction signals between user and item features. Suppose we define the of a signal as the number of features involved in deriving the signal. Then:

Recency of an item is an order-1 signal
Item category and timestamp is an order-2 signal, since e.g. demand for food items spikes at meal times
Gender, game category and user age is an order-3 signal, since e.g. male teenagers prefer shooting games

One can see that both low and high order signals are important for modelling CTR, and manual feature engineering is cumbersome to derive all such interaction rules. Hence we would like to have a deep learning model that can model such signals directly from raw user / item features and interaction data. This paper proposes to use Factorization Machines to model low order feature interactions and deep neural network to model high order feature interactions.

Setup

Suppose we have a dataset comprising n instances of tuples $(χ, y)$ , where $χ$ is an m-fields raw feature record where each field could be categorical or numerical. Each categorical field is represented as a vector of one-hot encoding, and each continuous field is represented as-is. Let $x_{f i e l d_{j}}$ denote the vector representation of field j (where dimension of numericals is 1 and dimension of categoricals is the number of categories), and $x$ denote the flattened vector of the $x_{f i e l d_{j}}$ laid out horizontally.

DeepFM

DeepFM comprises two distinct models: the FM component and the deep component which are simply summed together: $\overset{y}{^} = σ (y_{FM} + y_{d ee p})$ . We go into each component below.

The FM component is a factorization machine that captures order-1 and order-2 interactions between the raw features. We first project each feature field $x_{f i e l d_{j}}$ to a k dimensional latent vector using a learned $d i m_{f i e l d_{j}} \times k$ embedding matrix (in the paper, the authors set k=10). The latent vector representation of field j is denoted as $V_{j} \in R^{k}$ . We compute the FM output as follows:

$y_{FM} = ⟨ w, x ⟩ + i = 1 \sum m j = i + 1 \sum m ⟨ V_{i}, V_{j} ⟩$

The first term represents the order-1 representation of the features as-is. The second term is a pairwise inner product between the embedding representations of each feature field, which represents order-2 interactions between the features.

The deep component tries to capture higher-order interactions between feature fields. This is done simply by laying out the embedding vectors $V_{j}$ horizontally, such that we form a flat input vector of size k \times m, and we call this $a^{(0)}$ . This fixed size vector is then fed into a multi-layer perceptron (or dense neural network) to finally return a sigmoid output. The standard forward equation for layer $l$ of the MLP is denoted below. Note that the embedding layer is shared between the deep and FM networks, allowing the deep component to benefit from the faster learning of the FM component.

$a^{(l + 1)} = σ (W^{(l)} a^{(l)} + b^{(l)}) y_{d ee p} = σ (W^{(H + 1)} a^{(H)} + b^{(H)})$

In the paper, the MLP is of the form 400->400->400. Dropout is set to 0.5.

Implementation

Huawei has an implementation of DeepFM (amongst other models) in pytorch.

Hamilton 2017 - GraphSAGE

Paper.

This paper presents a framework to efficiently generate node embeddings, especially for previously unseen nodes. It calls this inductive learning, i.e. the model is able to generalize to new nodes, as opposed to previous frameworks which are transductive and only learn embeddings for seen nodes. For example, matrix factorization methods are transductive because we can only make predictions on a graph with fixed nodes, and need to be retrained when new nodes are added.

GraphSAGE (i.e. Sample and Aggregation) aggregates feature information from the neighbourhood of a node to represent a given node. Feature information can include structural information (e.g. degree of a node) or other content-based information about a node, e.g. description of a job for a job node etc.

Setup

We start with a graph $G (V, E)$ that is provided by data. We have input features $x_{v} \forall v \in V$ . For a user-chosen depth of $K$ , we have $K$ aggregator functions $AGG_{k}$ and $K$ weight matrices $W_{k}$ . We also have a neighbourhood function $N : V \to 2^{V}$ , which means it maps from a node $v \in V$ to a set of nodes in $V$ . Note that $2^{V}$ is the powerset of the set $V$ , i.e. the set of all possible subsets of $V$ . We wish to generate low-dimensional, dense vector representations of each node $z_{v}$ .

Forward Propagation

The algorithm for the forward propagation (in words) is as follows:

We start with hidden representations $h_{v}^{0} \leftarrow x_{v} \forall v \in V$ , i.e. at layer 0, we just use the input features to represent each node
At depth $k = 1$ , we perform a neighbourhood aggregation step at each node $v$ : $h_{N (v)}^{k} \leftarrow AGG_{k} ({h_{u}^{k - 1} : u \in N (v)})$
The aggregated vector is then passed through a dense layer to get the hidden representation at depth $k$ . Note that $σ$ represents a non-linear activation, such as ReLU: $h_{v}^{k} = σ (W_{k} \cdot CONCAT (h_{v}^{k - 1}, h_{N (v)}^{k}))$
We L2-normalize each vector $h_{v}^{k} \forall v \in V$
We then repeat this process repeatedly for depths $k = 1, ..., K$
We then take the last layer: $z_{v} \leftarrow h_{v}^{k}$

The intuition behind the forward propagation is that we use the neighbours of $v$ to represent it. Importantly, we also include the hidden representation of the current node from the previous depth (analogous to a residual connection). At each depth level $k$ , we increasingly pull more information from further reaches of the graph. Note that in the aggregation step, a subset of each node's neighbours are sampled uniformly (as opposed to taking the full neighbour set) to control the complexity of the algorithm.

Loss

To train the weights, we define an unsupervised loss based on how well the embeddings are able to reconstruct the graph. Specifically, we have a loss which:

Rewards positive pairs for having a high dot product
Penalizes negative pairs ( $v_{n}$ being sampled negatives according to the negative sampling distribution $P_{n}$ )

$J_{G} (z_{u}) = - l o g (σ (z_{u}^{T} z_{v})) - Q \cdot E_{v_{n} \sim P_{n} (v)} l o g (σ (- z_{u}^{T} z_{v_{n}}))$

Alternatively, we can also define a supervised loss based on classification cross entropy loss, with presumably some form of negative sampling. The authors did not elaborate on this.

Aggregation Methods

The authors explored a few ways to define the $AGG$ function to aggregate neighbour embeddings together:

GraphSAGE-mean: The element-wise mean of the neighbour embeddings is taken
GraphSAGE-GCN: Same as above, except that the current node's hidden representation from the previous depth $h_{v}^{k - 1}$ is not included. The experiments show that omitting this residual connection actually leads to significant performance degradation.
GraphSAGE-LSTM: An LSTM is fitted over the sequence of embeddings. Since there is no inherent order to the neighbours, the authors randomize the ordering for each training sample
GraphSAGE-pool: An additional linear layer is added over the sequence of embeddings, before an element-wise max-pool operation is carried out

Generally from the experiments, it seems that GraphSAGE-mean is sufficient.

Ma 2018 - Entire Space Multi-Task Model

Ma 2018 tackles the problem of building a post-click Conversion Rate (CVR) prediction model. Note that CVR is the task of predicting conversions from impressions, whilst click-through rate prediction (CTR) is predicting clicks from impressions.

In a typical recommender system, users follow the sequential process of impression -> click -> conversion, where conversion may refer to making a job application, purchase action etc. Usually, CVR models are built the same way as CTR models: a dataset of clicked impressions is prepared, and the converted items are labelled as relevant. A model is trained on this dataset and then used to make conversion predictions on all impressions. This paper argues that there are two problems with this approach:

Sample Selection Bias (SSB). The distribution of the training set (comprising only of clicked impressions) differs greatly from the distribution of the testing set (comprising all impressions), and this distribution shift will hurt generalization performance of the trained model.
Data Sparsity (DS). The dataset for CVR (clicked impressions) is typically much less than the dataset for CTR (all impressions), and the lack of data makes model fitting difficult. The paper estimates that CVR dataset is typically 4% of that of CTR dataset.

Setup

Denote the observed dataset to be $S = {(x_{i}, y_{i} \to z_{i})} ∣_{i = 1}^{N}$ , with each sample tuple $(x, y, z)$ representing one impression drawn from a distribution $D$ with domain $X \times Y \times Z$ . $X$ is the feature space, and $Y, Z$ are label spaces (i.e. 0 or 1). Each feature vector $x \in R^{d}$ captures all the user attributes, item attributes or user-item interaction for the impression event. The notation $y \to z$ represents the sequential nature where a click event $y$ must precede a conversion event $z$ .

We can denote the various prediction tasks as follows:

Post-view clickthrough: $pCTR = p (y = 1∣ x)$
Post-click conversion: $pC V R = p (z = 1∣ y = 1, x)$
Post-view click + conversion: $pCTC V R = p (z = 1, y = 1∣ x) = pCTR \times pC V R$

The conventional way of modelling pCVR is to construct a sample from only click impressions, i.e. $S_{c} = {(x_{j}, z_{j}) ∣ y_{j} = 1}$ , where clicked but not converted impressions are treated as negative samples. We can see than $S_{c} \subset S$ . As mentioned above, there are problems with this approach.

ESMM Model

The ESMM model breaks down the pCTCVR task into its constituents, and uses two neural networks to model pCTR and pCVR simultaneously. Based on the diagram, it seems to embed each user field and item field into a fixed-size embedding, where the user field embeddings are summed up element-wise to produce an overall user embedding. The same is done to produce an overall item embedding. The user and item embeddings are then concatenated together, and this combined embedding is fed into a dense layer to finally output a real score representing either pCVR or pCTR. The two scores are then multiplied together to form the final prediction of pCTCVR.

Importantly, the projection (or lookup) layer from raw features to embedding is shared between the two neural networks. This allows the pCVR network in particular to benefit from the richer sample data that the pCTR network enjoys and addresses the data sparsity issue.


ESMM Model Architecture (Figure 2 from ESMM Paper)

Finally, the model is trained with a multi-task objective. Specifically, the losses are computed on the dataset with all impressions. The output pCTR is compared against clicks using a cross-entropy loss, and the output pCTCVR is compared against conversions using a cross-entropy loss. This multi-task loss allows us to exploit the sequential nature of the data generation process, such that $f_{c v r}$ only needs to model the delta aspect that leads from a click to a conversion.

$L (θ_{c v r}, θ_{c t r}) = i = 1 \sum N l (y_{i}, f_{c t r} (x_{i})) + l (z_{i}, f_{c t r} (x_{i}) \times f_{c v r} (x_{i}))$

The authors show that modelling in this multi-task manner outperforms a setup where two models are trained independently to predict CTR and CVR respectively, and their product is taken to estimate pCTCVR. Unfortunately, we cannot replicate this joint-task learning setup with gradient tree-based models, at least not naively.

Details

The authors set the dimension of each embedding vector to be 18, and each MLP is 360 -> 200 -> 80 -> 2 dimensional. Adam solver is used with $β_{1} = 0.9, β_{2} = 0.999, ϵ = 1 0^{- 8}$ .

Kang 2018 - SASRec

Self-Attentive Sequential Recommendation

This paper uses a transformer model with self causal attention to perform recommendations, by representing each user as a sequence of item embeddings and predicting the item interacted at time t+1 by using all the information up to time t.

Background

This paper came out shortly after the transformer (Attention is all you need) was invented. Up to this point, sequential recommendation was performed using Markov Chain methods or RNN-based methods. Since the self-attention mechanism of transformers is well suited to sequential modelling, this paper makes the natural adaptation of self-attention to the recommendation task.

Setup

In the setting of sequential recommendation, we have for each user a sequence of item interactions $S^{u} = (S_{1}^{u}, S_{2}^{u}, ..., S_{∣ S^{u} ∣}^{u})$ where each element represents an item. For computation reasons we may choose to truncate $S^{u}$ to the most recent $n$ interactions. For simplicity we may also denote $S^{u} = (s_{1}, ..., s_{n})$ . We have user and item sets $U, I$ . Let us also define:

$M \in R^{∣ I ∣ \times d}$ as the full item embedding matrix with latent dimension $d$
$P \in R^{n \times d}$ as the learned position embedding matrix

For each user, we receive $S^{u}$ and truncate it to the most recent $n$ items. If there are less items, we left-pad the sequence with a constant zero vector. This results in an input embedding matrix $E \in R^{n \times d}$ for the user.

Analogous to the language modelling task, the targets for each user is simply $S^{u}$ shifted to the left by one. In other words, the target at time step $t$ would be the item interacted with at time step $t + 1$ .

Model

Position Embeddings. We start by adding position embeddings to the user representation; absolute position embeddings are used here. Since this is a transformer model, the model has no notion of the item sequences if we do not inject the position embedding, and would not be able to learn that more recent items contain more valuable information about the next item to predict. The authors later show that visualizing the self-attention heatmap reveals that without position embedding, all items are attended to similarly, but with position embedding the attention weights are concentrated near the diagonal, i.e. more recent items are attended to stronger.

Specifically, we simply add the position embedding matrix to the input embedding matrix, such that: $\hat{E}_{t} = M_{s_{t}} + P_{t}$

Self attention. The standard scaled dot product attention is used to perform self attention on the input embedding. Specifically: $SA (\hat{E}) Attention (Q, K, V) = Attention (\hat{E} W^{Q}, \hat{E} W^{K}, \hat{E} W^{V}), where = so f t ma x (\frac{Q K ^{T}}{d}) V$

Where $W^{Q}, W^{K}, W^{V} \in R^{d \times d}$ are the projection matrices. We then make sure to mask the softmax attention matrix in a causal manner, so that there can be no interaction between $Q_{i}$ and $K_{j}$ for all $j > i$ .

Feedforward network. A point-wise two-layer feedforward network is applied to the output of the self attention (i.e. $S \in R^{n \times d}$ ), like so: $FFN (S) = ReLU (S W^{(1)} + b^{(1)}) W^{(2)} + b^{(2)}$

Where $W^{(1)}, W^{(2)} \in R^{d \times d}$ and $b^{(1)}, b^{(2)} \in R^{d}$ . Note that in the feedforward networks, there remains no interaction between any $S_{i}, S_{j}$ at different time positions.

Stacking self attention layers. Now we stack the self attention layers and also apply (i) residual connection, (ii) dropout and (iii) LayerNorm each time. This is standard practice in transformers and leads to more stable training. $S^{(b)} F^{(b)} = g(SA) (F^{(b - 1)}) = g(FFN) (S^{(b)})$

Where we define the composite function $g(SA)$ as follows, and $g(FFN)$ is defined similarly. $g(SA) (X) := X + Dropout (SA (LayerNorm (X)))$

Note: In modern transformers, the $LayerNorm$ is replaced by the simpler $RMSNorm$ and the $ReLU$ function is replaced by the $geLU$ function.

This gives us the full specification for one layer (layer $b$ ) of the transformer. Several layers are stacked to provide the full model.

Prediction. After the final $B$ layer, we have $F^{(B)} \in R^{n \times d}$ as our representation. The predicted score at each time step $t$ for any item $i \in I$ is made according to a simple dot product: $r_{i, t} = F_{t}^{(B)} M_{i}^{T}$

Training Loss

As discussed above, the target for each time step $t$ is simply the next item at time step $t + 1$ . Specifically, if we define $o_{t}$ as the target output at time step $t$ , we have:

$o_{t} = <pad>$ if $s_{t}$ is a padding item
$o_{t} = s_{t + 1}$ for $1 \leq t < n$

Note: Not sure if we need to predict the padding item, or just simply mask the loss at those positions. Similarly for time step $n$ , where we do not know the next item to predict after the last item in the sequence.

Finally, the binary cross entropy loss is chosen as the objective function for each time step $t$ . Specifically, a random negative item that user $u$ has not interacted with is sampled for each time step and used as the negative example. The loss is: $L = - S^{u} \sum t \sum lo g (σ (r_{o_{t}, t})) + j \in / S^{u} \sum lo g (1 - σ (r_{j, t}))$

Note: The binary cross entropy loss is chosen here with one sampled negative per time step. A later paper Turning Dross Into Gold Loss: is BERT4Rec really better than SASRec? will show that using softmax cross entropy loss with many sampled negatives will lead to much better performance.

Experiments

For the experiments, two attention layers were used (i.e. $B = 2$ ). Item embeddings are shared between the input embedding layer and also used in the prediction layer (for $M_{i}$ ). The latent dimension is set to $d = 50$ .

The ablation studies found that:

Increasing number of layers saturates at around $B = 3$
Using multi-head attention did not improve over single head attention
The absolute position embeddings generally improved performance relative to no position embeddings

Implementation

The authors endorsed a pytorch implementation here.

Reimers 2019 - Sentence-BERT

Link to paper.

Sentence-BERT (or SBERT) was one of the first papers to suggest a way to fine-tune BERT models to generate useful embeddings that can be used for search / retrieval.

Prior to SBERT, BERT models were mainly used for sentence pair regression tasks by passing two sentences into the transformer network and adding a classification head on top to produce a float value. We can call this the cross-encoder approach. In other words, researchers only cared about the final prediction and did not make use of the embeddings, or the final vector representation of the inputs. This approach is suitable for reranking a small number of documents but not for nearest neighbour search in a corpus with millions of documents.

Naively, one can element-wise average the BERT embeddings at the final layer to produce a vector representation of the text. This vector representation can then be used for nearest neighbour search or clustering. However, because BERT was not explicitly trained for this objective, it results in rather bad sentence embeddings, often worse than GloVe embeddings.

Method

The SBERT paper presents 3 different training objectives, all of which perform well on embedding similarity tasks. The choice of objective depends on the dataset:

Classification objective. This is for tasks where the objective is to predict a label given two sentences A, B. We pass each sentence into the BERT network and a pooling layer to get two vector representations, $u$ and $v$ . The pooling layer can be (i) take the [CLS] token embedding, (ii) take the element-wise mean or (iii) take the element-wise max. We then create a concatenated vector $(u, v, ∣ u - v ∣)$ which is fed into a softmax classifer. The network is trained using cross-entropy loss.

Note that this siamese approach (where each sentence is passed into the same network) differs a little from the typical cross-encoder approach, where the sentences are concatenated as a string with the [SEP] token before passed into the network. The latter approach is presumably more powerful because the attention mechanism can attend to all pairwise relationships
Regression objective. This is for tasks where the objective is to predict a float given two sentences A, B. Given the vectors $u$ and $v$ , the cosine similarity is simply taken to generate a float between $- 1$ and $1$ . The cosine similarity is then compared with the actual float value using mean-squared error to generate a loss.
Triplet objective. This is for tasks where each data point is a triplet (anchor sentence $a$ , positive sentence $p$ , negative sentence $n$ ). We then minimize the loss function, where $m$ is the margin: $ma x (∣∣ s_{a} - s_{p} ∣∣ - ∣∣ s_{a} - s_{n} ∣∣ + m, 0)$

Ablation

Pooling strategy. Using [CLS] or mean seems to be largely similar. The authors saw some degradation using max for the regression objective.
Concatenation. For the classification objective, the concatenation strategy makes some difference. In particular, using $(u, v)$ yields $ρ = 0.66$ but $(u, v, ∣ u - v ∣)$ yields $ρ = 0.81$ . Thus the element-wise difference is important in yielding useful embeddings, probably because it can be used to push similar sentences together and dissimilar sentences apart. The authors also found that adding element-wise multiplication $u * v$ does not help.

Takeaway

It is interesting that the classification objective, which is close to a cross-encoder framework, is also able to learn useful embeddings by adding the difference operation $∣ u - v ∣$ . This suggests that we can train a cross encoder and simultaneously get useful embeddings for nearest neighbour retrieval at the same time.

Yi 2019 - LogQ Correction for In Batch Sampling

Yi 2019 - Sampling Bias Corrected Neural Modelling for Large Corpus Item Recommendations

This paper proposes a way to perform logQ correction for sampling bias introduced by in-batch negative sampling when training two tower models. The algorithm proposed is a streaming algorithm that estimates item frequencies based updates after seeing each mini batch.

Setup

Let $x_{i}$ , $y_{j}$ denote a user and item respectively, where there are $i = 1, ..., N$ users and $j = 1, ..., M$ items. Let $u (.)$ and $v (.)$ denote user and item embedding functions that map each $x_{i}$ and $y_{j}$ to $R^{k}$ . These functions are typically:

Some sentence transformer model for texts
Some hash embedding in the collaborative filtering setting

The output of the model is the inner product of the embeddings, i.e. $s (x, y) =< u (x), v (y) >$ . The goal is to train the model from a training dataset of $T$ user-item interactions, denoted by $T := {x_{i}, y_{i}, r_{i}}_{i = 1}^{T}$ , where $x_{i}$ , $y_{i}$ are the interacting query and item and $r_{i} \in R$ is the associated reward.

Typically $r_{i} = 1$ to denote an interaction
We can also use $r_{i}$ to denote some quality weight, e.g. time spent on product

Given a query $x$ , we typically model the conditional probability of picking item $y$ based on the softmax function. $θ$ parametrizes the embedding model: $P (y ∣ x; θ) = \frac{e ^{s (x, y)}}{\sum _{j \in [M]} e ^{s (x, y_{j})}}$

We then design the loss function as a weighted log likelihood of the training interactions: $L_{T} (θ) := - \frac{1}{T} i \in [T] \sum r_{i} \cdot lo g (P (y_{i} ∣ x_{i}; θ))$

In Batch Sampling

In practice, the denominator for $P$ above is not feasible to compute when the number of items $M$ is very large. The common practice is to sample only a subset of $B$ items that are drawn in a mini batch. Hence given a mini batch of B pairs ${(x_{i}, y_{i}, r_{i})}_{i = 1}^{B}$ and for any $i \in [B]$ , the batch softmax becomes:

$P_{B} (y_{i} ∣ x_{i}; θ) = \frac{e ^{s (x_{i}, y_{i})}}{\sum _{j \in [B]} e ^{s (x_{i}, y_{j})}}$

Note that each $x_{i}, y_{i}$ refers to a positive pair. However, the batch softmax above is usually a very biased estimate of the full softmax. This is because our training data usually has a heavy bias toward popular items, hence the likelihood of a popular item being included in the denominator is usually quite skewed.

In other words, our model trained with this biased likelihood function may have a low training loss against popular items in the denominator during training. But when used in retrieval, the model may be assigning high scores to rare items that should be negatives, just that our model did not have a chance to discriminate against them due to the biased sampling during training.

This issues underlies the common phenomenon when training such retrieval embedding models where the reranking performance is good but retrieval performance is very bad. The reason is that reranking is often performed against popular items that the model sees often, but retrieval by definition searches across the whole item catalogue. Hence retrieval is (from this perspective) a harder task than reranking. Special attention must be paid during training to ensure that the model learns to discriminate well against all items in the catalogue, and this logQ correction is one of the methods at our disposal.

In Adaptive Importance Sampling to Accelerate Training of A Neural Probabilistic Language Model, the authors propose the following way to correct the biased batch softmax by correcting each score logit: $s^{c} (x_{i}, y_{j}) = s (x_{i}, y_{j}) - l o g (p_{j})$

Where $p_{j}$ denotes the probability of sampling an item $j$ in a random batch. With this correction, we can denote the batch softmax as: $P_{B}^{c} (y_{i} ∣ x_{i}; θ) = \frac{e ^{s^{c} (x_{i}, y_{i})}}{e ^{s^{c} (x_{i}, y_{i})} + \sum _{j \in [B], j \neq = i} e ^{s^{c} (x_{i}, y_{j})}}$

And finally we have the batch loss function as: $L_{B} (θ) := - \frac{1}{B} i \in [B] \sum r_{i} \cdot lo g (P_{B}^{c} (y_{i} ∣ x_{i}; θ))$

Estimating Sampling Probability in Stream Setting

Notably, the batch loss function does not require holding a fixed set of items in memory to serve as negative candidates, making it suitable for use in a streaming training data setting. Thus, the authors propose a method to estimate the sampling probability $p_{j}$ in a streaming fashion as well.

The first observation is that it is easier to track the number of steps (or batches) between two consecutive hits of item $j$ . e.g. if we only get one item once every 50 batches, then $p = 0.02$ . The proposed algorithm is as follows:

Initialize Arrays $A, B$ with size $H$
Let $h (.)$ be a hash function from an item ID to $[H]$
At batch $t$ , sample a batch of items. For each item $y$ in the batch:
- $B [h (y)] \leftarrow (1 - α) \cdot B [h (y)] + α \cdot (t - A [h (y)])$
- $A [h (y)] \leftarrow t$
At inference time, the sampling probability for item $y$ will be $p_{y} = 1/ B [h (y)]$ .

Other Notes

The authors note that adding l2-normalization to embeddings improves model trainability and leads to better retrieval quality. Also, adding a temperature $τ$ to each logit helps to sharpen the predictions. In their experiment, the best $τ$ is usually around 0.05 (i.e. logits get multipled by 20x).

Zhao 2019 - Recommending What to Watch Next

Recommending What Video to Watch Next: A Multitask Ranking System

This paper covers Youtube's recommendation system for what video to watch next, given the current video. The main contributions of this paper are:

Handling position bias
Multi-gated Mixture of Experts to handle multi-task objectives

Background

This paper focuses on the ranking task of next video recommendation. There are specific challenges to their setting:

Multiple signals. For each video recommendation, there are engagement signals such as clicked, duration watched etc., and also user satisfaction signals such as liked, shared etc. Optimizing for these tasks is potentially conflicting, but multi-task learning also has potential to learned shared representations across tasks.
Position bias. Position bias can degrade recommender performance since it may learn signals irrelevant to relevance. Tackling this issue can lead to online performance gains.
Large Scale. The industry scale of recommending from billions of videos means that training quality might need to be traded off for efficiency. For this reason, the paper opts for a neural network based point-wise ranking model (as opposed to pair-wise or list-wise training methods like InfoNCE). This is also why the paper focuses on an architecture that can efficiently share parameters between different input feature modalities and prediction tasks.

The candidate generation is performed using multiple retrieval models, each of which captures a different aspect. For example, one algorithm matches topics of the query video, whilst another is based on co-watch statistics. A sequence model similar to Covington 2016 - Deep Neural Networks for YouTube Recommendations is also used. The efficient gramian methods in Krichene 2018 - Efficient Training on Very Large Corpora via Gramian Estimation was also used.

For ranking, a few hundred candidates are retrieved and passed to the ranker. Point-wise ranking loss is chosen for simplicity and scaleability.

Mixture of Experts

It is common for ranking systems to have multiple objectives, due to having multiple implicit feedback signals in a recommender system. There are some approaches to deal with this:

Explicitly modelling the relationship between the signals. E.g. in ESMM, the conversion probability head is modelled as a multiplication of the click probability with a separate head. In this way, the conversion head only needs to learn the "delta" signals that cause a click to turn into a conversion. However, such an approach only works when the relationship between tasks is known, and is also not very scaleable when the number of tasks increases.
Shared Bottom Layer. The typical architecture is to have a shared representation layer at the bottom (from the inputs), and then to have a task specific head for each task connected to this shared bottom (see figure below). However, such hard parameter sharing has been shown to sometimes harm the learning when correlation between tasks is low.


Shared Bottom Architecture

The mixture of experts architecture proposed is covered in Ma 2018 - Modeling Task Relationships in Multi-task Learning with Multi-gate Mixture of Experts. The main idea is that we have several expert layers that branch off from the shared bottom, and we also have a gating head (ending in a softmax) for each expert tower. The representation for each task is thus a unique weighted combination (the weights coming from the gating heads) of the expert layers. This somehow allows the model to "distribute work" to different experts to modularize information for each task. Note that the number of experts can differ from the number of tasks.

Specifically, focusing on task $k$ , let us denote:

$x \in R^{d}$ as the representation arising from the shared bottom layer
$f_{i} (x) \in R^{d}$ as the $i^{t h}$ expert where $i = 1, ..., n$ .
- In this paper, $f$ is simply a sequence of ReLU layers.
- Here I denote the expert representation as having the same dimension $d$ for simplicity but that does not need to be the case.
$g_{(i)}^{k} (x) \in R$ is the gating weight for task $k$ from the $i^{t h}$ expert
$g^{k} (x)$ is the vector form comprising $n$ elements for $n$ experts
- It is obtained by $g^{k} (x) = so f t ma x (W_{k} (x))$
- Where $W_{k} \in R^{n \times d}$ is the gating weights for task $k$
$h_{k}$ is a final sequence of ReLU layers for task $k$

Starting from the shared bottom representation $x$ , for each task $k$ we perform:

Obtain the gating weight vector $g^{k} (x) = so f t ma x (W_{k} (x))$
Obtain the weighted representation for task $k$ as $f^{k} (x) = \sum_{i = 1}^{n} g_{(i)}^{k} (x) \cdot f_{i} (x)$
Obtain the final prediction logit for task $k$ as $y_{k} = h_{k} (f^{k} (x)) \in R$

We see that this architecture is quite flexible in supporting an arbitrary number of experts and tasks. In contrast to ESMM, we do not need any knowledge of how the specific tasks relate to each other to specify this network.

The authors mention that in distributed training, there is a danger that the softmax gating network may experience an imbalanced expert distribution problem, where gating networks converge to have zero-utilization on most experts. They experienced this 20% of the time. To mitigate this issue, dropout is performed on the gating networks by setting the softmax weight for each gate to be 0 with probability 10%. The softmax vector $g^{k}$ is then renormalized to sum to 1 after the dropout. With this approach, they completely eliminated the polarization issue. It is not clear whether this issue happens in single-machine training, but nonetheless the fix seems easy to implement and seems like a low-hanging fruit to do.

Position Bias

A common problem with implicit feedback data is position bias, i.e. the propensity of users to click items in higher positions due to the position rather than the relevance of the item. There are some ways to deal with this:

Inject the position as an input feature in model training to learn the position bias, and then set this feature to a fixed constant when serving
Learn the position bias explicitly from offline data or random experiments, then use methods like Inverse Propensity Scoring to regularize or normalize the predictions

The problem with the latter approach is that position bias is dynamic in real world systems, making it cumbersome to separately estimate it. Hence this paper adopts the former approach that is efficient and adapts to training data distribution changes as we train the main model.

This paper treats position bias as a bias term that is added to the prediction logit for each task $k$ , before the logit is fed into a final sigmoid to generate a click probability. The position bias term is generated by passing the position feature and other features like device info into a shallow tower. Device info is included because different position bias is observed on different types of devices.

Note: The paper is not clear on how the position and device are represented as features. Naively, I imagine that a small embedding (say of dim 32) is used to encode each position and each device value. Since the position bias is strongest at the top positions, we could cap the number of position embeddings to say 50, and assign all later positions to the final position embedding. Then, we can concatenate the position and device embedding to form an embedding of say dim 64, and feed that into a shallow ReLU layer to generate the position bias logit. Another way is to do element-wise multiplication of the position and device embedding.

Training

The paper does not provide much detail on how exactly training is performed, but they mentioned that point-wise loss was chosen for simplicity. I would imagine the process to be something like:

Start with positive training examples where some subset of engagement or satisfaction metric were observed (e.g. user watched)
For each positive training example, sample one or more random negative(s) from items that were shown to the user in that context but user did not click
Train the model to predict the multi-task outcome for each of these examples in a point-wise manner

Experiments

Online experiments were performed on YouTube. AUC was chosen for classification tasks and squared error for regression tasks.

The paper found that using the shallow tower for position bias increases AUC by 0.24%
The Mixture of Experts approach is better than the shared bottom approach for the same number of model parameters.

Takeaways

This paper is scant on details on the actual ranker and features used, but makes a good case for the two proposed components of (i) mixture of experts and (ii) shallow tower for position bias.

Lee 2020 - Large Scale Video Representation Learning

Large Scale Video Representation Learning via Relational Graph Clustering

This paper proposes an efficient method to learn video representations from relational graph data (i.e. user item interactions). Specifically, it learns a small-ish embedding model that transforms raw audio-visual features for a given video into a vector representation that is useful for downstream recommendation tasks. This method seems to be in use at YouTube at least until 2023 as it was mentioned in Singh 2023 - Better Generalization with Semantic IDs.

This paper is in the genre of metric learning of similarity metric between videos. Similar to FaceNet, it uses triplet contrastive loss to push related videos together and random videos apart. The contribution of the paper is a hierarchical clustering approach to sample smart negatives which they show to be much more informative for learning than random negatives.

Setup

We start with a raw input representation of a given video $x \in R^{d}$ , where $x$ could be a concatenation of various raw input features or a representation from some off-the-shelf pretrained embedding model. We are also given a relational graph $G = (V, E)$ where each node is a video and each edge $(x_{1}, x_{2}) \in E$ represents some relationship between two videos. The edge weight can be binary or real numbers. We may obtain such edge relationships from implicit feedback, e.g. how frequently two videos are co-watched, co-clicked, co-searched, etc. The aim is to learn a representation $z \in R^{k}$ where $k$ is much smaller than $d$ , such that $z_{1}^{T} z_{2} \sim 1$ if they are related and $0$ otherwise.

Method

The relational graph is pre-processed with a hierarchical clustering algorithm. A hierarchical algorithm is chosen so that we can sample negatives with varying levels of difficulty later on. The paper uses Affinity Clustering, although it notes that any suitable clustering algorithm works as well. At a high level, affinity clustering at each step chooses the lowest outgoing edge-weight to add from each cluster. Some desirable properties of affinity clustering are (i) tends to produce clusters of similar size, which helps negative sampling to be consistent across clusters and (ii) easily parallelizable.

Once we have the relational graph $G$ and clustering $C$ , the training proceeds as follows:

Construct triplets:
- Sample a random anchor video from all videos
- Choose a positive video from its neighbours in $G$
- Sample a negative video at a desired level from sibling clusters in $C$
Compute the distances for each triplet $i$ :
- Anchor-positive distance: $d_{a p} := ∣∣ f (x_{i}^{a}) - f (x_{i}^{p}) ∣ ∣^{2}$
- Anchor-negative distance: $d_{an} := ∣∣ f (x_{i}^{a}) - f (x_{i}^{n}) ∣ ∣^{2}$
Perform online semi-hard negative mining:
- In each mini-batch, resample negatives to choose the hardest semi-hard negative in the mini-batch for each row
- Recall that a semi-hard negative is where $0 < d_{an} - d_{a p} \leq α$
- So we are choosing negatives such that $d_{an}$ is as close as possible to $d_{a p}$ without being lower
- Note that this means we need to compute $d_{an}$ for all anchor-negative pairs in the mini-batch
Optimize for the following objective: $min i = 1 \sum N [d_{a p}^{i} - d_{an}^{i} + α]_{+}$

This is essentially the same procedure as FaceNET, except the smart and dynamic choice of negatives. Also note that since we only train with semi-hard negatives, the inner term $0 \leq d_{a p} - d_{an} + α < α$ , so the $+$ operator on the loss can actually be ignored.

Denote the anchor video as A. The smart negative sampling from sibling clusters is defined as:

At L0, negatives are chosen from all the descendants of $P (A)$ from the affinity tree
At L1, negatives are chosen from all the descendants of $Gp (A) := P (P (A))$
And so on

Experiments

The embedding network is simply a fully connected MLP with layers of dimension 4,000 and 256 respectively. Each video is preprocessed into an embedding representing raw audio-visual features using techniques like fourier transform, feeding into pre-trained audio-visual ResNets etc. and fed into the MLP to get a resulting embedding.

The relational graph is constructed by adding an edge between a pair of videos if they are frequently co-watched by multiple users. Specifics are not provided but I presume the rule is something like co-watched by at least n user pairs. In other words, the edge rule sounds fairly stringent to minimize the number of false edges. The videos are then split into training and test with a 7:3 split. Each mini-batch comprises 9,600 triplets and the margin is set to 0.5. The learning rate starts at 0.1 and decays by 0.98 for every 300k steps.

The evaluation simulates a cold start scenario, where both the query video and candidate videos are all taken from the unseen test set. For each query video, retrieval is performed using the model (using cosine similarity) on all videos from the test set. The NDCG and MAP are then computed based on whether we successfully retrieved relevant videos for the query video based on the true relational graph $G$ .

Findings:

Smart negatives are significantly better than random negatives. To further prove this point, the authors tracked the % of negatives that remain the same after the hard negative re-sampling step, and showed that a much larger % of smart negatives were retained (10% to 15%) than random negatives (close to 1 / batch_size). Interestingly, the usage % of smart negatives actually increases steadily as training goes on. This could be because at the start, many smart negatives are too difficult for the model and do not qualify as semi-hard. As the model learns, it is able to handle more smart negatives and the usage % increases.
Online semi-hard negative mining is important. The authors found that using smart negatives without the online mining step does not perform well. This could be because the smart negatives are too difficult for the model at the beginning and the model fails to learn. The authors suggest that an alternative to the online mining step is curriculum learning, where easier negatives are provided at the start and difficulty is increased gradually.
More difficult negatives work better. The authors tried training at L0, L1 and L2 difficulty levels respectively, and found that training at L0 consistently does the best. This kind of defeats the point of the hierarchical clustering, but I suppose it depends on the use case and in other settings other difficulty levels may work better.

Takeaways

This paper proposes a scaleable and easy-to-understand method of item representation learning. Obviously, it can be extended to other modalities such as text so long as we find a reasonable way to represent the raw inputs. For text, we can probably directly fine-tune some BERT encoder using the same procedure.

Nevertheless, note that this paper alone would not provide optimal performance if we use the embeddings here directly for retrieval. One big reason is that we are considering an undirected relational graph, so we learn only symmetric relationships, whereas in retrieval item sequence often matters. For example, people often progress from beginner videos to more advanced videos for a particular subject, and this paper would not be able to capture such relationships. Hence the embeddings in this paper have to be fed into another retrieval or ranking model for optimal performance.

He 2020 - LightGCN

He 2020 - LightGCN is a simple and effective Graph Convolution Network for recommendation.

LightGCN is an adaptation of Graph Convolutional Neural Networks (GCN) to the task of recommendations. In a typical Convolutional Neural Network for vision, convolution aggregations (such as linear projections, pooling, average) are applied to a neighbourhood of pixels that are near to one another. The aggregations transform the raw pixel values into a hidden layer of "embedding" values, and the next layer of aggregations is applied to the hidden layer, allowing the CNN to learn more abstract features with each increasing layer. A GCN uses essentially the same idea, except that the definition of neighbourhood of a node A are the neighbouring nodes that are connected by an edge to A. The GCN thus allows us to train node embeddings on all types of graphical data, such as social networks, user-item interactions etc.

Setting

This paper tackles the task of collaborative filtering without features, i.e. making recommendations purely from the user and item id. Also, no negative samples are required - all we need is edges between users and items based on some form of implicit interaction.

Neural Graph Collaborative Filtering (NGCF)

The LightGCN model is essentially a simplification of the NGCF model, so the paper starts here. Btw, there are some overlaps between LightGCN authors and NGCF authors. The setup is as follows:

Each user and item are embedded from their id -> embedding
Let $e_{u}^{(0)}$ denote the ID embedding of user $u$ and $e_{i}^{(0)}$ denote the ID embedding of item $i$

NGCF uses the user-item interaction graph (derived from data) to propagate the embeddings as follows: $e_{u}^{(k + 1)} e_{i}^{(k + 1)} = σ (W_{1}^{(k)} e_{u}^{(k)} + i \in N_{u} \sum \frac{1}{∣ N _{u} ∣∣ N _{i} ∣} (W_{1}^{(k)} e_{i}^{(k)} + W_{2}^{(k)} (e_{i}^{(k)} ⊙ e_{u}^{(k)}))) = σ (W_{1}^{(k)} e_{i}^{(k)} + u \in N_{i} \sum \frac{1}{∣ N _{u} ∣∣ N _{i} ∣} (W_{1}^{(k)} e_{u}^{(k)} + W_{2}^{(k)} (e_{i}^{(k)} ⊙ e_{u}^{(k)})))$

Some notes about the propagation equations above:

$e_{u}^{(k + 1)}$ and $e_{i}^{(k + 1)}$ denote the embedding of user $u$ and item $i$ respectively after k layers of propagation
$σ$ is a non-linear activation function
$N_{u}$ denotes the set of items that interacted with user $u$ . For instance, it could be all the items the user purchased within the past 3 months. $N_{i}$ is the set of users defined in a similar way.
$W_{1}^{(k)}$ and $W_{2}^{(k)}$ are trainable weights

Intuitively for a given user, the equation propagates (i) the user embeddings itself (order-1), (ii) the embeddings of neighbouring items (order-1) and (iii) the hadamard interaction between the user and the neighbouring items (order-2). And likewise for the item embeddings. Note that is performed - the entire neighbour set is taken per node.

Finally, after training the network of $L$ layers, we obtain $L + 1$ embeddings for each user and item. The embeddings are concatenated as such $e_{u} = [e_{u}^{(0)}, ..., e_{u}^{(L)}]$ and $e_{i} = [e_{i}^{(0)}, ..., e_{i}^{(L)}]$ where $e_{u}, e_{i}$ are vectors of dimension $R^{k L}$ . Prediction scores for the match between user $u$ and item $i$ are then computed via the inner product $⟨ e_{u}, e_{i} ⟩$ .

Problem With NGCF

The authors argue that NGCF is unnecessarily complicated because traditionally, the base embedding layer $e_{u}^{(0)}, e_{i}^{(0)}$ is derived from rich semantic features such as embedding the title of papers etc. This justifies the usage of the activation function $σ$ and the projection weights $W_{1}^{(k)}, W_{2}^{(k)}$ etc. to learn a transformation of the semantic features. In contrast, for the collaborative filtering setting, the embeddings are arbitrary numbers tied to each user or item ID. Hence, performing multiple non-linear transformations will not lead to better feature learning.

: I'm not fully convinced by this argument, although the empirical results do support it. I agree with the argument to the extent that the base embedding layer is arbitrary, but imo NGCF can still learn a bigger representation space of models through its non-linear transformations. The problem seems to be more that (i) the richer feature representation is not very useful and (ii) the additional complexity makes the model harder to learn.

LightGCN Forward Propagation

In LightGCN, we essentially remove the non-linear activation and weight projections. The propagation equations simplify to the following:

$e_{u}^{(k + 1)} e_{i}^{(k + 1)} = i \in N_{u} \sum \frac{e _{i}^{(k)}}{∣ N _{u} ∣∣ N _{i} ∣} = u \in N_{i} \sum \frac{e _{u}^{(k)}}{∣ N _{u} ∣∣ N _{i} ∣} = \frac{1}{∣ N _{u} ∣} i \in N_{u} \sum \frac{e _{i}^{(k)}}{∣ N _{i} ∣} = \frac{1}{∣ N _{i} ∣} u \in N_{i} \sum \frac{e _{u}^{(k)}}{∣ N _{u} ∣}$

The final representation of each node $v$ (whether user or item) is then a weighted sum of its hidden representation across all layers:

$e_{v} = k = 0 \sum K α_{k} \cdot e_{v}^{(k)}$

Although $α_{k}$ could be a parameter to be optimized, the authors propose just setting $α_{k} = 1/ (K + 1)$ for simplicity.

Noticeably, the forward propagation does not include the self-connection from the previous layer, i.e. the update for $e_{u}^{k + 1}$ does not explicitly include $e_{u}^{k}$ , which other papers like GraphSAGE argue is important. The authors argue that because they use a weighted sum of hidden representations across all layers, this essentially is equivalent to including self-connections, so that is no longer necessary.

Loss

The only trainable parameters of the model are the embeddings at the 0th layer, i.e. $E^{(0)}$ . The authors propose using Bayesian Personalized Ranking loss, which is a pairwise loss that encourages the score of a neighbour to be higher than the score of an unobserved, randomly sampled counterpart.

$L_{BPR} = - u = 1 \sum M i \in N_{u} \sum j \in / N_{u} \sum l n σ (\overset{y}{^}_{u i} - \overset{y}{^}_{u j}) + λ ∣∣ E^{(0)} ∣ ∣^{2}$

In contrast to NGCF and other GCN approaches, the authors do not use dropout as a regularizer. Instead, they think the L2 regularization on the embedding layer is sufficient, as these are the only parameters in the model. Training of the model is done in a mini-batch manner, where batches of (user, item) tuples are drawn, negative items sampled, and the loss evaluated.

Ablation Studies

The paper has a few ablation findings:

Symmetric Normalization is important, i.e. it is important in the forward propagation to divide by $∣ N_{u} ∣∣ N_{i} ∣$ . Omitting either one leads to performance degradation. Note that in GraphSAGE, the GraphSAGE-mean variant essentially does $\sum_{i \in N_{u}} e_{i}^{(k)} /∣ N_{u} ∣$ , i.e. it only normalizes by the user degree. I suppose normalizing by the item degree as well penalizes popular items, so it could be useful.
Layer combination is important for robustness as we increase the number of layers, i.e. instead of just taking $e^{K}$ as the final embeddings, it is useful to take the element-wise mean of the embeddings at each layer. This might be analogous to the impact of including self connections.

Cornac Implementation

Cornac has a torch implementation of LightGCN:

The code relies on the dmlc/dgl package for constructing the bipartite user-item graph which will be used to compute neighbourhoods. The construct_graph function works as follows:

user_indices and item_indices are lists of the same length where each element at index i contains a pair of user, item that interacted
A dgl.heterograph is constructed with both directions:
- ("user", "user_item", "item") represents user -> item direction
- ("item", "item_user", "user") represents item -> user direction
- Hence there are two node types and two edge types in the graph
Starting with the user->item direction:
- src and dst are torch tensors containing the users and items respectively that interacted, both of length M
- dst_degree is a torch float tensor of length M containing the number of users interacting with each item in dst
- src_degree is a torch float tensor of length M containing the number of items interacting with each user in src

At model initialization, self.feature_dict is initialized with xavier initialization as follows. Note that because we have a heterograph, the nodes are defined as a dictionary of the form node_type: feature_tensor.

    self.feature_dict = {
        "user": user_embed, # (n_users, embed_dim)
        "item": item_embed, # (n_items, embed_dim)
    }

The GCNLayer class represents one layer of the message passing network.

References

Lewis 2020 - Retrieval Augmented Generation

Lewis 2020 - Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks

This paper proposes a way to fine-tune a parametric seq2seq transformers (e.g. GPT) with a non-parametric memory through dense retrieval. The main idea is to extend parametric memory (i.e. the "knowledge" that is stored within the LLM floating point parameters) of the seq2seq model by coupling it with retrieval of documents from a vector database (dubbed non-parametric memory, using Wikipedia articles in the original paper). We can update the non-parametric database's knowledge as the world changes.

The paper argues that this setup is ideal for knowledge-intensive tasks such as open question answering, fact verification etc., where it is impossible to store all knowledge in parametric memory.

Setup

Given an input sequence of tokens $x$ , we wish to retrieve contextual documents $z$ and use them as additional context when generating target sequence $y$ . We have two models:

Retriever $p_{η} (z ∣ x)$ which returns top-k truncated probability distributions over text passages. Truncated probably means that the probability is normalized such that the top-k probabilities sum up to 1 for each retrieval.
Generator $p_{θ} (y_{i} ∣ x, z, y_{1 : i - 1})$ parametrized by $θ$ which generates tokens in an auto-regressive manner.

The goal is to train both models end-to-end on a fine-tuning corpus of input sequence / output sequence pairs. The loss objective is to minimize the negative log-likelihood $- l o g p (y_{j} ∣ x_{j})$ .

Training

The paper proposes two different models to achieve the end-to-end prediction.

RAG-Sequence. In this setting, we retrieve k documents and keep using them to generate the entire target sequence.

$p_{R A G - S e q u e n ce} (y ∣ x) \sim z \in t o p k {p_{η} (\cdot ∣ x)} \sum p_{η} (z ∣ x) p_{θ} (y ∣ x, z) = z \in t o p k {p_{η} (\cdot ∣ x)} \sum p_{η} (z ∣ x) i \prod N p_{θ} (y_{i} ∣ x, z, y_{1 : i - 1})$

Note that we are marginalizing (or taking a weighted combination) over the truncated distribution of $z$ , implying that we trust each document according to its retrieval probability in the final probability for generating each token.
The expression $p_{θ} (y_{i} ∣ x, z, y_{1 : i - 1})$ just means that we generate the target sequence with an input sequence that is a concatenation of $x$ , $z$ and $y_{1 : i - 1}$ .
The retrieval is done using a BERT encoder using Maximum Inner Product Search (MIPS). To avoid re-indexing, the document vectors are held constant whilst the query encoder is trained in the above end-to-end fashion.
There is no explicit supervision on what documents are to be retrieved. Intuitively, if a document is useful for generating the correct tokens, the loss objective would encourage $p_{η} (z ∣ x)$ to be larger, thus encouraging the retriever to retrieve more relevant documents.
Another interesting way to think about this setup: suppose the generator just returns the retrieved document (token for token) and k=1, and the input / output pairs are anchor-positive pairs in a standard retrieval setting. Then we can see that this matches the standard retrieval training objectives but without negative sampling. Thus it seems that the token prediction task is sufficient to generate negative signals for non-useful documents such that explicit negatives are not needed.

RAG-token. In contrast, RAG-token retrieves k documents at each time step, allowing us to sample new documents according to the needs at each decoding step.

$p_{R A G - T o k e n} (y ∣ x) \sim i \prod N z_{i} \in t o p k {p_{η} (\cdot ∣ x, y_{1 : i - 1})} \sum p_{η} (z ∣ x) p_{θ} (y_{i} ∣ x, z_{i}, y_{1 : i - 1})$

Note that the retrieved context is now $z_{i}$ which varies at each time step. The change in retrieval at each step seems to add complexity during training.

Ablation

Increasing number of retrieved documents. 5 or 10 documents are used for retrieval. Ablation shows that performance increases monotonically (albeit diminishingly) for RAG-sequence with increasing number of retrieved documents.
Learned Retrieval is useful. The authors try freezing the retriever and compare it against the setting of allowing the retriever to learn. They find that generally learned retrieval improves results significantly.
RAG generates more diverse outputs. They measure the ratio of distinct tri-grams / total tri-grams and find that RAG-decoding compared to normal decoding is more diverse.

Wang 2020 - DCNv2

DCN V2: Improved Deep & Cross Network and Practical Lessons for Web-scale Learning to Rank Systems

This paper proposes a neural network architecture called Deep and Cross Networks for effectively learning feature crosses compared to a standard feedforward MLP.

Background

Effective feature crossing is essential in many ML tasks, especially for search and recommendation. For example, a country feature crossed with language is more informative than either alone. Manually searching for good feature crosses is a combinatorial exercise which is intensive.

Deep neural networks in the form of MLPs are generally viewed as universal function approximators in the limiting case. But in the finite depth, they are often incapable of fully modelling feature crosses effectively (say using a simulated dataset).

Traditionally, Factorization Machines seek to overcome the feature combination issue by embedding a sparse feature with large dimensionality into a small dense vector. Feature crosses between features is then performed using a dot product between features. However, we are limited in expressiveness to order-2 crosses and the number of feature crosses can still be large with a large number of features (after densification). The other limitation of FMs is that we require the embedding dimension of each feature be the same, which is limiting if our features have different cardinality needs.

Traditional approach is to mix implicit crossing (i.e. fitting a DNN to the features) with explicit crossing (say with the FM approach, or by multiplying raw features together etc.). The implicit network and explicit network are usually in parallel and the output is added together to form the final prediction.

DCNv1

The same authors proposed DCNv1 in 2017, and it is useful to see how it evolved. The method is as follows:

Let $x^{[0]}$ represent a concatenated feature vector at layer 0. That is, we lay out each dense or sparse feature sideways and concatenate them into a long feature vector. Let $x^{[0]} \in R^{d}$ .
We have a cross and deep layer in parallel
- The deep side is simply a standard MLP, i.e. $h_{1} = Relu (w^{[0]} x^{[0]} + b^{[0]})$
- The cross side is where the magic happens: $x^{[l + 1]} = x^{[0]} x^{[l]}^{T} w^{[l]} + b^{[l]} + x^{[l]}$
- Note that $w, b \in R^{d}$ are learnable parameters
- $x^{[0]} \cdot x^{[l]}^{T}$ is a $d \times d$ matrix of rank 1. At layer 1, it comprises all the pairwise crosses, e.g. $x_{1}^{[0]} \cdot x_{1}^{[l]}, x_{1}^{[0]} \cdot x_{2}^{[l]}, ...$
- Hence the transformation at each cross layer is rank 1.
As we increase the number of cross layers, we will get feature crosses of increasing polynomial degree. We end up with a $d$ -dimensional feature vector $x^{[l]} \in R^{d}$ which comprises complex weighted polynomial feature crosses of polynomial degrees $1, ..., l$ .
At the final layer, we have $x_{w i d e}^{[L]}, x_{d ee p}^{[L]} \in R^{d}$ . The two features are concatenated together to form a vector $R^{2 d}$ , which can then be fitted to a classifier head for final predictions.

DCNv2

The criticism of DCNv1 is that the transformation at each cross layer is rank 1 and hence not expressive enough. DCNv2 tries to make the cross layer more expressive while still making it parameter-efficient.

The cross layer formulation of DCNv2 is: $x^{[l + 1]} = x^{[0]} ⊙ (W^{[l]} x^{[l]} + b^{[l]}) + x^{[l]}$

Where $W^{[l]} \in R^{d \times d}, b^{[l]} \in R^{d}$ .

To see how it compares to DCNv1, we can let $W^{[l]} = u^{[l]} \cdot v^{[l]}^{T}$ be a rank 1 matrix and $b^{[l]} = 0$ . Furthermore, we set $u^{[l]} = 1$ . Then we have: $x^{[l + 1]} = x^{[0]} ⊙ (u^{[l]} \cdot v^{[l]}^{T} x^{[l]}) + x^{[l]} = (v^{[l]}^{T} x^{[l]}) (u^{[l]} ⊙ x^{[0]}) + x^{[l]} = (v^{[l]}^{T} x^{[l]}) x^{[0]} + x^{[l]}$

Note that in line 2, we use the fact that $v^{[l]}^{T} x^{[l]}$ is a scalar and move it out to the left. In line 3 since $u^{[l]} = 1$ we can remove it.

Similarly for DCNv1, we can pull out the scalar: $x^{[l + 1]} = x^{[0]} x^{[l]}^{T} w^{[l]} + b^{[l]} + x^{[l]} = (x^{[l]}^{T} w^{[l]}) x^{[0]} + b^{[l]} + x^{[l]}$

We thus see that DCNv2 ends up in exactly the same form as DCNv1 (with just a missing $b^{[l]}$ term).

This reformulation helps us see that DCNv1 is DCNv2 when $W^{[l]}$ is rank 1. Hence when we allow $W^{[l]}$ to be higher rank, we get more expressiveness than DCNv1.

Stack vs Parallel

In addition to the parallel structure proposed in DCNv1, where a deep MLP runs parallel to the cross network and the final vector is concatenated together, DCNv2 proposes an alternative stacked architecture. In this formulation, we run through the cross layers first to get $x^{[l]} \in R^{d}$ . Then, this is fed into a deep MLP. The paper says that which architecture performs better depends on the task.

Loss

Finally, the loss is computed as standard binary cross entropy wrt the binary labels: $L = - \frac{1}{N} i = 1 \sum N y_{i} lo g (\overset{y}{^}_{i}) + (1 - y_{i}) lo g (1 - \overset{y}{^}_{i}) + λ l \sum ∣∣ W_{l} ∣ ∣_{2}^{2}$

Modifications

Using ranking models in production settings usually has strict latency requirements, hence it is important to reduce cost while maintaining accuracy. The paper thus proposes 3 modifications to make the model more efficient.

Modification 1: Low Rank Approximation

In practice, the weight matrix $W^{[l]}$ is usually effectively low rank, so it is well motivated to approximate it with smaller matrices $U^{[l]} V^{[l]}^{T}$ , where both $U^{[l]}, V^{[l]} \in R^{d \times r}, r << d$ . So we have: $x^{[l + 1]} = x^{[0]} ⊙ (U^{[l]} (V^{[l]}^{⊺} x^{[l]}) + b^{[l]}) + x^{[l]}$

For the experimental setting in the paper, $r = 64$ was the low rank threshold after which they reported diminishing returns for increasing rank.

Modification 2: Mixture of Experts

Instead of just having one expert weight for each cross layer, they propose having multiple experts and then combining the expert outputs together using a gating mechanism. This is analogous to multi-headed attention with multiple heads. The idea is that each expert can learn effective feature crosses in a certain subspace. The input-dependent gating mechanism can then select the appropriate experts for a given input.

We have: $x^{[l + 1]} E_{i} (x^{[l]}) = i = 1 \sum K G_{i} (x^{[l]}) E_{i} (x^{[l]}) + x^{[l]} = x^{[0]} ⊙ (U_{i}^{[l]} (V_{i}^{[l]}^{⊺} x^{[l]}) + b^{[l]})$

Where:

$G_{i} (\cdot) : R^{d} \mapsto R$ is the gating function which represents the input-dependent weight of expert $i$ . It can be a learned softmax function.
$E_{i}$ is the expert $i$ . It is simply the earlier equation but with separate weights for each expert $i$ .

Modification 3: Pre Activation Functions

With the low rank approximation, we effectively project the features to a low dimension and project it back up. Instead of immediately projecting back from dimension $r$ to $d$ , we can apply non-linear transformations. This allows the function to learn a richer set of representations.

$E_{i} (x^{[l]}) = x^{[0]} ⊙ (U_{i}^{[l]} \cdot g (C_{i}^{[l]} \cdot g (V_{i}^{[l]}^{⊺} x^{[l]})) + b^{[l]})$

Here, $g (\cdot)$ represents any non-linear activation function (like Relu) and $C_{i} \in R^{r \times r}$ is a learned weight. In the paper, the sigmoid function was chosen.

In practice, the tensorflow implementation seems to incorporate (i) low rank approximation and (ii) pre activation function, but does not do the mixture of experts.

Gao 2021 - GradCache

Scaling Deep Contrastive Learning Batch Size under Memory Limited Setup

This paper demonstrates a method to perform contrastive learning for two-tower model training with an arbitrarily large batch size at a constant memory cost, at the expense of slightly longer compute time.

Background

The typical contrastive learning setup for learning a two tower retrieval model is explained by Karpukhin 2020 - Dense Passage Retrieval, in which we have batches of related (anchor, positive) passages. Contrastive learning sets out to maximize the similarity between the anchor and positive passage, and minimize the similarity between the anchor and all other passages in the mini batch. It has been consistently shown that using larger batch sizes is critical to the performance of this training method, as the number of negatives increases, providing more information to the contrastive learning process.

However, using large batch sizes is impractical for most researchers, since the memory cost scales linearly with the batch size. The DPR paper used 8x V100 GPUs to process a batch size of 128, which is not attainable for most outfits. Thus the method in this paper is of great practical significance.

Setup

We start with two classes or sets of data, $S$ and $T$ . Typically $S$ may represent a set of string queries and $T$ may represent a set of document texts. We want to learn encoders $f$ and $g$ such that, given $s \in S$ and $t \in T$ , the encoded representations $f (s)$ and $g (t)$ are close if related and far apart if not related.

Typically, we set up a contrastive loss as follows. Sample a mini batch of anchors $S \subset S$ and corresponding targets $T \subset T$ , where each element $s_{i} \in S$ has a corresponding target $t_{r_{i}} \in T$ . The rest of the random samples in $T$ will be used as in-batch negatives.

We have an InfoNCE loss as follows: $L = - \frac{1}{∣ S ∣} s_{i} \in S \sum lo g \frac{exp ( f ( s _{i} ) ) ^{T} g ( t _{r_{i}} ) / τ}{\sum _{t_{j} \in T} exp ( f ( s _{i} ) ) ^{T} g ( t _{j} ) / τ}$

Let us also denoted the parameters of $f$ as $Θ$ and parameters of $g$ as $Λ$ .

Analysis

Now we show how the gradient computation and therefore training can be broken down to mitigate the memory bottleneck. Importantly, note that the main bottleneck in such contrastive training is that increasing the batch size scales linearly with the maximum memory requirement of the forward pass of the large BERT model. This is because we encode all texts in the mini-batch simultaneously and run backpropagation. Hence, we want a method that allows us to batch the forward pass within a mini-batch into mini-mini-batches (lets call it a tiny batch) while still allowing us to get the correct backpropagation gradients.

Applying the multivariate chain rule to the loss above, we have that: $\frac{\partial L}{\partial Θ} \frac{\partial L}{\partial Λ} = s_{i} \in S \sum \frac{\partial L}{\partial f ( s _{i} )} \frac{\partial f ( s _{i} )}{\partial Θ} = t_{j} \in T \sum \frac{\partial L}{\partial g ( t _{j} )} \frac{\partial g ( t _{j} )}{\partial Λ}$

From these simple statements, the paper makes two important observations:

The partial derivative $\frac{\partial f ( s _{i} )}{\partial Θ}$ only depends on $Θ$ and $s_{i}$ . It does not depend on any other anchor or passage. Thus, if we have access to the numerical value of $\frac{\partial L}{\partial f ( s _{i} )}$ , we can run backpropagation for $\frac{\partial L}{\partial Θ}$ independently from all other samples in an arbitrarily small batch.
The partial derivative $\frac{\partial L}{\partial f ( s _{i} )}$ requires only the numerical values of the encoded representations $f (s_{i})$ for all $s_{i} \in S$ and $g (t_{j})$ for all $t_{j} \in T$ . To compute these values, we don't actually need the computation graph states of the encoder $f$ , we just need the numerical values of all the embeddings.

Note that we can

The above statements are focused on $f$ , $s_{i}$ and $Θ$ , but similar statements hold for $g$ , $t_{j}$ and $Λ$ . The first statement above allows us to run the expensive gradient updates on a small batch of anchors or passages at a time, which avoids the memory bottleneck of running gradient updates on a large batch size for the large encoders. We can do this so long as we have access to the partial derivatives $\frac{\partial L}{\partial f ( s _{i} )}$ . The second statement shows us that computing these partial derivatives is not difficult, because we just need the encoded representations of each anchor and passage. Hence we can batch encode all the anchors and passages in the mini-batch (without gradients), and then use these values to compute this derivative.

Method

The above explanation directly informs the algorithm called GradCache. It works as follows. First, split the large batch into tiny batches which can fit into memory, denoted as $S = {\hat{S_{1}}, \hat{S_{2}}, ...}$ and $T = {\hat{T_{1}}, \hat{T_{2}}}$ .

Step 1: Graph-less Forward. We run a no-gradient forward pass of each encoder to obtain $f (s_{i}), g (t_{j})$ for all $s_{i} \in S, t_{j} \in T$ . We store all the encoded representations in memory.

Step 2: Gradient Caching. Using the pre-computed representations in step 1, we run a forward pass to obtain the loss $L$ . We then allow the autograd library to run a backward pass to get gradients $\frac{\partial L}{\partial f ( s _{i} )}, \frac{\partial L}{\partial g ( t _{j} )}$ for each representation $f (s_{i})$ or $g (t_{j})$ . Note that we are not involving the encoders in this step at all, so memory costs are minimal (just some dot products).

Denote $u_{i} = \frac{\partial L}{\partial f ( s _{i} )}$
Denote $v_{i} = \frac{\partial L}{\partial g ( t _{i} )}$
Store the representation gradient cache ${u_{i}, u_{2}, ..., v_{1}, v_{2}, ...}$

Step 3: Tiny Batch Gradient Accumulation. Recall earlier we said that so long as we have access to the partial derivatives from the loss to the embeddings, we can compute gradients for each $s_{i}$ or $t_{j}$ in arbitrarily tiny batches. This is the step where we do so. Specifically, we perform gradient accumulation one tiny batch at a time.

For the parameters $Θ$ of encoder $f$ : $\frac{\partial L}{\partial Θ} = \hat{S_{j}} \in S \sum s_{i} \in \hat{S_{j}} \sum \frac{\partial L}{\partial f ( s _{i} )} \frac{\partial f ( s _{i} )}{\partial Θ} = \hat{S_{j}} \in S \sum s_{i} \in \hat{S_{j}} \sum u_{i} \frac{\partial f ( s _{i} )}{\partial Θ}$

The $u_{i}$ s are simply looked up from the gradient cache. We perform encoder forward on a tiny batch at each time, multiply with $u_{i}$ and accumulate the gradients. Thus the memory requirement is limited to the encoder forward on a tiny batch, which can be arbitrarily small. Note that the final gradient computed and applied will be equivalent to the original gradients had we directly computed the loss for the large batch. We can see this from the double summation in the equation above, which simply equates to summing over all $s_{i} \in S$ .

Finally, we perform a similar gradient accumulation for the parameters $Λ$ of encoder $g$ . Once all the sub-gradients are accumulated, the optimizer step is taken to update model parameters as though we had processed the full batch in a single forward backward pass.

Results

The results are based on replicating the Dense Passage Retrieval paper results using a smaller GPU. Note that DPR was trained on 8 GPUs and batch size of 128.

DPR had a top-20 hit rate of 78.4
Using a batch size of 8, which was the largest batch size that fits in memory on an RTX 2080ti, the top-20 hit rate was 77.2
Using gradcache to simulate the batch size of 128, the top-20 hit rate was 79.3
Using gradcache to simulate a larger batch size of 512, the top-20 hit rate was 79.9

These results demonstrate the importance of a large batch size and the effectiveness of the method to do so on a small GPU.

Takeaways

This paper uses a simple property of the chain rule to separate the computation of the InfoNCE loss gradients into two parts, thereby removing the memory bottleneck. This method is implemented as the CachedMultipleNegativesRankingLoss in SentenceTransformer, and is really useful for those of us without access to large memory GPUs. We can train a model exactly to the performance of an arbitrarily large batch size, at the cost of longer computation time.

Implementation

The author Luyu Gao has an implementation of GradCache in pytorch. The source code of the main classes are here:

The main method is in cache_step which computes the loss for a mini batch. We follow the logic below:

Step 1: forward_no_grad is called on the model_inputs to get the encoded representations (or embeddings) of all the input texts:
- torch.no_grad is used as context manager to avoid gradients
- In a for loop over the sub-batches:
  - The model forward method is called on the sub-batch input tensors
  - RandContext context manager for this forward pass is initialized and stored in a list of rnd_states
    - We need to store the random state of both CPUs and GPUs for this forward pass, because we need to exactly replicate the random number state at this point in time later. These random states can affect the behaviour of certain nn layers, especially DropOut.
    - The RandContext object will be used as context manager later on
- The sub-batch representation tensors are concatenated together
- The model_reps and rnd_states are returned
  - model_reps is appended to a list all_model_reps
  - rnd_states is appended to a list all_rnd_states
Step 2: build_cache is called to build the cache of gradients. These are the gradients from the loss $L$ to the embeddings $f (s_{i}), g (t_{j})$ .
- compute_loss is called to forward pass from the embeddings to the loss
- backward is called to compute the gradients from the loss to the embeddings
- For each embedding r, r.grad is accessed to get the gradients
- The cache is thus [r.grad for r in reps]
Step 3: forward_backward is called to accumulate gradients
- Firstly, with state is called to restore the random context that we stored earlier. This ensures that the forward pass of the model to get the embedding (this time with gradients) exactly matches the earlier forward pass with no gradients.
- We obtain the embeddings y with gradients enabled. This corresponds to $f (s_{i})$ in the analysis above.
- We retrieve the gradients associated with each embedding that we stored earlier in step 2 (call it reps). This corresponds to $u_{i}$ in the analysis above.
- Now we dot product reps and y to form surrogate. backward is then called on surrogate to get the correct gradients.
  - This step is a bit tricky, let's look at it in a bit more detail.
  - Recall that our objective is to obtain $\frac{\partial L}{\partial f ( s _{i} )} \frac{\partial f ( s _{i} )}{Θ}$
  - Recall that $u_{i}$ is the precomputed numerical value of $\frac{\partial L}{\partial f ( s _{i} )}$
  - Since $u_{i}$ is a constant, it will become a constant multiplier to the gradient on the backward pass
  - Hence by calling backward on surrogate, we will get gradients of the form $u_{i} \cdot \frac{\partial f ( s _{i} )}{\partial Θ}$ , which is what we want
- After the forward_backward function, the gradients will be accumulated in model.parameters().grad, and the optimizer step can then be taken

RandContext itself is an interesting context manager to store and load pytorch's internal rng state.

On init, the current cpu and gpu rng states are captured:
- torch.get_rng_state() gets a byte tensor representing the cpu rng state
- torch.utils.checkpoint.get_device_states(*tensors) looks up the gpu device where the tensors are held, and returns both the gpu_devices and gpu_rng_states across all gpu devices
__enter__ is triggered when this class is used as context manager, to restore the earlier captured rng state.
- self._fork = torch.random.fork_rng is called to create a fork of the current torch rng state. This creates an isolated rng environment where we can restore the earlier rng environment without messing up the original rng environment that we entered from.
- self._fork.__enter__() is called to actually enter the forked state
- torch.set_rng_state now sets the cpu rng state to the earlier captured state
- torch.utils.checkpoint.set_device_states similarly sets the gpu rng state to the earlier captured state
__exit__ is triggered when the context manager is closed.
- self._fork.__exit__ is called to close the isolated rng environment

Gao 2021 - Simple Contrastive Learning of Sentence Embeddings

Paper Link

This paper proposes an unsupervised and supervised approach to fine-tune sentence encoder models to perform semantic textual similarity tasks. The STS tasks refer to a set of tasks taken from SemEval 2012 to 2017 where given two sentences, the model is to output a score between 1-5 and scored based on correlation to human inputted scores.

Task Dataset

Example sentence pair with a score of 4 (taken from SemEval 2016):

In May 2010, the troops attempted to invade Kabul.
The US army invaded Kabul on May 7th last year, 2010

Example sentence pair with a score of 3:

John said he is considered a witness but not a suspect.
"He is not a suspect anymore." John said.

Unsupervised SimCSE

SimCSE follows the popular framework of contrastive learning using in-batch negatives, where pairs of related sentences are trained to have high similarity whilst having low similarity with other random sentences.

The idea of unsupervised SimCSE is simple: given a collection of sentences ${x_{i}}_{i = 1}^{m}$ , treat each sentence itself as its own positive pair, and use dropout noise to introduce random perturbations such that the self similarity is not trivially perfect.

Specifically, if we denote $h_{i}^{z}$ as the embedding for $x_{i}$ under dropout mask $z$ , then the loss for unsupervised SimCSE for a mini-batch of $N$ sentences is:

$l_{1} = - l o g \frac{e ^{s im (h_{i}^{z_{i}}, h_{i}^{z_{i}^{'}}) / τ}}{\sum _{j = 1}^{N} e ^{s im (h_{j}^{z_{j}}, h_{j}^{z_{j}^{'}}) / τ}}$

Note that $τ$ is the temperature hyperparameter. Importantly, the authors found that setting $τ = 1$ with cosine similarity performs very poorly (64.0), compared to dot product (85.9). However, carefully tuning $τ$ can lead to similar performance ( $τ = 0.05$ had a score of 86.2).

We may view this procedure as data augmentation, analogous to how random pixel distortions and rotations are applied to images to improve computer vision models. The paper shows that this simple unsupervised method significantly outperforms other data augmentation methods. Note that the authors used the default 10% dropout for BERT models.

Supervised SimCSE

The supervised version follows a similar framework, although the positive pairs are taken from an external dataset. They chose the Natural Language Inference (NLI) datasets, where each example is a triplet of sentences. The premise is denoted $x_{i}$ , the entailment sentence is denoted $x_{i}^{+}$ and the contradiction sentence is denoted $x_{i}^{-}$ . The loss is then formulated as:

$l_{2} = - l o g \frac{e ^{s im (h_{i}, h_{i}^{+}) / τ}}{\sum _{j = 1}^{N} e ^{s im (h_{i}, h_{j}^{+}) / τ} + e ^{s im (h_{i}, h_{j}^{-}) / τ}}$

Ablation Studies

The paper finds that including contradiction sentences as hard negatives has a small but significant improvement in performance
The paper finds that using the [CLS] token or averaging embeddings across the first and last layer does not make much difference

Alignment and Uniformity

Wang and Isola 2020 propose two metrics for measuring the effectiveness of an embedding method on a set of documents:

Alignment. Given a distribution of positive pairs of documents $p_{p os}$ , alignment desires the expected distance between embeddings of each pair to be small:

$l_{a l i g n} = E_{(x, x^{+}) \sim p_{p os}} ∣∣ f (x) - f (x^{+}) ∣ ∣^{2}$

Uniformity. Given any two documents drawn from the corpus, uniformity metric should be small (i.e. distance between them is large).

$l_{u ni f or m} = l o g E_{(x, y) \sim p_{d a t a}} e^{- 2∣∣ f (x) - f (y) ∣ ∣^{2}}$

A common problem pointed out in training language models is anisotropy, in which embeddings are pushed into a narrow cone in the vector space, which severely limits their expressiveness. The anisotropy problem is naturally connected to uniformity, which aims at distributing embeddings evenly in the space. The authors argue that contrastive learning as proposed in this paper addresses this problem through some analysis (omitted for now).

Empirically, they show that the alignment metric for SimCSE is comparable to average BERT, but the uniformity measure is significantly lower, leading to much better performance in terms of accuracy.

Weng 2021 - Contrastive Representation Learning

Reference: Weng, Lilian. (May 2021). Contrastive representation learning. Lil’Log. https://lilianweng.github.io/posts/2021-05-31-contrastive.

This page just follows Lilian's notes on contrastive learning.

Contrastive learning seeks to learn representations such that related samples are close together and unrelated samples are far apart. A sample is typically represented by a 1D vector (or embedding). The literature on contrastive learning comes from a wide range of tasks such as semantic retrieval, computer vision, hence we may see the same ideas repeated in slightly different forms.

Li 2021 - TaoBao Embedding-Based Retrieval

Li 2021 - Embedding Based Product Retrieval in TaoBao Search

This paper explains TaoBao's embedding-based search retrieval system. Some interesting aspects of this paper:

Tackles the inherent trade-off between recall and relevance for embedding based retrieval
Shows a way to incorporate personalization into search retrieval
Ideas on enhancing representation of the search query

Context

TaoBao's product search has many stages. This paper focuses on the initial retrieval stage, when around 10k products are retrieved per query. The retrievers used are:

Traditional lexical based term match retriever
Item-based collaborative filtering retriever (presumably based on recent items)
Embedding based retriever

The retrieval results from each retriever are merged and de-duplicated, and then sent to participate in the next phase (pre-ranking). This paper is only focused on the embedding-based retriever.

Problem

Taobao tackles several problems in this paper:

Poor control of relevance for embedding based retrieval (EBR). Taobao reported that over time, the EBR contributed to an increase in complaints as non-relevant items started to get surfaced. For example, if a query is made for Nike Shoes, shoes from similar brands like Adidas may get surfaced.
Balancing relevance and personalization. Taobao notes that personalization is an important aspect of product search, and they propose a way of merging search relevance and personalization in this paper.

Setup

Let $U = {u_{1}, ..., u_{N}}$ denote a set of users, $Q = {q_{1}, ..., q_{N}}$ denote their corresponding queries, and $I = {i_{1}, ..., i_{M}}$ denote the collection of items. Let us divide the user $u$ 's historical sequence of items into 3 buckets:

Real time, i.e. most recent $T$ item interactions, denote as $R^{u} = {i_{1}^{u}, ..., i_{T}^{u}}$
Short term, i.e. within last 10 days but older than real time, denote as $S^{u} = {i_{1}^{u}, ..., i_{T}^{u}}$
Long term, i.e. within last 1 month but older than short term, denote as $L^{u} = {i_{1}^{u}, ..., i_{T}^{u}}$

The search task is as follows. We are given the historical behaviours $(R^{u}, S^{u}, L^{u})$ of a given user and the submitted query $q_{u}$ at time $t$ . We also get access to a sequence of historical queries for the user $q_{hi s} = {q_{1}^{u}, ..., q_{k}^{u}}$ . Our task is to return a set of items $i \in I$ that satisfy the search request. Typically, we score each query, item pair according to score $z$ and return the top-K items: $z = F (ϕ (q_{u}, R^{u}, S^{u}, L^{u}), ψ (i))$

Where:

$F$ denotes the scoring function, typically the inner product function
$ϕ$ denotes the query + behaviour encoder
$ψ$ denotes the item encoder

Model Architecture

The model architecture in the paper is quite complicated, although understandable after some time. I don't think it is profitable to describe the architecture in full specificity, but there are useful ideas within.

At a high level, the model architecture is a standard two-tower setup, where the user query and behaviours are encoded into a user embedding, and the item is encoded into an item embedding. The dot product is used to score user-item pairs and ANN search is used for retrieval. The interesting details lie mostly within the user tower, in particular the way the query and behaviours are integrated.

Query Representation

We first tackle the Query Representation. Given a query string, the query encoder encodes it into $Q_{m g s} \in R^{6 \times d}$ , where $d$ is an arbitrary embedding dimension. The first dimension is 6 because the paper uses 6 different ways to encode the query (they call this multi-grained representation). The 6 representations are briefly explained as follows:

1-gram and 2-gram embeddings. The query string is tokenized into 1-gram or 2-gram and embeddings are looked up for each token. Mean pooling over tokens is done to get a single embedding.
Phrase embeddings. Similarly, the query string is tokenized into phrases using their query segmentation engine. For example, 红色 is a phrase and 连衣裙 is a phrase. The same embedding lookup and mean-pooling is done. Call this $q_{se g}$ embedding.
Phrase transformer embeddings. The phrase embeddings are also passed through a transformer before mean-pooling. I suppose the sequence of phrases matters enough to do this.
Historical query embeddings. This is the interesting part where the query interacts with historical queries.
- Specifically, the phrase embedding $q_{se g} \in R^{d}$ and the historical query matrix $q_{hi s} \in R^{k \times d}$ is used to form an attention matrix in the form of $attn = softmax (q_{se g} \cdot q_{hi s}^{T}) \in R^{1 \times k}$ , which provides the relevance of each historical query to the current submitted query.
- The attention weights are then used to do a weighted average over the historical query embeddings $q_{hi s_se q} = attn \cdot q_{hi s} \in R^{d}$ .
- Hence we get a weighted representation of historical queries where more relevant queries to the current query are emphasized.
Mix embeddings. This is simply a sum over the above 5 embeddings

Note that in the query representation, both token-based embeddings and phrase-based embeddings are used to provide a fine grained representation of the query.

User Behaviour Representation

Now we tackle the representation of user behaviours. There are some minor deviations on how they treat $R^{u}, S^{u}, L^{u}$ respectively, but broadly they follow the same structure. We refer to $R^{u}$ below but similar treatment applies to $S^{u}$ or $L^{u}$ .

Each item in the sequence $R^{u}$ is represented by an embedding
The item embedding comprises a concatenation of several embeddings, such as item ID embedding, item category embedding, brand embedding etc.
$R^{u} \in R^{T \times d}$ is passed into a transformer with self attention to get a hidden representation $R_{se l f_a tt n}^{u} \in R^{T \times d}$
Similar to the treatment of historical query above, cross attention is performed with the query representation $q_{m g s}$ to get a query-weighted representation of $R^{u}$ , as follows:
- The attention matrix is formed by taking $attn = softmax (q_{m g s} \cdot (R_{se l f_a tt n}^{u})^{T}) \in R^{6 \times T}$ , which provides the relevance of each historical interacted item to the current submitted query.
- The attention weights are then used to do a weighted average over the historical item embeddings $H_{R} = attn \cdot R_{se l f_a tt n}^{u} \in R^{6 \times d}$ .
- Hence we get a weighted representation of historical item interactions where more relevant items to the current query are emphasized.
The same operations are applied to $S^{u}$ and $L^{u}$ to get $H_{S}$ and $H_{L}$ respectively.

Finally, the query $q_{m g s}$ and the representations of historical item interactions $H_{R}, H_{S}, H_{L}$ are passed into a self attention layer to get the final user representation, which captures both query semantics and historical information about the user: $H_{q u} = Transformer ([[CLS], q_{m g s}, H_{R}, H_{S}, H_{L}]) \in R^{25 \times d}$

The embedding at the $[CLS]$ token is taken to represent the user. This has dimension $d$ .

Item Representation

The item embedding is represented similarly to how each item was represented above. Mean pooling is done over the phrase segmentation of the item title, and added to the ID embedding of the item.

Loss Function

Now that we have described the model architecture, we proceed to training details. The authors used sampled softmax loss to train the model. The sampled softmax loss was compared against pairwise hinge loss and found to be superior. The hinge loss also has a downside of needing to tune the margin parameter, which can significantly affect results.

Specifically, for a given user representation $H_{q u}$ and item representation $H_{i}$ of item $i$ . Let the predicted score for the match between user and item $i$ be denoted: $\overset{y}{^} (i ∣ q, u) = \frac{exp ( F ( H _{q u} , H _{i} ) / τ )}{\sum _{i^{'} \in I} exp ( F ( H _{q u} , H _{i^{'}} ) / τ )}$

Where $τ$ is the temperature parameter.

We can then denote the loss as the negative log likelihood of the positive interactions: $L = - (q, u, i^{+}) \in D \sum lo g \overset{y}{^} (i^{+} ∣ q, u)$

Adjustments

The authors propose two adjustments to the training procedure to improve performance.

Firstly, they propose to use temperature $τ$ to handle noisy relevance labels. They argue that the positive signals in e-commerce data from clicks and purchases are noisy, and will benefit from tuning the temperature. The argument is:

As $τ \to 0$ , the softmax function approaches a one-hot function of the item with the highest score. Thus the model will be encouraged to fit the training signals exactly, which runs the risk of overfitting to noisy labels.
Conversely, as $τ \to \infty$ , the softmax function approaches a uniform distribution over all items regardless of relevance. This encourages the model to underfit the training signals, since scoring positive items higher does not affect the loss by much.

Therefore, supposing the positive signals to be noisy, we should increase $τ$ to accommodate the noise. The authors found that $τ = 3$ performed the best for their data.

Note: This approach advocates for more smoothing of the softmax distribution, which is opposite to the findings from papers like SimCSE, which advocate for less smoothing (or a very low $τ = 0.05$ ). I suppose the optimal temperature depends on how noisy the positive signals are relative to how noisy the negative signals are.

Secondly, they propose a unique way of mining hard negative samples. Firstly, they mine hard negatives from the random negatives as per normal by choosing the top $N$ negative items with the highest inner product with the user representation. Let this matrix be called $H_{ha r d} \in R^{N \times d}$ . Now, they linearly interpolate $H_{ha r d}$ with the positive item embedding $H_{i^{+}}$ as follows: $H_{mi x} = α \cdot H_{i^{+}} + (1 - α) \cdot H_{ha r d}$

This $H_{mi x}$ is then used in the loss function as the negative item embeddings. In their experiment, they found that $α$ between $0.4 - 0.6$ worked well.

Note: This approach seems to exacerbate the problem of false negatives, since we are using the positive item as part of the negative signal.

Finally, the authors propose a simple solution to the problem they raised of poor relevance matches surfaced by the EBR approach. The propose relevance control measure is to simply add a boolean term-based filter to the EBR results. For example, if the query is Red Nike Shoes, the boolean query will be something like Color:Red AND Brand:Nike AND Product:Shoes. The boolean filter will thus filter out any irrelevant results.

The paper does not elaborate much on how the query filter is implemented. Taobao decides on the key query terms to filter based on their understanding of what constitutes essential components of the query (i.e. brand name, color, style etc.).

Note: One may wonder what is the point of Embedding Based Retrieval (EBR) if we are going to slap a lexical boolean filter on it at the end. The answer is that Taobao is not using EBR as a way to achieve fuzzy semantic matching, as one might do when the lexical retrieval is too strict. Rather, it seems that due to the large number of items in Taobao's catalogue, the number of items passing the boolean filter already far exceeds the 10k limit. Thus they are using EBR as a way of surfacing the most relevant items within the subset of items that pass the boolean filter.

Results

Interestingly, Taobao evaluates the quality of the retrieval based on two metrics:

Recall@1k: This is a typical measure of how many positive items were retrieved by the algorithm
Relevant %: Taobao has an in-house relevance model that scores 0.915 AUC on human-annotated data on whether a query and product has good relevance match. Hence they use this model to determine if each query and retrieved product has good relevance, and reports the % of good relevance.

The findings of the paper are:

The multi-grained representation of the query adds a small amount of recall
Tuning temperature $τ$ / using hard negatives adds significant amount of relevance
Adding all the improvements raises relevance but decreases recall slightly, showing that there is some inherent trade-off between relevance and recall (i.e. some users do indeed engage with non-relevant items to their query)

It seems that Taobao prioritizes relevance over recall, which makes sense as surfacing irrelevant results undermines the integrity of the search system. Furthermore, since they use a boolean filter for relevance control, irrelevant results surfaced by the EBR system would get filtered out anyway.

Takeaways

This paper tackles an important problem for EBR systems where irrelevant results get surfaced sometimes. While a boolean filter might make sense for Taobao's use case, other systems might need a different approach to avoid being overly strict in relevance control. It is also insightful to understand the inherent trade-off between recall and relevance when training the model.

Finally, the paper also demonstrates a way to incorporate personalization into search via cross attention mechanisms between the query representation and the historical item interactions of the user.

Zou 2021 - PLM Based Ranking in Baidu Search

Pre-trained Language Model based Ranking in Baidu Search

This paper explains how Baidu fine-tunes pre-trained language models (PLMs) to perform efficient ranking for search. Some interesting aspects:

A quick sentence extractor to perform query-sensitive summarization of documents on the fly
Calibrating relevance signals for each query-document pair using click features

Background

The paper highlights some challenges with using PLM-based cross encoder ranking for search:

Documents tend to be long, making it hard for a ranker to meet the extremely low latency requirements of search. Transformer based models have quadratic time complexity with sequence length, exacerbating the problem.
PLMs are typically trained on the language modelling task, which does not transfer directly into a query-document pair type of sentence for ranking.
The score needs to be in a meaningful range to be easily blended with other ranking signals such as freshness, authority etc. Note: This portion might be specific to Baidu as they score documents in the 0-4 range.

Problem Formulation

Given a query $q$ and a set of retrieved documents $D = {d_{i}}_{i = 1}^{N}$ , we desire a scoring function $f : Q \times D \to R$ , which maximizes some evaluation metric: $ma x_{f} E_{q, D, Y} Θ (Y, F (q, D))$

Where:

$Θ$ is an evaluation metric like DCG
$F (q, D) = {f (q, d_{i})}_{i = 1}^{N}$ is the set of document scores for this query
$Y = {y_{i}}_{i = 1}^{N}$ is the set of true labels. For Baidu, $y_{i}$ is in the range 0-4, corresponding to bad, fair, good, excellent, perfect respectively.

In learning to rank, we have a set of labelled query document pairs denoted as $Φ = {ϕ_{q}}$ , where $ϕ_{q} = {q, D = {d_{i}}, Y = {y_{i}}, i = 1, ..., N}$ is a set of labelled documents given a query $q$ . We minimize the learn to rank loss $L$ as a proxy to minimizing the non-differentiable metric $Θ$ : $L (f) = \frac{1}{∣ Z ∣} {q, D, Y} \in Φ \sum l (Y, F (q, D))$
Where $l$ is the loss function and $Z$ is the normalizing factor.

Query-Weighted Summary Extraction

As mentioned above, documents are typically long and ill-suited for direct inputting into a BERT-based transformer. Thus the authors propose a fast query-sensitive algorithm to select a subset of sentences from each document to serve as the summary. The algorithm is as follows:

Given a query $q$ , document $d$ , decay factor $α$ and max number of sentences $k$
Tokenize $q$ into a set of words $W_{q}$
Tokenize the document $d$ into a set of sentences $S$
Initialize a vector of word importances as $ω_{w}$ for $w \in W_{q}$
For each sentence $s_{i} \in S$ :
- Tokenize $s_{i}$ into a set of words $W_{s_{i}}$
- Compute $score = \sum_{w \in W_{s i} ⋂ W_{q}} ω_{w}$
Choose the sentence with the highest score $s^{*}$ and add to summary
Remove $s^{*}$ from $S$
Decay used word importances in $ω$ accordingly, i.e. $ω_{w} \leftarrow α \cdot ω_{w}$ for all $w \in W_{s^{*}} ⋂ W_{q}$

It is a simple and fast algorithm that chooses sentences with the highest word overlap with the query. Each subsequent sentence aims to capture a different set of words by decaying the weights of words already chosen. Baidu chooses $α = 0.5$ and eventually $k = 1$ in their experiments.

Ranking Architecture

The ranking architecture is a simple cross-encoder set up, except that the representation is split up with query, title on one side and the query-sensitive document summary on the other side. This avoids incurring quadratic time complexity on the fully concatenated text. In their experiments, they used 9 representation layers and 3 interaction layers, so that the full attention is only done for 3 layers. Baidu estimates that this setup reduces average time by 30%.

The representation at the [CLS] token is taken and probably attached with a softmax head to perform classification into the 0-4 range.


Ranking Architecture

The model is fine-tuned (the paper calls it pre-training, but I think it's more of a fine-tuning step) using triplet loss on a set of positive and negative pairs. The generation of positive and negative examples is covered below. Specifically, the PLM is fine-tuned to minimize the following loss: $l (Y, F (q, D)) = y_{i} < y_{j} \sum [f (q, d_{i}) - f (q, d_{j}) + m]_{+}$

Where $y_{i}, y_{j}$ refer to the true labelled score of documents $i, j$ respectively, and $m$ is the margin.

Relevance Score Calibration

Naively, we can curate $y_{i}, y_{j}$ using clicks and non-clicks respectively. However, the paper highlights several issues with this approach:

Many clicks are in fact false positives, as they are caused by noisy clicks such as clickbait and accidental clicks
There is exposure bias in the ranking system favouring higher position items
Clicks do not necessarily imply relevance

Thankfully, there are many other signals accompanying each q, d pair, assuming that sufficient traffic has been observed for such a pair. Some examples from the paper include:

$\frac{# click}{# skip}$ , which measures the number of clicks vs number of users who were impressed but did not click.
$\frac{# click}{# total clicks in the query}$ , which measures the number of clicks on the item as a fraction of all other clicks the user made for that query. This helps to control for users who are just trigger-happy and click on many items.
$\frac{# long click}{# click}$ , which measures the number of long clicks (i.e. user clicked and made no other action for x seconds) against clicks.

These features contain signals about relevance that can be combined in a relevance model to infer the relevance of a q, d pair. To do this, Baidu crowd-sourced annotated ratings of a score from 0-4 for 70k query-document pairs. A simple decision tree-based model was then trained to infer the relevance score for a q, d pair given the relevance signal features discussed above. After training this model, it is then used to run inference on all q, d pairs (with sufficient observed traffic) to predict the relevance label and create a training dataset with a far more accurate relevance label.

Note: the paper also has a final step of fine-tuning the model according to crowd-sourced data, after the initial fine-tuning phase above with the calibrated relevance scores. Since this part is not particularly innovative nor feasible, I have omitted it here.

Experiments

They compared these innovations in an ablative manner from the original ERNIE system, which was a 12-layer transformer ranker trained using pairwise loss with human-labelled query-document pairs. The online ablation experiments show that:

The inclusion of the document summary in the document representation adds around 0.65% in DCG@2
The inclusion of fine-tuning on calibrated relevance scores increases the gains to 2.78%

The ablation experiments also showed that calibrated relevance scores is critical for effective fine-tuning. By using raw user clicks as signal, the positive-negative ratio (PNR, i.e. ratio of concordant q,d pairs over number of discordant q,d pairs) was only 1.86, but with the calibrated scores it increased dramatically to 3.35.

Takeaways

This is an interesting paper on building an effective pure search ranker (i.e. no personalization), although we can always layer additional personalization on top of the initial relevance score. The benefit of training a pure search ranker is that we can observe multiple instances of the same q, d pair, which allows us to bootstrap a diverse array of weaker relevance signals into a much stronger relevance score. Even if we do not have human-annotated signals to train the decision tree, we often have a sparser but stronger signal (e.g. conversion) and many other weaker but more abundant signals. Hence, it may be possible to use the same approach to predict the stronger signal from the weaker ones.

I suppose another way to deal with the diverse signal problem is multi-task learning, but this approach allows us to avoid the complexity of multi-task learning by first pre-processing relevance signals.

The idea of a on-the-fly query-sensitive document summary also sounds useful, and can probably be implemented with the help of ElasticSearch. We could extend this approach by performing query expansion before doing the document summary, which may allow us to fetch more relevant sentences which do not have exact term overlap. Another way of doing this is to pre-generate document summaries using an LLM, which requires assuming that one summary for a document is able to suit all queries, which may not be the case.

Dao 2022 - Flash Attention

Paper Link

This paper argues that the attention mechanism is slow because of reading / writing between GPU High Bandwidth Memory and GPU on-chip SRAM. The authors hence create a block-wise attention algorithm that minimizes such IO read / writes and speeds up attention significantly especially when the sequence length is long.

Brief Overview of Attention

Suppose we have an input sequence of embeddings $X = (x_{1}, ..., x_{N})$ where $x_{i} \in R^{k}$ , such that $X \in R^{k \times N}$ . Naively, we can compute activations by $V = X^{T} \cdot W_{v}$ , where $W_{v} \in R^{k \times d}$ , such that $V \in R^{N \times d}$ . However, this naive way of encoding our input sequence does not allow interaction between inputs at different positions (say $x_{i}$ with $x_{j}$ ). We can see this by observing that the first row of $V$ is only affected by the first column of $X$ (i.e. first encoding $x_{1}$ ), and likewise for all the other positions.

Attention addresses this problem by adding an interaction mechanism. Besides $W_{v}$ , we also create weight parameters $W_{q}, W_{k} \in R^{k \times d}$ . Given an input $X$ , we compute $Q, K, V \in R^{N \times d}$ as follows:

$Q = X^{T} \cdot W_{q}$
$K = X^{T} \cdot W_{k}$
$V = X^{T} \cdot W_{v}$

We then create an interaction matrix $S = Q K^{T} \in R^{N \times N}$ , and apply row-wise softmax to get $P = so f t ma x (S) \in R^{N \times N}$ . $S$ can be thought of as a pairwise similarity matrix between the encoding at position $i$ and position $j$ that captures the degree of interaction. For example, in a sentence the economy has been in decline, the value of $S_{1, 5}$ (assuming 0-index) measuring the interaction between economy and decline might be high.

Finally, we produce the output $O = P V \in R^{N \times d}$ , which is an activation output from the input sequence that has captured the interactions between tokens at different positions of the input. This simple mechanism has led to significant improvements in language modelling.

GPU Memory Hierarchy

The memory hierarchy is such that read/write speed is super fast on the SRAM but memory is highly limited. Hence, the N x N attention matrix is written/read repeatedly to/from HBM, resulting in IO being a bottleneck. The numbers are as such on an A100 GPU:

SRAM: 19 TB/s (20 MB RAM)
HBM: 1.5 TB/s (40 GB RAM)

Naive Attention Algorithm

The naive attention algorithm has many reads and writes to HBM. (ps: Not sure why we cannot persist the intermediate matrices on SRAM and complete the computations, but in any case the naive algorithm requires materializing the $N \times N$ matrices on SRAM which will quickly flood it. For example, a sequence length of 2,048 at float32 already takes up 33MB for the $S$ matrix).

Load $Q, K$ from HBM, compute $S = Q K^{T}$ , write $S$ to HBM
Read $S$ from HBM, compute $P = so f t ma x (S)$ , write $P$ to HBM
Load $P, V$ from HBM, compute $O = P V$ , write $O$ to HBM

Flash Attention

The main idea is quite simple: instead of computing the full attention matrix, we use block-wise tiling to compute parts of it at a time. This reduces the memory required for each block and allows the whole computation to be done on SRAM while minimizing the amount of IO read from HBM, leading to faster compute time and lower memory usage on SRAM. The difficulty is in devising a block-wise softmax algorithm that yields the exact same result as computing it all at once.

Consider the naive softmax algorithm on an arbitrary vector $x \in R^{B}$ .

$m (x) p (x) l (x) so f t ma x (x) := i max x_{i} := [e^{x_{i} - m (x)} ... e^{x_{B} - m (x)}] := i \sum p (x)_{i} := \frac{p ( x )}{l ( x )}$

Note that the maximum value $m (x)$ is subtracted for numerical stability to avoid overflow (underflow is ok because $e^{- in f} = 0$ ). $f (x) \in R^{B}$ is the numerator and $l (x) \in R$ is the sum of all elements in $p (x)$ .

Now, the problem with the naive softmax algorithm in the context of attention is that we need an entire row of $S$ ( $N$ elements) to perform the row-wise softmax computation. This will not be available if we are performing block-wise computation, since we are splitting $Q, K \in R^{N x d}$ row-wise into blocks of $Q_{i}, K_{j} \in R^{B x d}$ . When we compute $S_{ij} := Q_{i} \cdot K_{j}^{T} \in R^{B x B}$ , blocks of $S$ will be materialized in each pass, but not the entire row at a time.

Hence, we need a modified algorithm that allows us to compute chunks of the final output $O \in R^{N x d}$ at a time by iterating block-wise through $S$ , such that the combination of the new chunk of $O$ at each step with the already written intermediate $O$ gives the correct result at the end. The key to realizing this algorithm is in decomposing the softmax step, as shown below.

Consider two vectors $x^{a}, x^{b} \in R^{B}$ . We can decompose the softmax of their concatenated vector $x = [x^{a} x^{b}] \in R^{2 B}$ as follows:

$m (x) p (x) l (x) so f t ma x (x) = ma x (m (x^{a}), m (x^{b})) = [e^{m (x^{a}) - m (x)} \cdot p (x^{a}) e^{m (x^{b}) - m (x)} \cdot p (x^{b})] = e^{m (x^{a}) - m (x)} \cdot l (x^{a}) + e^{m (x^{b}) - m (x)} \cdot l (x^{b}) = \frac{p ( x )}{l ( x )}$

The first line of the above simply notes that the maximum of $x$ is the maximum over each of the subvector maximums $x^{a}, x^{b}$ . The second line notes that we previously multiplied each element of $p (x)$ by a factor, say $e^{- m (x^{a})} \cdot e^{x_{i}}$ for those in $x_{a}$ . To get the correct multiplier for the full vector $x$ , we need to divide away the previous multiplier and apply the new multiplier, i.e. $e^{- m (x)} / e^{- m (x^{a})} = e^{m (x^{a}) - m (x)}$ . The third line notes that the new denominator is the sum over each of the subvector sums, after we apply the correct multiplier from line 2.

The decomposition is simple but powerful. It implies that so long as we keep track of intermediate statistics $m (x)$ and $l (x)$ , we can compute the softmax of a long vector $x$ by splitting $x$ into subvectors and operate over each subvector at a time.

Now we are ready for Algorithm 1: Flash Attention of the paper.

$Require: Q, K, V \in R^{N x d} 01. Initialize O = 0^{N x d}, l = 0^{N}, m = - in f^{N} 02. Divide Q, K, V row-wise into B -sized blocks of Q_{i}, K_{j}, V_{j} \in R^{B x d} 03. Divide O row-wise into B -sized blocks of O_{i} \in R^{B x d} 04. Divide m, l into B -sized arrays of m_{i}, l_{i} \in R^{B} 05. for j in 1, ... : 06. Load K_{j}, V_{j} from HBM to SRAM 07. for i in 1, ... : 08. Load Q_{i}, O_{i}, l_{i}, m_{i} from HBM to SRAM 09. Compute S_{ij} = Q_{i} \cdot K_{j}^{T} \in R^{B x B} 10. Compute m_{ij} = ro w ma x (S_{ij}) \in R^{B} 11. Compute P_{ij} = e^{S_{ij} - m_{ij}} \in R^{B x B} 12. Compute l_{ij} = ro w s u m (S_{ij}) \in R^{B} 13. Compute m_{i}^{n e w} = elementwise-max (m_{i}, m_{ij}) \in R^{B} 14. Compute l_{i}^{n e w} = e^{m_{i} - m_{i}^{n e w}} \cdot l_{i} + e^{m_{ij} - m_{i}^{n e w}} \cdot l_{ij} \in R^{B} 15. Write O_{i} \leftarrow (l^{n e w})^{- 1} \times [l_{i} \times e^{m_{i} - m_{i}^{n e w}} \times O_{i} + e^{m_{ij} - m_{i}^{n e w}} \times P_{ij} V_{j}] to HBM 16. Write l_{i} \leftarrow l_{i}^{n e w}, m_{i} \leftarrow m_{i}^{n e w} to HBM 17. Return O (1) (2) (3) (5) (7)$

Note that we use $0^{N x d}, 0^{N}$ to denote a zero matrix of size $N x d$ and a zero array of size $N$ respectively. For simplicity, we divide $Q, K, V$ into equal $B$ -sized blocks but the paper allows different block sizes for $Q$ and $K, V$ . The equation numbers on the right in parentheses show which equations the lines correspond to above. Equation line 15 is a bit confusing because it combines multiple steps together. The next few paras try to unpack this.

Firstly, note that we are using the $\times$ operator to denote an element-wise broadcasted multiplication. For a vector $l_{i} \in R^{B}$ and a matrices $P_{ij} \in R^{B x B}, V_{j} \in R^{B x d}$ , observe the associative property $(l_{i} \times P_{ij}) \cdot V_{j} = l_{i} \times (P_{ij} \cdot V_{j})$ , since each element of $l_{i}$ only affects the corresponding row in the final matrix. This allows us to apply the scaling to either $O_{i}$ or $P_{ij}$ and the result will be the same.

Next, see that the term $e^{m_{ij} - m_{i}^{n e w}} \times P_{ij} V_{j}$ is simply the corrected numerator of the softmax dotted with $V_{j}$ . Dividing this term by $l^{n e w}$ gives the output block for this particular $S_{ij}, V_{j}$ pair.

Similarly, the other term $l_{i} \times e^{m_{i} - m_{i}^{n e w}} \times O_{i}$ is the existing output that has been accumulated from previous steps ${S_{ij}, V_{j} : j = 1, ...}$ . Due to the associative property, we can also directly apply the scaling correction to $O_{i}$ . The $l_{i} / l^{n e w} \times e^{m_{i} - m_{i}^{n e w}}$ are scaling factors according to equations (6), (8) to correct the scaling of previous steps.

Finally, we should understand why there is a + in equation 15. I find it easier to visualize if we set $B = 1$ . If we trace the matrix multiplications, we will observe that $O_{i} \in R^{1 x d}$ is only affected by $Q_{i} \in R^{1 x d}$ , i.e. it corresponds to only the query token in position $i$ . Now, $O_{i}$ represents the weighted average over all $N$ positions of the $V$ matrix where the weights are determined by the softmax over the interaction between $Q_{i}$ (representing one token) and all $N$ positions on the $K$ matrix. This weighted average is why it is a $+$ symbol: we are accumulating the weighted sum over $V_{j}$ into $O_{i}$ . The only complication is that we are applying the scaling corrections at each step.

Hopefully these explanations provide some intuition to the FlashAttention algorithm, which is quite a simple idea but makes a ton of difference practically. It should be easy to implement this algorithm in numpy if the reader wishes to understand it better.

Wei 2022 - CoT Prompting in LLMs

Chain-of-Thought Prompting Elicits Reasoning in Large Language Models

This is an early paper that sparked the explosion of research on in-context learning for LLMs. The main idea is to use few-shot examples with the Chain-Of-Thought reasoning, i.e. <Question>, <CoT reasoning>, <Answer> , as opposed to just the question and answer alone <Question>, <Answer> (as per Brown 2020 - Language Models are Few-Shot Learners).

Method

The method is simple - we include the CoT reasoning in the few-shot examples included in the prompt, inducing the LLM to also generate CoT reasoning before answering the question at inference time. An example prompt:

Q: Roger has 3 balls. He buys 2 more cans of balls, each with 3 balls.
How many balls does he have now?

A: Roger started with 3 balls. 2 cans of 3 balls each is 6 balls.
3 + 6 = 9. The answer is 9.

What are some advantages of CoT prompting?

Useable on black box LLMs. No fine-tuning is required, so we can readily apply it on off-the-shelf LLMs.
CoT allows decomposition of complex problems into intermediate steps. The conceptual idea is that this allows the model to offload additional computation to the intermediate tokens, analogous to human reasoning.
CoT reasoning offers interpretation of model behaviour. We have a suggestive trace of how the model arrived at that answer (though it would be naive to assume that the LLM uses the CoT trace reasoning exactly the way human logic operates).

Observations

An important sub-finding is that between CoT prompting and standard prompting (as per Brown 2020), CoT prompting's improvement gap significantly increases as we:

Increase the question difficulty (seems intuitive); and
Increase the model size. At the 8B parameter model size, the study showed no difference between CoT prompting and standard prompting. But the gap widened significantly at the 100B model size and widened further at the 500B model size.

Another interesting ablation study tried to isolate the improvements to a specific aspect of prompting:

Equation only. This would be something like 3 + 2 * 3 = 9 for the above example. This showed to be no different from standard prompting. This showed to be useful on some datasets with simpler steps, but did not help on GSM8K which requires more semantic parsing of the question.
Variable compute only. One may argue that the exact tokens generated in the intermediate step does not matter much, all that matters is the additional compute the model performs to generate the intermediate tokens. Hence the authors prompt the model to generate dots ... as the reasoning step instead. This proves to not be helpful.
CoT after answer. Another argument is that including the CoT traces in the prompt improves in-context learning in and of themselves, meaning that the intermediate tokens are not actually necessary for improving the model's accuracy. The authors disprove this hypothesis by putting the CoT reasoning after the answer in the prompt, i.e. <Question, Answer, CoT reasoning>. This forces the LLM to generate the answer before it generates the CoT trace. This also proves to not be helpful.

Thus the ablation studies help to clarify that it is the intermediate natural language reasoning steps that help the model offload computation and improve the accuracy of its answers.

Results

The performance of CoT few-shot prompting compared to standard few-shot prompting is striking:

Using GPT-3 175B, performance increases from 15.6 to 46.9
Using PaLM 540B, performance increases from 17.9 to 56.9

Honovich 2022 - Instruction Induction

Instruction Induction: From Few Examples to Natural Language Task Descriptions

This is an early paper that suggests taking a random subset of training (input, label) pairs, showing them to an LLM, and asking the LLM to guess the instruction that produces the label from the inputs.

Method

For example, the LLM prompt may look something like:

Here are the input-output pairs:

Input: As soon as you can.
Output: At your earliest convenience.
... (+4 more input / output pairs)

The instruction was ...

The LLM may then guess something like The instruction was translate the inputs into more formal language. We can then use this instruction as the new optimized prompt.

Evaluation

The paper found that the execution accuracy of these prompts with instruction fine-tuned version of GPT-3 (called InstructGPT) could reach similar levels to human-generated prompts on most simple tasks.

Notably, the authors were careful to control for the variation caused by the selection of few-shot examples:

An induce set of examples are held out from the training set
For each run, 5 input / output pairs are randomly drawn from the induce set, and used to generate a prompt
The evaluation accuracy of this prompt is recorded
The evaluation accuracy is averaged over 100 runs

In hindsight, by using the idea of self-consistency, the performance can likely be significantly improved by taking the majority vote over the 100 runs rather than the average accuracy.

Huang 2022 - LLMs can Self Improve

Large Language Models Can Self-Improve

The main idea of this paper is that we can improve the instruction tuning of LLMs for reasoning capabilities using its own synthetic generated data.

Method

We are given a pre-trained LLM and a question-only training dataset (e.g. like GSM8k). We are also given a few-shot Chain of Thought examples (an example comprises of a question, a reasoning, and a correct answer).

The method is simple. For each question $x_{i}$ in the training set, we:

Sample $m$ reasoning paths and answers
Use majority voting from the $m$ answers to select the most consistent answer $\tilde{y}$ (this is called self-consistency in the literature). Importantly, to increase diversity:
- Set the temperature $T > 0$
- Apply mixed formats of prompts and answers
Keep all reasoning paths that lead to $\tilde{y}$ as our synthetic dataset
Fine-tune our LLM on the synthetic dataset using supervised fine-tuning

Note that since we are using self-consistency to obtain "labels" for our synthetic dataset, we do not require training labels.

Observations

For this method to work, self-consistency needs to be a reliable way of getting accurate answers. The authors plot the confidence score (% of paths leading to $y$ ) of $y$ against the accuracy at that confidence level, and find that it is highly correlated. This implies that highly consistent answers is a strong indication of correctness.

Generally performance increases as we increase the number of sampled paths $m$ . It seems to saturate at around $m = 32$ . Also, the ideal temperature is around 1.2, showing that diversity is important for this technique to work well.

Findings

The fine-tuned LLM significantly advances the SOTA performance:

GSM8K increases from 74.4 using self-consistency to 82.1 (using self-consistency on the fine-tuned LLM)

Tunstall 2022 - SetFit

Paper link.

The task which SetFit tackles is few-shot labelling of texts. Since labelling data is scarce, current SOTA methods rely on large language models and techniques like In Context Learning (ICL) or Parameter Efficient Fine Tuning (PEFT).

Wang 2022 - Self Consistency LLM

Self-Consistency Improves Chain of Thought Reasoning in Language Models

The main idea of this paper is that majority decoding (dubbed self-consistency) of sampled LLM responses significant improves over CoT prompting (see Wei 2022).

Method

The idea is simple. As per (Wei 2022), we use CoT few shot examples in our prompt. Instead of just picking the answer from one run of the LLM, we sample many <CoT reasoning, Answer> pairs from the LLM with the same prompt. We then "marginalize out" the reasoning by just choosing the answer that occurs the most frequently.

Note that this method is unsupervised and the only cost to pay is the compute cost of multiple runs.

The name self-consistency comes from the idea that the most consistent answer given by the model is the most reliable one. Prior approaches to task-based responses is to use greedy decoding, i.e. set $T = 0$ and get the highest likelihood answer. This paper shows that an ensemble of diverse answers is far more effective than the greedy approach.

We may think of this idea as analogous to random forests. Creating an ensemble over diverse weak learners improves significantly, and increasing the diversity of the learners (via column or row sampling) up to a certain point helps to improve performance.

Note that the idea of self-consistency decoding is orthogonal to the specific choice of CoT prompting. Specifically, self-consistency significantly improves with several forms of CoT prompting, including:

Few shot in-context CoT prompts (Wei 2022)
Zero shot Let's think step by step prompt (Kojima 2022)

Parameters

To generate diverse reasoning paths, the authors mainly refer to methods used in other papers:

For PaLM-540B they used T=0.7 and k=40 with top-k token truncation
For GPT-3 they used T=0.7 without top-k token truncation

For the paper, the authors mainly sampled 40 paths.

Rafailov 2023 - Direct Preference Optimization

Here we trace the derivations from the DPO paper. Denote the model after SFT as $π^{SFT} (y ∣ x)$ , i.e. the policy $π^{SFT}$ is a probability function for each pair of inputs and answers (x, y). Naturally, we can use this policy to generate tokens by choosing the response with the highest probability (or approximate it in a greedy token-by-token manner).

To perform RLHF, we first need to build a reward model. First, we prompt the SFT model to obtain pairs of answers $(y_{1}, y_{2}) \sim π^{SFT} (y ∣ x)$ . These samples are presented to human labellers who record their preferences in the form $y_{w} ▹ y_{l} ∣ x$ , where $y_{w}$ wins and $y_{l}$ loses. These preferences are assumed to be generated from an underlying reward model $r^{*} (x, y)$ which we do not have access to.

We wish to learn this reward model. Since we only have access to pairwise preferences instead of the reward score, a common approach is to model the pairwise preferences using the Bradley-Terry model. Specifically, we assume that the observed human preference decisions are related to the underlying human reward model in the following way:

$p^{*} (y_{1} ▹ y_{2} ∣ x) = \frac{e ^{r^{*} (x, y_{1})}}{e ^{r^{*} (x, y_{1})} + e ^{r^{*} (x, y_{2})}} (1)$

Suppose we have a static dataset of comparisons $D = {x^{(i)}, y_{w}^{(i)}, y_{l}^{(i)}}_{i = 1}^{N}$ sampled from the human preference model of $p^{*}$ . We can create a reward model $r_{ϕ} (x, y)$ and use the BT-model to express the negative log likelihood of $D$ . Note that are we using the expression of the BT-model as a sigmoid function. With this NLL expression, we can optimize for $r_{ϕ}$ using gradient descent to learn the reward model from $D$ . Notice that the heart of Equation (2) is essentially just a difference in reward between the winning answer $y_{w}$ and losing answer $y_{l}$ .

$L (r_{ϕ}, D) = - E_{(x, y_{w}, y_{l}) \sim D} [l o g σ (r_{ϕ} (x, y_{w}) - r_{ϕ} (x, y_{l}))] (2)$

Note that $r_{ϕ}$ is usually initialized using the SFT model $π^{SFT}$ , but with an additional linear layer over the final transformer layer to generate a scalar value as the reward.

Having learned the reward model, we then need to use the learned reward function to fine-tune $π^{SFT}$ using reinforcement learning. Specifically, we set $π_{re f} := π^{SFT}$ as the reference model, and initialize a new $π_{θ}$ model that we wish to train. Usually, $π_{θ}$ is also initialized as a copy of $π^{SFT}$ . The objective function we wish to maximize is:

$π_{θ} max E_{x \sim D, y \sim π_{θ} (y ∣ x)} [r_{ϕ} (x, y)] - β \cdot D_{K L} [π_{θ} (y ∣ x) ∣∣ π_{re f} (y ∣ x)] (3)$

Inspecting this objective, we see that we are trying to tune $π_{θ}$ such that it generates answers that maximize the learned reward function $r_{ϕ}$ , while at the same time ensuring that we do not deviate too far from the original reference model. $β$ is a hyperparameter controlling the degree of deviation. This penalty constraint serves to:

Ensure that we do not drift too far from the (x, y) distribution on which the reward model is accurate
Ensure that we maintain generational diversity and not just collapse into a single high-reward answer for a given prompt

Objective (3) may not be directly optimized, because we need to generate $y \sim π_{θ} (y ∣ x)$ at each step from the current policy (not sure if I fully understand this). Hence typically this is optimized using reinforcement learning using PPO.

Direct Preference Optimization

The RLHF process in general is unstable, requires more memory / computation and requires tricks to make it work. Hence the authors of DPO set out to create an optimization procedure that:

Avoids fitting an explicit, standalone reward model
Avoids using reinforcement learning

DPO starts off with the KL-constrained reward maximization objective from Equation (3) above. The first step is to show that the optimal policy for this objective is of the following form for an arbitrary reward model $r (x, y)$ :

$π_{r} (y ∣ x) where partition fn Z (x) = \frac{1}{Z ( x )} π_{re f} (y ∣ x) e x p (\frac{1}{β} r (x, y)) = y \sum π_{re f} (y ∣ x) e x p (\frac{1}{β} r (x, y)) (4)$

The derivation for Equation (4) is as follows. For a given reward function $r (x, y)$ :

$where Z (x) π max E_{x \sim D, y \sim π (y ∣ x)} [r (x, y)] - β \cdot D_{K L} [π (y ∣ x) ∣∣ π_{re f} (y ∣ x)] = π max E_{x \sim D, y \sim π (y ∣ x)} [r (x, y) - β lo g \frac{π ( y ∣ x )}{π _{re f} ( y ∣ x )}] = π min E_{x \sim D, y \sim π (y ∣ x)} [lo g \frac{π ( y ∣ x )}{π _{re f} ( y ∣ x )} - \frac{1}{β} r (x, y)] = π min E_{x \sim D, y \sim π (y ∣ x)} lo g \frac{π ( y ∣ x )}{\frac{1}{Z ( x )} π _{re f} ( y ∣ x ) exp ( \frac{1}{β} r ( x , y ) )} - lo g Z (x) = y \sum π_{re f} (y ∣ x) e x p (\frac{1}{β} r (x, y))$

For line 2 above, recall that $D_{K L} (P ∣∣ Q)$ is the expected value of $lo g P (Y) - lo g Q (Y)$ if the random variable $Y$ is drawn from $P$ . Since the outer expectation is over draws from $π$ , we can breakdown the KL-divergence by bringing the log difference into the expectation. Line 3 simply divides by $- β$ and flips max to min. Line 4 uses $lo g exp$ to bring the reward term into the denominator of the left term, then introduces an arbitrary $Z (x)$ . Note that the two $Z (x)$ can be cancelled out if we brought them together, but we will be using them later on.

Now let us define the optimal policy $π^{*} (y ∣ x) = \frac{1}{Z ( x )} π_{re f} (y ∣ x) exp (\frac{1}{β} r (x, y))$ . We will need to prove that $π^{*}$ is indeed optimal. Note that $π^{*}$ is a valid probability distribution as:

$π^{*} (y ∣ x) \geq 0 \forall y$ ; and
$\sum_{y} π^{*} (y ∣ x) = 1$ , since the denominator is just the sum over $y$ of the numerator

Since $Z (x)$ is not a function of $y$ , we can sub in $π^{*}$ and take $lo g Z (x)$ out. The left term becomes a KL-divergence between $π$ which we are optimizing over and the optimal policy $π^{*}$ .

$π min E_{x \sim D} [E_{y \sim π (y ∣ x)} [lo g \frac{π ( y ∣ x )}{π ^{*} ( y ∣ x )}] - lo g Z (x)] = π min E_{x \sim D} [D_{K L} (π (y ∣ x) ∣∣ π^{*} (y ∣ x)) - lo g Z (x)]$

Finally, note that $Z (x)$ does not depend on $π$ , so we only need to consider the KL-divergence term. Gibb's inequality tells us that KL-divergence is minimized at $0$ if and only if the two distributions are identical. This completes our derivation of (4) by showing that $π^{*}$ is indeed the optimal policy.

Now that we have completed the derivation, let's consider what Equation (4) is saying. It tells us that we have an analytical solution for the policy $π_{r}$ that optimizes (3), and that it can be expressed in terms of $π_{re f}$ (which we already have) and a given reward function $r (x, y)$ .

Since we previously learned a reward model $r_{ϕ}$ , we could simply plug that into (4) to get our optimal policy. Specifically, for a new input prompt $x$ , we can compute $π_{re f} (y ∣ x) exp (\frac{1}{β} r_{ϕ} (x, y))$ for all possible values of $y$ and pick the best $y$ . We can ignore $Z (x)$ since it is fixed for a given prompt $x$ . Intuitively, the new model scales the probability of high reward answers up with an exponential multiplier, and the degree of scaling is controlled by $β$ . However, this is not computationally practical as we need to evaluate over a very large space of $y ∣ x$ (i.e. all possible answers for a given prompt $x$ ).

Hence, we want to find a form of the optimal policy which does not involve the partition function $Z (x)$ nor the reward model $r (x, y)$ . We start by re-arranging Equation (4), taking log on both sides and re-arranging:

$r (x, y) = β lo g \frac{π _{r} ( y ∣ x )}{π _{re f} ( y ∣ x )} + β lo g Z (x) (5)$

Since Equation (5) holds for any arbitrary reward model, we can use the optimal (unknown) human reward model back in Equation (1), $r := r^{*}$ . Also, note that $π_{r}$ in Equation (5) refers to the optimal policy under $r$ , so since we are using the optimal reward $r^{*}$ , we can call this optimal policy $π^{*}$ as well. Now we plug Equation (5) back into Equation (1).

$p^{*} (y_{1} ▹ y_{2} ∣ x) = \frac{exp ( r ^{*} ( x , y _{1} ) )}{exp ( r ^{*} ( x , y _{1} ) ) + exp ( r ^{*} ( x , y _{2} ) )} = \frac{exp ( β lo g \frac{π ^{*} ( y _{1} ∣ x )}{π _{re f} ( y _{1} ∣ x )} + β lo g Z ( x ) )}{exp ( β lo g \frac{π ^{*} ( y _{1} ∣ x )}{π _{re f} ( y _{1} ∣ x )} + β lo g Z ( x ) ) + exp ( β lo g \frac{π ^{*} ( y _{2} ∣ x )}{π _{re f} ( y _{2} ∣ x )} + β lo g Z ( x ) )} = \frac{1}{1 + exp ( β lo g \frac{π ^{*} ( y _{2} ∣ x )}{π _{re f} ( y _{2} ∣ x )} - β lo g \frac{π ^{*} ( y _{1} ∣ x )}{π _{re f} ( y _{1} ∣ x )} )} = σ (β lo g \frac{π ^{*} ( y _{2} ∣ x )}{π _{re f} ( y _{2} ∣ x )} - β lo g \frac{π ^{*} ( y _{1} ∣ x )}{π _{re f} ( y _{1} ∣ x )}) (6)$

The derivation of the above is quite simple, we just need to note that it is of the form $e x p (x) / (exp (x) + exp (y))$ , and use the expression of the BT-model as a sigmoid function. The great thing is that the pesky partition function $Z (x)$ cancels out because the BT-model simply ends up with the difference between two scores / rewards.

Equation (6) looks simple but is quite remarkable when we compare it to Equation (1), because we now have the probability of the human preference data in terms of just the reference policy $π_{re f}$ and the optimal policy $π^{*}$ , without the reward model at all! Of course, the reward model is implicitly embedded in the equation. If we stare at Equation (6), we see the implicit reward function is $r (x, y) = β lo g \frac{π ^{*} ( y ∣ x )}{π _{re f} ( y ∣ x )}$ .

The benefit of this new formulation is that we can write the maximum likelihood objective in terms of the optimal policy. We do not know the optimal policy, but we can parametrize it as $π_{θ}$ and use the maximum likelihood objective to train the model. Our new policy objective becomes:

$L_{D PO} (π_{θ}; π_{re f}) = E_{(x, y_{w}, y_{l}) \sim D} [lo g σ (β lo g \frac{π _{θ} ( y _{w} ∣ x )}{π _{re f} ( y _{w} ∣ x )} - β lo g \frac{π _{θ} ( y _{l} ∣ x )}{π _{re f} ( y _{l} ∣ x )})] (7)$

Application to Dual Encoder Retrieval

The DPO framework offers a way to fine-tune a policy according to human preferences, whilst ensuring stability against a reference model. In the original formulation, $π (y ∣ x)$ represents the probability of generative model generating $y$ given an input prompt $x$ .

A related problem is that of fine-tuning embedding models for search retrieval, in what is known as the dual encoder framework. In this case, we also have preference data in the form of triplets (query, positive_passage, negative_passage), and we wish to fine-tune embeddings such that the dot-product or cosine similarity between (query, positive_passage) is high whilst that of (query, negative_passage) is low. In this formulation, we could let the policy $π (q, p)$ represent the normalized probability of a relevant match between query and passage. We can then borrow the framework of DPO to fine-tune our embeddings. The benefit of DPO compared to typical optimization objectives for dual encoder is the stability of the policy against the reference policy, which hopefully is a good form of regularization even when preference data is limited.

Specifically, we define a dual encoder policy $π (q, p)$ , where $q$ represents a query and $p$ represents a passage, like so:

Encode $q$ into a vector $\in R^{d}$ using a BERT model
Encode $p$ into a vector $\in R^{d}$ using a BERT model (can be same model as the query encoder)
Take the dot product
Run the dot product through a sigmoid layer to convert it into a probability

We can then apply this method to a reference encoder model and call it $π_{re f} (q, p)$ , and optimize another encoder model against this reference model. Will need to conduct experiments to see if this method offers gains against the typical dual encoder objectives, such as Karpuhkhin 2020.

References

Blecher 2023 - Nougat

Blecher 2023 - Nougat: Neural Optical Understanding for Academic Documents.

Nougat is an end-to-end system that converts a scientific PDF into a sequence of tokens in markdown format in an auto-regressive way.

Prior methods for Visual Document Understanding (VDU) usually rely on an external OCR service to generate intermediate outputs. In contrast, this method is end-to-end, and the text is generated directly from image embeddings in a decoder manner. Thus, the model is very simple and most of the work in this paper is in data preparation.

Model

Encoder. The encoder gets a variable size document image $x \in R^{3 \times H_{0} \times W_{0}}$ and applies crop / resizing to generate a fixed rectangle of size $(H, W)$ . Smaller images are white padded. The fixed size image can then be passed into a Swin Transformer to output a sequence of embedded patches $z \in R^{d \times N}$ where $d$ is the latent dimension and $n$ is the number of patches.

Dong 2023 - MINE Loss

Revisiting Recommendation Loss Functions through Contrastive Learning

This paper compares several recommendation loss functions like BPR, CCL and introduce two new losses: InfoNCE+ and MINE+.

Setup

Let $U, I$ denote the user and item sets. We denote that:

Each user $u \in U$ has interactions with $I_{u}^{+}$ items, and has no interactions with the remaining $I ∖ I_{u}^{+}$ set of items.
On the item side, $U_{i}^{+}$ denotes all users who interacted with item $i$
We can also denote $r_{u i} = 1$ if there was an interaction and $r_{u i} = 0$ otherwise

Let the latent embeddings $v_{u}, v_{i}$ represent user $u$ and item $i$ respectively. The similarity measure between them is then denoted $\overset{y}{^}_{u i} =< v_{u}, v_{i} >$ .

BPR Loss

The most widely used loss is Bayesian Personalized Ranking:

$L_{BPR} = E_{u} E_{i \sim p_{u}^{+}} j \sim p_{i} \sum N - lo g σ (\overset{y}{^}_{u i} - \overset{y}{^}_{u j}) = E_{u} E_{i \sim p_{u}^{+}} j \sim p_{i} \sum N - lo g (1 + e x p (\overset{y}{^}_{u j} - \overset{y}{^}_{u i}))$

Note that for each user, we take the expectation over the set of items relevant to him. We then sample $N$ negatives from the overall item distribution (usually uniformly at random).

Softmax Loss

One common approach is to model $P (I = i ∣ U = u)$ as an extreme classification problem where $I$ is a very large set. The probability may then be modeled as a softmax: $P (I = i ∣ U = u) = \frac{e ^{v_{u}^{T} v_{i}}}{\sum _{j \in I} e ^{v_{u}^{T} v_{j}}}$

In practice, it is infeasible to compute over the large item set, so we sample negative candidates for the denominator. The sampling is then corrected via importance weighting.

Covington 2016 for such an approach.
Jean 2014 for importance weighting correction

Using this approach, the loss may be formulated as: $L_{so f t} = - E_{u} lo g i \in I_{u}^{+} \sum \frac{e ^{\overset{y}{^}_{u i}}}{e ^{\overset{y}{^}_{u i}} + \sum _{j \sim p_{u}^{-}}^{N} e ^{\overset{y}{^}_{u j}}}$

Note that $p_{u}^{-}$ is a negative sampling distribution for each user $u$ , and is typically implemented as $p$ which is based on item popularity.

Contrastive Learning Loss (InfoNCE)

InfoNCE loss looks very similar to the sampled softmax loss, although the motivation is different. The key idea is to pull similar points closer and push dissimilar points apart. InfoNCE loss is the most famous contrastive learning loss:

$L_{in f o} = - E_{u} i \in I_{u}^{+} \sum lo g \frac{e ^{\overset{y}{^}_{u i}}}{e ^{\overset{y}{^}_{u i}} + \sum _{j \sim p_{i}^{-}}^{N} e ^{\overset{y}{^}_{u j}}} = - E_{u} i \in I_{u}^{+} \sum \overset{y}{^}_{u i} - lo g e^{\overset{y}{^}_{u i}} + j \sim p_{i}^{-} \sum N e^{\overset{y}{^}_{u j}}$

Note that the only difference from the sampled softmax loss is that the $lo g$ is inside the sum rather than outside. The InfoNCE loss has been shown to maximize the mutual information between user $u$ and item $i$ and minimize mutual information between unrelated pairs.

Empirical Exploration of InfoNCE+

The authors propose an InfoNCE+, which is just adding some hyperparameters to InfoNCE and performing some empirical tuning of these hyperparameters. The InfoNCE+ proposes adding $ϵ$ and $λ$ :

$L_{in f o +} N = - E_{u} i \in I_{u}^{+} \sum (\overset{y}{^}_{u i} - λ \cdot lo g N) = ϵ \cdot e^{\overset{y}{^}_{u i}} + j \sim p_{u}^{-} \sum N e^{\overset{y}{^}_{u j}}$

Empirically, the authors find that setting $ϵ = 0$ and $λ = 1.1$ usually works best (tbh, the empirical evidence is not super convincing).

Theoretical Support for Removing Positive term from Denominator

As we can see, setting $ϵ = 0$ effectively removes the positive term $e^{\overset{y}{^}_{u i}}$ from the denominator of the loss. This makes intuitive sense as it would constrain $\overset{y}{^}_{u i}$ from increasing which is what we want.

This has theoretical backing as well, as explored in Decoupled Contrastive Learning - Yeh 2022. The DCL paper also shows that removing the positive term from the denominator leads to more stable training and less hyperparameter sensitivity.

The DCL loss is thus: $L_{D C L} = - E_{u} i \in I_{u}^{+} \sum \overset{y}{^}_{u i} - lo g j \sim p_{u}^{-} \sum N e^{\overset{y}{^}_{u j}}$

The authors also show that this "decoupling" is also justified from the Mutual Information Neural Estimator perspective. Specifically, the MINE paper shows that we can estimate the true mutual information between each user $u$ and item $i$ by the following optimization problem:

$I (u, i) = sup_{(v_{u}; v_{i})} E_{p_{u, i}} (\overset{y}{^}_{u i}) - lo g E_{p_{u} \otimes p_{i}} (e^{\overset{y}{^}_{u i}})$

Intuitively, we want to maximize the above equation over the similarity function parametrized by the embeddings $v_{i}, v_{u}$ .

The first term takes an expectation of similarity scores over the joint user, item distribution where an interaction occurs (i.e. positive pairs).
The second term takes an expectation of exponentiated similarity scores over the product measure of marginal user and item distributions (i.e. assuming independence between user and item distribution).

MINE Loss

The authors then say that a "simple" adaptation of the MINE problem to the recommendation setting is formalized as the MINE loss:

$L_{min e} = - E_{u} E_{i \sim p_{i}^{+}} [\overset{y}{^}_{u i} - lo g E_{j \sim p_{i}} (e^{\overset{y}{^}_{u j}})]$

Not too sure how this is derived from the above.

They also add a hyperparameter $λ$ to control the relative weightage of the positive and negative samples. This results in what they term as MINE+:

$L_{min e +} = - E_{u} E_{i \sim p_{i}^{+}} [\overset{y}{^}_{u i} - λ lo g E_{j \sim p_{i}} (e^{\overset{y}{^}_{u j}})]$

Based on some ablation studies, they find that $λ = 1.1 - 1.2$ usually works best.

The paper also offers some lower bound analysis and de-biasing of InfoNCE which I will not delve into for now.

Liu 2023 - Meaning Representations from Trajectories

Paper Link

This paper demonstrates a way to compute sentence similarities using auto-regressive LLMs with a decoder-only architecture.

In the encoder-decoder setting, characterized by BERT models, a sentence is typically represented by the embedding generated by the encoder at the final layer. Typically, either the [CLS] token embedding is taken, or the element-wise mean of the embeddings at the final layer is taken to represent the sentence. As Reimers 2019 demonstrates, the default BERT trained using masked language modelling objective does not automatically have good embeddings suitable for Semantic Textual Similarity tasks. Hence the BERT model needs to be fine-tuned with some sentence similarity training task to generate good embeddings.

This paper is interesting as it does not require any such fine-tuning. The main idea is to represent a sentence (sequence of tokens) by the distribution of text continuations (denoted as trajectories) generated by the auto-regressive LLM on the context sentence. In other words, as the paper elegantly puts it, we represent the meaning of linguistic terms by their usage.

Setup

Let $A$ denote a finite vocabulary of tokens and $A^{*}$ indicate the set of all variable-length finite sequences of tokens in $A$ . A language model may be viewed as a map $M (\cdot ∣ \cdot) : A^{*} \times A^{*} \to [0, 1]$ that associates a prompt sequence $s \in A^{*}$ and a possible continuation sequence $t \in A^{*}$ with a likelihood score $M (t ∣ s) \in [0, 1]$ .

The paper chose inverse perplexity as the score: $M (t = (a_{1} ... a_{m}) ∣ s) := i = 1 \prod m P_{M} (a_{i} ∣ s a_{1} ... a_{i - 1})^{1/ m}$

Side note on inverse perplexity. We know that perplexity is as below. Hence the above score is indeed the inverse perplexity. $P er pl e x i t y (Model M) = e x p [- \frac{1}{m} i = 1 \sum m l o g P_{M} (a_{i} ∣ s a_{1} ... a_{i = 1})] = e x p [l o g (i = 1 \prod m P_{M} (a_{i} ∣ s a_{1} ... a_{i = 1})^{- 1/ m})] = i = 1 \prod m P_{M} (a_{i} ∣ s a_{1} ... a_{i = 1})^{- 1/ m}$

Given two prompts $u$ and $v$ , we denote their semantic representation under model $M$ as $M_{u}$ and $M_{v}$ . The semantic distance between them may be denoted $d (M_{u}, M_{v})$ where $d$ is some distance function. The paper chose: $d (M_{u}, M_{v}) = E_{t \sim \frac{1}{2} (M_{u} + M_{v})} ∣ l o g M_{u} (t) - l o g M_{v} (t) ∣ = E_{t \sim \frac{1}{2} (M_{u} + M_{v})} \frac{1}{m} i = 1 \sum m l o g \frac{P _{M} ( a _{i} ∣ u , a _{< i} )}{P _{M} ( a _{i} ∣ v , a _{< i} )}$

In other words, we sample a continuation trajectory $t$ with equal probability from either prompt and generate a continuation sequence of length m, and compute the expected difference in log-likelihood of the generated continuation between the two models $M_{u}$ and $M_{v}$ . Since it is not feasible to integrate the above expression over all possible trajectories $t$ , we sample $n$ trajectories and compute an estimation to the above expected difference.

Results

For the experiments, the authors set n=20, m=20, and sampling temperature $λ$ =1.0.

Ablating on number of trajectories n shows that most of the performance is achieved with n=20, with marginal gains thereafter
Ablating on sequence length m shows that most of the performance is achieved with m=20, with marginal gains thereafter. Interestingly, there is a big improvement in performance from m=1 to m=10, showing that the distribution of the next token alone is not sufficiently discriminating as a semantic representation.
Ablating on sampling temperature $λ$ shows that 1.0 is optimal. Sampling from too diverse or too similar trajectories hurts the performance of this method.

The results show that this method is the best way to represent autoregressive models thus far, with even slightly better performance than encoder-based models like T5 on the STS task. The results are not as good as models explicitly trained with unsupervised contrastive learning (see SimCSE) for the STS task, but as the decoder model size is increased, the performance starts to reach similar levels.

Application to Semantic Retrieval

Unfortunately, this method is not suitable for semantic retrieval, as the distance measure requires sampling from both query and document distributions (if we treat query $q$ as one prompt and document $d$ as the other prompt). Note that the setting is Semantic Textual Similarity where query and documents come from similar distributions.

The authors create a proof-of-concept workaround by creating a fixed set of trajectories $T_{D}$ that is used instead of sampling $t \sim \frac{1}{2} (M_{u} + M_{v})$ , where they select $n$ random documents from the dataset and create a sample trajectory from each document to form $T_{D}$ . We can then precompute $M_{d}$ for each document $d$ on this set of trajectories $T_{D}$ . At inference time, we just need to compute $M_{q}$ on the fixed set of trajectories, then compare the distance $M_{q}$ against the pre-computed $M_{d}$ to get the distance scores. The authors show that this approximation obtains reasonable performance.

However, I note that in typical semantic search settings, query and document usually come from different distributions where queries are short and documents are long. Hence, the continuations of a query and document might not be directly comparable. However, the beauty of this method is that since it is based on continuations, we can use prompting to adjust the behaviour of the LLM to somewhat align distributions.

For example, we may on the document-side generate trajectories using a prompt like Given document d, generate a query. On the query-side, we may use something like Given query q, generate other similar queries. We can then compare trajectories generated in this fashion.

Another issue with the original workaround is that of latency. Even with a fixed trajectory set, we need to compute $M (q ∣ d_{i}) \forall d_{i} \in T_{D}$ at inference time, which is costly even for small number of trajectories. One way to deal with this issue is to have a hybrid setup:

On the document side, we use the high-precision decoder to generate a prospective set of queries from documents of size n (say n=1,000). Each document $d$ is represented as a embedding of size n where each position records the likelihood of prospective query $p_{i}$ given document $d$ using the decoder model.
We use a separate fine-tuned encoder model to precompute embeddings for each of the prospective queries.
On the query side, during inference, we use the same encoder model to embed the query and compute similarity against the n prospective queries. The similarity scores form a representation of size n for the query.
We may then compute similarity between the size n query and size n document representations and perform retrieval with standard nearest neighbour vector databases.

In contrast to standard embedding vectors where each position encodes an unknown quality, these representation vectors have clear semantic meaning in each position (indicating the strength of relation to a particular prospective query). This setup also allows us to trade-off precision against recall by tweaking the sparsity of the representation vectors using score thresholding or top-k thresholding on either (or both) the query and document side.

Klenitskiy 2023 - BERT4Rec vs SASRec

Turning Dross Into Gold Loss: is BERT4Rec really better than SASRec?

This is a short paper that argues the alleged performance gains of BERT4Rec over SASRec is not due to the masked cloze prediction task and bi-direction attention as the authors claim, but rather due to the loss function.

BERT4Rec uses softmax cross entropy over the entire item catalog at each time step, whereas SASRec uses binary cross entropy against a single sampled negative at each time step. When the same softmax cross entropy is used for SASRec, it outperforms BERT4Rec consistently and trains faster.

Background

Sequential recommendation is a popular approach currently to recommender systems, and in particular transformer models with self attention are the standard approach. SASRec is the standard approach, where the task of sequential recommendation is treated as a causal modelling task where the self attention mechanism is only allowed to attend to previous time steps when making the prediction at time step $t$ .

BERT4Rec was proposed as an improvement to SASRec, and the claim was that introducing bi-directional attention (like BERT) and performing prediction on the cloze passage task (i.e. randomly masking x% of items) is able to lead to significant gains over SASRec. The argument is that the random masking is akin to data augmentation, as there are far more permutations of masked positions compared to just predicting the next item.

The authors point out two misgivings they have with this interpretation, which is in line with my own intuitions:

BERT4Rec task is only weakly related to the final goal of sequential recommendations, whereas SASRec tasks for training and prediction are perfectly aligned (i.e. just predict the next item). This is akin to only using the BERT encoder (without the decoder) for a language modelling task, which is quite strange.
BERT4Rec masks some items and only calculates losses for the subset of items, whereas SASRec computes losses for all items (except the last item) at once, getting more training signal from each training sequence

So we should expect SASRec to perform better and more efficiently than BERT4Rec. How then do we explain the performance discrepancy? The authors hypothesize that the performance difference is really due to the difference in loss functions between them, as explained below.

Setup

Start with a set of users $U$ and items $I$ . Let each user $u \in U$ be represented by a sequence of item interactions $s_{u} = {i_{1}^{(u)}, i_{2}^{(u)}, ..., i_{n_{u}}^{(u)}}$ . Each sequential deep learning model may be abstracted as an encoder of input sequence $s_{u}$ , and the encoded sequence be denoted $H_{u} \in R^{n_{u} \times d}$ , where $d$ is the latent dimension.

To make predictions, given the full item embedding matrix $E \in R^{∣ I ∣ \times d}$ , we take: $R_{u} = H_{u} E^{T} \in R^{n_{u} \times ∣ I ∣}$

Then the $t, i$ element of $R_{u}$ may be denoted as $r_{t, i}^{(u)}$ represents the predicted relevance of item $i$ at time step $t$ for user $u$ .

SASRec: Binary cross entropy loss. SASRec does not compute the full $R_{u}$ prediction matrix. Instead, for each true positive item at each time step, it randomly samples one negative item and computes the predictions $r_{t, i_{t}}^{(u)}$ and $r_{t, -}^{(u)}$ . Then the loss is: $L_{BCE} = - u \in U \sum t = 1 \sum n_{u} lo g (σ (r_{t, i_{t}}^{(u)})) + lo g (1 - σ (r_{t, -}^{(u)}))$

BERT4Rec: Softmax Cross Entropy. In contrast, BERT4Rec computes the full prediction matrix $R_{u}$ for each user and computes the softmax over the entire item catalog for each masked item prediction. The cross entropy loss is thus: $L_{CE} = - u \in U \sum t \in T_{u} \sum lo g \frac{exp ( r _{t, i_{t}}^{(u)} )}{\sum _{i \in I} exp ( r _{t, i}^{(u)} )}$

Note that for BERT4Rec, the inner summation is only over the time steps with masked items. If we were to translate this loss to SASRec, we would sum over all time steps.

Sampled Softmax. Finally, it may not be computationally feasible to compute the softmax over the full item catalog. Hence the authors propose that for each user sequence in a batch, we sample $N$ items a user has not interacted with and use the same set of negatives for each time step of a given sequence. Let $I_{N}^{u -}$ denote all items that user $u$ has not interacted with. The loss is then: $L_{CE - s am pl e d} = - u \in U \sum t = 1 \sum n_{u} lo g \frac{exp ( r _{t, i_{t}}^{(u)} )}{exp ( r _{t, i_{t}}^{(u)} ) + \sum _{i \in I_{N}^{u -}} exp ( r _{t, i}^{(u)} )}$

Qn: This means that we have a different set of negatives for each user in a batch? Seems quite memory intensive.

Experiments

The authors use the full sequence of item interactions for each user. The last (most recent) item is held out as the test set, and the second last item is chosen as the validation step. Models are trained with early stopping on the validation set. The authors note that the common practice of sampling negative items for metric computation is not a robust one, as it introduces randomness into the metrics.

Note: I think the other reason sampling negatives is not robust is because it does not directly mirror the retrieval task, which requires choosing an item from the full catalogue.

The results show that SASRec with 3,000 negatives is consistently the best model, beating BERT4Rec consistently. It also trains around $\frac{1}{2}$ to $\frac{1}{4}$ times faster than BERT4Rec. Hence the authors recommend sampled softmax SASRec as the de-facto standard instead of BERT4Rec.

The authors do note that for the smaller datasets, SASRec may overfit relatively quickly (validation loss peaks and declines) and hence it is important to use early stopping. In contrast, BERT4Rec is more robust to overfitting and validation performance generally does not decline (it plateaus near the peak).

Singh 2023 - Semantic IDs for Recs

Better Generalization with Semantic IDs: A Case Study in Ranking for Recommendations

This paper proposes uses semantic IDs instead of hashing to random IDs to represent user or items. Semantic IDs mean that collisions to the same ID are semantically meaningful, and hence addresses the collision problem and also offers better generalization in cold start scenarios. The paper shows that this approach (i) retains memorization ability on par with random hashing and (ii) generalizes much better to unseen items and (iii) is much more computationally efficient as each user / item is represented simply by a 64-bit integer.

Background

In industry recommender systems, users or items are typically represented using random bloom hashing on a string ID (see Weinberger 2009) into an embedding table. Such IDs have no semantic meaning, so collisions in this approach degrade performance. Nevertheless, it is empirically clear that the item-level memorization is important for good recommender performance. On the other extreme, one may choose to avoid IDs entirely and simply represent items using their content embedding. While it is clear that this obviates the collision problem and improves cold-start performance, multiple papers have shown that performance will be degraded overall due to inability to match the memorization ability of ID-based recommenders.

The ranker used in experiments is from Google's video recommender. See Improving Training Stability for Multitask Ranking Models in Recommender Systems and Recommending What Video to Watch Next: A Multitask Ranking System.

Approach

The approach pre-supposes that every item (video in this case) already has a good learned embedding. For Google's case, they have a video encoder that represents each YouTube video as a 2048 dimensional vector that captures the topicality of the video. The encoder takes both audio and visual features as input. The model training is described in Large Scale Video Representation Learning via Relational Graph Clustering.

The approach has two stages:

Represent each item as a sequence of Semantic IDs. They employ a method called RQ-VAE to compress a dense content embedding into a few discrete tokens (represented as integers) which capture the semantic content.
Train the ranking model with Semantic IDs. Once trained, the RQ-VAE model is frozen and used to generate semantic IDs for each item. We then train embeddings for each semantic ID along with the rest of the ranking model.

Stage 1: RQ-VAE for Semantic IDs

The idea of RQ-VAE is to iteratively find tokens that match the content embedding. Once a token is found, the next token will try to match the residual embedding and so on.

Let $x \in R^{d}$ be the content embedding. The algorithm is as follows:

Use encoder $E$ to map the content embedding $x \in R^{D}$ to a latent vector $z \in R^{D^{'}}$
The residual quantizer of $L$ layers recursively quantizes $z$ into semantic IDs
- Each level $l$ has a codebook $C_{l} := {e_{k}^{l}}_{k = 1}^{K}$ containing $K$ vectors (K=2048 in this paper)
- The residual $r_{l}$ for level $l$ is used to find the nearest codebook vector $e_{l}$
- The ID associated with $e_{l}$ is taken as the semantic ID
- We compute $r_{l + 1} := r_{l} - e_{l}$
- Thus we end up with a sequence of semantic IDs (e.g. (1, 4, 6, 2))
A decoder $D$ maps the quantized latent $\overset{z}{^} := \sum_{l = 1}^{L} e_{l}$ back to $\overset{x}{^}$

The RQ-VAE model is trained with the following losses:

$L := L_{reco n} + L_{r q v a e}$
$L_{reco n} := ∣∣ x - \overset{x}{^} ∣ ∣_{2}^{2}$ aims to reconstruct the content embedding $x$
$L_{r q v a e} := \sum_{l = 1}^{L} β \cdot ∣∣ r_{l} - s g [e_{l}] ∣ ∣^{2} + ∣∣ s g [r_{l}] - e_{l} ∣ ∣^{2}$ where $s g$ is the stop-gradient operator (which disables gradient updates to the term in the operator). This is to encourage $r_{l}$ and the codebook vector $e_{l}$ to move toward each other, so that we have a good codebook at the end.

Stage 2: Using Semantic IDs for Ranking

Now that each item $v$ is represented by a sequence of semantic IDs $(c_{1}^{v}, ..., c_{L}^{v})$ , we can treat each ID as a token. The most intuitive thing is to treat each token as a subword and assign a unique embedding to each token (this is the unigram approach below). Unfortunately, this simplistic approach is not the most ideal. The experiments reveal that these semantic tokens behave more like characters in NLP, and it is important for performance to assign unique embeddings to sequences of tokens (i.e. create subwords out of the characters).

There are two approaches that were experimented:

Create n-grams out of the IDs. Suppose an item has semantic IDs (4, 6, 3, 2). A unigram approach would look up embeddings for each ID by itself. A bigram approach would look up embeddings for (46, 63, 32) and so on. Consequently, the embedding table size for a quantizer with L levels and K codes in each level is something like $(L - N + 1) \times K^{N}$ . This gets prohibitively expensive for larger N, so the experiments stop at bigram.
Sentence piece based. As with natural language, most n-grams rarely occur (or do not occur) at all. Hence the n-gram approach is wasteful. The authors found that applying the sentence piece approach (I suppose using byte pair encoding?) is an effective way to learning the subwords to assign a unique embedding. The authors train the subword model on impressed items to a desired arbitrary vocab size. Note: this makes the approach less flexible, as we need to freeze both the quantizer and the set of subwords. However their experiments show that this is generally quite resilient.

Results

The experiments use CTR AUC as the target metric. The model is trained on a window of N days and predictions on day N + 1. Cold start performance refers to the performance on the subset of items that were newly introduced on day N + 1. Note that each user is represented as a sequence of items, including past item history and the current item. The item is represented by semantic IDs for itself. The main findings of the experiments:

Sentencepiece (SPM) approach > n-gram approaches for overall performance. As the vocab size is scaled up for the SPM approach, it outperforms both unigram and bigram easily.
SPM > random hashing with the same vocab size for overall performance. It does not compromise on the memorization capacity of the model.
All Semantic ID approaches > random hashing in the cold start scenario (as expected)

The authors also conducted a set of experiments where items are represented by their content embedding directly. Due to the large size of the embedding, they did not include user past history for these experiments. These showed that content embedding approach is inferior to random hashing, unless the number of layers is increased significantly. This suggests that a larger model is indeed able to memorize better just using the content embeddings, but at a significant computational cost. Hence this justifies the use of semantic IDs as a more efficient way to balance between memorization and cold start performance.

Takeaways

Semantic IDs is a promising approach to balance memorization and cold start performance. However, it presupposes that we have good embeddings for each item. It also introduces significant engineering complexity in training and freezing a residual quantizer and a fixed subword vocabulary. Although the authors conduct experiments to show that this quantizer is quite robust to data distribution shift, there would probably come a time when we need to update the quantizer and the subword vocabulary and deal with the refreshing issue.

Yang 2023 - OPRO

Large Language Models as Optimizers

This paper proposes an automated prompt optimization technique that resembles optimization. Compared to previous efforts, this paper makes use of an LLM's ability to understand patterns and what they call optimization trajectories to optimize a prompt.

Method

We have access to a training set of <question, answer> pairs. The idea is simple:

Test out a prompt with the task description and evaluate the accuracy
Append previously tested prompts and its accuracy in an ascending score order (keeping only the best 8 prompts in the trajectory)
Ask the LLM to write a new prompt that is different from the old ones and achieve a high score

Motivation

The authors motivate this by showing that LLMs can solve simple linear regression and travelling salesman problems. For linear regression:

A true generating equation of $y = w_{t r u e} x + b_{t r u e} + ϵ$ is chosen
50 data points are generated with gaussian noise
The goal is to get an LLM to guess what is the best (w, b) pair
The algorithm loops through the following:
- Start with 5 random (w, b) pairs, compute their mean-squared error
- Prompt the LLM with these pairs and tell it to suggest 8 new (w, b) pairs that further reduces the loss
- Evaluate the MSE of these new pairs and replace the pairs in the history
- Keep up to 20 pairs in history

It is not obvious that LLMs can accomplish this task, as it requires some understanding of the underlying generating equation and how to nudge the w, b values. However, the paper shows that LLMs can generally solve this task and reach the global optimum, and that better models (e.g. gpt-4 vs gpt-3.5) achieve the optimum in fewer number of steps.

OPRO for Prompt Optimization

The general layout of a prompt is like so (note that the Comment headers are not part of the prompt):


### Comment: Meta Instruction for Optimization Trajectory ###
I have some texts with their corresponding scores. The texts are arranged in ascending order based on their scores, where higher scores indicate better quality.

### Comment: Optimization Trajectory ###
text:
Let's figure it out!
score:
61

... more history ...

### Comment: Meta Instruction for Few Shot Exemplars ### 
The following exemplars show how to apply your text: you replace <INS> in each input with your text, then read the input and give an output. We say your output is wrong if your output is different from the given output, and we say your output is correct if they are the same.

### Comment: Few Shot Exemplars ###
input:
Q: Alannah, Beatrix, and Queen are preparing for the new school year
and have been given books by their parents. Alannah has 20 more books than Beatrix. Queen has 1/5 times more books than Alannah. If Beatrix has 30 books, how many books do the three have together?
A: <INS>
output:
140

... more exemplars ...

### Comment: Final Meta Instruction ###
Write your new text that is different from the old ones and has a score as high as possible. Write the text in square brackets.

Optimization is tested out on GSM8K (grade school math problems) and Big Bench Hard (logical reasoning questions). At each step, the above prompt is used 8 times to generate 8 different prompts. These prompts are evaluated on the training set. The best scoring 20 prompts are kept in the optimization trajectory at any point in time. Other parameters:

Temperature = 1 when generating prompts for diversity
Temperature = 0 when evaluating each prompt on the training set with the scorer LLM for greedy decoding

The optimization trajectory looks like this, resembling actual optimization.


OPRO Optimization Trajectory (Figure 1 from paper)

Notes

The method is intuitive and compelling. The downside is that inclusion of both optimization trajectory and exemplars makes the method quite costly, although the optimization only needs to be done once to get the best prompt.

Rajput 2023 - Generative Retrieval

Recommender Systems with Generative Retrieval

Borisyuk 2024 - GNN at LinkedIn

Borisyuk 2024 - LiGNN: Graph Neural Networks at LinkedIn

This is an applied paper documenting learning points from LinkedIn's latest Graph Neural Network (GNN) models, geared toward industry practitioners.

The main architecture is based on GraphSAGE. The graph is heterogenous, where nodes can represent a company or member or post etc., and edges can represent (i) engagement of a user with a post, (ii) affinity between a member and a creator, and (iii) whether a member or company has a particular attribute. Edges are weighted by the strength of the engagement, except attribute edges which all have a weight of 2.0.

The GNN's are trained as encoders for the various types of nodes, and the generated embeddings may be used for downstream link prediction tasks. They are trained using the GraphSAGE framework, which inductively generates the node embeddings using graph sampling and neighbourhood aggregation. The graph sampling is powered by the Microsoft DeepGNN Graph Engine, which is run on a Kubernetes CPU cluster.

Temporal GNN

One of the main innovations of this paper is to adapt the GraphSAGE framework to the temporal nature of a live recommender system, where items lose relevance very quickly with time and modelling the activity sequence is important.

Firstly, the typical SAGE encoder is applied to the member and its neighbours (e.g. connections, affinities etc.). This generates an embedding of d dimensions. Multi-head attention is applied to generate a few of these embeddings, e.g. H=4. In addition, sampling of the member's past activities (e.g. N=100) is performed, and each of these activity nodes are also encoded. These are the embeddings labelled A1, ..., A100 in the diagram below, note that the numbers correspond to chronological order. The input is now of dimension H+N x d.

We can see how this now resembles a sequential input embedding (e.g. from text) which is typically fed into transformers and trained with a causal prediction task. Indeed, here we feed the sequential input into a transformer and get an output of H+N x d as well (referring to H1, ..., H80 at the top). Positional encodings are added to the transformer so that the positions may be interpreted as a sequence. Finally, causal masking is used for training the transformer. This means that the first H tokens, no masking is applied, but for the last N positions, each token can only attend to the first H tokens and the activity tokens that preceded it.

The temporal modelling contributes around 4-5% lift in AUC which is significant.


Temporal GNN Framework (Figure 4 from paper)

Graph Densification

Since GNNs rely on neighbourhood aggregation, nodes with few interactions pose a cold start problem. To solve this problem, artificial edges are added between nodes with low out-degrees and nodes with high out-degrees based on content-based similarity. Specifically, their algorithm works like so:

Nodes with out-degree $\leq$ 30th percentile designated as low-degree
Nodes with out-degree $\geq$ 90th percentile designated as high-degree
Pre-embed all nodes with an LLM based on content
For each low-degree node, find its approximate top-50 neighbours amongst the high-degree set. Create an edge between each of them and the low-degree node with edge weight based on embedding similarity

It seems that graph densification adds a small (0.2%-0.5%) lift to AUC both on offline and online metrics. It probably also increases the diversity of recommendations.

Personalized Page Rank Graph Sampling

Personalized Page Rank (PPR) is an adaptation of the well-known Page Rank algorithm and is widely used as a notion of node similarity. Specifically, given a source node s and a target node t, the PPR score $π (s, t)$ represents the probability that a random walk originating at s will terminate at t.

In training the GNN under the GraphSAGE framework, embeddings of neighbours to a node s are aggregated together to form the representation for s. A terse way of saying the same thing is that "GNNs propagate signals across graphs". Hence, how neighbours are sampled is crucial in determining the performance of the GNN.

The simple way is Random Sampling, which simply chooses neighbours randomly amongst nodes connected to node s. This can done over a single-hop or multi-hops, and is efficient but not the most performant. The better way is PPR sampling, which chooses neighbours weighted by their PPR score, with the search space limited to nodes connected to s over k hops. This is slower but a better measure of neighbourhood.

From their experiments, 2-hop PPR sampling contributes 2.1% validation AUC lift. Adding more hops beyond 2 hops contributes marginally to performance, hence 2-hops is chosen for efficiency.

Other Tips and Tricks

The paper is full of practical tips on how to speed up training and improve stability. Here are some that stood out to me:

Adaptive Graph Sampling. IO time is correlated with number of neighbours sampled. So they start training with a small number of neighbours and increases it only when validation metric starts to stall. It seems that the final number of neighbours is capped at 200.

Mixed Precision. Mixed precision (float-16) gave roughly 10% speed up in training. But they had to be careful to retain float32 in the last layer otherwise it degraded the training process.

Data Bound. One reflection point they made is that GNN training is generally data-bound rather than compute bound. So the focus on the paper was to make data loading and processing more efficient. For example, they implemented a shared memory queue, and use the multiprocessing package to have different processes pre-fetch data from the DeepGNN Graph Engine and pre-process it simultaneously.

Wang 2024 - LLM for Pinterest Search

This paper explains Pinterest use of pre-trained LLMs to improve their search ranking model. They first fine-tune a Llama-8B model on human-labelled search relevance data, then used the model to label large amounts of unlabelled data. The unlabelled data was used to train a smaller MLP, which led to significant improvement in search performance.

Background

Pinterest curates human-labelled dataset for search relevance. For each query, a pin is classified into five ordered relevance levels:

L5: Excellent / Highly Relevant
L4: Good / Relevant
L3: Complementary / Marginally Relevant
L2: Poor / Irrelevant
L1: Highly Irrelevant

Teacher Model

The teacher model takes the form of a standard cross encoder setup. The teacher model is either a BERT-based LLM or a decoder-style LLM. The text inputted to the LLM is of the following format:

[CLS] Query [SEP] Pin Text

The embedding of the [CLS] token is taken for BERT-based models, whilst the embedding of the final non-padding token is taken for decoder-based models (such as Llama). The embedding is passed through several fully-connected layers, and the final output dimension corresponds to the 5 relevance levels. The LLMs are fine-tuned during training by minimizing the pointwise multi-class cross entropy loss.

To enhance the representation of the pin text for the teacher model, several features are concatenated together:

Pin titles and descriptions are used as-is
Synthetic image captions are generated using an off-the-shelf model called BLIP
High-engagement query tokens. Queries resulting in high engagement with a pin are aggregated over the past 2 years, and the top tokens are selected.
User-curated Board Titles. Users curate their pins onto boards. The board titles are aggregated across users and top tokens are selected.
Link Titles and Descriptions. A long click is defined as when a user clicks into a link to the webpage of a pin and stays 10 seconds there. The incoming link url and description contains useful text which is mined.

A few off-the-shelf LLMs are tested as teacher models and the results reported below.

Student Model

The teacher model is used to label large amounts of unlabelled search impressions data. Specifically, for each row of data, the teacher model generates a softmax probability distribution over the 5-point relevance labels. The student model is then taught to mimic the predicted probability distribution using cross-entropy loss.

Note: The exact loss function for knowledge distillation is not covered in the paper. But the classical way is to perform knowledge distillation using KL-divergence loss. An implementation of the loss can be found here. The basic idea is to minimize KL-divergence between the teacher predicted probability distribution over labels and the student predicted probability distribution.

The student model is a simple feed forward network on a diverse set of features. It seems like some feature engineering is performed to optimize performance. Features include:

Query-side embeddings. Pinterest has classifiers to categorize the query into a query interest type, shopping interest type etc. These categorical features are embedded using an embedding table. The SearchSage embedding for the query is also included.
Pin-side embeddings. The PinSage pin embedding, SearchSage pin embedding etc. are included
Query-pin interaction features. Standard query-pin match scores like BM25 score, % of query terms matched are also included as features

All features are presumably concatenated together, and passed into the MLP network. As mentioned above, this student network is trained using knowledge distillation from the teacher predictions above.

Results

The teacher and student model are evaluated using human-annotated search relevance dataset. The train set for the teacher model is around 280k rows of human-annotated data, and the remaining 30k rows are used for evaluation. The teacher model is then used to run inference on 30 million rows of unlabelled data, and then used to train the student model.

For the teacher model, scaling up the base model consistently produces better test performance (accuracy):

BERT-base: 53.5%
T5-base: 56.9%
DeBERTaV3-base: 58.0%
Llama3-8b: 60.2%

Note that the Llama3-8b model was trained using qLora.

The experiments also showed that:

Scaling up the amount of teacher-inferred data was crucial for improving student model performance
The teacher LLM was able to successfully transfer learning from a purely English human-annotated dataset to multiple other languages, thus allowing the student model to learn multi-lingual behaviour.
The inclusion of additional text features for the teacher model above helped to improve performance significantly.

Takeaways

This is a simple paper that shows the power of using LLM models to bootstrap learning from a smaller product-specific dataset to distill an effective student model. This is a useful paradigm because in search systems, we often have vast amounts of unlabelled impressions data that we can use a teacher model to run inference on.

The paper also shows that scaling up the LLM is able to produce non-trivial increases in search relevance performance on the human-annotated dataset.

Solatorio 2024 - GISTEmbed

GISTEmbed: Guided In-sample Selection of Training Negatives for Text Embedding Fine-tuning

This paper proposes a way to learn embeddings using contrastive learning. The main idea is to use a guide model to filter out false negatives from training for better data quality and performance. GISTEmbed models are currently topping the MTEB benchmark.

Implementation

The method has an implementation in the SentenceTransformer package. We will walk through the implementation here.

Firstly, the loss is initialized with both a model to train and a guide model, both of which are SentenceTransformers.

In the forward step:

sentence_features is the input which is a list of dict[str, Tensor]
- The list is length 2 if we have anchor, positive
- The list is length 3 if we have anchor, positive, negative
The sentence_features is passed into both the model and the guide to get the respective embeddings
- For guide, we may need to re-tokenize it if the tokenizer of the guide differs from model
- This is done by using batch_decode and then tokenize again
Now we have anchor, positive and negative embeddings, each of shape (batch_size, embed_dim)
The sim_matrix is used to compute pairwise cosine similarities:
- The code is simply torch.nn.CosineSimilarity(dim=-1)(embed1.unsqueeze(1), embed2.unsqueeze(0))
- embed1 becomes shape (batch_size, 1, embed_dim) and embed2 becomes shape (1, batch_size, embed_dim)
- The similarity is compared at dimension -1
- Broadcasting ensures that the comparison is done pairwise, such that the result is of shape (batch_size, batch_size)
- This is a common way to do pairwise similarity
Now we obtain the pairwise similarity matrices:
- ap_sim, aa_sim, pp_sim, an_sim
- guided_ap_sim, guided_aa_sim, guided_pp_sim, guided_an_sim
The anchor positive similarity threshold is used to filter away potential false negatives
- This is simply the guided_ap_sim.diagonal() which corresponds to the similarity between the anchor and positive in each row
- Note that they use the guide model for determining the threshold
- This threshold is called guided_sim
mask_false_negatives is used to suppress false negatives
- Using the absolute strategy, cases where the guided_sim_mat > guided_sim - self.margin will be suppressed (set to torch.inf)
- The idea is that negatives should not have a higher similarity than the threshold, otherwise there is a higher probability they are false negatives
- This function is applied to ap_sim, aa_sim, pp_sim and an_sim to mask false negatives
Finally, we have scores
- scores = torch.cat([ap_sim, aa_sim, pp_sim, an_sim], dim=1) / self.temperature
- This is of shape (batch_size, 4*batch_size)
We create labels which is torch.arange(batch_size)
- This is because the correct label is in the diagonal of scores matrix
Finally the loss is computed via torch.nn.CrossEntropyLoss()(scores, labels)
- Each row is considered as a classification task where the label corresponds to the column position where the correct class is found
- The log softmax loss is then computed per row and averaged across rows

Sanjabi 2025 - 360Brew

360Brew: A Decoder-only Foundation Model for Personalized Ranking and Recommendation

360Brew is a 150B foundational LLM model used to centralize many LinkedIn models into one model.

For V1.0, 360Brew focused on replacing ranking tasks since they are less bounded by computational constraints in practice compared to retrieval.

Current Paradigm

The current paradigm relies on bloom embedding of ID based features for both members and items. These large embedding tables are supplemented with other engineered features such as item attributes. Some challenges with the current paradigm are:

Cold start: ID based features cannot handle cold start, and thus a lot of work is done to learn good content-based representations
Feature interactions: Previously,

Zhang 2025 - Qwen3 Embedding

Qwen3 Embedding: Advancing Text Embedding and Reranking Through Foundation Models

This short paper introduces the Qwen3 Embedding and reranker series, which are the strongest open source models currently for such tasks. The Qwen3 foundation decoder LLM serves as the backbone for fine-tuning and also is used to generate high quality synthetic training data. Note that Qwen3 is multi-lingual and are publicly available under the Apache 2.0 license, which means it can be used commercially.

Characteristics

The embedding and reranking models come in 3 sizes: 0.6B, 4B and 8B.

0.6B: 28 layers, embedding dimension of 1024 for the embedder
4B: 36 layers, embedding dimension of 2560 for the embedder
8B: 36 layers, embedding dimension of 4096 for the embedder

Since the 4B and 8B have same number of layers, presumably the 8B model has larger hidden sizes.

All the models have 32K sequence length limit, and are instruction aware, meaning that we can adjust the instruction at the start of the prompt to adjust the behaviour of the embedder or reranker. For the embedding models, there is also MRL support (Matryoshka Representation Learning), meaning that we can use custom dimensions from the embeddings.

Embedder

The text embeddings are obtained by appending an [EOS] token at the end of every input sequence. The final embedding is derived from the hidden state of the last layer corresponding to this [EOS] token.

Input format for queries or Documents is as follows:

{Instruction} {Query or Document}<|endoftext|>

The contrastive loss based on InfoNCE is used for training the embedder. Specifically, given a batch of $N$ training instances, the loss is defined as: $L_{e mb e dd in g} = - \frac{1}{N} i \sum N lo g \frac{e ^{s (q_{i}, d_{i}^{+}) / τ}}{Z _{i}}$

$s$ is cosine similarity function, $τ$ is temperature and $Z_{i}$ is the normalization factor which includes the positive pair + various negative pairs: $Z_{i} = e^{s (q_{i}, d_{i}^{+}) / τ} + k \sum K m_{ik} e^{s (q_{i}, d_{i, k}^{-}) / τ} + j \neq = i \sum N m_{ij} e^{s (q_{i}, q_{j}) / τ} + j \neq = i \sum m_{ij} e^{s (d_{i}^{+}, d_{j}) / τ}$

Comment on the above normalization factor:

The second term is the similarity between each anchor query and $K$ hard negatives $d_{i, k}^{-}$ per query. Note that as it is written, only the hard negatives in the same row are used as negatives for each anchor query, but in theory we could use all negatives in the mini-batch.
The third term is the similarity between pairs of queries. The assumption is that randomly selected queries should be unrelated to each other.
The last term is the similarity between the positive document in each row (i.e. $d_{i}^{+}$ ) and all other documents $d_{j}$ (including hard negatives).

The $m_{ij}$ and $m_{ik}$ are mask factors designed to reduce impact of false negatives in the normalization factor $Z_{i}$ . Specifically, given an anchor query or document $i$ and a potential negative query or document $j$ : $m_{ij} = {01 if s_{ij} > s (q_{i}, d_{i}^{+}) + 0.1 or d_{j} == d_{i}^{+}, otherwise$

This means that for each row, we use the similarity score between the query $q_{i}$ and $d_{i}^{+}$ as a dynamic threshold to filter out false negatives. For any term in $Z_{i}$ which has too high similarity exceeding this threshold (plus a small margin), we reject the false negative and mask it out. Note that this approach is reminiscent of triplet loss semi-hard masking or the GISTEmbed loss.

Reranker

The reranker is simpler, and training remains in the text paradigm. Specifically, the authors use the LLM chat template to incorporate instruction and frame the reranking task as a yes or no question:

<|im_start|>system
Judge whether the Document meets the requirements based on the Query and 
the Instruct provided. Note that the answer can only be "yes" or "no".<|im_end|>

<|im_start|>user
<Instruct>: {Instruction}
<Query>: {Query}
<Document>: {Document}<|im_end|>

<|im_start|>assistant
<think>\n\n</think>\n\n

Instead of fitting a classifier head, no change is made to the architecture. The reranking score is computed as the likelihood ratio of the next token being yes or no: $score (q, d) = \frac{e ^{P (yes ∣ I, q, d)}}{e ^{P (yes ∣ I, q, d)} + e ^{P (n o ∣ I, q, d)}}$

The task then reduces to a supervised fine-tuning task, where the label is either yes for positives or no for negatives. The loss is simply the log probability of the correct label for each row (yes or no). $L_{rer ankin g} = - lo g p (l ∣ q, d)$

Multi-stage Training

The multi-stage training has emerged as a common practice for training text embedding models. The 3 stages used are as follows:

Stage 1: Large scale synthetic data. The Qwen 32B model was used to synthesize training pairs of data across many tasks, such as retrieval, classification, semantic textual similarity.
- To create diversity, a document is taken from the Qwen3 training corpus, and top 5 similar documents are retrieved
- Qwen3 is presented with these documents and a user persona to generate a potential query
- Qwen3 is also instructed to vary the query type, length, difficulty and language for each query
- 150 million query - document pairs are generated this way
Stage 2: High quality synthetic data
- The 150 million pairs in Stage 1 are filtered down to 12 million high quality pairs
- Specifically, only query - document pairs with cosine similarity greater than 0.7 are selected
Stage 3: Model merging
- Model merging based on Spherical Linear Interpolation is used, which merges multiple model checkpoints saved during the fine tuning process.

Note that all 3 stages were used for the embedder, but stage 1 was omitted for the reranker as it did not help. The ablation studies show that all 3 stages are crucial for final performance of the 0.6B embedding model.

Yan 2025 - LLM for Recsys

Recsys Keynote: Improving Recommendation Systems & Search in the Age of LLMs - Eugene Yan, Amazon

This talk covers future of recsys and how LLMs can be incorporated. 3 challenges:

Cold start challenge of hash based item IDs
Lack of metadata
Task specific models duplicate engineering, increase maintenance cost and don't benefit from transfer learning
- Benefits: simplifies systems, reduces maintenance and transfer learning
- But there may be alignment tax

Kuaishou Example for semantic IDs

Challenge: hash based item IDs don't encode item content and struggle with cold start and sparsity problem.
Solution: Semantic IDs based on multimodal content

Kuaishou is a short video platform. The main problem they wanted to tackle is to help users discover new items faster.

Idea:

Train standard ID-based embeddings for user and items
Create cluster ID from concatenated content embeddings
- Text: BERT
- Video: ResNet
- Audio: VGGish
Run k-means on 100 million items into around 1k clusters
- Each cluster gets an ID and also an embedding
- Incorporate cluster embedding in final embedding

Result:

+3.4% clicks
+3.0% likes
+3.6% cold start coverage (% of item impressions which are new items)
+1.2% cold start velocity (% of new items that were able to hit some threshold of views)
Example:
- trainable, multimodal, semantic IDs @ Kuaishou
- Short videos platform
- Problem: help users discover new items faster

Filtering Bad job recommendations at Indeed

Problem: poor user experience of email job recommendations and lost trust due to low quality job recommendations Solution: Lightweight classifier trained from GPT-4o annotated data to filter bad recs

Process:

Start with evals - 250 labelled examples with confidence labels
Started with open LLMs like Llama2, but performance was very bad
Used GPT-4, which was very accurate but too slow and costly (22 secs)
Used GPT-3.5, but had poor precision (0.63) on job recommendations
Finetuned GPT-3.5 and got 0.9 precision at 1/4th of of GPT-4 cost and latency
Distilled lightweight classifier on finetuned GPT-3.5 labels

Lightweight classifier was 0.86 auc-roc, with latency <200ms. Result was:

-18% bad recommendations
- Expected lower application rates because recommending fewer items
unsubscribed rate -5%
application rate +5%

Enriching exploratory search queries @ Spotify

Problem: Help users search for new items (podcasts, audiobooks) in a catalogue of known items (e.g. songs, artists)

How to solve cold start issue for new categories?
Exploratory search was essential to expand beyond music

Solution: Query recommendation system

Start creating queries from new items (e.g. podcast title, author etc.) and ask LLM to rewrite as natural language query

Unified Ranker for Search & Recsys @ Netflix

Joint Modeling of Search and Recommendations Via an Unified Contextual Recommender (UniCoRn)

Example of Stripe building a transformer based foundation model from sequence of transactions to identify fraud.

Problem: teams deal with complexity from bespoke models for search, similar item recs, pre-query recs

High operational cost and missed transfer learning opportunities

Unified Contextual Ranker (UniCoRn) takes in a unified input schema and returns a prediction. Unified inputs:

User ID
item ID
Search Query
Task

Some clever tricks to reframe item to item recommendations as search, by using last item title as query.

Unified model used for search, pre-query filtering, video to video recs and more. Able to match or exceed previous task based models. Can iterate much faster with a unified model.

Unified Embeddings @ Etsy

Problem: How to help users get better results with highly specific or broad queries, on ever-changing inventory.

Query mother's day gift does not match product vocabulary
Lexical retrieval does not account for user preferences

Solution: Unified embedding and retrieval model

Two tower architecture for user and product side
Add a quality vector on the product side (rating, freshness, conversion rate) concatenated to the product vector
Add a constant vector on the user side just to make dimensions match

Tandon 2025 - Gemini for YouTube

Teaching Gemini to Speak YouTube: Adapting LLMs for Video Recommendations to 2B+DAU - Devansh Tandon

In general, recommendations is going to have higher reach than search for most consumer apps.

The personalized recommendation problem is to learn f(user, context) = recs.

How to rethink the whole recommendation process in terms of gemini?

Large Recommender Model (LRM): adapting gemini for recommendation tasks
Start with a base gemini checkpoint, adapted for youtube recommendations
Align LRM to different tasks using adapters
- Video retrieval (home, watch next, shorts, search, music)
- Video ranking

To train LRM, need to develop video tokens

Each video is one token, so we can represent watch history
And then output some sequence of video recs

SemanticID: tokenize videos to create a language for Youtube

Creating the atomic language for youtube videos, move away from hash based tokenizing
Youtube video -> extract features like title, description, transcript, audio, vidoe frames -> embedding -> RQ-VAE to quantize embedding into semantic ID
E.g. a semantic ID would look like A228 B204 C196 D413 E589 ...

Continued pre-training: get LRM to understand both English and the semantic ID "language":

Synthetic data relationships:
- Link text and SID, e.g. Prompt: Video [A228 ...] has title and the response is Carlos Alcaraz vs Novak Djokovic
- Create lots of synthetic data doing this
User behaviours: SID sequences of video watches
- Prompt: A user has previously watched the following vidoes: [A110, ...], [F707, ...], [C94, ...], \n <mask1>
- Output: <mask1>: [B230, ...]

After doing this kind of training, we get an LRM that can reason across both English and Youtube Video language.

e.g. Prompt: Video [A185, ...] is interesting to tennis fans since it is about wimbledon. Video [J110 ...] is interesting to
Model can respond Output: scientists since it is about AI

We can now perform generative retrieval with this new language. For example, we can just load context and history into the prompt, and get it to recommend new videos.

Prompt:
------
User: region US | 24 years female | device ANDROID | origin watch next|

Context video: channel Olympics | title WHAT a COMEBACK! Men's 400m | SemanticID_1 |

Watch history:
SID 1 Taylor Swift 100% 260.00s
SID 2 Kris Hui 40% 260.00s

Interestingly, compared to traditional recommenders which returns mostly men's sports videos (due to the last video being Men's 400m), the generative system was able to recommend women's sprint events.

In general:

The LRM is able to learn very quickly, very data efficient
Handles the toughest recs when user information is very limited
But servicing costs are very large for YouTube scale, so alot of effort focused on reducing TPU serving cost

A simple trick to reduce serving cost is to do batch recommendations

Build a simple video to recs table where given a seed video, get the LRM to output generic recommendations
Prompt is something like language {seed_lang} | duration {video_length} | age {video_age} | title {title} | channel {uploader} | {seed_sid}
Everyday, take top 20M of videos by views in last 7days, and do batch inference
Use these batch recommendations to serve some users

Base Gemini is too large at YouTube scale, have to use smaller more efficient models to adapt for recsys.

LLM x RecSys Recipe

Tokenize content

Capture the content essence into an atomic, semantic token

Adapt LLM: english <> domain language

Adapt foundation model to reason across english & new tokens (bilingual)

Prompt with user information

Construct prompts with user information, activity, actions

Train surface/task-specific models

Hameed 2025 - 360Brew

360Brew: LLM-based Personalized Ranking and Recommendation - Hamed and Maziar, LinkedIn AI

Pain points on current LinkedIn ML:

Operational: costly low agility development lifecycle
Quality: disjoint optimization
Developer experience: slow to roll out changes to models one by one

Goal: build a foundational model capturing the lifetime member activity data that solves all LinkedIn matching problems

Zero shot capability: works well out of the box for next prediction tasks
- Measure how well the model does on new products
In-Context learning: Learning from few examples without needing to retrain
- How well does the model do on new users / items? [cold start]
Follow instruction from developers / users
- User control via prompts

Development

Building the LLM:

Need to convert user history into a prompt by verbalizing user information and activities
Provide instruction on what problem we are solving
At time of training use different styles for verbalization

Prompt looks something like:


## Instruction
You are provided a member's profile and a set of jobs, their description, and interactions that the member had with the jobs. For each past job, the member has taken one of the following actions: applied, viewed, dismissed, or did not interact. Your task is to analyze the job interaction data along with the member's profile to predict whether the member will apply, view, or dismiss a new job referred to as the "Question" job.

Note: Focus on skills, location, and years of experience more than other criteria.

## Member Profile
Current position: software engineer, current company: LinkedIn, Location: Sunnyvale, California.

## Past job interaction data
Member has applied to the following jobs: [Age: 2 days, Title: Software Engineer, Location: New York, Country: USA, Company: Meta, Description: . . . ]
Member has viewed the following jobs: [Age: 1 week, Title: Software Engineer, Location: Texas, Country: USA, Company: AMD, Description: . . . ]

## Question 1
Will the member apply to the following job: [Age: 1 day, Title: Software Engineer, Location: Seattle, Country: USA, Company: Apple, Description: . . . ]

## Question 2
Will the member apply to the following job: [Age: 5 days, Title: RF Engineer, Location: Bay Area, Country: USA, Company: Google, Description: . . . ]

So in contrast to YouTube's semantic IDs, LinkedIn encodes past interactions in textual form.

Development pipeline:

Start with OSS model
Continued pre-training
Supervised Finetuning
Alignment
Generate Brew-XL 150B model
Distill to Brew-mini
Prune and quantize to Brew-mini-turbo at 3B parameters
- Ablation studies show that it is critical to first go BIG, then go small

To make development cycle smooth, build in a lot of automation into the pipelines. Especially evaluation loop.

Three levers to improve model quality:

More (and better data)
- Prepare data to maximize accuracy, distribution of different type of data
Bigger model size
Context length
- Longer context length means deeper user activity
- Increasing context length initially improves performance to a certain point (around 20-30k tokens)
- Beyond that models don't generalize that well and performance degrades

Tasting

Performance of model is best for cold start users. Measure relative gain over production model:

5 maximum activities: +6%
10 maximum activities: +4%
100 maximum activities: +2%

Generalization to new domain. 360Brew model can generalize to out of domain tasks and surfaces and beat production models in those tasks.

Increases team agility to roll out new features without training new model

Serving

Three levers to improve efficiency:

Sparsification
Smaller model
- Distillation from big model to small model done using SFT + KD loss
- Gradual distillation is more effective than direct distillation, i.e. go 8B model to 6B model to 3B model etc.
- Pruning is done layerwise, gradual pruning
Quantization: Mix precision
- FP8 for all weights
- FP32 for language model head and logit processor. Important for recommendations, otherwise predictions collapse.
Sparsification
- Star attention (reduce attention quadratic cost)
  - Not every item needs to attend to every item
- When scoring, we can score multiple items at the same time (sounds like 500)
  - Need to make sure these items do not attend to each other

Q&A:

They use 50-60 tasks out of domain to measure the effectiveness of the model in the eval loop.
Designed custom vLLM kernels to allow multi-item scoring by modifying the attention mask

NLP Course

Based on lectures from Graham Neubig's Advanced NLP Class in 2024.

Lecture 1: Introduction

Lecture 1 Link

A general framework for NLP systems: create a function to map an input X into an output Y, where X and/or Y involved language. Tasks include:

Language Modelling
Translation
Text Classification
Langugage Analysis
Image Captioning

Challenges of NLP

Low frequency words
Conjugation (e.g. adjectives modifying a word)
Negation
Metaphor, analogy
Other languages

Topics:

Language Modelling Fundamentals
Training and Inference Methods
Experimental Design and Evaluation
Advanced Training and Architectures
NLP Applications
Linguistics and Multi-Linguality

Lecture 2: Word Representation and Text Classification

Subword models aim to address issue of low-frequency words. How many words are there in the English language? This is a bit of a trick question because do we consider company or companies different words? By using subwords we can greatly reduce the vocabulary size to around 60k which is used in modern tokenizers.

Another way to address this is to use character-level models. The main issue of character-level models is that our sequences become very long, so for a fixed sequence length we cover a lot less text in the same batch size. Using subwords saves compute and memory.

Byte Pair Encoding (2015) is a very simple method to create subwords. It simply incrementally combines the most frequent token pairs together, starting at the character level. e.g. starting with the sentence newest, widest, we first combine es into a token, then est, and so on.

Another way to do this is to use Unigram Models (e.g. Kudo 2018). First we use a unigram LM that generates all words in the sequence independently. We then pick a vocabulary that maximizes the log likelihood of the corpus given a fixed vocabulary size. The optimization process is using the EM algorithm.

Sentencepiece is a highly optimized library to train both these types of subword models.

Subword nuances:

Subword models are hard to use multilingually because they will over-segment less common languages naively
Subword segments are sometimes arbitrary (e.g. is it es t or e st)?

Pytorch vs Tensorflow / JAX

Pytorch is more widely used
Pytorch favour dynamic execution vs compile + execute

Training transformers typically uses a learning rate schedule. Start low, increase in the middle and decrease toward the end. Starting with a high learning rate will lead to weird models because transformers are sensitive.

Lecture 3: Language Modelling

Lecture 3 Link

Generative vs Discriminative models:

Discriminative model: a model that calculates the probability of a latent trait given the data, i.e. $P (Y ∣ X)$ , where $Y$ is say the sentiment and $X$ is the text
Generative model: a model that calculates the probability of the data itself i.e. $P (X)$

Generative models for text are called language models. We can use LMs to generate sentences. We can also use them to score sentences (e.g. when a sentence is a concatenation of a question and answer pair). Can also use LMs to correct sentences.

Auto-regressive language models compute $P (X) = \prod_{i = 1}^{I} P (x_{i} ∣ x_{1}, ..., x_{i - 1})$ . Question: why do we do it auto-regressively instead of computing the entire sequence $x$ at once? The problem is computational - predicting the next token has a space of $∣ V ∣ \sim 60, 000$ , but predicting the entire sequence is in the order of $∣ V ∣^{N}$ where $N$ is the length of the sequence, which is currently untractable. That being said, if we can model the entire sequence, it will probably be a lot more efficient than the auto-regressive way.

The simplest language model is count-based unigram model. By making an indepenence assumption $P (x_{i} ∣ x_{1}, ..., x_{i - 1}) \sim P (x_{i})$ , we just ignore all the previous words and predict the probability of a word occurring.

The maximum likelihood estimation for the probability of a given token $x_{i}$ will simply be:

$P_{M L E} (x_{i}) = \frac{c _{t r ain} ( x _{i} )}{\sum _{x} ( x )}$

Detail: parametrizing in log space. Multiplication of probabilities are re-expressed as additions in log space, because otherwise, if we multiply 100 probabilities together for a sequence of 100 tokens, we will easily underflow the numeric precision.

$P (X) = x = 1 \prod N P (x_{i}) l o g P (X) = x = 1 \sum N l o g P (x_{i})$

Correspondingly, we can define the parameters $θ_{x_{i}} = l o g P (x_{i})$ .

Moving on to higher order n-gram models, the idea is to limit the context length to one-word before the token we are predicting, and then count:

$P_{M L} (x_{i} ∣ X_{i - n + 1}, ..., x_{i - 1}) := \frac{c ( x _{i - n + 1} , ... , x _{i} )}{c ( x _{i - n + 1} , ... , x _{i - 1} )}$

e.g. P(example | this is an) = c(this is an example) / c(this is an).

Due to sparsity of data, we need to add smoothing to deal with zero counts, i.e. instead of just using tri-gram, we smooth tri-gram and bi-gram probabilities together:

$P (x_{i} ∣ x_{i - n + 1}, ..., x_{i - 1}) = λ P_{M L} (x_{i} ∣ x_{i - n + 1}, ..., x_{i - 1}) + (1 - λ) P (x_{i} ∣ x_{i - n + 2}, ..., x_{i - 1})$

e.g. $P(example | this is an) = λ P_{M L} (example | this is an) + (1 - λ) P_{M L} (example | is an)$ .

More sophisticated smoothing techniques are studied in Goodman 1998: An Empirical Study of Smoothing Techniques for Language Modelling.

Problems:

Cannot share strength amongst similar words, e.g. car and bicycle
Cannot condition on context with intervening words, e.g. Dr Jane Smith vs Dr Gertrude Smith
Cannot handle long-distance dependencies, e.g.tennis and racquet in for tennis class he wanted to buy his own racquet

The standard toolkit for n-gram models is kenlm which is extremely fast and scalable, written in c++.

Evaluating Language Models

Log Likelihood:

$LL (X_{t es t}) = X \in X_{t es t} \sum l o g P (X)$

Per-word Log Likelihood:

$W LL (X_{t es t}) = \frac{1}{\sum _{X \in X_{t es t}} ∣ X ∣} X \in X_{t es t} \sum l o g P (X)$

Per-word Cross Entropy:

$H (X_{t es t}) = - \frac{1}{\sum _{X \in X_{t es t}} ∣ X ∣} X \in X_{t es t} \sum l o g_{2} P (X)$

Aside: Any probabilistic distribution can also be used to compress data. The entropy measure is closely related to the number of bits needed to store the data based on our language model.

Perplexity. Lower is better. Perplexity is the number of times we need to sample from the probability distribution until we get the answer right.

$PP L (X_{t es t}) = 2^{H (X_{t es t})} = e^{- W LL (X_{t es t})}$

Other Desiderata of LMs

Calibration Guo 2017. Formally, we want the model probability of the answer matching the actual probability of getting it right. Typically we measure calibration by bucketing the model output probabilities and calculating expected calibration error:

$ECE = m = 1 \sum M \frac{∣ B _{m} ∣}{n} ∣ a cc (B_{m}) - co n f i d e n ce (B_{m}) ∣$

where $m$ represents a sub-segment of the data which corresponds to a confidence interval from the model.

How do we calculate answer probabilities? e.g. the university is CMU and the university is Carnegie Mellon University should both be acceptable.

One way is to use paraphrases to substitute phrases Jiang 2021.
One way is to just ask the model to generate a confidence score, see Tian 2023 - Just ask for calibration

Another desirable characteristic is efficiency. Some metrics are:

Memory usage (load model only, peak memory usage)
Latency (to first token, to last token)
Throughput

Some efficiency tips:

On modern hardware doing 10 operations of size 1 is much slower than doing 1 operation of size 10
CPUs are like motorcycles and GPUs are like airplanes
Try to avoid memory moves between CPU and GPU, and if we need to move memory, do so as early as possible (as GPU operations are asynchronous).

Lecture 4: Sequence Modelling

Lecture 4 Link

NLP is full of sequential data, especially those containing long range dependencies. References can also be complicated, e.g. the trophy would not fit in the suitcase because it was too big. What does it refer to? These are called winograd schemas.

Types of tasks:

Binary classification
Multi-class classification
Structured Prediction. e.g. predicting the parts-of-speech tag for each word in the sentence

Sequence labelling is an important category of tasks in NLP:

e.g. Parts of speech tagging
Lemmatization
Morphological tagging, e.g. PronType=prs

Span labelling:

Named entity resolution
Syntactic Chunking
Entity Linking
Semantic role labelling

We can treat span labelling as a sequence labelling task by using beginning, in and out tags.

Three major types of sequence modelling:

Recurrence
Convolutional
Attentional

Recurrent neural networks essentially unroll a computational graph through "time" (where time is the position in the sequence). This results in a vanishing gradient problem, where the gradient on the last token is largest, and the gradient becomes smaller as we move backwards towards the first token. This problem is not just present in RNNs, but in general for computation graphs, if there is important information, adding direct connections from the important nodes to the loss is one way to improve performance. This is the motivation for residual or skip-connections.

LSTM is one way of solving this problem - the basic idea is to make additive connections between time steps, which does not result in vanishing. The idea is to have different "gates" that control the information flow, and use additive connections. Additive connections solves the vanishing gradient problem because it does not modify the gradient (?).

Attention score functions:

Original paper in Bahdanau 2015 used a multi-layer perceptron for the attention function, which is very expressive but has extra parameters $a (q, k) = w_{2}^{T} t anh (W_{1} [q; k])$
Bilinear in Luong 2015: $a (q, k) = q^{T} Wk$
Dot product: $a (q, k) = q^{T} k$
Scaled Dot product (Vaswani 2017): $a (q, k) = \frac{q ^{T} k}{∣ k ∣}$

Note that the attention mechanism in Vaswani 2017 is essentially a bilinear function + scaling, because the weight matrices are involved in obtaining the query and key vectors.

Comparing RNN, convolution and attention, for a token at position 200 to attend to token at position 1:

RNN will take 199 steps through the computation graph
Convolution will take 20 steps if the convolution window is 10
Attention will take 1 step

This is an advantage of the attention representation of text. Another advantage of attention is the computation speed. For a given text of N tokens, in order to produce N-1 token predictions for the next word at each step:

RNN will have to sequentially run N-1 steps
Attention will do it in 1 step by masking the attention matrix suitably. This makes it much faster to train attention models.

Truncated Backpropagation is one technique to reduce computation. e.g. in the context of RNN, we might forward propagate through positions 1-10 and compute the gradients. For the next positions 11-20, the hidden state from position 10 is used for forward propagation, but we do not backpropagate the gradients back to positions 1-10.

Lecture 5: Transformers

Lecture 5 Link

Transformers started as cross attention (Bahdanau 2014) and evolved to self-attention (Vaswani 2017). There are two main types of transformers:

Encoder-Decoder model, e.g. T5 or BART. Encoder-decoder models have a separate encoder and decoder module. The encoder takes the input sequence and passes it into a transformer block to return a "context" embedding. The decoder then takes in both the "context" embedding and the output sequence (e.g. the same sentence in French for translation, or the same sentence itself shifted right for self-attention) with appropriate masking and generates a probability for each position in the output sequence. Note that the encoder module has un-masked full attention over the entire input sequence, but the decoder module is masked.

The benefit of the encoder-decoder model is that we get an embedding representation of a given input sequence which can be used for downstream tasks. But it is also less flexible because we need to have a concept of the "input sequence" vs the "output sequence", which is suitable for tasks like translation but not in text generation tasks there is no clear input/output.
Decoder only model, e.g. GPT or LLaMa. The decoder model simply takes in the input sequence and passes it through a transformer module, resulting in a probability for each position in the input sequence. Using appropriate causal masking, we can train the network to predict the next token at each position given only information about tokens at previous positions.

The benefit of the decoder model is fewer parameters than the encoder-decoder model. It's also more suitable for text generation tasks. In the encoder decoder framework, at each time step we need to recompute the encoder representation, because the representation at earlier positions can change with the addition of a new token. In the decoder only framework, the previous cached Q, K, V values do not change due to the causal masking, so we can re-use those in decoding the next time step. We should thus expect decoder only models to be faster at decoding.

Core concepts

Multi-head attention. The intuition for having multiple attention heads is that information from different parts of the sentence can be useful to disambiguate in different ways. Typically for a given word to attend to nearby words is useful for learning syntax, but attending to further words is useful for learning semantics.

Multi-head attention basically comprises of multiple attention modules, each with their own weights $W_{Q_{i}}, W_{K_{i}}, W_{V_{i}}$ . The attention outputs are computed for each head, and then concatenated together. Since the resulting matrix will have $n_{h e a d}$ times the size on one dimension, we pass it through a final linear transformation to get the desired dimension size (as though we only used one attention head).

In practice the attention weights across all the heads are concatenated together first before the matrix multiplication to vectorize the computation. The resulting matrix is then sliced and attention computed on each head, before concatenating together again for the final matrix multiplication.

Positional Encoding. There is no notion of position in the transformer. Positional encoding simply adds an embedding at each position to the word embedding to encode this information. Note that it is added right at the beginning to the raw token embeddings.

Sinusoidal Encoding was the original proposal in the Vaswani 2017 paper. The position embedding is a fixed vector at each position $t$ , where the $i^{t h}$ element is $s in (ω_{k} \cdot t)$ for even $i = 2 k$ and $cos (ω_{k} \cdot t)$ for odd $i = 2 k + 1$ , and $ω_{k} := \frac{1}{1000 0 ^{2 k / d}}$ . Here $k$ is an index on the dimension going from $1, ... \frac{d}{2}$ and $d$ is the dimension of the embedding.

The method is rather counter-intuitive but the basic idea is that we want the dot product between two embeddings to be high when the relative position is near and decay as we move away. The intuition is covered nicely in Kazemnejad 2019 and basically we can think of each positional embedding as analogous to a binary encoding represented by $0, 1$ bits at each dimension. Instead of a hard $0$ or $1$ , the $s in e$ and $cos in e$ functions provide a smoothed version so that we get a nice relative decay.

Note that this method does allow us to extrapolate the position embedding to longer sequences that we have not seen before in training.
Learnable Embeddings. This was proposed in Shaw 2018, which basically just allows the embedding at each position to be a learnable vector. The problem of this approach is that it becomes impossible to extrapolate to longer sequences in inference time.
Rotary Positional Encodings or RoPE. This was proposed in Su 2021. The fundamental idea is that we want the dot product of embeddings to result in a function of relative position.

Specifically, we desire that for given positions $m, n$ , and the respective word embeddings $x_{m}, x_{n}$ , the dot product of the resulting embeddings may be expressed purely as a function of their relative distance $m - n$ , and in so doing we lose notion of the absolute position entirely. $f_{q} (x_{m}, m) \cdot f_{k} (x_{n}, n) = g (x_{m}, x_{n}, m - n)$

The paper uses trigonometry and imaginary numbers to come up with a function that satisfies this property. The benefit of losing notion of absolute position entirely means that we can extrapolate to longer sequences that we have not seen before, and RoPE extrapolates better than sinusoidal embeddings. LLaMa uses RoPE embeddings.

Stability. Problem of gradient vanishing or exploding as we pass through the layers of an rnn or transformer. Layer normalization (Ba 2016) is the traditional way to deal with this issue. The intuition is that it normalizes the outputs of each attention layer to a consistent range, preventing too much variance in the scale of outputs.

$LayerNorm (X; g, b) = \frac{g}{σ _{x}} \cdot (X - μ_{x}) + b$

Here, $X \in R^{d \times n}$ is the output of an attention layer of embedding dimension $d$ and sequence length $n$ , $μ_{x}$ and $σ_{x}$ are the element-wise mean and standard deviation respectively across the time positions, and $g$ and $b$ are learnable vectors of dimension $d$ . Hence, if $X$ has very large or small values, the normalization process standardizes the range of values. The parameters $g$ and $b$ allow the model flexibility to shift the values to a different part of the space.

A simplification of layer norm is RMSNorm (Zhang and Sennrich 2019). It removes the mean and bias terms but does not hurt performance empirically. It is used in Llama. The only learnable parameters per layer is $g$ . $RMS (X) = \frac{1}{n} i = 1 \sum n x_{i}^{2} RMSN or m (X) = \frac{X}{RMS ( X )} \cdot g$

Residual Connections. Add an additive connection between input (to an attention layer) and the output. $Residual (X, f) = f (X) + X$

Where $f$ is the attention layer function. It prevents vanishing gradients (since we get $X$ ) and also allows $f$ to focus on learning the difference from the input. In the self-attention context, having the residual connection also prevents the need for tokens to attend to themselves, since that is provided by $X$ itself.

Post vs Pre Layer Norm (Xiong 2020). This paper found that applying LayerNorm after the attention layer (and the residual connection) broke the residual connection, because we have $LayerNorm (f (X) + X)$ , which means that we are no longer guaranteed to get the input $X$ . This led to some instability in transformer training. Instead, they found it is better to apply $f (LayerNorm (X)) + X$ , which led to more stable training, since the residual connection is preserved.

Activation functions.

Vaswani used $R e LU (x) = ma x (0, x)$
LLaMA uses Swish or SiLU (Hendricks and Gimpel 2016), which is $Sw i s h (x; β) = x \cdot σ (β \cdot x)$ Looks a lot like ReLU but avoids the zero gradient problem.

Transformers are powerful but fickle - Vaswani 2017 used Adam with learning rate increase and decrease (warm up). This is no longer that necessary after the pre-layer norm (Xiong 2020).

AdamW (Loshchilov and Hutter 2017) is now more popular. It applies weight decay for regularization to Adam and corrects the previous implementation which applied the regularization incompletely. AdamW is thus theoretically more stable than Adam.

In summary, some comparisons. The Mamba paper found that the LLaMA architecture is 10x more efficient in terms of scaling law compared to the original Vaswani.

	Vaswani	LLaMA
Norm Position	Post	Pre
Norm Type	LayerNorm	RMSNorm
Non-linearity	ReLU	SiLU
Position-Encoding	Sinusoidal	RoPE

Lecture 6: Decoding Strategies

What is an LLM? It is a model that defines a conditional probability distribution of a sequence given some input sequence $X$ .

$P (Y ∣ X) = j = 1 \prod J P (y_{j} ∣ X, y_{1}, ..., y_{j - 1})$

The nice thing about the conditional distribution is that we get some notion of the model's confidence about the next token to generate. The problem with the conditional distribution is hallucination, since models generally assign some small but non-zero probability to incorrect tokens, even if all the pre-training data is factual. See Kalai and Kempala 2023.

Ancestral Sampling is to sample the next token based on the indicated confidence by the model. The nice thing is that the resultant generations follow exactly the distribution of the model.

$y_{j} \sim P (y_{j} ∣ X, y_{1}, ..., y_{j - 1})$

The problem with ancestral sampling is the long tail problem. Most language models have around 30k tokens, and the probabilities from the long tail adds up, so that there's a somewhat good chance of sampling something really unlikely.

The obvious solution to this problem is to ignore the long tail and only sampling from the top-k most probably tokens. This is top-k sampling. This results in only tokens that the model is somewhat confident in.

Alternatively, we could only sampling from the top-p probability mass. This is called top-p or nucleus sampling. This is to account for the case where top-k sampling is not so desirable because if most of the probability mass is only on say 3 tokens, we may only want to sample from them.

Another alternative is epsilon sampling, where we only sample tokens with some minimum probability. This ensures that we only sample from tokens where the model is somewhat confident.

Another strategy is to modify the "peakiness" of the data by controlling the distribution temperature. This is done by modifying the scaling factor for the final softmax layer. This allows the user to put more weights on the top results (temperature < 1.0) for factual answers, or spread the weights out more by increasing the temperature for say story generaion.

Contrastive Decoding is a newer idea - the idea is that we use a smaller model to improve the performance of a larger model. Instead of just decoding from the larger model's distribution, we contrastively decode by choosing outputs where the expert model thinks are more likely than the smaller model. The intuition is that both the big and small model are degenerate in similar ways (e.g. keep repeating itself), but the expert has knowledge which the smaller model does not have. Hence taking the probability difference $l o g (p_{e x p er t}) - l o g (p_{ama t e u r})$ helps to eliminate the degenerate cases and produce better results.

Mode-Seeking Decoding Methods

Instead of sampling, another approach is to try to maximize the probability of generation, i.e. mode-seeking.

$\hat{Y} = a r g ma x_{y} P (Y ∣ X)$

Greedy decoding chooses the most likely token at each step. $y_{j} = y^{'} arg max P (y^{'} ∣ X, y_{1}, ..., y_{j - 1})$

However, greedy decoding does not guarantee finding the most likely sequence. e.g. the is often the most likely word, but choosing the would exclude many other sentences that may be more likely. Hence Beam Search, which ensures that we don't miss a high-probability sequence "hidden" behind a lower-probability prefix. This is a form of breadth-first search, where we maintain a few options at any point in the search.

First, we explore the top 3 next tokens at time step 1
Next, we explore the top 3 next tokens at time step 2 from each branch, leading to 9 options
Then, we prune down to the top 3 paths from the 9 options and repeat the process for time step 3

In practice, beam search often results in a very non-diverse set, i.e. sentences that are very similar to each other. Hence we may want to introduce diversity into the process. Diverse beam search modifies the scoring when pruning beams to avoid choosing overly similar beams. The similarity score can be as simple as word jaccard similarity between pairs. Stochastic beam search modifies the next token selection to sampling instead of using top greedy decodings.

Minimum Bayes Risk

The question is: Do we actually want the generations with the highest probability? In general, we do find that outputs with low probability tend to be worse than those with high probability. However, when we compare amongst the top outputs where probabilities are quite close, it becomes less clear. Say e.g. we performed beam search and found the top 3 sequences:

the cat sat down - 0.3
the cat ran away - 0.25
the cat sprinted off - 0.2

the cat sat down is the highest probability sequence, but the combined probability mass is higher for the idea that "the cat left the area". So the idea is that we might prefer generations that have high agreement with other sequences (high probability and low risk).

$\overset{y}{^} = y^{'} \in Y_{h} arg max y \in Y_{e} \sum G (y, y^{'})$

In the equation above, $Y_{e}$ refers to a random sample from the model, say 100 samples. $Y_{h}$ refers to our hypothesis space, supposedly the top 10 outputs. The risk function $G$ measures the similarity of each candidate $y^{'}$ against the samples. For example, a risk function could be ROUGE score, which measures the n-gram overlap between two sequences. Generally, MBR is high performance but high cost - even if G is ROUGE-1 (i.e. unigram overlap), it significantly outperforms beam search or greedy search.

Other MBR variants: output ensembling. Post-ensemble (Kobayashi 2018) compares pairwise embedding similarity between outputs acoss models and chooses outputs with highest average similarity. self-consistency (Wang 2023) prompts for an answer using chain of thought, samples multiple outputs, and extracts the answer from each sample (ignoring the explanations). The most frequently generated answer is then chosen.

Constrained Generation

Sometimes we want to impose some constraints on the outputs, e.g. we want the model to suggest some hobbies but we do not want climbing, or more properly, say we want to omit toxic texts. Options:

Ask the model to exclude climbing: often does not work
Logit Manipulation. Set the logit for the token(s) corresponding to climbing to be 0. This often messes up because there may be many synonyms or ways to express the same thing and its impossible to enumerate them
Sample and Discard. We set up a new discriminator model which predicts whether a sequence corresponds to the idea of climbing or not. This is an easier task and often people will initialize the model from the original language model and train it to predict this idea from a small set of fine-tuning data.

We can then get the generative model to generate a few samples and keep only samples where the predicted probability of the idea to avoid is low. Another variant of this is FUDGE (Yang and Klein 2021), where we multiply the generative probability of the next token $p (y_{j})$ by the discriminator probability that the new sequence will belong to the idea that we desire (e.g. formality). The chosen token will be that which maximizes the combined score.
RLHF. The alignment fine-tuning using RLHF may be viewed as a way to do constrained generation. An interesting paper that discusses RLHF as bayesian inference is Korbak 2022.

Instead of fine-tuning, one way is to do reward-augmented decoding (Deng and Raffel 2023). The idea is to have a reward model that modifies the generative probabilities based on rewards. (?)

Human In the Loop Decoding

Some strategies to incorporate human intervention in the generation:

Interleaved text. Model generates some text, human continues, then back to the model etc.
Fine-grained replacement. Human selects some part of the generated text, and twiddles some knobs (e.g. "more descriptive" etc.)
Choosing outputs. Model generates a few options, and human chooses one.

We could also use a model in the loop. One idea is Tree of thought prompting, which is somewhat analogous to beam search. We have the model generate a few sentences at a time, with a few samples. An external model then judges and chooses the best branches to continue generation.

Practical Considerations

To increase decoding speed, one method is speculative decoding. We generally generate tokens with a small model, but when the small model is very uncertain, the small model will generate topk samples and a large model will pick the next token. This can speed up generation significantly.

There are many libraries for fast decoding, e.g. vLLM, Outlines, disco etc. General takeaway is that a lot can be done at decoding time without needing to fine-tune the original model.

Lecture 7: Prompting Strategies

Basic Prompting. Append a textual string to be beginning of the sequence and let the model complete. Some models are trained as chatbots and require a specific prompting template, e.g. GPT, but in the backend its simply formatted into a string and passed to the model. The important thing is that we need to follow the template that the model was trained on, otherwise performance could suffer.

Post-processing refers to formatting the returned output from the LLM. E.g. ChatGPT supports markdown rendering, so if you ask it to generate a table it will generate it and then render it as a markdown table. Another form of post-processing is output selection, i.e. we extract the part of the output that we are actually interested in. We could do so by extracting keywords (e.g. fantastic, terrible) for a sentiment analysis task.

An interesting phenomenon for predicting labels is that getting the model to predict positive or negative vs 1-5 labels, the former will do better. The intuition behind this is to think about the data that the model was trained on. It is likely that the model has seen many more reviews with the works positive or excellent vs numeric labels, so it might do better with those types of labels.

Few-shot prompting (Brown 2021) basically injects a few examples of the task together with the instruction. One thing to take note of is that LLMs (especially smaller ones) are sensitive to small changes in the in-context examples:

Example ordering (Lu 2021)
Label balance (Zhang 2022)
Label coverage (Zhang 2022)

Effects of few-shot prompting are also sometime counter-intuitive. For example, replacing correct labels with random labels sometimes barely hurts the accuracy of the task. This suggests that the few-shot prompts are more for getting the structure of the response correct rather than learning the desired labelling logic. Sometimes, more demonstrations can also hurt accuracy (this may be due to the longer context length confusing the model).

Chain of thought prompting (Wei 2022) basically tries to get the model to explain its reasoning before making an answer. The original idea was to include reasoning steps in the few-shot prompts to get the model to do likewise, and it found that this significantly improved the accuracy of the model. One interpretation for why this works is that it provides the model with adaptive computation time to generate the correct answer. e.g. a simple question like 1+1= may be answered immediately but some complex logical question might require several steps, and the reasoning step allows the model to generate whatever it needs to get the answer right.

The next step is unsupervised chain of thought prompting (Kojima 2022), which basically found that we can get the same results by just appending let's think step by step to the prompt, without requiring the few-shot examples.

Another idea is that structuring outputs as computer programs can help (Madaan 2022). e.g. if we want the model to output a Direct Acyclic Graph, we could represent the output as a python graph class object. The reason that this works is perhaps because programs are highly structured and there is a lot of code in the pre-training data. Another useful method is to get the model to output JSON format, which it has also seen a lot of.

Another idea is to have program-aided language models (Gao 2022). This allows the LLM to call a code interpreter or calculator to compute answers. This works especially well for numeric questions.

Prompt Engineering

One thing of take note of is that the format should match that of a trained model. e.g. leaving out the space after the colon Passage:text can lead to severe performance degradation. Changing the casing to PASSAGE: text can also degrade performance.

We can also do automatic prompt generation. e.g. use another model to paraphrase our prompt and then select the best response out of the samples. Another approach is gradient-based, where we try out different prompt words and choose the best prompt words based on some kind of loss. These types of methods can result in highly non-human sequences that somehow produce the best results, but they can also be exploited to elicit harmful responses.

Another method along these lines is Prefix tuning (Li and Liang 2021), where they train an embedding prefix that is appended to the transformer weights in each layer according to the task of interest. This is akin to LORA methods which train additional weights that are appendable to the model.

One way to view prompting is to view it as a human-interpretable prior to the model, which can be easier than fine-tuning.

Lecture 8: Fine-tuning and Instruction Tuning

The general framework is that language models are pre-trained on the semi-supervised language modelling task on a very large corpus, and then fine-tuned on a downstream task. There are two paradigms for doing this:

Multi-task learning. The standard multi-task learning framework is to train the multiple tasks (language modelling and downstream task) simultaneously, e.g. by alternating mini batches or combining losses together.

In Dery 2021, the paper argues that learning jointly on the language modelling and end-task produces better results than if we pre-trained and then fine tuned. The intuition is that the model will be learning representations that are useful for both tasks.

Another paper from Anthropic also shows that incorporating safety training at the beginning out-performs pre-training first and then fine-tuning to incorporate safety.
Pre-train then fine-tune. However, because it is so expensive to perform the language modelling, usually pre-train and fine-tune is the actual paradigm that we follow.

Full fine-tuning means to simply continue training the language model on the training data that we have. This is also called supervised fine-tuning. This can be prohibitively expensive. Rajbhandari 2019 showed that training a 65B parameter with 16-bit mixed precision without any optimizations requires around 1TB of GPU memory, which is clearly not feasible.

The simplest solution is to scale-up horizontally across multiple-GPUs. DeepSpeed ZeRo (Rajbhandari 2019) is a popular framework for doing so:

Stage 1: partitioning the optimizer state does not hurt optimization speed much. This brings down memory per device from 120GB to 31GB across 12 devices.
Stage 2: in addition, it partitions the gradients. This brings memory further down to 16GB.
Stage 3: in addition, it partitions the parameters. This brings memory down to 1.9GB but severely impacts computation speed.

An alternative strategy is to only fine tune a smaller set of parameters. Adapters (Houlsby 2019) is one way to do this. The idea is to add an adapter block after each attention block. The adapter block down-projects the embedding dimensionality to something small like 16, passes it through a non-linearity, then up-projects back to the original dimension. Each adapter block only uses 2 x model_dim x adapter_dim parameters.

There are generally two benefits to parameter-efficient fine-tuning methods:

They are much more memory efficient. This is because we only need to back-propagate gradients on nodes which are on the computation path between the adapter block sto the final loss function
They are more robust to over-fitting on the small set of fine-tuning data

An extension to Adapters is Adapter Fusion (Pfeiffer 2020). The idea is that instead of a single adapter after each attention block, we have multiple adapters, each trained on a different task. We then add an AdapterFusion block, which is basically a multi-head attention over each adapter block, so that it can learn to choose which adapters to use automatically.

LoRA (Hu 2021) is very similar conceptually to adapters, with the important difference that it does not have any non-linearity. The idea is that we express the fine-tuned weights as follows (reference: Cameron Wolfe's blog post): $W_{f t} = W_{pt} + Δ W$

The goal is to learn $Δ W$ with a low rank adaptation, so that it is parameter-efficient. Suppose for simplicity that $W_{f t} \in R^{d \times d}$ . We may approximate $Δ W := A \cdot B$ , where $A \in R^{d \times r}, B \in R^{r \times d}$ . We can then simply freeze $W_{pt}$ and modify the forward pass to become $W_{pt} + A \cdot B$ , and fine-tune the parameters for $A$ and $B$ . $r$ can be as small as 8 or 2, leading to a very parameter-efficient fine-tuning method. $A$ is initialized with small random values, whilst $B$ is initialized as zero, ensuring that we begin the finetuning process with the model's original outputs.

The efficiency of LoRA compared to full fine-tuning is significant. For d=512, r=8, the efficiency is around 3%. $\frac{2 d r}{d ^{2}} = \frac{2 r}{d}$

The reason LoRA has caught on is two-fold:

It does not require modifying the original model structure - we simply modify the state_dict of the original model by adding the $Δ W$
Adapters incur additional inference latency due to the additional layers. LoRA has no additional latency at all

Lecture 8: Reinforcement Learning and Human Feedback

Fall 2024 - Reinforcement Learning and Human Feedback

So far we optimize model using maximum likelihood of next token. This paradigm has some problems:

Problem #1: Some mistakes are worse than others, and we want to penalize more egregious mistakes more
Problem #2: The target labels in MLE can be bad (e.g. toxic internet stuff, disinformation)
Problem #3: Exposure bias. At training time, the prefix sequence is guaranteed to be good / normal. At generation, the prefix sequence can potentially become strange.
- One example is repeating words. In normal language, when a word is repeated twice, it is more likely that the word repeats again. Hence language models learn this behaviour.

So how do we measure how "good" an output is?

Objective assessment. Have an annotated correct answer and match against it. This approach is good for math problems or problems with objective answers.
- GSM8K (Cobbe 2021) is a famous math problem dataset
Human evaluation.

Database Course

Based on CMU 15-445/645 Intro to Database Systems course taught by Andy Pavlo and Jignesh Patel in Fall 2023. The youtube playlist.

Lecture 1

Course about how to design and implement a database management system. Textbook: Database System Concepts by Silberschatz, Korth and Sudarshan.

Agenda:

Database Systems Background
Relational Model
Relational Algebra
Alternative Data Models

A database is an organized collection of inter-related data that models some aspect of the real world. Databases are the core component of most computer applications. e.g. an excel spreadsheet is a database. SQLite is the most popular database as it is used in every cellphone.

Flat File Strawman

Store our database as a csv file that we manage. e.g. each line corresponds to an artist, year, country etc. Problems:

Super slow to find the artist of interest as we use a for-loop to find each
Super slow to update or delete an artist
Data types are not stored on the csv file, we need to know which is an integer etc.
Concurrent writes to the file are not supported

A Database Management System (DBMS) is a software that allows applications to store and analyze information in a database. A general purpose DBMS supports the definition, creation, querying, update and administrations of databases in accordance with some data model. Usually first choice is postgres or sqlite.

A data model is a collection of concepts for describing the data in a database. A schema is a description of a particular collection of data, using a given data model. Examples of data models:

Relational
Key / Value
Graph
Document / XML / Object
Wide-Column / Column-family
Array / Matrix / Vectors
Hierarchical
Network
Multi-Value

1 is the most common. 2-4 are considered NoSQL models (a loose term). 7-9 are obsolete.

Relational Model

Early database applications were difficult to write. Every time the database schema or layout changed, IBM would need to rewrite database programs. Ted Codd devised the relational model to address this problem. The relational model is an abstraction. The relational model defines a database abstraction based on relations to reduce maintenance overhead. Key tenets:

Store database in simple data structures (relations)
Physical storage left up to the DBMS implementation
Access data through high-level language, DBMS figures out the best execution model

A relation is an unordered set that contain the relationship of attributes that represent entities. An n-ary relation is equivalent to a table with n columns. A tuple is a set of attributes values (also known as its domain) in the relation. The special value NULL is a member of every domain.

A relation's primary key uniquely identifies a single tuple. Some DBMSs automatically create an internal primary key if a table does not define one. Primary key is a constraint that the DBMS will enforce to ensure no duplicates exist. A foreign key specifies that an attribute from one relation maps to a tuple in another relation. E.g. If we have an artist table with the artist id, and an album table with an artist column, the artist column is a foreign key.

We can impose constraints on the database that must hold for any tuple. DBMS will then prevent any modification that could violate those constraints. Unique and foreign key constraints are the most common. e.g. CREATE ASSERTION in SQL.

Data Manipulation Languages (DML)

There are two broad methods to store and retrieve information from a database:

Procedural: the query specifies the high level strategy to find the desired result based on sets. This uses relational algebra.
Non-Procedural (Declarative): The query specifies only what data is wanted and not how to find it. This uses relational calculus.

Relational Algebra

Fundamental operations to retrieve and manipulate tuples in a relation. Based on set algebra (unordered lists with no duplicates). Each operator takes one or more relations as its inputs and outputs a new relation. We can thus chain operators together to create more complex operations. The operations are:

SELECT. Choose a subset of the tuples from a relation that satisfies a selection predicate (filter). Predicates act as filters to retain only tuples that fulfill the qualifying requirement. We can combine multiple predicates using conjunctions / disjunctions.
- Syntax: $σ_{p re d i c a t e} (R)$
- SELECT * from TABLE where id="a"
PROJECTION. Generate a relation with tuples that contains only the specified attributes. E.g. re-arrange ordering, manipulate values ( $+, -$ etc.) and remove unwanted attributes.
- Syntax: $π_{A 1, A 2, ...} (R)$
- Example: SELECT b_{id} - 100
UNION. Generate a relation that contains all tuples that appear in one or both input relations. Note that R and S must have the same schema.
- Syntax: $(R ⋃ S)$
- Example: (SELECT * from R) UNION (SELECT * from S)
INTERSECTION. Generate a relation that contains only the tuples that appear in both of the input relations.
- Syntax: $(R ⋂ S)$
- Example: (SELECT * from R) INTERSECT (SELECT * from S)
DIFFERENCE. Generate a relation that contains only the tuples that appear in the first and not the second of the input relations.
- Syntax: $(R - S)$
- Example: (SELECT * from R) EXCEPT (SELECT * from S)
JOIN. Generate a relation that contains all tuples that are a combination of two tuples (one from each input relation) with a common value for one or more attributes.
- Syntax: $(R \infty S)$
- Example: SELECT * FROM R NATURAL JOIN S;
Rename $ρ$
Assignment $R \leftarrow S$
Duplicate Elimination $δ$
Aggregation $γ$
Sorting $τ$
Division $(R \div S)$

Relational algebra defines an ordering of the high level steps of how to compute a query. E.g.

$σ_{b_{i d} = 102} (R \infty S)$ vs $R \infty (σ_{b_{i d} = 102} (S))$ . The former will do a huge join before filtering, whereas the latter filters first before joining, which is much better.

Instead of specifying the exact operations, DBMS allow us to state the high level answer we want, and the DBMS decides on the exact operations and the ordering to perform. This abstracts away the need for the user to know which operations are more efficient. Note that the relational model is independent of the query language implementation, although SQL is the de-facto standard (with many variants).

Document Data Model

A collections of record documents containing a hierarchy of named field / value pairs. A field's value can be a scalar type, array, or another document. Modern implementations use JSON. Main reason for this model is to avoid relational object impedance msimatch, i.e. relational databases store data in rows with relationships between tables, but in object oriented languages like Python data is stored in objects with nested attributes, which could result in inefficient queries when we try to map between the two. In contrast, Document Databases store data in a nested json which closely resembles the object-oriented approach, making it easier to work with. The down side is that we could end up storing a lot of duplicate data in the json objects.

Examples: MongoDB, RavenDB, DynamoDB etc.

Vector Data Model

One-dimensional arrays used for nearest neighbour search, used for semantic search on embeddings generated by transformer models. Native integration with modern ML tools etc. At their core, these are just systems that use specialized indexes (e.g. Meta FAISS) to perform NN search quickly.

Examples: Pinecone, Weaviate, Milvus etc.

Lecture 2: Modern SQL

1971, the first relational query language called SQUARE was created. 1972 SEQUEL (Structured English Query Language) was created. SQL was added to ANSI standard in 1986. Current standard is SQL:2023.

SQL:2023 - property graph queries, multi-dim arrays
SQL:2016 - JSON, polymorphic tables etc.

Relational languages:

Data Manipulation Langauge (DML)
Data Definition Language (DDL)
Data Control Language (DCL)

Important: (with duplicates) not sets (no duplicates).

We should try to do everything on the database, in one big query.

Example database:

student table: sid, login, gpa etc.
enrolled table: sid, cid (course id), grade
course table: cid, name

Aggregates:

AVG(col). e.g. SELECT AVG(s.gpa) FROM student as s
MIN(col)
MAX(col)
COUNT(col). e.g. SELECT COUNT(LOGIN) as cnt FROM student WHERE login LIKE '%@cs. Equivalently, COUNT(1). Count number of rows where their login matches the pattern.

Groupby: Get average gpa by course id.

SELECT AVG(s.gpa), e.cid
    FROM enrolled as e JOIN student AS s
    ON e.sid = s.sid
GROUP BY e.cid

String operations.

LIKE is used for string matching. % matches any substring, including empty strings. _ matches any one character.
SQL-92 defines string functions.

Window functions. Perform a sliding calculation across a set of tuples that are related. Like an aggregation but tuples are not grouped into a single output tuple.

SELECT ... FUNC-NAME(...) OVER (...)
    FROM TABLE_NAME

e.g. Get row number per course id.

SELECT cid, sid
    ROW_NUMBER() OVER (PARTITION BY cid)
    FROM enrolled
ORDER BY cid

Nested queries. Invoke a query inside of another query to compose more complex computations. These are often difficult for the DBMS to optimize. e.g. This one below is a join written as a nested query.

outer query ->    SELECT name FROM student WHERE
                    sid IN (SELECT sid FROM enrolled) <- inner query

e.g. Get the names of students in '15-445':

SELECT name FROM student
    WHERE sid IN (
        SELECT sid FROM enrolled
        WHERE cid = '15-445'
    )

Lateral joins. LATERAL operator allows us to reference tables preceding the expression. e.g. below, the second expression can reference t1.

SELECT * FROM (SELECT 1 AS X) AS t1,
    LATERAL (SELECT t1.x+1 AS y) AS t2;

Common Table Expressions. Provides a way to write auxiliary statements for use in a larger query, i.e. a temp table just for one query.

WITH cteName (col1, col2) as (
    SELECT 1, 2
)
SELECT col1 + col2 FROM cteName

Demonstration of CTEs: Find student record with the highest id that is enrolled in at least one course. We use the maxId in the temporary table below.

WITH cteCourse (maxId) AS (
    SELECT MAX(sid) FROM enrolled
)
SELECT name FROM student, cteSource
    WHERE student.sid = cteSource.maxId

We can also use recursion with CTE. Print the sequence of numbers from 1 to 10.

WITH RECURSIVE cteSource (counter) AS (
    (SELECT 1)
    UNION
    (SELECT counter + 1 FROM cteSource
    WHERE counter < 10)
)
SELECT * FROM cteSource

Lecture 3: Database Storage 1

Class will shift gears towards how to implement a DBMS. Focus is on a single node machine first, then move to concurrency. Today's focus is on the .

Focus on , i.e. the DBMS assumes that the primary storage location of the database is on non-volatile disk. The DBMS's components manage the movement of data between volatile (e.g. RAM) and non-volatile storage.

Storage hierarchy. Higher rank is faster, smaller, expensive.

CPU Registers
CPU Caches
DRAM (i.e. memory)
SSD
HDD
Network Storage

Items 1-3 are volatile, i.e. random access, byte-addressable (we can fetch specific bytes). Items 4-6 are non-volatile, i.e. sequential access, block-addressable. For non-volatile storage, we need to fetch data one block at a time, and accessing data in one contiguous block sequentially is much faster than trying to access blocks scattered around. CPU refers to items 1 and 2, Memory refers to item 3, Disk refers to items 4-6.

Some new storage category: Fast network storage is in between HDD and SSD. Persistent Memory is between SSD and DRAM.

DBMSs need to exploit the sequential access on non-volatile memory to maximize speed of retrieval. Hence how we store data contiguously matters in the design of the DBMS.

Access times:

- 1 ns          L1 Cache Ref        1 sec
- 4 ns          L2 Cache Ref        4 sec
- 100 ns        DRAM                100 sec
- 16,000 ns     SSD                 4.4 hours
- 2,000,000 ns  HDD                 3.3 weeks
- 50,000,000 ns Network Storage     1.5 years

System Design Goals

The overall goal is to make it appear like we have more memory than we do:

We want to allow the DBMS to manage databases that exceed the amount of memory available.
Reading/ writing to disk is expensive, so it must be managed carefully to avoid large stalls.
Random access on disk is much slower, so DBMS wants to maximize sequential access.

Disk oriented DBMS

On disk, we have database files stored in pages. Each page is a block of data. When an execution engine makes a query, it loads a page into buffer pool in memory. The buffer pool returns a 64-bit pointer to the memory location where the page is located. The execution engine interprets the layout of the page, updates it. The new page is then written back into the database file on disk.

Why not use the OS?

The DBMS could potentially use memory mapping mmap to store the contents of a file (or page) in the address space of a program. The benefit is that we can tap on the OS to decide which pages to load into physical memory and when to swap pages out. The downside is that the lack of control over memory management can lead to stalling and bad performance.

How the OS works:

Suppose we have Page 1, ..., Page 4 on disk. There's only enough space in physical memory to load 2 pages.
The OS represents this to the program as virtual memory, where Page 1, ..., Page 4 are available
When the program touches Page 2, the OS will load Page 2 into physical memory and return a pointer to the program
Suppose the program now touches Page 3 and Page 4
The OS needs to decide which page to evict from physical memory

So the problems with using mmap to handle I/O for the DBMS:

Transaction safety. The OS can flush dirty pages to disk at any time, even mid-write. We can get corrupted data on disk.
I/O stalls. The DBMS does not know which pages are in memory, so the OS could stall due to page faults (fetching a page not in memory).
Error handling. Difficult to validate pages. Any access can cause a SIGBUS that the DBMS needs to handle.
Performance issues. The OS has its own scheduling and data structures, and can contend with the DBMS's priorities.

There are some companies that use mmap:

RavenDB
ElasticSearch
QuestDB
MongoDB (moved away from mmap)

So DBMS almost always wants to control things itself:

Flushing dity pages to disk in the correct order
Specialized pre-fetching
Buffer replacement policy
Thread / process scheduling

Paper on this: Are you sure you want to Use MMAP in your DBMS?

For database storage, we need to deal with two problems:

How the DBMS represents the database in files on disk
How the DBMS manages its memory and moves data back and forth from disk

File Storage

The DBMS stores a database as one or more files on disk typically in a proprietary format. The OS doesn't know anything about the contents of these files. There are portable file formats (e.g. parquet etc.). Early systems in 1980s further used custom filesystems on raw block storage. Most newer DBMSs do not do this.

The is responsible for maintaining a database's files. It organizes the files as a collection of pages, and tracks the data read / written to pages, and tracks the available space in each page. A DBMS does not typically maintain multiple copies of a page on disk. This happens above or below the storage manager.

A is a fixed-size block of data. It can contain tuples, meta-data, indexes, logs etc. Most systems do not mix page types. Some systems require a page to be self-contained, i.e. all the information needed to understand the data in the page is contained within the page itself (e.g. Oracle). Each page is given a unique identifier. The DBMS uses an indirection layer to map page IDs to physical locations (e.g. in memory or in S3 or whatever).

There are 3 different notions of what a page is:

Hardware page (typically 4KB). A hardware page is the largest block of data that the storage device can guarantee failsafe writes (i.e. if it says it wrote 4KB, it actually happened).
OS Page (usually 4KB, but can be much larger)
Database Page (512B to 32KB)

Why might a larger page size be a good idea? Sequential data access. If we request a page of 16KB, it is one call to the OS and data is stored sequentially. If we request 4 pages of 4KB each, it is potentially random access. But the downside is that writing to a larger page is slower than writing to a small one.

Page Storage Architecture

How do DBMSs map page IDs to physical location? Different approaches:

Heap file organization <- most common
Tree file organization
Sequential / Sorted File Organization (ISAM)
Hashing file organization

A heap file is an unordered collection of pages with tuples that are stored in random order.

RL Course

Lecture notes based on David Silver's Reinforcement Learning course.

Textbooks:

An Introduction to Reinforcement Learning - Sutton and Barto
Algorithms for Reinforcement Learning, Szepesvari

Lecture 1: Introduction

Lecture 1

Reinforcement Learning is at the centre of many disciplines such as engineering, psychology, economics, neuroscience because it deals with the science of decision making. E.g. in neuroscience, one major part is the dopamine neurotransmitter which resembles RL. Mathematics - operations research. Economics - bounded rationality.

What makes RL different from other ML paradigms?

There is no supervisor, only a reward signal. There is no clear correct action to take, only a reward
The feedback is delayed, not instantaneous
Time really matters in RL (non iid data)
Agent's actions affect the subsequent data it receives

Examples:

Fly stunt manoeuvres in a helicopter
Play backgammon
Manage an investment portfolio
Make a humanoid robot walk

Rewards. A reward $R_{t}$ is a scalar feedback signal. Indicates how well agent is doing at step $t$ . The agent's job is to maximize cumulative reward. RL is based on the reward hypothesis.

Reward Hypothesis. All goals can be described by the maximization of expected cumulative reward.

Is a scalar feedback sufficient? David's argument is that in order to pick one action, we must always be able to compare and rank two actions. Hence we must have a way of ordering actions and it boils down to a scalar score.

Examples of rewards:

Fly stunt manoeuvres in a helicopter:
- +ve reward for following desired trajectory
- -ve reward for crashing
Backgammon:
- +ve / -ve reward for winning / losing a game
Manage investment portfolio
- +ve reward for each $ in bank

Sequential decision making. Goal: select actions to maximize total future reward. We have to plan ahead, because actions may have long term consequences. It may be better to sacrifice immediate reward to gain more long term reward. Greedy approach does not work in RL.

Formalism. At each step $t$ , the agent:

Executes action $A_{t}$
Receives observation $O_{t}$ from environment
Receives reward $R_{t}$ from environment

The environment:

Receives action $A_{t}$
Emits observation $O_{t}$
Emits reward $R_{t}$

The history is the sequence of observations, actions, rewards. $H_{t} = A_{1}, O_{1}, R_{1}, ..., A_{t}, O_{t}, R_{t}$

What happens next depends on the history:

The agent selects the next action based on the history
The environment selects observation and reward to emit

This is a very generic framework or formalism that can handle all types of real world scenarios.

State is the information used to determine what happens next. We do not want to have to load the entire history (e.g. stream of video frames) to make a decision. Formally, state is a function of the history: $S_{t} = f (H_{t})$

The simplest example of state is to e.g. just take the observation at the last timestamp (this worked for Atari games).

The environment state $S_{t}^{e}$ is the environment's private representation. What state the environment is in. That is, the data that the environment uses to pick the next observation / reward. The environment state is not usually visible to the agent or the algorithm.

The agent state $S_{t}^{a}$ is the agent's internal representation, i.e. it is the information used by reinforcement learning algorithms. It can be any function we choose of the history: $S_{t}^{a} = f (H_{t})$ .

A more mathematical definition of state. An information state (or Markov state) contains all useful information from the history.

Definition. A state S_t is Markov if and only if $P (S_{t + 1} ∣ S_{t}) = P (S_{t + 1} ∣ S_{1}, ..., S_{t})$

The idea may be stated as "the future is independent of the past given the present". Once the state is known, the history may be thrown away. i.e. The state is a sufficient statistic of the future. In the helicopter example, the markov state might be something like position, velocity, angle, angular velocity etc. Having known all this information, it does not matter where the helicopter was 10 minutes ago, as we already have all the information we need to make an optimal decision next.

Two trivial statements to show that there always exists a markov state:

The environment state $S_{t}^{e}$ is Markov by definition
The history $H_{t}$ is Markov, albeit not a useful one

Fully observable environments. Full observability means that agent directly observes the environment state. The nice case. $O_{t} = S_{t}^{a} = S_{t}^{e}$

Formally, this is a markov decision process (MDP).

Partially observable environments. Partial observability means that agent indirectly observes the environment. e.g.

Robot with camera doesn't know its absolute location
A trading agent only observes current prices
A poker playing agent only observes public cards, not hidden cards

Now, agent state is not the same as the environment state. Formally this is a partially observable markov decision process. So now we need the agent to construct its own state representation $S_{t}^{a}$ , e.g.:

Remember the complete history $S_{t}^{a} = H_{t}$
Build beliefs of the environment state, i.e. $S_{t}^{a} = (P (s_{t}^{e}) = s_{1}, .. P (S_{t}^{e}) = s_{n})$ . So we have a probability distribution over possible states that we believe the environment is in.
Recurrent neural network, i.e. $S_{t}^{a} = σ (S_{t - 1}^{a} W_{s} + O_{t} W_{o})$ . Use a linear transformation to combine the current observation with previous time step to get current time step.

Inside an RL agent

An RL agent may include these components:

Policy: the agent's behaviour function
Value function: how good is each state or action
Model: the agent's representation of the environment

Policy. A policy is the agent's behaviour. It is a map from state to action, e.g.

Deterministic policy, $a = π (s)$
Stochastic policy, $π (a ∣ s) = P (A = a ∣ S = s)$

Value function is a prediction of future reward, or the expected future reward. It is used to evaluate the goodness or badness of states. $v_{π} (s) = E_{π} [R_{T} + γ R_{t + 1} + γ^{2} R_{t + 2} + ...∣ S_{t} = s]$

Note that the value function is subscripted by $π$ . This indicates that the agent's expected future reward depends on its current policy. If the policy is causing a robot to fall a lot, the expected future reward is probably low.

Model A model predicts what the environment will do next. There are usually 2 parts to the model:

Transitions $P$ predicts the next state (dynamics of the environment)
Rewards $R$ predicts the next immediate reward

Formally, $P_{s s^{'}}^{a} = P (S^{'} = s^{'} ∣ S = s, A = a) R_{s}^{a} = E [R ∣ S = s, A = a]$

Note that having a model is optional, there are many model-free methods in RL.

Maze example. We have a grid maze. The rewards are $- 1$ per time step. The actions are N, E, S, W. The states are the agent's current location.

An example of a deterministic policy would be to have a fixed arrow direction for the action to take in any given grid that the agent is in.
An example of value function is to have a number in each position showing the number of steps it would take to get to the end (but negated as we lose value each time step)

A taxonomy of RL agents:

Value based - only using a value function, the policy is implicit
Policy based - just have a policy function, no value function
Actor critic - have both a value function and policy function, and try to get best of both worlds

Another categorization is model free vs model approach:

Model free means we just go straight to policy and/or value function
- No model
Model based approach. Try to model the environment and world first, then build the policy accordingly

Learning and Planning

Two fundamental problems in sequential decision making:

Reinforcement learning
- The environment is initially unknown
- The agent interacts with the environment
- The agent improves its policy
Planning
- A model of the environment is known and provided to the agent
- The agent performs internal computations with the model without any external interaction
- e.g. if when playing an atari game, we gave the model access to an atari emulator. The model then knows if it takes action $a$ , the emulator will be in this state etc. The model can then plan ahead, build a search tree etc.

Exploration and Exploitation

Reinforcement learning is like trial and error learning. But it is possible to miss out on exploring steps that can lead to more reward.

Exploration means choosing to gives up some known reward in order to find out more information about the environment. Exploitation exploits known information to maximize reward.

Examples:

Restaurant selection:
- Exploitation: go to your favourite restaurant
- Exploration: try a new restaurant
Online banner advertisements
- Exploitation: show the most successful advert
- Exploration: show a different and new advert

Lecture 2: Markov Decision Processes

Markov Decision Processes formally describe an environment for Reinforcement Learning.

The environment is fully observable, i.e. the current state fully characterizes the process
Almost all RL problems can be characterized as an MDP
Even continuous things like Optimal Control
Partially observable cases can be formulated as MDPs
Bandits are MDPs with one state

The Markov Property is central to MDPs. "The future is independent of the past given the present."

State Transition Matrix. For a Markov state $s$ and successor state $s^{'}$ , the state transition probability is defined as: $P_{s s^{'}} = P (S_{t + 1} = s^{'} ∣ S_{t} = s)$

The state transition matrix $P$ defines transition probabilities from all states $s$ to all successor states $s^{'}$ .

$P = P_{11} ⋮ P_{11} \dots ⋱ \dots P_{1 n} ⋮ P_{nn}$

Each row of the transition matrix sums to $1$ .

Markov Process. A Markov Process is a memoryless random process, i.e. a sequence of of random states $S_{1}, S_{2}, ...$ with the Markov Property.

Definition. A Markov Process (or Markov Chain) is a tuple $< S, P >$ , where:
(i) $S$ is a (finite) set of states
(ii) $P$ is a state transition probability matrix
(iii) $P_{s s^{'}} = P (S_{t + 1} = s^{'} ∣ S_{t} = s)$

Example of a Markov Process. A student can transit from Class 1 to Class 2 to Class 3, Pass or Sleep or Pub based on transition probabilities. We can sample episodes for the markov chain. E.g. one episode may be C1 C2 C3 Pass Sleep.

The transition probability matrix may look something like the below. Note that Sleep is the terminal state, so its self-probability is 1.0.

$P = C1 C2 C3 Pass Pub FB Sleep C1 0000 0.2 0.1 0 C2 0.5 000 0.4 00 C3 0 0.8 00 0.4 00 Pass 00 0.6 0000 Pub 00 0.4 0000 FB 0.5 0000 0.9 0 Sleep 0 0.2 0 1.0 001$

A markov reward process is a markov chain with values.

Definition. A Markov reward process is a tuple $< S, P, R, γ >$ :

$S$ is a finite set of states

$P$ is a transition probability matrix, $P_{s s^{'}} = P [S_{t + 1} = s^{'} ∣ S_{t} = s]$

$R$ is a reward function, $R_{s} = E [R_{t + 1} ∣ S_{t} = s]$

$γ$ is a discount factor, $γ \in [0, 1]$

Definition. The return $G_{t}$ is the total discounted reward from time step $t$ .
$G_{t} = R_{t + 1} + γ R_{t + 2} + ... = \sum_{k = 0}^{\infty} γ^{k} R_{t + k + 1}$

There is no expectation because $G_{t}$ is one sample run of the Markov reward process. We'll take expectation later to get the expected return over infinite runs.

Note that the discount factor is the present value of future rewards. $0$ implies maximally short-sighted and $1$ implies maximally far-sighted.
The value of receiving reward $R$ after $k + 1$ time steps is $γ^{k} R$
This setup values immediate reward above future reward

Most Markov reward and decision processes are discounted. Why?

We do not have a perfect model, so the expected future rewards are more uncertain. Hence we put higher weights on immediate rewards.
Avoids inifite returns in cyclic Markov processes
If the reward is financial immediate rewards earn more interest than delayed rewards
Animal / human behaviour shows preference for immediate rewards
It is sometimes possible to use undiscounted Markov reward processes, e.g. if all sequences terminate

The value function $v (s)$ gives the long-term value of state $s$ .

Definition. The state value function $v (s)$ of an MRP is the expected return stating from state $s$ :
$v (s) = E [G_{t} ∣ S_{t} = s]$

How do we compute the state value function? One way is to sample returns from the MRP. e.g. stating from $S 1 = C 1$ and $γ = 1/2$ :

C1 C2 C3 Pass Sleep: -2.25
C1 FB FB C1 C2 Sleep: -3.125

Consider if we set $γ = 0$ . Then the value function $v (s) = R_{s}$ , i.e. the value is just the immediate reward.

Now the important Bellman equation for MRPs: $v (s) = E [R_{t + 1} + γ v (S_{t + 1}) ∣ S_{t} = s]$

It essentially tells us that the value function can be decomposed into two parts:

Immediate reward $R_{t + 1}$
Discounted value of successor state $γ v (S_{t + 1})$

$v (s) = E [G_{t} ∣ S_{t} = s] = E [R_{t + 1} + γ R_{t + 2} + γ^{2} R_{t + 3} + ...∣ S_{t} = s] = E [R_{t + 1} + γ (R_{t + 2} + γ R_{t + 3} + ...) ∣ S_{t} = s] = E [R_{t + 1} + γ G_{t + 1} ∣ S_{t} = s] = E [R_{t + 1} + γ v (S_{t + 1}) ∣ S_{t} = s] = R_{s} + γ s^{'} \sum P_{s s^{'}} v (s^{'})$

Note that in the second-to-last line, the argument inside $v (S_{t + 1})$ is a random variable, to express the fact that the state at time $t + 1$ is random.
Note that both $G_{t + 1}$ and $v (S_{t + 1})$ are random variables, which express the value function at each possible state at time step $t + 1$ .
$G_{t}$ becomes $v (S_{t + 1})$ due to the law of iterated expectations. Recall that $E [X] = E [E [X ∣ Y]]$ . (Not very sure exactly how this works out.)

To dig into bellman equation a bit more. Use a 1-step look ahead search. We start at state $s$ , we look ahead one step and integrate over the probabilities of the next time step. Hence we get $v (s) = R_{s} + γ \sum_{s^{'}} P_{s s^{'}} v (s^{'})$ .

We can use the bellman equation to verify if our value function is correct. Taking the value at a particular state, we can check if it is indeed the sum of the immediate reward and the weighted sum of values in all possible next steps.

The Bellman equation can be expressed concisely using matrices, $v = R + γ P v$

$v (1) ⋮ v (n) = R_{1} ⋮ R_{n} + γ P_{11} ⋮ P_{n 1} \dots ⋱ \dots P_{1 n} ⋮ P_{nn} v (1) ⋮ v (n)$

The bellman equation is a linear equation and can be solved directly using matrix inversion. $v (1 - γ P) v v = R + γ P v = R = (I - γ P)^{- 1} R$

The complexity due to the matrix inversion if $O (n^{3})$ for $n$ states, which is not feasible for a large number of states. There are many iterative methods which are more efficient:

Dynamic programming
Monte Carlo evaluation
Temporal Difference learning

Markov Decision Process

So far it has been a building block. The MDP is what we really use. A Markov Decision Process (MDP) is a markov reward process with decisions (actions). It is an environment in which all states are Markov.

Definition. A Markov Decision Process is a tuple $< S, A, P, R, γ >:$

$S$ is a finite set of states

$A$ is a finite set of actions

$P$ is a transition probability matrix with $P_{s s^{'}}^{a} = P [S_{t + 1} = s^{'} ∣ S_{t} = s, A_{t} = a]$

$R$ is a reward function, $R_{s}^{a} = E [R_{t + 1} ∣ S_{t} = s, A_{t} = a]$

$γ$ is a discount factor

Note that the transition probabilities and reward functions now also depend on an action, in which we can have some agency now. We can choose actions to influence the reward and values.

Definition. A policy $π$ is a distribution over actions given states,
$π (a ∣ s) = P [A_{t} = a ∣ S_{t} = s]$

A policy fully defines the behaviour of an agent. Some properties of a policy:

It only depends on the current state (not the history)
The policy does not depend on the time step $t$ (i.e. stationary)

We can still obtain the optimal policy because of the markov property - the current state captures all relevant information to make the optimal decision.

Definition. The state-value function $v_{π} (s)$ of an MDP is the expected return starting from state $s$ , and following policy $π$ : $v_{π} (s) = E_{π} [G_{t} ∣ S_{t} = s]$

Definition. The action-value function $q_{π} (s, a)$ of an MDP is the expected return starting from state $s$ , taking action $a$ , and following policy $π$ : $q_{π} (s, a) = E_{π} [G_{t} ∣ S_{t} = s, A_{t} = a]$

Bellman Expectation Equation. The state-value function can again be decomposed into immediate reward plus discounted value of the successor state. $v_{π} (s) = E_{π} [R_{t + 1} + γ v_{π} (S_{t + 1}) ∣ S_{t} = s]$

Similarly we can do so for the action-value function, by inserting the chosen action: $q_{π} (s, a) = E_{π} [R_{t + 1} + γ q_{π} (S_{t + 1}, A_{t + 1}) ∣ S_{t} = s, A_{t} = a]$

From a given state, we have a value function attached to that state, i.e. $v_{π} (s)$ . From this state, we have some possible actions to take. The policy determines the probability distribution over which action to take. With each action comes an action-value function $q_{π} (s, a)$ . Hence we have: $v_{π} (s) = a \in A \sum π (a ∣ s) q_{π} (s, a)$

Another way to look at it. We start with having chosen a particular action. Having chose a particular action, the environment will determine the particular state I end up in (based on the transition probability matrix $P$ ). Hence we have: $q_{π} (s, a) = R_{s} + γ s^{'} \in S \sum P_{s s^{'}}^{a} v_{π} (s^{'})$

Now we can stitch these two perspectives together. Starting from a particular state, we can write $v_{π} (s)$ in terms of $q_{π}$ , then write $q_{π}$ in terms of $v_{π}$ again. This will allow us to get a recursive relationship of $v_{π (s)}$ in terms of $v_{π} (s^{'})$ and allow us to solve the equation.

The bellman expectation equation for $v_{π} (s)$ is thus:

$v_{π} (s) = a \in A \sum π (a ∣ s) (R_{s}^{a} + γ s^{'} \in S \sum P_{s s^{'}}^{a} v_{π} (s^{'}))$

The math is expressing a simple idea: that the value at a particular state $s$ is the weighted sum of values from all possible actions we take under the current policy $π$ . The value of each action is in turn affected by the reward function and the transition probability that determines the state we end up in after taking a particular action.

Similarly, we can do the same by starting at an action instead of a state. The bellman expectation equation for $q_{π}$ is thus:

$q_{π} (s, a) = R_{s}^{a} + γ s^{'} \in S \sum P_{s s^{'}}^{a} a^{'} \in A \sum π (a^{'} ∣ s^{'}) q_{π} (s^{'}, a^{'})$

Optimal Value Function

So far we have been defining the dynamic process of the MDP, but have not tried solving the optimization problem. We will turn to this now.

Definition. The optimal state-value function $v_{*} (s)$ is the maximum value function over all policies: $v_{*} (s) = π max v_{π} (s)$ The optimal action-value function $q_{*} (s, a)$ is the maximum action-value function over all policies: $q_{*} (s, a) = π max q_{π} (s, a)$

The MDP problem is solved once we find $q_{*}$ . We thus need some algorithms to systematically find $q_{*}$ .

Define a partial ordering over policies: $π \geq π^{'} if v_{π} (s) \geq v_{π^{'}} (s) \forall s$

Theorem. For any MArkov Decision Process:

There exists an optimal policy $π_{*}$ that is better than or equal to all other policies, i.e. $π_{*} \geq π, \forall π$

All optimal policies achieve the optimal value function, i.e. $v_{π_{*}} (s) = v_{*} (s), \forall s$

All optimal policies achieve the optimal action-value function, i.e. $q_{π_{*}} (s, a) = q_{*} (s, a), \forall s, a$

How do we find the optimal policy? An optimal policy can be found trivially by maximizing over $q_{*} (s, a)$ , if we knew it. That is, we always pick the action $a$ with the highest $q (s, a)$ value. Hence if we have $q_{*}$ , we have $π_{*}$ . $π_{*} (a ∣ s) = {10 if a = ar g a^{'} \in A max q_{*} (s, a^{'}) otherwise$

Intuitively, we find the optimal policy by starting at the end (resting), and iteratively look backward. This is the same kind of intuition for the Bellman optimality equations.

The optimal value of being in a state $s$ is the highest value action we can take in that state. Note that we use $q_{*}$ instead of a generic $q$ because we are choosing from the optimal action-value function. $v_{*} (s) = a max q_{*} (s, a)$

The optimal value of an action $a$ is the weighted sum of values of states that we can end up in after taking the action. Note that in this step, we do not get to choose an action - the transition probabilities will determine what state we end up in after taking actions $a$ : $q_{*} (s, a) = R_{s}^{a} + s^{'} \in S \sum P_{s s^{'}}^{a} v_{*} (s^{'})$

Finally, stitching these two equations together, we get the bellman optimality equation for $v_{*}$ : $v_{*} (s) = a max [R_{s}^{a} + s^{'} \in S \sum P_{s s^{'}}^{a} v_{*} (s^{'})]$

How do we solve the bellman optimality equations? It is now non-linear due to the max function, so we cannot solve it with matrix inversion as before. There is no closed form solution in general, but there are many iterative solution methods:

Value iteration
Policy iteration
Q-learning
Sarsa

Intuition. The core idea behind the bellman equations is to break down a complex sequential decision problem into a series of simpler, recursive steps. Imagine we are at a particular point in time and in a particular state. The bellman equations tell us that if we can assume that we will act optimally for all future steps after this action, then the problem of finding the best current action becomes trivial - we simply choose the action that yields the highest expected value (based on assuming future optimality).

To actually start unravelling the equations and solving them, we start from the termination point of a process (where the assumption of future optimality trivially holds) and work backwards.

Lecture 3: Planning by Dynamic Programming

What is dynamic programming?

Dynamic: sequential or temporal component to the problem
Programming: optimising a mathematical "program", i.e. a policy

It is a method for solving complex problems, by breaking them into subproblems that are simpler to solve.

Dynamic Programming is a very general solution method for problems which have two properties:

Optimal substructure: the pricniple of optimality applies. The optimal solution can be decomposed into subproblems.
- e.g. to find the shortest path from A to B, we can find the shortest path from A to midpoint, and then find the shortest path from midpoint to B, and then combine the paths together.
Overlapping subproblems. The subproblems need to recur many times and solutions can be re-used.
- e.g. if we have the shortest path from midpoint to B, we can reuse that to find the shortest path from C to B if it traverses the midpoint as well.

Note that Markov Decision Processes satisfy both properties:

The bellman equations decompose the large problem into recursive steps
The value functions for a particular state cache the sub-solutions and are re-used

Planning by dynamic programming. Planning is a different problem from RL. Someone tells us the dynamics of the MDP, and we try to solve it.

Assume full knowledge of the MDP
Used for planning in an MDP
We can use this for prediction:
- e.g. input: an MDP $< S, A, P, R, γ >$ and policy $π$
- The output of this planning step is to output the value function $v_{π}$
We can also use this for control:
- Input: MDP $< S, A, P, R, γ >$
- Output: optimal value function $v_{*}$ , i.e. we want to find the best policy $π_{*}$

Policy Evaluation

Problem: we want to evaluate a given policy $π$ to see how good it is. The solution is to iteratively apply the bellman expectation.

Let $v_{1}$ be a vector representing the value at all states
We use the bellman equation to update $v_{1} \to v_{2} \to ... \to v_{π}$ , i.e. it will converge to the true value function for this policy
Do this using synchronous backups:
- At each iteration step $k + 1$
- For all states $s \in S$
- Update $v_{k + 1} (s)$ using $v_{k} (s^{'})$ , i.e. we use the estimate of the value function in the previous iteration to form the new estimate of the value function
- $s^{'}$ is a successor state of $s$ , i.e. the next states we can reach using an action from $s$
It can be proven that this algorithm will converge to $v_{π}$

How exactly do we update $v_{k + 1} (s)$ ? We use the bellman expectation equation from before. Intuitively, it is a one-step look ahead from the current state $s$ to compute the value for $s$ . $v_{k + 1} (s) = a \in A \sum π (a ∣ s) (R_{s}^{a} + γ s^{'} \in S \sum P_{s s^{'}}^{a} v_{k} (s^{'}))$

Or in vector form: $v^{k + 1} = R^{π} + γ P^{π} v^{k}$

Small Gridworld example. Suppose we have a 4x4 grid, in which the top-left and bottom-right grids are terminal states. The reward for any state is $- 1$ , and we can walk NSEW from any spot. If we walk off the grid, the action just does nothing.

Now suppose we take a random walk ( $0.25$ probability of walking any direction) and set $γ = 1$ . How would the value update algorithm look like?

We initialize all grids with value $0$
Recall that the update algorithm essentially adds the immediate reward and a discounted sum of the value function from the previous iteration
At $k = 1$ , every spot will be updated to $- 1$ because the reward is $- 1$ , and $v_{0} (s) = 0 \forall s$ . Except the two terminal states which remain at value $0$ by definition

If we continue to update like that, it will converge to the true value function for $π$ . Also note that with this value function $v_{π}$ , if we take a greedy approach to devise a new policy, we can obtain the optimal policy.

Idea: A lousy policy can be used to devise a better policy after computing the value function.

Policy Iteration

We did policy evaluation in the previous section, i.e. finding the true value function $v_{π}$ for a given policy. Now in this section we want to optimize and find the best policy.

Two step process:

Policy Evaluation. Given a policy $π$ , we first evaluate the policy $π$ , finding $v_{π} (s) = E [R_{t + 1} + γ R_{t + 2} + ...∣ S_{t} = s]$
Policy improvement. We take a greedy approach and choose the best action at each state: $π^{'} = greedy (v_{π})$

Typically, we need many rounds of iteration of this process to converge. But the process of policy iteration always converges to $π_{*}$ . Specifically:

$v$ converges to $v_{*}$
$π$ converges to $π_{*}$

Somewhat realistic toy example from Sutton and Barto:

Suppose we have two locations, maximum of 20 cars at each location
Each day, we get to move up to 5 cars between locations overnight
Reward: $10 for each car rented (can only rent if we have enough cars)
Transitions: every day, a random number of cars are requested and returned at each location (governed by a poisson distribution)
- Location A: $request \sim P o i sso n (3)$ , $return \sim P o i sso n (3)$
- Location B: $request \sim P o i sso n (4)$ , $return \sim P o i sso n (2)$

Naturally we expect the optimal policy to involve moving cars from location A to location B. Using the policy iteration process, we get convergence to the optimal policy in 4 steps. Note that since this is a planning problem, we do know the underlying probability mechanisms, which allows us to compute the value function.

Now we can show that this policy iteration process converges to the optimal policy.

Theorem. Policy iteration process converges to the optimal policy.

Proof. First, consider a deterministic policy $a = π (s)$ . Now, we consider what happens if we change the policy by acting greedily wrt the value function of this policy, i.e.: $π^{'} (s) = a \in A arg max q_{π} (s, a)$ We see that taking the greedy action can only improve the policy, as expressed: $q_{π} (s, π^{'} (s)) = a \in A max q_{π} (s, a) \geq q_{π} (s, π (s)) = v_{π} (s)$ Notes on the above statement:

On the 2nd line equality: Recall that $q_{π} (s, a)$ is the value at state $s$ if we took action $a$ at $s$ and then followed policy $π$ thereafter. So it follows that $q_{π} (s, π (s)) = v_{π} (s)$ , since we are just following policy $π$ in the current step + future steps

The statement is quite simply saying that $π^{'} (s) \geq π (s)$ , which leads to an improvement in $q_{π}$ . Hence choosing the highest value action will improve the action value function (quite trivial).

Now we want to go from this somewhat trivial statement to show that the value function itself must improve with every step of policy iteration (not trivial at all!).

The idea is to use a telescoping argument to show that this improves the value function, or $v_{π^{'}} (s) \geq v_{π} (s)$ : $v_{π} (s) \leq q_{π} (s, π^{'} (s)) = E_{π^{'}} [R_{t + 1} + γ v_{π} (S_{t + 1}) ∣ S_{t} = s] \leq E_{π^{'}} [R_{t + 1} + γ q_{π} (S_{t + 1}, π^{'} (S_{t + 1})) ∣ S_{t} = s] \leq E_{π^{'}} [R_{t + 1} + γ R_{t + 2} + γ^{2} q_{π} (S_{t + 2}, π^{'} (S_{t + 2})) ∣ S_{t} = s] \leq E_{π^{'}} [R_{t + 1} + γ R_{t + 2} + ...∣ S_{t} = s] = v_{π^{'}} (s)$

Some notes on the above:

We start with the trivial inequality expressed above

The expression $E_{π^{'}}$ means taking expectation over possible trajectories under the policy where we take $π^{'} (s)$ in the current step, then follow policy $π$ for the rest of the trajectory

In line 2, we unpack $q_{π}$ according to the Bellman equation, which simply splits up the $q$ value into (i) the immediate reward and (ii) the expected value of our new state (expressed as a random variable $S$ ):

Note that $R_{t + 1}, R_{t + 2}, ...$ are rewards from taking the greedy action $π^{'} (s)$ at each step

Note that $v_{π} (S_{t} + n)$ is the random variable expressing the value we have at the next time step, but evaluated under the previous policy $π$ . It is the previous $π$ intead of $π^{'}$ because we have access to the cached $v_{π}$ up to this point.

In line 3, we apply the trivial inequality again to show that taking the greedy step at the next state will again improve the value

In line 4, we again use the Bellman equation to unpck $q_{π}$

We keep repeating the two steps until termination. What we have at the end is simply the value function of our new policy $π^{'}$

We have shown that policy iteration must improve the value function with each iteration. Now what happens when improvements stop? We now have: $q_{π} (s, π^{'} (s)) = v_{π} (s)$

Since $q_{π} (s, π^{'} (s)) = max_{a \in A} q_{π} (s, a)$ , we have: $v_{π} (s) = a \in A max q_{π} (s, a)$

This is simply the Bellman optimality equation. Satisfying the Bellman optimality equation means that we are in the optimal state (will show later). Hence we have shown that we get $v_{π} (s) = v_{*} (s)$ .

Now, an observation is that policy iteration is quite wasteful. This is because we need to get the value function to converge fully to $v_{π}$ before we take the greedy step to improve the policy. In most cases, this is unnecessary because the greedy policy would already improve even with an imperfect value function.

Some ways to early stop policy evaluation to speed up this process:

Introduce a stopping condition once the value function does not change by much ( $ϵ$ -convergence of value function)
Stop policy evaluation after $k$ iterations
- In the extreme case, if we stop policy evaluation after $k = 1$ iterations, it is called value iteration

Value Iteration

Moving into value iteration, but recall the fundamentals of dynamic programming. Observe that any optimal policy can be subdivided into two components:

An optimal first action $A_{*}$
Followed by an optimal policy from successor state $S^{'}$

Theorem. Principle of optimality.

A policy $π (a ∣ s)$ achieves the optimal value from state s, i.e. $v_{π} (s) = v_{*} (s)$ , if and only if for any state $s^{'}$ reachable from $s$ , $π$ achives the optimal value from state $s^{'}$ , i.e. $v_{π} (s^{'}) = v_{*} (s^{'})$

This theorem seems a bit of a truism, but it will be used to build the idea of value iteration.

Let us think of the value function as "caching" the solutions to subproblems. Now suppose we start "at the end" and assume we know the solution to all the subproblems $v_{*} (s^{'})$ where $s^{'}$ is all the states reachable from our current state $s$ .

Then we can solve immediately for $v_{*} (s)$ by doing a one-step lookahead to all these states $s^{'}$ : $v_{*} (s) \leftarrow a \in A max R_{s}^{a} + γ s^{'} \in S \sum P_{s s^{'}}^{a} v_{*} (s^{'})$

The statement above shows us how we can propagate the optimal value function from some states $s^{'}$ to a new state $s$ . So we can propagate the optimal value function across to all states as we continue to iterate.

The way to think about it (using the small gridworld as example) is that the termination point (trivially) starts off with the optimal value function. After one step of update, the states next to the termination point will now have the optimal value function, and then the states next to these, and so on until we propagate through all states.

Note that in contrast to policy iteration, where in the policy evaluation step we update the value function across all states based on the bellman expectation equation, in value iteration, we are updating the value in each state by choosing the optimal action. This is a key difference in how the two algorithms differ. The value iteration algorithm may be thought of as combining the (i) policy evaluation step and the (ii) greedy policy step from value itaration into one single step.

So we have seen that value iteration iteratively applies the bellman optimality equation to update $v_{1} \to v_{2} \to ... \to v_{*}$ . (Note: Useful to compare this update statement with the bellman expectation equation used for policy evaluation above). Convergence to $v_{*}$ will be proved later. The update equation is: $v_{k + 1} (s) = a \in A max (R_{s}^{a} + γ s^{'} \in S \sum P_{s s^{'}}^{a} v_{k} (s^{'}))$

Note that another difference between value iteration and policy iteration is that there is no explicit policy in value iteration. Since we are only doing one step of policy evaluation and then immediately taking the greedy step, the value function we have may not correspond to any real policy. But this does not stop the algorithm from converging.

The following table sums up the relationship between what we've learnt.

Problem	Bellman Equation	Algorithm
Prediction	Bellman Expectation Equation	Iterative Policy Evaluation
Control	Bellman Expectation Equation + Greedy Policy Improvement	Policy Iteration
Control	Bellman Optimality Euqation	Value Iteration

Some notes:

These algorithms are based on the state value function $v_{π} (s)$
The complexity for $m$ actions and $n$ states is $O (m n^{2})$ per iteration
We could also apply the same algorithm to the action-value function $q_{π} (s, a)$
But the complexity worsens to $O (m^{2} n^{2})$ per iteration

Extensions to Dynamic Programming

The DP methods described so far used synchronous backups, i.e. we backup all states in parallel
Asynchronous backs up state individually in any order, without updating all states in one step
This can significantly reduce computation
There are nice properties to show that it is guaranteed to converge if we still select all states in the way we update

Now, three simple ideas:

In place Dynamic Programming
Prioritized Sweeping
Real time dynamic programming

In place dynamic programming. A simple idea where we update the value function in-place rather than store it in a separate array.

Original (store updates in a separate array):

For all $s \in S$ :
- $v_{new} (s) \leftarrow max_{a \in A} (R_{s}^{a} + γ \sum_{s^{'} \in S} P_{s s^{'}}^{a} v_{old} (s^{'}))$
$v_{old} \leftarrow v_{new}$

New (in place updates right away):

For all $s \in S$ :
- $v (s) \leftarrow max_{a \in A} (R_{s}^{a} + γ \sum_{s^{'} \in S} P_{s s^{'}}^{a} v (s^{'}))$

The method with in place updates has more recent updates to $v$ , and thus often are a lot more efficient in convergence.

Prioritized Sweeping. Since we are doing immediate updates to $v$ , it begs the question: in what order should we update states?

One method is to use the magnitude of the bellman error to guide state selection: $error = a \in A max (R_{s}^{a} + γ s^{'} \in S \sum P_{s s^{'}}^{a} v (s^{'})) - v (s)$
The idea is that states with the largest bellman error are those states whose value functions will change the most, which will significantly change the dynamics of the system, so we should update them first
This can be implemented efficiently by maintaining a priority queue

Real time dynamic programming. Idea is to select states that the real world agent is visiting to update.

Basically update the states that the agent is visiting right now
After each time step we have $S_{t}, A_{t}, R_{t + 1}$
So we update the state $S_{t}$

Important note: DP uses full-width backups, whether we are doing sync or async updates. This means that we consider the max over every successor state and action. Also, we need full knowledge of the MDP dynamics to compute. For large problem spaces, one single backup may be too expensive to compute. Hence in subsequent lectures we will consider sample backups.

Can use the contraction mapping theorem to prove convergence etc.

Model Free Prediction

Lecture 4: Model Free Prediction

Methods in which no one tells us the environment, as opposed to before.

Introduction

Lecture 3: planning by dynamic programming, in which we solve a known MDP
Lecture 4: Model-free prediction, in which we estimate the value function of an unknown MDP
- Unknown in the sense that we do not have access to the environment, only the interactions that the agent has with the environment
Lecture 5: Model-free control (or optimization), in which we optimize the value function of an unknown MDP

Monte Carlo Reinforcement Learning

MC methods learn directly from episodes of experience.

It is model-free in that there is no knowledge of MDP transitions / rewards.
It only learns from complete episodes
MC uses the simplest idea: the value function of a state is the average return from that state over many many runs
Downside: MC can only be applied to episodic MDPs
- Episodic meaning that all episodes must terminate and we get a return value

The goal of MC Policy Evaluation is to learn $v_{π}$ from episodes of experience under policy $π$ :

An episode: $S_{1}, A_{1}, R_{2}, ..., S_{k} \sim π$

Recall that:

The return is the total discounted reward: $G_{t} = R_{t + 1} + γ R_{t + 2} + ... + γ^{T - 1} R_{T}$
And the value function is the expected return given that we start at a given state: $v_{π} (s) = E_{π} [G_{t} ∣ S_{t} = s]$

So the whole idea of MC policy evaluation is to replace the expected return function with the empirical mean return from observing many many episodes.

There are two main methods of performing this:

First visit MC policy evaluation
Every visit MC policy evaluation

Method 1: First visit MC policy evaluation. Algorithm for evaluating a given state $s$ is:

At the first time step $t$ where state $s$ is visited in a given episode:
- Increment counter $N (s) \leftarrow N (s) + 1$ . $N (s)$ is the count of episodes where $s$ was visited.
- Increment total return $S (s) \leftarrow S (s) + G_{t}$
- Value is estimated by mean return $V (s) = S (s) / N (s)$
By the law of large numbers, $V (s) \to v_{π} (s)$ as $N (s) \to \infty$

Note that $G_{t}$ above is the total discounted reward from time step $t$ onwards.

Method 2: Every visit MC policy evaluation. The algorithm is identical to first visit, with the only difference in that we perform the increment step every time we visit state $s$ .

BlackJack example

There are 200 unique states:
- Current sum (12 to 21). If it's 11 or below, the action is automatically to twist.
- Dealer's showing card ace to 10
- Do I have a "useable" ace?
Actions:
- stick: stop receiving cards
- twist: take another card.
Reward for stick:
- +1 if our sum > dealer's sum
- 0 if our sum = dealer's sum
- -1 if our sum < dealer's sum
Reward for twist:
- -1 if our sum > 21
- 0 otherwise

We can use MC policy evaluation algorithm to play 10,000 episodes of blackjack and compute the value function of each state, under a given policy. For e.g. a naive policy is to stick if our sum >= 20, otherwise twist.

Incremental Mean

The mean of a sequence of values can be computed in an incremental algorithm: $μ_{k} = \frac{1}{k} j = 1 \sum k x_{j} = \frac{1}{k} (x_{k} + j = 1 \sum k - 1 x_{j}) = \frac{1}{k} (x_{k} + (k - 1) μ_{k - 1}) = μ_{k - 1} + \frac{1}{k} (x_{k} - μ_{k - 1})$

The last line shows that at each step, we just need to adjust the running mean $μ$ by a small quantity, which is the difference between the new observed value $x_{k}$ and the current mean $μ_{k - 1}$ . This is analogous to a gradient update.

So applying this incremental mean algorithm to monte carlo updates. Recall that the value function $V (s)$ is the mean return over episodes. Hence we can change the above MC algorithm to an incremental mean update:

After observing a given episode $S_{1}, A_{1}, R_{2}, ..., S_{T}$ :
- For each state $S_{t}$ with return $G_{t}$ : $N (S_{t}) V (S_{t}) \leftarrow N (S_{t}) + 1 \leftarrow V (S_{t}) + \frac{1}{N ( S _{t} )} (G_{t} - V (S_{t}))$
- We may even replace the running count $N (S_{t})$ with a fixed step size $α$ . This is the usual approach in non-stationary problem. This algorithm allows us to avoid keeping track of old episodes and just keep updating $V (S_{t})$ . $V (S_{t}) \leftarrow V (S_{t}) + α (G_{t} - V (S_{t}))$

Temporal Difference Learning

TD methods are different from MC methods, in that we do not need to wait for full episodes to learn.

TD methods, like MC methods, learn directly from episodes of experience
TD methods, like MC methods, are also model-free
TD learns from incomplete episodes using bootstrapping
TD updates a guess towards a guess

Goal remains the same: learn $v_{π}$ online from experience under policy $π$ .

Simplest temporal difference learning algorithm: $T D (0)$ .

Update value $V (S_{t})$ toward estimated return $R_{t + 1} + γV (S_{t + 1})$ $V (S_{t}) \leftarrow V (S_{t}) + α (R_{t + 1} + γV (S_{t + 1}) - V (S_{t}))$
$R_{t + 1} + γV (S_{t + 1})$ is called the TD target - $R_{t + 1} + γV (S_{t + 1}) - V (S_{t})$ is called the TD error

Contrast this with incremental every-visit Monte Carlo which we saw earlier:

Update value $V (S_{t})$ toward actual return $G_{t}$ $V (S_{t}) \leftarrow V (S_{t}) + α (G_{t} - V (S_{t}))$

Car Driving Example

An analogy for understanding the difference between MC and TD methods. Imagine we are driving the car home. At the start, we expect to take 30 mins for the journey. And then it rains, so we update our prediction to 40 mins. And so on. Eventually, the final journey takes 43 mins.

For MC method, we need to wait until the journey is complete, and then we update the value of our policy to 43 mins each step of the way
For TD method, we can immediately update the value function to the next prediction each step of the way

Bias Variance Trade-off

There is a bias variance trade-off between choosing MC or TD method for policy evaluation.

The return $G_{t} = R_{t + 1} + γ R_{t + 2} + .. + γ^{T - 1} R_{T}$ is an unbiased estimate of $v_{π} (S_{t})$
The oracle TD target $R_{t + 1} + γ v_{π} (S_{t + 1})$ is also an unbiased estimate of $v_{π} (S_{t})$
- We know this from the bellman expectation equation
- But it requires access to the oracle $v_{π} (S_{t + 1})$ which we do not have
The TD target $R_{t + 1} + γV (S_{t + 1})$ is a biased estimate of $v_{π} (S_{t})$
- This is because $V$ is our current estimate of the value function which can be wildly wrong
Observe that the TD target has much lower variance than the return:
- The return depends on many random actions, transitions, rewards through the entire run of the episode
- The TD target only depends on one random action, transition and reward
  - The value function $V$ is a deterministic function

So to summarize:

MC has high variance and zero bias
- So it has good convergence properties, even with function approximation later
- It is not very sensitive to the initial value
- Very simple to understand and use
TD has low variance but some bias
- Usually it is much more efficient than MC
- TD(0) can be proven to converge to $v_{π} (s)$ using a table lookup
- But with function approximation convergence is not always guaranteed
- More sensitive to the initial value

What is function approximation? This will be covered later on. But in general, we have been looking at $v (s)$ as a table lookup for each state. This is not feasible for problems with large state spaces, hence we need to learn a function to approximate $v (s)$ for all states.

MC vs TD empirical example

TD generally converges faster than MC
But if the step size $α$ is too larger, TD may not fully converge as it will oscillate

So far we have seen that both MC and TD converge as the number of episodes goes to infinity.

That is, $V (s) \to v_{π} (s)$ as $e p i so d es \to \infty$
But what if we only have a limited number of $K$ episodes to learn from?
For example, what if we are repeatedly sampling episode $k \in [1, K]$ ?

AB Example

A simple example to illustrate difference between MC and TD in the finite data case. Suppose we have 6 episodes:

A, 0, B, 0
B, 1
B, 1
B, 1
B, 1
B, 0

What would we think $V (A), V (B)$ are?

If we use MC, then $V (A) = 0, V (B) = 4/6$ . $V (A)$ is $0$ because we only encountered one episode involving state $A$ and the reward was $0$ .
If we use TD, then $V (A) = 4/6, V (B) = 4/6$ . $V (A)$ is $4/6$ because we observed a 100% probability of transiting from $A \to B$ , so the value of $A$ (without discounting) is the same as value of $B$ due to bootstrapping.

In more precise terms:

MC converges to the solution which minimizes the mean squared error
- i.e. minimizes the divergence from observed returns $min k = 1 \sum K t = 1 \sum T_{k} [g_{t}^{k} - V (s_{t}^{k})]^{2}$
- In the AB example above, this sets $V (A) = 0$
TD converges to the solution of the maximum likelihood Markov model
- i.e. it converges to the MDP that best fits the data
- In the AB example, $V (A) = 4/6$

In summary:

TD exploits the markov property, so it is usually more efficient in markov environments, where we can rely on states to encode information
MC does not exploit the markov property, so it is usually more efficient in non-markov environments, e.g. partial observability etc.

So far we have looked at 3 types of backup:

Monte Carlo Backup: we sample one entire trajectory / episode from the agent's interactions with the environment till termination $V (S_{t}) \leftarrow V (S_{t}) + α (G_{T} - V (S_{t}))$
Temporal Difference Backup: we sample one step lookahead and then update parameters $V (S_{t}) \leftarrow V (S_{t}) + α (R_{t + 1} + γV (S_{t + 1}) - V (S_{t}))$
Dynamic Programming Backup: we look ahead one step, but because we have access to the environment, we can compute the expectation over all possible next steps. $V (S_{t}) \leftarrow E_{π} [R_{t + 1} + γV (S_{t + 1})]$

This gives us two dimensions to categorize our algorithms:

Bootstrapping: the update involves an estimate (e.g. our value function)
- MC does not bootstrap
- DP bootstraps
- TD bootstraps
Sampling: we use sampling instead of a full-width expectation / search
- MC samples
- DP does not sample
- TD samples

TD Lambda

TD Lambda is a generalization of the above trade-off. We let TD target look $n$ steps into the future before updating. If we look forward $\infty$ number of steps, it becomes monte carlo learning.

Specifically, for $n = 1, 2, \infty$ , our returns are:

$n = 1$ : $G_{t}^{(1)} = R_{t + 1} + γV (S_{t + 1})$
$n = 2$ : $G_{t}^{(2)} = R_{t + 1} + γ R_{t + 2} + γ^{2} V (S_{t + 2})$
$n = \infty$ : $G_{t}^{(\infty)} = R_{t + 1} + γ R_{t + 2} + ... + γ^{T - 1} R_{T}$ . We can see this corresponds to MC update, without use of value function $V$ at all

So the n-step return is: $G_{t}^{(n)} = R_{t + 1} + γ R_{t + 2} + ... + γ^{n - 1} R_{t + n} + γ^{n} V (S_{t + n})$

And the n-step TD learning update is: $V (S_{t}) \leftarrow V (S_{t}) + α (G_{t}^{(n)} - V (S_{t}))$

What is the best $n$ ? It is a highly sensitive parameters that depends on the problem, $α$ etc. Hence a proposal is made to average the returns from each time step, up to step $n$ . For example, we could average $\frac{1}{2} G_{t}^{(2)} + \frac{1}{2} G_{t}^{(4)}$ . This averaging would make the algorithm much more robust to step size $n$ .

The common way to perform a weighted average of returns is to use exponential $λ$ decay, such that returns with a longer look-ahead window are weighted less. This algorithm is called TD- $λ$ . Specifically: $G_{t}^{λ} = (1 - λ) n = 1 \sum \infty λ^{n - 1} G_{t}^{(n)}$

Note that the weight given to the final return $G_{t}^{(n)}$ is the sum to $\infty$ of weights from step $n$ onwards, i.e. it is a geometric series. It makes sense to put more weight on the final, actual return.

This leads directly to forward-view TD( $λ$ ), where we sample trajectories of $n$ steps and update the value function according to: $V (S_{t}) \leftarrow V (S_{t}) + α (G_{t}^{λ} - V (S_{t}))$

Now, the forward view has a shortcoming, which is that we need to wait until we have sampled $n$ steps into the future, before we can update the value function. Thus it suffers similar downside to MC update, where we cannot update the value function immediately after each step.

Backward View TD Lambda

One key idea is eligibility traces. In deciding to assign credit to past events for a current reward, there are generally two intuitive heuristics to use:

Frequency heuristic: assign credit to most frequent recent states
Recency heuristic: assign credit to most recent states

The eligiblity trace combines both heuristics in a simple formula:

$E_{0} (s) = 0$
$E_{t} (s) = γλ E_{t - 1} (s) + 1 (S_{t} = s)$

The eligibility trace gives us a weight at a given time step for each state $s$ . This weight tells us how much credit we should assign to $s$ for a reward at the current time step.

The Backward View TD Lambda uses this idea:

Keep an eligibility trace for every state $s$
Update value V(s) for every state $s$ in proportion to the TD-error $δ_{t}$ and eligibility trace $E_{t} (s)$ : $δ_{t} V (s) = R_{t + 1} + γV (S_{t + 1}) - V (S_{t}) \leftarrow V (s) + α δ_{t} E_{t} (s)$

Observe that $δ_{t}$ is just our update for TD(0) with a single step look ahead, i.e. $G_{t}^{(1)} - V (S_{t})$ . Thus we can see that when $λ = 0$ , only the current state is updated, since $E_{t} (s) = 1 (S_{t} = s)$ . This results in the TD(0) update: $V (S_{t}) \leftarrow V (S_{t}) + α δ_{t}$ .

On the other extreme, when $λ = 1$ , all credit is deferred until the end of the episode (not sure I see this from the formula). Thus it is equivalent to MC update.

In fact, there is a theorem that the sum of offline updates is identical for both forward view and backward view TD-lambda. This is nice because the backward view with eligibility traces makes it easy to implement, as we never need to look forward into the future. We just need to keep track of eligibility traces at each time step, and then apply the update to all states at each step.

Model Free Control

Lecture 5: Model Free Control

All lectures are building up to this point, to optimize a problem where we do not have access to the underlying MDP. For such problems, we either do not know the underlying MDP, or it is too big to use (e.g. game of Go).

On policy vs Off policy

On policy learning is to learn on the job. Learn about policy $π$ based on experience sampled from $π$
Off policy learning is to learn by observing others. Learn about policy $π$ by sampling another robot (or human) experience $μ$

Start with the simpler case, which is on policy learning. The basic framework is generalized policy iteration (recap), which alternates between:

Policy evaluation: estimate $v_{π}$
Policy improvement: generate a new $π^{'} \geq π$

Naive case: policy iteration with Monte-Carlo evaluation. Basically, we use MC policy evaluation to update our value function, and then do greedy policy improvement. Would this work?

No. The main problem is that previously when we had access to the underlying MDP, we could do greedy policy improvement because we had access to the transition dynamics. Specifically, when we do policy improvement, we want to compute: $π^{'} (s) = a arg max R_{s}^{a} + P_{s s^{'}}^{a} V (s^{'})$
However in model-free control, we do not have access to $P_{s s^{'}}^{a}$ , meaning that we do not know what probabilities determine the state we will end up in given action $a$ . So there is no clear way to do greedy policy improvement if we only have an estimate of $V (s) \forall s$ .
To deal with this issue, we can do greedy policy improvement over $Q (s, a)$ instead. Then we can simply take: $π^{'} (s) = a \in A arg max Q (s, a)$

So now we do generalized policy iteration with action-value function.

Start with $Q, π$
Update action-value function $Q = q_{π}$
Greedily update policy to $π = g ree d y (Q)$

However, we still have another problem, which is the exploration issue. If we act greedily all the time, there is no guarantee that we will explore all states and thus find the optimal policy.

Toy Example: Greedy Action Selection

Choose between two doors:

Open left door: reward 0. $V (l e f t) = 0$
Open right door: reward +1. $V (r i g h t) = + 1$
Open right door: reward +3. $V (r i g h t) = + 2$
Open right door: reward +2. $V (r i g h t) = + 2$

The greedy policy will lock us onto right door forever. But we will never know if the left door actually has higher mean return.

$ϵ$ -Greedy Exploration

The simplest idea for ensuring continual exploration.

Try all $m$ actions with non-zero probability
With probability $1 - ϵ$ choose the greedy action
With probability $ϵ$ choose an action at random

$π (a ∣ s) = {ϵ / m + 1 - ϵ ϵ / m if a^{*} = arg max_{a \in A} Q (s, a) otherwise$

Note that $ϵ / m$ is added for the first case as well since the action chosen at random can include the greedy policy $a^{*}$ as well.

$ϵ$ -greedy policy is important because there is a theorem to assure us that we will indeed get a policy improvement on every step.

Theorem. For any $ϵ$ -greedy policy $π$ , the $ϵ$ -greedy policy $π^{'}$ with respect to $q_{π}$ is an improvement, i.e. $v_{π^{'}} (s) \geq v_{π} (s)$

Proof. $q_{π} (s, π^{'} (s)) = a \in A \sum π^{'} (a ∣ s) q_{π} (s, a) = ϵ / m a \in A \sum q_{π} (s, a) + (1 - ϵ) a \in A max q_{π} (s, a) \geq ϵ / m a \in A \sum q_{π} (s, a) + (1 - ϵ) a \in A \sum \frac{π ( a ∣ s ) - ϵ / m}{1 - ϵ} q_{π} (s, a) = a \in A \sum π (a ∣ s) q_{π} (s, a) = v_{π} (s)$ Therefore from the policy improvement theorem, $v_{π^{'}} (s) \geq v_{π} (s)$ .

The key step in the proof is the transition from line 2 to line 3. The idea is that the maximum q-value (by choosing the greedy action) will be greater than or equal to any weighted average of $q_{π} (s, a)$ . Hence we choose a clever weighted average such that we can end up with $\sum_{a \in A} π (a ∣ s) q_{π} (s, a)$ in line 4.

Note that it is indeed a weighted average because of the following. Note that $π (a ∣ s)$ must sum to 1 over all actions as it is a valid policy. And since there are $m$ unique actions, we multiply the constant $ϵ / m$ by $m$ . $a \in A \sum \frac{π ( a ∣ s ) - ϵ / m}{1 - ϵ} = \frac{1}{1 - ϵ} a \in A \sum [π (a ∣ s) - ϵ / m] = \frac{1}{1 - ϵ} [1 - m \cdot ϵ / m] = 1$

An idea that we encountered earlier. We do not need to fully evaluate the policy before we do a greedy improvement. In the context of Monte Carlo policy evaluation, in the extreme case, we can update the policy after every episode instead of gathering many episodes.

How can we guarantee that we find the optimal policy $π^{*}$ ? We need to ensure that our algorithm balances two things: (i) suitably explore all options and (ii) ensure that at the end, we converge on a greedy policy.

This leads us to GLIE, which is a property that we want our algo to have.

Definition Greedy in the Limit with Infinite Exploration (GLIE).

All state-action pairs are explored infinitely many times, i.e. $k \to \infty lim N_{k} (s, a) = \infty$

The policy converges on a greedy policy, i.e. $k \to \infty lim π_{k} (a ∣ s) = 1 (a = a^{'} \in A arg max Q_{k} (s, a^{'}))$

One simple way to get GLIE is to use $ϵ$ -greedy with a decaying schedule for $ϵ_{k} = \frac{1}{k}$ .

GLIE Monte Carlo Control

This brings us to GLIE Monte Carlo control.

Algorithm GLIE Monte-Carlo Control.

Sample kth episode using policy $π : {S_{1}, A_{1}, R_{2}, ..., S_{T}} \sim π$

For each state $S_{t}$ and action $A_{t}$ in the episode, update $N (S_{t}, A_{t}) Q (S_{t}, A_{t}) \leftarrow N (S_{t}, A_{t}) + 1 \leftarrow Q (S_{t}, A_{t}) + \frac{1}{N ( S _{t} , A _{t} )} (G_{t} - Q (S_{t}, A_{t}))$

Improve policy based on the new action-value function: $ϵ π \leftarrow 1/ k \leftarrow ϵ -greedy(Q)$

MC vs TD Control

TD learning has several advantages over MC:
- Lower variance
- Online
- Can deal with incomplete sequences
Natural idea: use TD instead of MC in our control loop
- Apply TD to Q(S, A)
- Use $ϵ$ -greedy policy improvement
- Update every time step
- This is probably the most well known RL algorithm (Sarsa)

Sarsa policy evaluation update step: $Q (S, A) \leftarrow Q (S, A) + α (R + γ Q (S^{'}, A^{'}) - Q (S, A))$

Note that we are updating the Q value for one single state-action pair. We take action $A$ on state $S$ and observe reward $R$ , and use that to update the Q-value. In addition, we also sample a next action $A^{'}$ and corresponding resultant state $S^{'}$ , and we bootstrap the Q-value to use $Q (S^{'}, A^{'})$ to also update the Q-value. So it corresponds to a one-step lookahead in TD.

So the off-policy control with Sarsa algo. For every time step:

Policy evaluation with Sarsa: $Q = q_{π}$
Policy improvement using $ϵ$ -greedy

Algorithm. Sarsa algorithm for on-policy control.

Initialize $Q (s, a), \forall s \in S, a \in A (s)$ arbitrarily

Repeat (for each episode):

Initialize $S$

Choose $A$ from $S$ using policy derived from $Q$ (e.g. $ϵ$ -greedy choice)

Repeat (for each step of episode):

Take action $A$ , observe $R, S^{'}$

Choose $A^{'}$ from $S^{'}$ using policy derived from $Q$ (e.g. $ϵ$ -greedy)

Update $Q (S, A) \leftarrow Q (S, A) + α [R + γ Q (S^{'}, A^{'}) - Q (S, A)]$

$S \leftarrow S^{'}$ , $A \leftarrow A^{'}$

Until $S$ is terminal

Note that this is a fundamentally on-policy algorithm, because the $A^{'}, S^{'}$ that we sample and use to bootstrap is also the next action and state we end up in.

Algorithm. Sarsa converges to the optimal action value function, $Q (s, a) \to q * (s, a)$ under the following conditions:

GLIE sequence of policies $π_{t} (a ∣ s)$

Robbins Monro sequence of step sizes $α_{t}$ : $t = 1 \sum \infty α_{t} t = 1 \sum \infty α_{t}^{2} = \infty < \infty$

$n$ -step Sarsa

As before, we saw that $n$ -step algorithm gets the best of both worlds in betwen MC and TD. So we do the same here.

Consider the following $n$ -step returns for $n = 1, 2, \infty$ :

$n = 1$ , $q_{t}^{(1)} = R_{t + 1} + γ Q (S_{t} + 1)$
$n = 2$ , $q_{t}^{(2)} = R_{t + 1} + γ R_{t + 2} + γ^{2} Q (S_{t} + 2)$
$n = \infty$ , $q_{t}^{(\infty)} = R_{t + 1} + γ R_{t + 2} + ... + γ^{T - 1} R_{t}$

Define the $n$ -step Q-return: $q_{t}^{(n)} = R_{t + 1} + γ R_{t + 2} + ... + γ^{n - 1} R_{t + n} + γ^{n} Q (S_{t + n})$

Sarsa update $Q (s, a)$ towards the n-step Q-return: $Q (S_{t}, A_{t}) \leftarrow Q (S_{t}, A_{t}) + α (q_{t}^{(n)} - Q (S_{t}, A_{t}))$

Forward View Sarsa( $λ$ )

As before, we saw that the n-step return itself is noisy and sensitive to hyperparameter choice of n and $α$ . So the better way is to average the value over all $n$ steps.

The $q^{λ}$ return combines all $n$ -step Q-returns $q_{t}^{(n)}$
Using weight $(1 - λ) λ^{n - 1}$ , we have: $q_{t}^{λ} = (1 - λ) n = 1 \sum \infty λ^{n - 1} q_{t}^{(n)}$
And the forward view Sarsa( $λ$ ) is: $Q (S_{t}, A_{t}) \leftarrow Q (S_{t}, A_{t}) + α (q_{t}^{λ} - Q (S_{t}, A_{t}))$

Backward View Sarsa( $λ$ )

Recall that we used eligibility traces to construct the backward view TD( $λ$ ). As the forward view algo is not an online policy - we need to wait until the end of the episode to do the update.

Just like TD( $λ$ ), we use eligibility traces in an online algorithm
But Sarsa( $λ$ ) has one eligibility trace for each state-action pair instead of just for every state $E_{0} (s, a) E_{t} (s, a) = 0 = γλ E_{t - 1} (s, a) + 1 (S_{t} = s, A_{t} = a)$
$Q (s, a)$ is updated for every state $s$ and action $a$
In proportion to TD-error $δ_{t}$ and eligibility trace $E_{t} (s, a)$ : $δ_{t} Q (s, a) = R_{t + 1} + γ Q (S_{t + 1}, A_{t + 1}) - Q (S_{t}, A_{t}) \leftarrow Q (s, a) + α δ_{t} E_{t} (s, a)$

Algorithm. Sarsa( $λ$ ) On Policy Algorithm.

Initialize $Q (s, a)$ arbitrarily, for all $s \in S, a \in A$

Repeat (for each episode):

$E (s, a) = 0$ , for all $s \in S, a \in A$

Initialize $S, A$

Repeat (for each step of episode):

Take action $A$ , observe $R, S^{'}$

Choose $A^{'}$ from $S^{'}$ , using policy derived from $Q$ (E.g. $ϵ$ -greedy)

$δ \leftarrow R + γ Q (S^{'}, A^{'}) - Q (S, A)$

$E (S, A) \leftarrow E (S, A) + 1$

For all $s \in S, a \in A$ :

$Q (s, a) \leftarrow Q (s, a) + α δ E (s, a)$

$E (s, a) \leftarrow γλ E (s, a)$

$S \leftarrow S^{'}, A \leftarrow A^{'}$

Until $S$ is terminal

Note that for a given step we have a single value of $δ$ which is our TD error, but we propagate that to all $s, a$ pairs based on the eligibility trace, as potentially every $s, a$ pair could have contributed to it.

Off Policy Learning

So far we have been looking at on-policy learning. However it is often useful to do off policy learning, i.e. evaluate a target policy $π (a ∣ s)$ to compute $v_{π} (s)$ or $q_{π} (s, a)$ , while we follow the behaviour policy $μ (a ∣ s) : {S_{1}, A_{1}, R_{2}, ..., S_{T} \sim μ}$ . Of course in this case, $μ \neq = π$ .

Why is off policy learning useful?

We can learn from observing humans or other agents
We can re-use experience that was previously generated from old policies $π_{1}, π_{2}, ... π_{t - 1}$ , possibly in a batched manner
We can learn about the optimal policy while following the exploratory policy
We can learn about multiple policies while following one policy

First mechanism is importance sampling. The main idea is to estimate the expectation of a different distribution by re-weighting the distributions: $E_{X \sim P} [f (X)] = \sum P (X) f (X) = \sum Q (X) \frac{P ( X )}{Q ( X )} f (X) = E_{X \sim Q} [\frac{P ( X )}{Q ( X )} f (X)]$

We can apply importance sampling to Monte Carlo for Off policy monte carlo learning:

We use returns generated from behaviour policy $μ$ to evaluate $π$
Then we weight the return $G_{t}$ according to the ratio of probabilities between the two policies
We need to apply the correction at every time step along the whole episode, because the change in policy affects every time step $G_{t}^{π / μ} = \frac{π ( A _{t} ∣ S _{t} )}{μ ( A _{t} ∣ S _{t} )} \frac{π ( A _{t + 1} ∣ S _{t + 1} )}{μ ( A _{t + 1} ∣ S _{t + 1} )} ... \frac{π ( A _{T} ∣ S _{T} )}{μ ( A _{T} ∣ S _{T} )} G_{t}$
And then update the value towards the corrected return $V (S_{t}) \leftarrow V (S_{t}) + α (G_{t}^{π / μ} - V (S_{t}))$

While off policy MC learning is theoretically sound, there are some major problems which make it practically useless in practice:

Importance sampling dramatically increases variance, as we are adjusting over every time step, and the cumulative effect over the whole episode makes our estimate of $G_{t}^{π / μ}$ vary wildly
We also cannot use this adjustment if $μ$ is zero when $π$ is non-zero

So we have to use bootstrapping for importance sampling. This allows us to only adjust the probability for one time step. So we have importance sampling for off policy TD:

We use TD targets generated from $μ$ to evaluate $π$
For TD(0), We weight the TD target $R + γV (S^{'})$ by importance sampling
This means we only need a single importance sampling correction: $V (S_{t}) \leftarrow V (S_{t}) + α (\frac{π ( A _{t} ∣ S _{t} )}{μ ( A _{t} ∣ S _{t} )} (R_{t + 1} + γV (S_{t + 1})) - V (S_{t}))$
This has much lower variance that MC importance sampling, and could work if $μ$ and $π$ do not differ by too much over a single step

As we have seen, importance sampling leads to large variances. The best solution is known as Q-learning, which is specific to TD(0) or Sarsa(0).

Does not require any importance sampling
Allows off policy learning of action values $Q (s, a)$

Recall that $μ$ is the behaviour policy that our agent is actually following, and $π$ is a target policy that we want to learn from. The main idea is that in our Sarsa(0) update step, we update the Q-value towards the target policy $π$ , but allow our agent to continue following the behaviour policy $μ$ .

This allows the agent to explore the environment using $μ$ , but learn from the action-value function of $π$ . Specifically:

We choose each next action for the agent using behaviour policy $A_{t + 1} \sim μ (.∣ S_{t})$
But use alternative successor action $A^{'} \sim π (.∣ S_{t})$ in our Q-value update
So we update $Q (S_{t}, A_{t})$ using $A^{'}$ : $Q (S_{t}, A_{t}) \leftarrow Q (S_{t}, A_{t}) + α (R_{t + 1} + γ Q (S_{t + 1}, A^{'}) - Q (S_{t}, A_{t}))$

Note importantly that we are using $A^{'} \sim π$ in the Q value above, instead of $A_{t + 1}$ . This allows us to learn off policy.

Q-Learning (or SARSA-MAX)

A special case of Q-learning is the case where the target policy is greedy wrt $Q (s, a)$ . This is usually what people refer to as Q-learning.

We allow both behaviour and target policies to improve:

The target policy $π$ is greedy wrt $Q (s, a)$ , i.e. $π (S_{t + 1}) = a^{'} arg max Q (S_{t + 1}, a^{'})$
The behaviour policy $μ$ is $ϵ$ -greedy wrt $Q (s, a)$ again
The learning target inside the Q-update then simplifies as follows: $R_{t + 1} + γ Q (S_{t + 1}, A^{'}) = R_{t + 1} + γ Q (S_{t + 1}, a^{'} arg max Q (S_{t + 1}, a^{'})) = R_{t + 1} + a^{'} max γ Q (S_{t + 1}, a^{'})$

Note that since we are following a greedy target policy, the action chosen will be the Q-maximizing one (line 2). Since we are choosing the Q-maximizing action, we get the maximum Q-value over all possible actions (line 3). This simplifies the equation quite abit, and now it resembles the Bellman optimality equation.

This leads us to the well known Q-learning algorithm, which David calls Sarsa-max. The Q-update is: $Q (S, A) \leftarrow Q (S, A) + α (R + γ a^{'} max Q (S^{'}, a^{'}) - Q (S, A))$

There is a theorem that tells us that the Q-learning control algorithm converges to the optimal action-value function, i.e. $Q (s, a) \to q_{*} (s, a)$

To wrap up, here is a classification of some algorithms we have so far:

	Full Backup (Dynamic Programming)	Sample Backup (Temporal Difference)
Bellman Expectation Equation for $v_{π} (s)$	Iterative Policy Evaluation	TD Learning
Bellman Expectation Equation for $q_{π} (s, a)$	Q-policy iteration	Sarsa
Bellman Optimality Equation for $q_{*} (s, a)$	Q-value iteration	Q-learning

Value Function Approximation

Lecture 6: Value Function Approximation

This lecture looks into approximating functions with neural networks to overcome the large state-action space problem.

RL often encounters large problems:

Backgammon: $1 0^{20}$ states
Go: $1 0^{170}$ states
Helicopter: continuous state space

We want to do policy evaluation and control efficiently in large state spaces. So far, we have represented $V$ or $Q$ with a lookup table:

Every state $s$ has an entry $V (s)$
Every state-action pair $s, a$ has an entry $Q (s, a)$

This is a problem for large MDPs:

Too many states or actions to store in memory
It is too slow or data inefficient to learn the value of each state individually

Solution for large MDPs:

Estimate the value function with function approximation using parameters $w$ :
- $\overset{v}{^} (s, w) \approx v_{π} (s)$
- $\overset{q}{^} (s, a, w) \approx q_{π} (s, a)$
Generalizes from seen states to unseen states
Update parameters of our function using MC or TD learning

Types of value function approximation (different architectures):

Represent a given state $s$ with some parameters $w$ . Then neural network spits out $\overset{v}{^} (s, w)$ , which is our value function for being in state $s$
Have a neural network $\overset{q}{^} (s, a, w)$ , which takes in a state-action pair and spits out the Q value
Sometimes, it is more efficient to have a neural network $\overset{q}{^} (s, w)$ , such that we feed in a single state and we get Q-values for every possible action in a single forward pass, i.e. we get $\overset{q}{^} (s, a_{1}, w), \overset{q}{^} (s, a_{2}, w), ...$

Which function approximator? We focus on differentiable function approximators that we can easily optimize, i.e. Linear combinations of features, neural networks. Furthermore, we want a training algorithm for a non-iid, non-stationary set of data, so it is not standard supervised learning.

Incremental Methods

Gradient Descent

Starting with gradient descent.

Let $J (w)$ be a differentiable function of parameter vector $w$
Define the gradient of $J (w)$ to be a vector $\nabla_{w} J (w)$ where $\nabla_{w} J (w) [0]$ is $\frac{\partial J ( w )}{\partial w _{1}}$
To find the local minimum of $J (w)$ , we adjust the parameter $w$ in the -ve gradient direction: $△ w = - \frac{1}{2} α \nabla_{w} J (w)$

Goal: find parameter vector $w$ minimizing mean squared error between approximate value fn $\overset{v}{^} (s, w)$ and true oracle value fn $v_{π} (s)$ (assuming we know the oracle) $J (w) = E_{π} [(v_{π} (S) - \overset{v}{^} (S, w)^{2})]$

Gradient descent finds a local minimum: $△ w = - \frac{1}{2} α \nabla_{w} J (w) = α E_{π} [(v_{π} (S) - \overset{v}{^} (S, w)) \nabla_{w} \overset{v}{^} (S, w)]$

Stochastic gradient descent samples the gradient: $△ w = α (v_{π} (S) - \overset{v}{^} (S, w)) \nabla_{w} \overset{v}{^} (S, w)$

The nice thing about SGD is that it still converges under non-stationary environment. The expected update is equal to full gradient update.

Feature Vectors

To represent a state, we use a feature vector. $x (S) = x_{1} (S) ⋮ x_{n} (S)$

For example, the features (numeric) could be:

Distance of robot to landmarks
Trends in the stock market
Configuration of pawn on a chess board

Linear Value Function Approximation

Let us represent the value function using a linear combination of features (i.e. just a dot product between two vectors): $\overset{v}{^} (S, w) = x (S)^{T} w = j = 1 \sum n x_{j} (S) w_{j}$

The nice thing is that linear approximator is quadratic in the parameters $w$ , so it is a convex optimization problem, i.e. SGD will converge on the global optimum: $J (w) = E_{π} [(v_{π} (S) - x (S)^{T} w)^{2}]$

The gradient update is really simple: $\nabla_{w} \overset{v}{^} (S, w) △ w = x (S) = α (v_{π} (S) - \overset{v}{^} (S, w)) x (S)$

Note that we are just subbing the simple expression for $\nabla_{w} \overset{v}{^} (S, w)$ into the general $△ w$ formula above. The update may be interpreted as step-size x prediction error x feature value. This means that features with high correlation with the prediction error will have large gradient updates intuitively.

We can think of table lookup as a special case of linear value function approximation. Suppose we use a table lookup feature (1-hot) as follows: $x^{t ab l e} (S) = 1 (S = s_{1}) ⋮ 1 (S = s_{n})$

And suppose we have a parameter vector of size $n$ , such that we have one parameter for each state. Then we have: $\overset{v}{^} (S, w) = 1 (S = s_{1}) ⋮ 1 (S = s_{n}) \cdot w_{1} ⋮ w_{n}$

And we can see that this reduces to a table lookup where the parameter $w_{j}$ represents the state value for each state $j$ .

Estimating the Oracle

So far, we have assumed that the true oracle value function $v_{π} (s)$ is available, but in RL there is no true label, only rewards. So in practice, we need to substitute a target for $v_{π} (s)$ :

For MC, the target is the return $G_{t}$ : $△ w = α (G_{t} - \overset{v}{^} (S_{t}, w)) \nabla_{w} \overset{v}{^} (S_{t}, w)$
For TD(0), the target is the TD target $R_{t + 1} + γ \overset{v}{^} (S_{t + 1}, w)$ : $△ w = α (R_{t + 1} + γ \overset{v}{^} (S_{t + 1}, w) - \overset{v}{^} (S_{t}, w)) \nabla_{w} \overset{v}{^} (S_{t}, w)$
For TD( $λ$ ), the target is the $λ$ -return $G_{t}^{λ}$ : $△ w = α (G_{t}^{λ} - \overset{v}{^} (S_{t}, w)) \nabla_{w} \overset{v}{^} (S_{t}, w)$

Monte Carlo with Value Function Approximation

We can think of our algorithm as supervised learning.

Treat the return $G_{t}$ as an unbiased noisy sample of the true value $v_{π} (S_{t})$
We therefore are applying supervised learning to "training data": $< S_{t}, G_{1} >, < S_{2}, G_{2} >, ..., < S_{T}, G_{T} >$
For example, using linear MC policy evaluation: $△ w = α (G_{t} - \overset{v}{^} (S_{t}, w)) \nabla_{w} \overset{v}{^} (S_{t}, w) = α (G_{t} - \overset{v}{^} (S_{t}, w)) x (S_{t})$
MC evaluation converges to a local optimum even when using non-linear value function approximation

TD with Value Function Approximation

The same applies to TD learning, but we have some biased estimate:

The TD target $R_{t + 1} + γ \overset{v}{^} (S_{t + 1}, w)$ is a biased sample of the true value $v_{π} (S_{t})$ - it's biased because our own value function is a biased estimate
We can still apply supervised learning to the "training data": $< S_{1}, R_{2} + γ \overset{v}{^} (S_{2}, w) >, < S_{2}, R_{3} + γ \overset{v}{^} (S_{3}, w) >, ..., < S_{t - 1}, R_{T} >$
For example using linear TD(0): $△ w = α (R + γ \overset{v}{^} (S^{'}, w) - \overset{v}{^} (S, w)) \nabla_{w} \overset{v}{^} (S, w) = α δ x (S)$
There is a theorem showing that for linear TD(0), despite the bias, we will always converge (close) to the global optimum

Note: There is a little inconsistency in the above formula, once we start introducing bootstrapped approximations of the return. Recall that when we used the oracle $v_{π}$ to represent the target and took the derivative, only $\nabla_{w} \overset{v}{^} (S, w)$ enters the derivative as we treat the oracle value as a constant.

However, once we introduce $\overset{v}{^}$ itself to substitute the oracle function, we should technically include that term in the derivative as well. As it turns out, this is not a good idea and will not lead to convergence. There is some theoretical analysis for this to justify it.

TD( $λ$ ) with Value Function Approximation

And again, we can do the same with TD- $λ$ , since the $λ$ -return $G_{t}^{λ}$ is also a biased sample of the true value $v_{π} (s)$ :

The training data is now: $< S_{1}, G_{1}^{λ} >, < S_{2}, G_{2}^{λ} >, ..., < S_{T - 1}, G_{T - 1}^{λ} >$
The forward view linear TD( $λ$ ) is: $△ w = α (G_{t}^{λ} - \overset{v}{^} (S_{t}, w)) \nabla_{w} \overset{v}{^} (S_{t}, w) = α (G_{t}^{λ} - \overset{v}{^} (S_{t}, w)) x (S_{t})$
The backward view linear TD( $λ$ ) is: $δ_{t} E_{t} △ w = R_{t + 1} + γ \overset{v}{^} (S_{t + 1}, w) - \overset{v}{^} (S_{t}, w) = γλ E_{t - 1} + x (S_{t}) = α δ_{t} E_{t}$
There is a theorem to show that the forward view and backward view linear TD( $λ$ ) are equivalent.

For the backward view, notice that the eligibility trace is now updated using the gradient wrt the parameter vector, namely $\nabla_{w} \overset{v}{^} (S_{t}, w)$ , which is of the same dimensionality as $w$ . More precisely, the eligibility trace is the decaying accumulation of past gradients. In the linear case, this is an accumulation of the feature vector $x (S_{t})$ .

It is a bit unintuitive to understand why we use the accumulated gradient as the eligibility trace, but I suppose it is proved in the equivalence theorem between the forward and backward view. Perhaps we can just think of it as "the features which we see the most often will have high eligibility trace".

Control with Value Function Approximation

Start with some random parameter vector $w$
Set policy based on some greedy function $π = ϵ -greedy (q_{w})$
Do policy evaluation $\overset{q}{^} (., ., w) \approx q_{π}$

First we need to do everything again wrt to action-value function instead of value function to perform this algorithm. The steps are:

Approximate the action-value function $\overset{q}{^} (S, A, w) \approx q_{π} (S, A)$
Minimize the mean squared error between approximate action value function and true oracle action value $q_{π} (S, A)$ : $J (w) = E_{π} [(q_{π} (S, A) - \overset{q}{^} (S, A, w)^{2})]$
Use SGD to find a local minimum: $- \frac{1}{2} \nabla_{w} J (w) △ w = (q_{π} (S, A) - \overset{q}{^} (S, A, w)) \nabla_{w} \overset{q}{^} (S, A, w) = α (q_{π} (S, A) - \overset{q}{^} (S, A, w)) \nabla_{w} \overset{q}{^} (S, A, w)$
Again, we represent the state and action by a feature vector: $x (S, A) = x_{1} (S, A) ⋮ x_{n} (S, A)$
Represent action value function by a linear combination of features: $\overset{q}{^} (S, A, w) = x (S, A)^{T} w = j = 1 \sum n x_{j} (S, A) w_{j}$
Do an SGD update: $\nabla_{w} \overset{q}{^} (S, A, w) △ w = x (S, A) = α (q_{π} (S, A) - \overset{q}{^} (S, A, w)) x (S, A)$

Incremental Control Algorithms

Like prediction, we need to substitute a target for the unknown oracle $q_{π} (S, A)$ . We sub out all the $v_{π}$ for an approximate target:

For MC, target is the return $G_{t}$ $△ w = α (G_{t} - \overset{q}{^} (S_{t}, A_{t}, w)) \nabla_{w} \overset{q}{^} (S_{t}, A_{t}, w)$
For TD(0), the target is the TD target $R_{t + 1} + γ Q (S_{t + 1}, A_{t + 1}) :$ : $△ w = α (R_{t + 1} + γ \overset{q}{^} (S_{t + 1}, A_{t + 1}, w)) - \overset{q}{^} (S_{t}, A_{t}, w) \nabla_{w} \overset{q}{^} (S_{t}, A_{t}, w)$
For forward view TD( $λ$ ), the target is the action value $λ$ -return: $△ w = α (q_{t}^{λ} - \overset{q}{^} (S_{t}, A_{t}, w)) \nabla_{w} \overset{q}{^} (S_{t}, A_{t}, w)$
For backward view TD( $λ$ ), the equivalent update is: $δ_{t} E_{t} △ w = R_{t + 1} + γ \overset{q}{^} (S_{t + 1}, A_{t + 1}, w) - \overset{q}{^} (S_{t}, A_{t}, w) = γλ E_{t - 1} + \nabla_{w} \overset{q}{^} (S_{t}, A_{t}, w) = α δ_{t} E_{t}$

Should we bootstrap? Empirically across many examples, we almost always have the case that:

MC takes too many steps because variance is too high
TD(0) always has a large efficiency gain compared to MC
There's always some $λ$ value in between which is better than TD(0)

Batch Methods

Motivation:

Gradient descent is simple and appealing
But it is not sample efficient (we throw a sample away as soon as we use it once)
Batch methods seek to find the best fitting value function, given the agent's experience ("training data")

Least Squares Prediction

The problem becomes the following:

Given our value function approximation $\overset{v}{^} (s, w) \approx v_{π} (s)$
And experience $D$ consisting of <state, value> pairs

$D = {< s_{1}, v_{1}^{π} >, < s_{2}, v_{2}^{π} >, ..., < s_{T}, v_{T}^{π} >,}$

Find the parameters $w$ that give the best fitting function $\overset{v}{^} (s, w)$

Least squares algorithms simply try to find $w$ that minimizes the sum of squares error between $\overset{v}{^} (s_{t}, w)$ and target values $v_{t}^{π}$ : $LS (w) = t = 1 \sum T [v_{t}^{π} - \overset{v}{^} (s_{t}, w)]^{2} = E_{D} [v^{π} - \overset{v}{^} (s, w)]^{2}$

SGD with Experience Replay

It turns out there is a really easy way to find the least squares solution, using experience replay. The idea is to just keep using the data over and over again, instead of throwing away every sample after each update.

Given experience comprising of: $D = {⟨ s_{1}, v_{1}^{π}, ⟩ ⟨ s_{2}, v_{2}^{π}, ⟩ ..., ⟨ s_{T}, v_{T}^{π}, ⟩}$

Repeat:

Sample state, value from experience: $⟨ s, v^{π} ⟩ \sim D$
Apply SGD update: $△ w = α (v^{π} - \overset{v}{^} (s, w)) \nabla_{w} \overset{v}{^} (s, w)$

It can be shown that this converges to the least squares solution: $w^{π} = w arg min LS (w)$

Experience Replay in Deep Q-Networks (DQN)

DQN (for atari games) uses experience replay and fixed Q-targets:

Take action $a_{t}$ according to $ϵ$ -greedy policy
Store transition $(s_{t}, a_{t}, r_{t + 1}, s_{t + 1})$ in replay memory $D$
Sample random mini batch of transitions $(s, a, r, s^{'})$
- Small batch size of 64 is sufficient
Maintain two neural networks that estimate Q-values:
- The old reference neural network is frozen periodically and used as the target
- Call its parameters $w^{-}$
- The actual neural network we are training has parameters $w$
Compute Q-learning targets wrt old, fixed parameters $w^{-}$
Optimize MSE between reference Q-network and Q-learning targets: $L_{i} (w_{i}) = E_{s, a, r, s^{'} \sim D_{i}} [(r + γ a^{'} max Q (s^{'}, a^{'}; w_{i}^{-}) - Q (s, a; w_{i}))^{2}]$
This is essentially Q-learning with a one-step look ahead, but using the reference network instead of the current active network under training
Success of this method depends on its stability in training:
1. Experience replay helps to stabilize training as it randomly samples from past experience instead of getting batches of highly correlated data
2. Fixed Q-targets - fixing the reference neural network helps to stabilize the targets and thus training
The neural network is just a large convolutional neural network
- Input state $s$ is a stack of raw pixels from last 4 frames
- Output is $Q (s, a)$ for 18 joystick / button positions
- Reward is the change in score for that step
- Applied to a large number of Atari games

Linear Least Squares Prediction

Experience replay finds the least squares solution, but it takes many iterations. If we use a linear value function approximation, we can solve the least squares solution directly.

At the minimum of $LS (w)$ , the expected update must be zero: $E_{D} [△ w] = 0$

So the expected update is zero: $t = 1 \sum T x (s_{t}) (v_{t}^{π} - x (s_{t})^{T} w) = 0$

Solving for $w$ : $t = 1 \sum T x (s_{t}) v_{t}^{π} w = t = 1 \sum T x (s_{t}) x (s_{t})^{T} w = (t = 1 \sum T x (s_{t}) x (s_{t})^{T})^{- 1} t = 1 \sum T x (s_{t}) v_{t}^{π}$

Note that the matrix inverse is performed on a matrix of size $∣ w ∣^{2}$ , where $∣ w ∣$ is the size of the feature / parameter vector. Hence if the number of parameters is small, this is acceptable to take the $O (N^{3})$ complexity
Using Shermann-Morrison, the solution time is reduced to $O (N^{2})$

Linear least squares prediction algorithms actually have better convergence properties.

Least Squares Policy Iteration

Policy evaluation is done using least squares Q-learning (linear or otherwise)
Policy improvement is done using greedy policy improvement as per normal

Policy Gradient

Lecture 7: Policy Gradient

Look at methods that update the policy directly, instead of working with value / action-value functions.

In the last lecture we approximated the value of action value function using parameters $θ$ . $V_{θ} (s) Q_{θ} (s, a) \approx V^{π} (s) \approx Q^{π} (s, a)$
We generated a policy directly from the value function, e.g. using $ϵ$ -greedy algorithm
In this lecture, we will directly parametrise the policy: $π_{θ} (s, a) = P [a ∣ s, θ]$
Again, we will focus on model free reinforcement learning

Value-based vs Policy-based RL

Value Based
- Learn the value function
- Policy is implicit from the value function (e.g. $ϵ$ -greedy)
Policy Based
- No value function
- Learn policy directly
Actor-Critic
- Learn a value function
- Also learn a policy

Advantages of Policy-based RL

Better convergence properties than value based
- Value based sometimes swings or chatters around the optimum and do not converge
Effective in high dimensional or continuous action spaces
- We do not need to compute max action over Q-values
- E.g. if the action space is continuous, the maximization is not at all trivial and may be prohibitive
- This may be the main impediment to value-based RL
Can learn stochastic policies

Disadvantages of Policy-base RL

Typically converges to a local rather than global optimum
Evaluating a policy is typically inefficient and high variance

Why might we want to have a stochastic policy?

e.g. Rock paper scissors
Having a deterministic policy is easily exploited
A uniform random policy is optimal

Stochastic policy is also necessary in a case of state-aliasing, in which our state representation cannot exhaustively differentiate states from each other. In this case, we do not have a Markov Decision Process, and there may not exist a deterministic policy that is optimal. That is why we need a stochastic policy.

Policy Objective Functions

Our goal in policy gradient is to find the best $θ$ for a given policy $π_{θ} (s, a)$ with parameters $θ$ . But how do we measure the quality of a given policy?

There are 3 ways of measuring:

In episodic environments we can use the start value: $J_{1} (θ) = V^{π_{θ}} (s_{1}) = E_{π_{θ}} [v_{1}]$
In continuing environments we can use the average value: $J_{average value} (θ) = s \sum d^{π_{θ}} (s) V^{π_{θ}} (s)$
Or the average reward per time step $J_{average reward} (θ) = s \sum d^{π_{θ}} (s) a \sum π_{θ} (s, a) R_{s}^{a}$

In the above, $d^{π_{θ}} (s)$ is the stationary distribution of the markov chain for $π_{θ}$ . It tells us the amount of time we spend in each state, and so it provides the weighting required to get the average value or reward.

Policy Optimization

Policy based reinforcement learning is an optimisation problem:

Gradient free methods:
- Hill climbing
- Simplex / amoeba / Nelder Mead
- Genetic algorithms
Gradient methods are almost always more efficient:
- Gradient descent
- Conjugate gradient
- Quasi Newton

Finite Difference Policy Gradient

Let $J (θ)$ be any policy objective function
Policy gradient algorithms search for a local maximum in $J (θ)$ by ascending the gradient of the policy wrt parameters $θ$ $△ θ = α \nabla_{θ} J (θ)$
Where $\nabla_{θ} J (θ)$ is the policy gradient (a vector of partial derivatives along each dimension) $\nabla_{θ} J (θ) = \frac{\partial J ( θ )}{\partial θ _{1}} ⋮ \frac{\partial J ( θ )}{\partial θ _{n}}$

The simplest way to compute the policy gradient is to use finite differences:

For each dimension $k \in [1, n]$ :
- Estimate the kth partial derivative of objective function wrt $θ$
- By perturbing $θ$ by a small amount $ϵ$ in the kth dimension $\frac{\partial J ( θ )}{\partial θ _{k}} \approx J ( θ + ϵ u _{k} ) - J ( θ )$
- Where $u_{k}$ is a unit vector with 1 in the kth component and 0 elsewhere
This is not the most efficient algorithm, as it requires $n$ evaluations (once for each dimension) to compute a single gradient step
It is simple, noisy and inefficient, but sometimes works
Works for arbitrary policies, even if the policy is not differentiable

Likelihood Ratios

We now want to compute the policy gradient analytically, assuming:

The policy $π_{θ}$ is differentiable whenever it is non-zero; and
We know the gradient $\nabla_{θ} π_{θ} (s, a)$
Likelihood ratio methods exploit the following identity (call it the log trick): $\nabla_{θ} π_{θ} (s, a) = π_{θ} (s, a) \frac{\nabla _{θ} π _{θ} ( s , a )}{π _{θ} ( s , a )} = π_{θ} (s, a) \nabla_{θ} lo g π_{θ} (s, a)$
Note that we use the simple identity $\partial_{θ} lo g f (θ) = \frac{\partial _{θ} f ( θ )}{f ( θ )}$
The new formulation is nicer to work with because we have $π_{θ} (s, a)$ on the left, which when integrated over, basically gives us the expectation over our policy $π_{θ}$
- This allows us to basically sample trajectories from the data and compute the gradient at each step
The score function is the quantity $\nabla_{θ} lo g π_{θ} (s, a)$

Linear Softmax Policy

Use the softmax policy as a simple running example:

Weight actions using linear combination of features $ϕ (s, a)^{⊺} θ$
The probability of action is then proportional to the exponentiated weight: $π_{θ} (s, a) \propto e^{ϕ (s, a)^{⊺} θ}$
The score function is then: $\nabla_{θ} lo g π_{θ} (s, a) = ϕ (s, a) - E_{π_{θ}} [ϕ (s, \cdot)]$
Note that we are omitting the derivation of the second term of the score function which is a bit more involved, as it involves differentiating the normalization factor (not shown above)

Gaussian Policy

In continuous action spaces, a Gaussian policy is natural

Let the mean of the gaussian be a linear combination of state features $μ (s) = ϕ (s)^{⊺} θ$
The variance may be fixed $σ^{2}$ or parametrized
The policy is gaussian (recall that we are in a continuous action space, so $a$ is a vector of floats): $a \sim N (μ (s), σ^{2})$
The score function is then: $\nabla_{θ} lo g π_{θ} (s, a) = \frac{( a - μ ( s )) ϕ ( s )}{σ ^{2}}$
We can probably derive this score function by writing down the PDF of the gaussian distribution for $π_{θ} (s, a)$ and then taking the derivative

Policy Gradient Theorem: One-Step MDPs

Consider a simple class of one-step MDPs to simplify the math

Start in a state $s \sim d (s)$
Terminate after one step with reward $r = R_{s, a}$
This is a sort of contextual bandit

Use likelihood ratios to compute the policy gradient

First we pick our objective function, which is just the expected reward (averaged over our start state and action that we choose) $J (θ) = E_{π_{θ}} [r] = s \in S \sum d (s) a \in A \sum π_{θ} (s, a) R_{s, a}$
Then we take the derivative to do gradient ascent: $\nabla_{θ} J (θ) = s \in S \sum d (s) a \in A \sum \nabla_{θ} π_{θ} (s, a) R_{s, a} = s \in S \sum d (s) a \in A \sum π_{θ} (s, a) \nabla_{θ} lo g π_{θ} (s, a) R_{s, a} = E_{π_{θ}} [\nabla_{θ} lo g π_{θ} (s, a) r]$
Note that when taking the gradient of $π_{θ}$ , we use the log trick to rewrite it in line 2, and it becomes a new expectation again because we recover $π_{θ} (s, a)$ outside of the gradient. This shows the power of the log trick.

Policy Gradient Theorem

But we don't just want to do one-step MDPs, we want to generalize to multi-step MDPs

It turns out that we just need to replace the instantaneous reward $r$ with long term value $Q^{π} (s, a)$ (I suppose this means we need to model $Q$ as well)
Regardless of whether we use the (i) start state objective, (ii) average value objective or (iii) average reward objective, the policy gradient theorem hold

Theorem. Policy Gradient Theorem.

For any differentiable policy $π_{θ} (s, a)$ , and for any of the policy objective functions $J_{1}$ , $J_{average value}$ or $J_{average reward}$ , the policy gradient is: $\nabla_{θ} J (θ) = E_{π_{θ}} [\nabla_{θ} lo g π_{θ} (s, a) Q^{π_{θ}} (s, a)]$

Monte Carlo Policy Gradient (REINFORCE)

The policy gradient theorem basically gives rise to a simple monte carlo policy gradient algorithm to find the optimal policy:

REINFORCE Algorithm.

Initialize $θ$ randomly

For each episode ${s_{1}, a_{1}, r_{2}, ..., s_{T - 1}, a_{T_{1}}, r_{T}} \sim π_{θ}$ do:

For $t = 1$ to $T - 1$ do:

$θ \leftarrow θ + α \nabla_{θ} lo g π_{θ} (s_{t}, a_{t}) G_{t}$

Return $θ$

Note that:

We are doing monte carlo, i.e. we wait until the end of the episode before we go back to update the parameters for each time step.
We are doing SGD, so there is no expectation term
We use the return $G_{t}$ as an unbiased sample of $Q^{π_{θ}} (s_{t}, a_{t})$ . Recall that $G_{t}$ is the total discounted reward from time step $t$ until termination.
This is the simplest and oldest policy gradient algorithm.

Empirically, policy gradient methods have a nice learning curve without the jittery behaviour of value based methods. But, monte carlo methods take very very long (millions of steps) to converge due to high variance.

Actor Critic Policy Gradient

The main problem with monte carlo policy gradient is the high variance of the return $G_{t}$ . Sometimes we get no reward, sometimes we get high reward.

The idea is thus to use a critic to estimate the action-value function $Q$ : $Q_{w} (s, a) \approx Q^{π_{θ}} (s, a)$

The name critic refers to the value function, which simply "watches" and evaluates the value of an action, whilst the actor is the policy itself which decides how we should act.

We maintain two sets of parameters:

Critic updates the action value function parameters $w$
Actor updates the policy parameters $θ$ , in the direction suggested by the critic

Actor-critic algorithms follow an approximate policy gradient: $\nabla_{θ} J (θ) △ θ \approx E_{π_{θ}} [\nabla_{θ} lo g π_{θ} (s, a) Q_{w} (s, a)] = α \nabla_{θ} lo g π_{θ} (s, a) Q_{w} (s, a)$

Notice that we just replace $G_{t}$ with $Q_{w} (s, a)$ , which is the value function of the critic model.

Estimating the Action-Value Function of the Critic

The critic is solving a familiar problem: policy evaluation. What is the value for policy $π_{θ}$ for current parameters $θ$ ? We have explored this previously, i.e.:

Monte Carlo policy evaluation
TD- $λ$
or least squares policy evaluation

This leads us to a simple actor-critic algorithm:

Critic: use linear value function approximation $Q_{w} (s, a) = ϕ (s, a)^{⊺} w$
- Update $w$ using linear TD(0)
Actor: update $θ$ using policy gradient

Q Actor Critic (QAC) Algorithm.

Initialize $s, θ$

Sample $a \sim π_{θ}$

for each step:

Sample reward $r = R_{s}^{a}$ ; sample transitions $s^{'} \sim P_{s}^{a}$

Sample action $a^{'} \sim π_{θ} (s^{'}, a^{'})$

$δ = r + γ Q_{w} (s^{'}, a^{'}) - Q_{w} (s, a)$

$θ = θ + α \nabla_{θ} lo g π_{θ} (s, a) Q_{w} (s, a)$

$w \leftarrow w + β δ ϕ (s, a)$

$a \leftarrow a^{'}, s \leftarrow s^{'}$

Note that the $δ$ line is simply the TD(0) update, and the $θ$ line is simply the policy gradient step.

Bias in Actor Critic Algorithms

The problem is that approximating the policy gradient introduces bias

A biased policy gradient may not find the right solution
Luckily, if we choose the value function approximation carefully, we can avoid introducing any bias
i.e. we can still follow the exact policy gradient (see Compatible Function Approximation theorem)

Reducing Variance using a Baseline

Here we move on to tricks to improve the training algorithm. The baseline method is the best known trick.

The main idea is to subtract a baseline function $B (s)$ from the policy gradient, such that we can reduce variance without changing expectation.

The reason is that since $B (s)$ does not depend on the action $a$ , its expectation when plugged in will be $0$ , like so: $E_{π_{θ}} [\nabla_{θ} lo g π_{θ} (s, a) B (s)] = s \in S \sum d^{π_{θ} (s)} a \in A \sum \nabla_{θ} π_{θ} (s, a) B (s) = s \in S \sum d^{π_{θ}} B (s) \nabla_{θ} a \in A \sum π_{θ} (s, a) = 0$

Note that in line 2, since $π_{θ}$ is a probability, the right-most term sums to $1$ , which is a constant. Hence the gradient $\nabla_{θ}$ will become $0$ , and the expectation resolves to $0$ .

A good and popular choice of baseline is the state value function $B (s) = V^{π_{θ}} (s)$ .

So we can rewrite the policy gradient using the advantage function $A^{π_{θ}} (s, a)$ : $A^{π_{θ}} (s, a) \nabla_{θ} J (θ) = Q^{π_{θ}} (s, a) - V^{π_{θ}} (s) = E_{π_{θ}} [\nabla_{θ} lo g π_{θ} (s, a) A^{π_{θ}} (s, a)]$
Intuitively, the advantage function tells us the additional benefit of an action over the baseline value of being in that state

Estimating the Advantage Function

The advantage function can significantly reduce the variance of the policy gradient. The naive way is to estimate $V^{π_{θ}} (s)$ and $Q^{π_{θ}} (s, a)$ separately, i.e.: $\hat{V} (s) Q_{w} (s, a) A (s, a) \approx V^{π_{θ}} (s) \approx Q^{π_{θ}} (s, a) = Q_{w} (s, a) - \hat{V} (s)$

Then, we use TD methods to update both value functions. However, this is not efficient in terms of parameters.

The better way is to observe that the TD error $δ^{π_{θ}}$ is an unbiased estimate of the advantage function. Therefore, we can plug in the TD error in our policy gradient update instead.

Recall that the TD error is $δ^{π_{θ}} = r + γ V^{π_{θ}} (s^{'}) - V^{π_{θ}} (s)$
And this is an unbiased estimate of the advantage function $E_{π_{θ}} [δ^{π_{θ}} ∣ s, a] = E_{π_{θ}} [r + γ V^{π_{θ}} (s^{'}) ∣ s, a] - V^{π_{θ}} (s) = Q^{π_{θ}} (s, a) - V^{π_{θ}} (s) = A^{π_{θ}} (s, a)$
Note that we have not taken any approximations above. We just showed that the expectation of the TD(0) error when following policy $π_{θ}$ corresponds to the advantage function. Interestingly, computing the TD error does not require estimating the $Q$ function, only the $V$ function. This gives us a simpler update
We can thus use the TD error to compute the policy gradient $\nabla_{θ} J (θ) = E_{π_{θ}} [\nabla_{θ} lo g π_{θ} (s, a) δ^{π_{θ}}]$
In practice, since we do not have the true value function $V^{π_{θ}}$ , we use an estimate $\hat{V}$ : $δ_{w} = r + γ \hat{V_{w}} (s^{'}) - \hat{V_{w}} (s)$
This approach only requires one set of critic parameters $w$

Improving the Actor Updates

Recall that we can estimate the value function $V_{θ} (s)$ from targets at different time scales to trade-off bias and variance:

For MC, the target is the return $G_{t}$ $△ θ = α (G_{t} - V_{θ} (s)) ϕ (s)$
For TD(0), the target is the TD target $△ θ = α (r + γ V_{θ} (s^{'}) - V_{θ} (s)) ϕ (s)$
For forward view TD( $λ$ ), the target is the lambda return $△ θ = α (v_{t}^{λ} - V_{θ} (s)) ϕ (s)$
For backward view TD( $λ$ ), we use eligibility traces $δ_{t} e_{t} △ θ = r_{t + 1} + γV (s_{t + 1}) - V (s_{t}) = λ e_{t - 1} + ϕ (s_{t}) = α δ_{t} e_{t}$

Similarly, we can estimate the policy gradient for the actor at many time scales. The main target we want to estimate is $\nabla_{θ} J (θ) = E_{π_{θ}} [\nabla_{θ} lo g π_{θ} (s, a) A^{π_{θ}} (s, a)]$

MC policy gradient uses the error from the return $△ θ = α (G_{t} - \hat{V} (s_{t})) \nabla_{θ} lo g π_{θ} (s_{t}, a_{t})$
Actor critic policy gradient uses the one-step TD error $△ θ = α (r + γ \hat{V} (s_{t + 1}) - \hat{V} (s_{t})) \nabla_{θ} lo g π_{θ} (s_{t}, a_{t})$
Forward view TD( $λ$ ) uses the $λ$ target $△ θ = α (v_{t}^{λ} - \hat{V} (s_{t})) \nabla_{θ} lo g π_{θ} (s_{t}, a_{t})$
Backward view TD( $λ$ ) uses eligibility traces. Note that the eligbility update now uses the score function instead of $ϕ$ which was the feature function $δ_{t} e_{t + 1} △ θ = r_{t + 1} + γ \hat{V} (s_{t + 1}) - \hat{V} (s_{t}) = λ e_{t} + \nabla_{θ} lo g π_{θ} (s, a) = α δ_{t} e_{t}$
The update can be performed online, unlike MC policy gradient

Natural Policy Gradient

The natural policy gradient is a parametrization independent approach. It finds ascent direction that is closest to vanilla gradient, when changing the policy by a small, fixed amount $\nabla_{θ}^{nat} π_{θ} (s, a) = G_{θ}^{- 1} \nabla_{θ} π_{θ} (s, a)$

Where $G_{θ}$ is the fisher information matrix $G_{θ} = E_{π_{θ}} [\nabla_{θ} lo g π_{θ} (s, a) \nabla_{θ} lo g π_{θ} (s, a)^{⊺}]$
Notice that this obviates the need for a critic, as $G_{θ}$ is based on the actor itself

Summary

The policy gradient has many equivalent forms $\nabla_{θ} J (θ) G_{θ}^{- 1} \nabla_{θ} J (θ) = E_{π_{θ}} [\nabla_{θ} lo g π_{θ} (s, a) G_{t}] = E_{π_{θ}} [\nabla_{θ} lo g π_{θ} (s, a) Q_{w} (s, a)] = E_{π_{θ}} [\nabla_{θ} lo g π_{θ} (s, a) A_{w} (s, a)] = E_{π_{θ}} [\nabla_{θ} lo g π_{θ} (s, a) δ] = E_{π_{θ}} [\nabla_{θ} lo g π_{θ} (s, a) δe] = w REINFORCE Q Actor-Critic Advantage Actor-Critic TD Actor-Critic TD (λ) Actor-Critic Natural Actor-Critic$

Each formulation leads to a different SGD algorithm. We can learn the critic using policy evaluation (e.g. MC or TD learning) to estimate $Q_{π}, A_{π}$ or $V_{π}$ .

Learning & Planning

Lecture 8: Integrating Learning and Planning

A more conceptual lecture.

Introduction
Model based RL
Integrated Architectures
Simulation Based Search

Model Based RL

Lecture 7: Learn policy directly from experience
Lecture 4-6: Learn value function directly from experience
This lecture: learn model directly from experience
- The model describes an agent's understanding of the environment.
Use planning to construct a value function or policy
Integrate learning and planning into a single architecture

Taxonomy

Model free RL
- No model
- Learn value function / policy function from experience
Model-based RL
- Learn a model from experience
- Plan value function / policy function from model. Use model to look ahead and think and get rewards.

The learning diagram for Model-based RL is as follows:

graph LR;
    A("value/policy") -- acting --> B("experience");
    B("experience") -- model learning --> C("model");
    C("model") -- planning --> A("value/policy");

Advantages of Model-Based RL

Can efficiently learn model by supervised learning methods
- Consider a game like chess, where the model and game rule is quite simple
- But the value function is very hard, because moving one piece can dramatically change the value function
- So in some cases, the model is a more compact and useful representation of the problem
- The model simply predicts our next state, given the previous state and action. We can have a teacher which is the environment or game engine.
Can reason about model uncertainty

Disadvantage of Model-based RL:

First learn a model, then construct a value function, which gives us two sources of approximation error

What is a Model?

A model $M$ is a representation of an MDP $⟨ S, A, P, R ⟩$ parametrized by $η$ .

Assume state space $S$ and action space $A$ are known
So a model $M = ⟨ P_{η}, R_{η} ⟩$ represents state transitions $P_{η} \approx P$ and rewards $R_{η} \approx R$
And we can use the model to sample next state and reward $S_{t + 1} R_{t + 1} \sim P_{η} (S_{t + 1} ∣ S_{t}, A_{t}) \sim R_{η} (R_{t + 1} ∣ S_{t}, A_{t})$
Note that we just need to advance one time step to start learning
It is typical to assume conditional independence between state transitions and rewards, i.e. $P [S_{t + 1}, R_{t + 1} ∣ S_{t}, A_{t}] = P [S_{t + 1} ∣ S_{t}, A_{t}] P [R_{t + 1} ∣ S_{t}, A_{t}]$

Model Learning

The goal is to estimate the model $M_{η}$ from experience ${S_{1}, A_{1}, R_{2}, ..., S_{T}}$

Observe that this is just a supervised learning problem $S_{1}, A_{1} S_{2}, A_{2} ⋮ S_{T - 1}, A_{T - 1} \to R_{2}, S_{2} \to R_{3}, S_{3} \to R_{T}, S_{T}$
Learning $s, a \to r$ is a regression problem
Learning $s, a \to s^{'}$ is a density estimation problem
Pick our favourite loss function, e.g. MSE, KL divergence
Pick parameters $η$ that minimize the empirical loss

Examples of models:

Table lookup model
Linear expectation model
Linear gaussian model
Gaussian process model
Deep belief network model

Table Lookup Model

Our model is an explicit MDP, i.e. $\hat{P}, \hat{R}$

We simply count the number of visits to each state action pair, i.e. $N (s, a)$ and record the probability of each resulting $s^{'}$ and mean reward $R_{t}$ : $\hat{P}_{s, s^{'}}^{a} \hat{R}_{s}^{a} = \frac{1}{N ( s , a )} t = 1 \sum T 1 (S_{t} = s, A_{t} = a, S_{t + 1} = s^{'}) = \frac{1}{N ( s , a )} t = 1 \sum T 1 (S_{t} = s, A_{t} = a) R_{t}$
An alternative to do this in a non-parametric way:
- At each time step $t$ , record experience tuple $⟨ S_{t}, A_{t}, R_{t + 1}, S_{t + 1} ⟩$
- To sample model, each time we are in state $s, a$ , we randomly pick a tuple matching $⟨ s, a, \cdot, \cdot ⟩$

AB Example

Suppose we have $8$ episodes of experience:

A, 0, B, 0
B, 1
B, 1
B, 1
B, 1
B, 1
B, 1
B, 0

From this data, we learn a model using the one-step supervised learning method like so:

graph LR;
    A("A") --  $r = 0$ , 100% --> B("B");
    B("B") --  $r = 1$ , 75% --> C("END");
    B("B") --  $r = 0$ , 25% --> D("END");

Planning with a Model

Given a model $M_{η} = ⟨ P_{η}, R_{η} ⟩$ , now we want to solve the MDP $⟨ S, A, P_{η}, R_{η} ⟩$ . We can use our favourite planning algo:

Value iteration
Policy iteration
Tree search

One of the simplest approaches is to do sample-based planning, but also one of the most powerful. The idea is to use the model only to generate samples. We sample experiences from the model: $S_{t + 1} R_{t + 1} \sim P_{η} (S_{t + 1} ∣ S_{t}, A_{t}) \sim R_{η} (R_{t + 1} ∣ S_{t}, A_{t})$

We can then apply model free RL to samples, e.g. using

Monte carlo control
Sarsa
Q-learning etc.

Sample based planning methods are often more efficient.

Sampling from the model is efficient because sampling gives us high probability events, compared to full width look up

Back To AB Example

After we built the model, we can sample from it, for e.g.

B, 1
B, 0
B, 1
A, 0, B, 1
B, 1
A, 0, B, 1
B, 1
B, 0

Based on this sample data, we apply monte carlo learning, and end up with: $V (A) = 1, V (B) = 0.75$

Planning with an Inaccurate Model

Given an imperfect model $⟨ P_{η}, R_{η} ⟩ \neq = ⟨ P, R ⟩$ , we need to remember that the performance of model-based RL is limited to the optimal policy for the approximate MDP $⟨ S, A, P_{η}, R_{η} ⟩$ .

i.e. the model based RL is only as good as the estimated model
When the model is inaccurate, the planning process will compute a suboptimal policy

Solutions:

When model is wrong, use model-free RL
Reason explicitly about model uncertainty

Integrated Architectures

Bring together the best of model-based and model free architectures. We consider two sources of experience:

Real experience. Sampled from environment (true MDP): $S^{'} R \sim P_{s, s^{'}}^{a} = R_{s}^{a}$
Simulated experience. Sampled from model (approximate MDP): $S^{'} R \sim P_{η} (S^{'} ∣ S, A) = R_{η} (R ∣ S, A)$

Some taxonomy:

Model Free RL:
- No model
- Learn value function (or policy) from real experience
Model based RL
- Learn a model from real experience
- Plan value function (or policy) from simulated experience
Dyna
- Learn a model from real experience
- Learn and plan value function (or policy) from real and simulated experience

The dyna architecture looks like that:

graph LR;
    A("value/policy") -- acting --> B("experience");
    B("experience") -- model learning --> C("model");
    C("model") -- planning --> A("value/policy");
    B("experience") -- direct RL --> A("value/policy");

Dyna-Q Algorithm

Initialize $Q (s, a)$ and $M o d e l (s, a)$ for all $s \in S$ and $a \in A (s)$

Do forever:

$S \leftarrow current state$

$A \leftarrow ϵ -greedy (S, Q)$

Execute action $A$ ; observe reward $R$ and state $S^{'}$

$Q (S, A) \leftarrow Q (S, A) + α [R + γ max_{a} Q (S^{'}, a) - Q (S, A)]$

$M o d e l (S, A) \leftarrow R, S^{'}$

Repeat $n$ times:

$S \leftarrow random previously observed state$

$A \leftarrow random action previously taken in S$

$R, S^{'} \leftarrow M o d e l (S, A)$

$Q (S, A) \leftarrow Q (S, A) + α [R + γ max_{a} Q (S^{'}, a) - Q (S, A)]$

Note that:

Step 4 is the standard Q-learning update step
Step 5 updates the model using simple SGD supervised learning
Step 6 is the thinking/planning step, where we "imagine" scenarios using our model and update Q $n$ times, without actually moving our agent in the real world

Experiments show that planning significantly speeds up convergence, requiring much fewer exploration steps in the real world to converge. So we are squeezing much more information out of what we have explored so far.

A variation to Dyna-Q is Dyna-Q+, which puts higher weight on unexplored states and encourages exploration.

Simulation Based Search

Two key ideas: sampling and forward search.

Forward search algorithms select the best action by lookahead. A search tree is built with the current state $s_{t}$ at the root. We then use a model of the MDP to look ahead.

So we do not need to solve the entire MDP, just the sub-MDP starting from the current position. Solving the entire MDP is a waste of time. Note that this is in contrast to the Dyna-Q algorithm, where the "thinking" step starts by randomly visiting a previously observed state.

So we can simulate episodes of experience from now using our model, and apply model free RL to simulated episodes.

Simulation based search:

Simulate episodes of experience from now with the model ${s_{t}^{k}, A_{t}^{k}, R_{t + 1}^{k}, ..., S_{T}^{k}}_{k = 1}^{K} \sim M_{v}$
Apply model free RL to simulated episodes
- If we use Monte-carlo control, we get Monte-Carlo search
- If we use Sarsa for control, we get TD search

Simple Monte Carlo Search

Given a model $M_{v}$ and some simulation policy $π$ (how we pick actions in our imagination)
For each current action $a \in A$ :
- Simulate $K$ episodes using model from current real state $s_{t}$ ${s_{t}, a_{t}, R_{t + 1}^{k}, S_{t + 1}^{k}, A_{t + 1}^{k}, ..., S_{T}^{k}}_{k = 1}^{K} \sim M_{v}, π$
- Note that after current $a_{t}$ , we follow policy $π$ for future actions
- Evaluate actions by mean return (Monte Carlo evaluation) $Q (s_{t}, a) = \frac{1}{K} k = 1 \sum K G_{t} \to q_{π} (s_{t}, a)$
Select current real actions with maximum value $a_{t} = a \in A arg max Q (s_{t}, a)$

Monte Carlo Tree Search (Evaluation)

MCTS differs in that we allow the policy to improve within our simulation (i.e. policy $π$ is not stationary within our simulation runs).

Given a model $M_{v}$
Simulate $K$ episodes using model from current real state $s_{t}$ using current simulation policy $π$ ${s_{t}, a_{t}, R_{t + 1}^{k}, S_{t + 1}^{k}, A_{t + 1}^{k}, ..., S_{T}^{k}}_{k = 1}^{K} \sim M_{v}, π$
Build a search tree containing visited states and actions from the above simulations
Evaluate states $Q (s, a)$ by mean return of episodes starting from $s, a$ (i.e. monte carlo evaluation) $Q (s, a) = \frac{1}{N ( s , a )} k = 1 \sum K u = t \sum T 1 (S_{u} = s, A_{u} = a) G_{u} \to q_{π} (s, a)$
Note that:
- $N (s, a)$ is the number of times we visited the $s, a$ pair during our simulations.
- We are assuming our model is good enough so that we can rely on the simulated returns
- We are not really learning a persistent Q function. At each step, we run simulations to get fresh estimations of the Q values
After the search is finished, select the current real action with maximum value in the search tree $a_{t} = a \in A arg max Q (s_{t}, a)$

In MCTS, the simulation policy $π$ actually improves.

Each simulation comprises two phases (in-tree, out of tree)
- Tree policy (improves): pick actions to maximize $Q (S, A)$ by looking at the search tree and the node's children
- Default policy (fixed): in our simulation, when we run beyond the frontier of the search tree, we will pick actions randomly
Repeat (each simulation)
- Evaluate states $Q (S, A)$ by monte carlo evaluation
- Improve tree policy, e.g. by $ϵ$ -greedy $(Q)$
Essentially, this is monte carlo control applied to simulated experience
This method converges on the optimal search tree, i.e. $Q (S, A) \to q_{*} (S, A)$

Case Study: Game of Go

Position evaluation in Go:

How good is a position s?
Reward function (undiscounted): $R_{t} R_{T} = 0 for all non-terminal steps t < T = {1 if black wins 0 if white wins$
Policy $π = ⟨ π_{B}, π_{W} ⟩$ selects moves for both players
The value function (how good is position $s$ ): $v_{π} (s) v_{*} (s) = E_{π} [R_{T} ∣ S = s] = P [Black wins ∣ S = s] = π_{B} max π_{W} min v_{π} (s)$

How does simple monte carlo evaluation work?

Suppose we start with a certain board configuration $s$
We simulate many runs of games starting from $s$ with current policy $π$
The value function $v_{π} (s)$ would be the fraction of simulated games where black wins

Keyboard shortcuts

Chux's Notebook