This is a collection of my notes which I refer to on a regular basis. Hope it is also helpful for others stumbling by.

I think of these notes as mountaineering pegs. Often at the time of studying a particular paper or topic, the concepts are clear. Over time, however, only a vague impression remains, and I can no longer tussle with the issues. So these notes serve as pegs that I hope to use to re-scale an old hill, or at least scale it with less effort than starting from scratch.

Generally, I highlight main points like so, and put in-line code or numbers like so.

Built with mdBook.

LightGBM Memory

TLDR: Solutions for memory issues during training of a LightGBM model:

  1. Cast numeric values into np.float32 to save data space
  2. Keep num_leaves <= 100 (or some reasonable number)
  3. If feature dimension is large (e.g. M >= 1000), try colsample_bytree = 0.1, although this might not help too much if the bottleneck is during bin histogram construction (rather than the actual training)
  4. If number of rows and features are both large (e.g. N >= 1_000_000 and M >= 1000, i.e. >= 4 GB) then the data itself is taking up a lot of memory. It would be worthwhile to put the data on disk and use lgb.Dataset by providing the file path as the data argument instead. Then, we should set two_round=True for the train method params. The explanation for two round is rather unclear, but it should help with memory when Dataset is loading from disk (rather than from a numpy.array in memory). For this option, I had some trouble getting it to work with categorical columns.

For more details can refer to the experiments below.

Experiments

I often run into memory issues running LightGBM. So here are some experiments to measure memory usage and understand how hyperparameters can affect memory usage.

The function of interest is the fit method for the learn to rank task.

import lightGBM as lgb
def f():
    model = lgb.LGBMRanker(**params, objective="lambdarank")
    model.fit(
        X=data,
        y=y,
        group=groups,
    )

The memory usage is measured using the memory_profiler module, which checks the memory usage at .1 second intervals. The maximum is then taken to represent the maximum memory usage of the fit function. We also take note of the size of the data itself (using data.nbytes) and subtract that away to get closer to the LightGBM memory usage. Do note that this memory profiling is not very rigorous, so the results are best for relative comparison within each experiment rather than across experiments.

from memory_profiler import memory_usage
def run(params):
    mem_usage = memory_usage(f)
    return max(mem_usage) / 1000 # GB

We set the default parameters as follows and generate the data this way. For the experiments below, the default parameters are used unless specified otherwise.

DEFAULT_PARAMS = {
    "N": 200000, # number of instances
    "M": 500, # feature dimension
    "n_estimators": 100,
    "num_leaves": 100,
    "histogram_pool_size": -1,
}
data = np.random.randn(DEFAULT_PARAMS["N"], DEFAULT_PARAMS["M"])
groups = [20] * int(N / 20) # assume each session has 20 rows
y = np.random.randint(2, size=N) # randomly choose 0 or 1

Large num_leaves can get very memory intensive. We should not need too many leaves, so generally using num_leaves <= 100 and increasing the number of estimators seems sensible to Gme.

  • num_leaves: 10, Maximum memory usage: 2.28 GB - 0.80 GB = 1.48 GB
  • num_leaves: 100, Maximum memory usage: 2.52 GB - 0.80 GB = 1.72 GB
  • num_leaves: 1000, Maximum memory usage: 4.04 GB - 0.80 GB = 3.24 GB

Increasing n_estimators doesn't seem to raise memory much, but increases run time because each tree is fitted sequentially on the residual errors, so it cannot be parallelized.

  • n_estimators: 10, Maximum memory usage: 2.28 GB - 0.80 GB = 1.48 GB
  • n_estimators: 100, Maximum memory usage: 2.53 GB - 0.80 GB = 1.73 GB
  • n_estimators: 1000, Maximum memory usage: 2.69 GB - 0.80 GB = 1.89 GB

Increasing N increases memory sublinearly. It seems that the data size itself will be more of a problem than the increase in LightGBM memory usage as N increases. For extremely large N, we can also set the subsample parameter to use only a fraction of the training instances for each step (i.e. stochastic rather than full gradient descent). By default subsample=1.0.

  • N: 1,000, Maximum memory usage: 0.38 GB - 0.00 GB = 0.38 GB
  • N: 10,000, Maximum memory usage: 0.45 GB - 0.04 GB = 0.41 GB
  • N: 100,000, Maximum memory usage: 1.46 GB - 0.40 GB = 1.06 GB
  • N: 1,000,000, Maximum memory usage: 6.12 GB - 4.00 GB = 2.12 GB
  • N: 2,000,000, Maximum memory usage: 10.48 GB - 8.00 GB = 2.48 GB

In contrast to N, memory usage is quite sensitive to M, seems to increase linearly when M gets large. M=10,000 blows up my memory. I suppose this could be mitigated by setting colsample_bytree or colsample_bynode to sample a smaller subset.

  • M: 100, Maximum memory usage: 2.08 GB - 0.16 GB = 1.92 GB
  • M: 1000, Maximum memory usage: 4.92 GB - 1.60 GB = 3.32 GB
  • M: 2000, Maximum memory usage: 9.69 GB - 3.20 GB = 6.49 GB
  • M: 3000, Maximum memory usage: 14.35 GB - 4.80 GB = 9.55 GB

To deal with the high memory usage of large M, we can set colsample_bytree which samples a subset of columns before training each tree. This will help to mitigate the memory usage. For this experiment, we set M=2000 to simulate data with high number of dimensions.

  • colsample_bytree: 0.1, Maximum memory usage: 8.60 GB - 3.20 GB = 5.40 GB
  • colsample_bytree: 0.2, Maximum memory usage: 9.58 GB - 3.20 GB = 6.38 GB
  • colsample_bytree: 0.4, Maximum memory usage: 10.06 GB - 3.20 GB = 6.86 GB
  • colsample_bytree: 0.6, Maximum memory usage: 10.07 GB - 3.20 GB = 6.87 GB
  • colsample_bytree: 0.8, Maximum memory usage: 10.46 GB - 3.20 GB = 7.26 GB

In contrast, setting colsample_bynode does not help memory usage at all. Not too sure why, but I suppose since multiple nodes for the same tree can be split at the same time, the full feature set still has to be kept in memory.

  • colsample_bynode: 0.1, Maximum memory usage: 10.49 GB - 3.20 GB = 7.29 GB
  • colsample_bynode: 0.2, Maximum memory usage: 10.49 GB - 3.20 GB = 7.29 GB
  • colsample_bynode: 0.4, Maximum memory usage: 10.49 GB - 3.20 GB = 7.29 GB
  • colsample_bynode: 0.6, Maximum memory usage: 10.49 GB - 3.20 GB = 7.29 GB
  • colsample_bynode: 0.8, Maximum memory usage: 10.48 GB - 3.20 GB = 7.28 GB

Tweaking boosting and data_sample_strategy don't seem to affect memory usage too much. Using dart seems to require a bit more memory than the traditional gbdt.

  • data_sample_strategy: bagging, boosting: gbdt, Maximum memory usage: 8.90 GB - 3.20 GB = 5.70 GB
  • data_sample_strategy: goss, boosting: gbdt, Maximum memory usage: 9.58 GB - 3.20 GB = 6.38 GB
  • data_sample_strategy: bagging, boosting: dart, Maximum memory usage: 9.81 GB - 3.20 GB = 6.61 GB
  • data_sample_strategy: goss, boosting: dart, Maximum memory usage: 9.80 GB - 3.20 GB = 6.60 GB

Another bottleneck we can tackle is to realize that LightGBM is a two-stage algorithm. In the first stage, LightGBM uses the full dataset to construct bins for each numeric variable (controlled by the max_bins argument) based on the optimal splits. In the second stage, these discretized bins are then used to map and split the numeric variables during the actual training process to contruct trees. From my understanding, the first stage cannot be chunked as it requires the full dataset, but the second stage can be chunked (as per any stochastic gradient descent algorithm) where a fraction of the dataset is loaded at each time. Hence, the real bottleneck appears to be the first stage, when the bins are constructed.

According to this thread, we can separate the memory usage between the two stages by using lgb.Dataset. First, we initialize the Dataset object and make sure to set free_raw_data=True (this tells it to free the original data array after the binning is done). Then, we trigger the actual dataset construction using dataset.construct(). Thereafter, we are free to delete the original data array to free up memory for the actual training. The following code illustrates this concept.

dataset = lgb.Dataset(data=data, label=y, group=groups, free_raw_data=True)
del data
dataset.construct()
lgb.train(params=params, train_set=dataset)

TF-IDF

Term Frequency - Inverse Document Frequency is a well known method for representing a document as a bag of words. For a given corpus , we compute the IDF value for each word by taking , with denoting the number of documents in containing the word . The document is represented by a vector of length corresponding to the number of unique words in . Each element of the vector will be a tf-idf value for the word, i.e. , where represents the term frequency of the word in document . Sometimes, we may l1 or l2 normalize the tf-idf vector so that the dot product between document vectors represents the cosine similarity between them.

Bayesian Smoothing

We may want to apply some bayesian smoothing to the terms to avoid spurious matches. For example, suppose that a rare word appears only in documents and in the entire corpus just by random chance. The will be a large value, and hence documents and will have a high cosine similarity just because of this rare word.

For the specific setting I am considering, we can deal with this problem using bayesian smoothing. The setting is as follows:

  • Each document represents a job, and each job is tagged to an occupation
  • An occupation can have one or more jobs tagged to it
  • We wish to represent each occupation as a TF-IDF vector of words

To apply bayesian smoothing to this scenario, notice that we only need to smooth the term frequencies . Since the IDF values are estimated across the whole corpus, we can assume that those are relatively reliable. And since term frequencies are counts, we can use a poisson random variable to represent them. See reference for a primer on the gamma-poisson bayesian inference model.

Specifically, we assume that is the poisson parameter that dictates the term frequency of in any document belonging to , i.e. . We treat the observed term frequency for each document belonging to as a data sample to update our beliefs about . We start with an uninformative gamma prior for , and obtain the MAP estimate for as below, with denoting the number of documents that belong to occupation .

We can thus use this formula to obtain posterior estimates for each . One possible choice of the prior parameters and is to set to be the mean term frequency for word per document in the entire corpus, and to set . This prior corresponds to following a gamma distribution with mean and variance , which seems to be a reasonable choice that can be overrided by a reasonable amount of data.

The posterior variance, which may also be helpful in quantifying the confidence of this estimate, is:

Finally, after obtaining the posterior estimates for each , we can just use them as our term frequencies and multiply them by the IDF values as per normal. We can also apply l1 or l2 normalization thereafter to the tf-idf vectors. This method should produce tf-idf vectors that are more robust to the original problem of spurious matches.

For illustration, for a very rare word , will be a low value close to 0 (say 0.01). Suppose we were to observe number of new documents, each containing one occurrence of word . Then the posterior estimate of will update as follows:

n
10.005
20.337
30.503
40.602
200.905

As desired, the estimate for starts off at a very small value and gradually approaches the true value . This will help mitigate the effect of spurious matches. If we desire for the update to match the data more quickly, we can simply scale and down by some factor, e.g. now and . Then we have:

n
10.001
20.455
30.626
40.715
200.941

As a final detail, note that the update formula can result in negative estimates if and . The small negative value is probably not a big problem for our purposes, but we could also resolve it by setting the negative values to zero if desired.

Cross Encoders

Cross encoder is a type of model architecture used for re-ranking a relatively small set of candidates (typically 1,000 or less) with great precision. In the Question-Answering or machine reading literature, typically the task involves finding the top matching documents to a given query. A typical task is the MS MARCO dataset, which seeks to find the top documents that are relevant to a given bing query.

Basic Setup

Typically, the base model is some kind of pre-trained BERT model, and a classification head is added on top to output a probability. Each (query, document) pair is concatenated with [SEP] token in-between to form a sentence. The sentence is fed into the classification model to output a probability. The model is trained using binary cross-entropy loss against 0,1 labels (irrelevant or relevant).

This is the setup used by <Nogeuira 2019>, possibly the first paper to propose the cross encoder. Some specifics for their setup:

  • The query is truncated to max 64 tokens, while the passage is truncated such that the concatenated sentence is max 512 tokens. They use the [CLS] embedding as input to a classifier head.
  • The loss for a single query is formulated as below. refers to the score from the classifier model, refers to the documents that are relevant, and refers to documents in the top 1,000 retrieved by BM25 that are not relevant. Note that this results in a very imbalanced dataset.

  • The model is fine-tuned with a batch size of 128 sentence pairs for 100k batches.

As opposed to bi-encoders (or dual encoders), which take a dot product between the query embedding and the document embedding, we cannot pre-compute embeddings in the cross encoder setting, because the cross encoder requires a forward pass on the concatenated (query, document) pair. Due to the bi-directional attention on the full concatenated sentence, we need the full sentence before we can compute the score, which requires the query that we only see at inference time. Hence, the cross encoder is limited to reranking a small set of candidates as it requires a full forward pass on each query, candidate_document pair separately.

Contrastive Loss

The vanilla binary cross entropy loss proposed above may be thought of as a loss, in which each document is either relevant or irrelevant in absolute terms. However, treating relevance as a concept often better reflects reality. For example, given the first page of search results for a Google query, most of the documents should be relevant to some extent, but some are more relevant than the rest (and get clicked on). Simply treating all clicks as relevant and all non-clicks as irrelevant naively ignores the context (i.e. the neighbouring search results) in which the clicks were generated. It assumes that across query sessions, the average level of relevance of the results is comparable. Treating relevance as a concept within the same query session weakens this assumption and hence often works better.

Thus <Gao 2021> proposes the Local Contrastive Estimation loss. For a given query q, a positive document is selected, and a few negative documents are sampled using a retriever (e.g. BM25). The contrastive loss then seeks to maximize the softmax probability of the positive document against the negative documents.

It is confirmed in multiple experiments in Gao 2021 and Pradeep 2022 that LCE consistently out-performs point-wise cross entropy loss. Furthermore, the performance consistently improves as the number of negative documents per query (i.e. ) increases. In Gao 2021, up to 7 negatives (i.e. batch size of 8) were used. Pradeep 2022 shows that increasing the batch size up to 32 continues to yield gains consistently (albeit diminishingly).

Other details

Pradeep 2022's experiments show that using a stronger retrieval model (a ColBERT-based model) during inference generates slight gains in final performance (as opposed to BM25). Although Gao 2021 argues that it is also important to use the same retrieval model during model training (so that the cross encoder sees the same distribution of negatives during training and inference), Pradeep 2022 argues that the alignment is not as important as the stronger retrieval performance during inference.

References

SentenceTransformers

Collaborative Filtering

This post is trying to grok my colleague Yuxuan's sharing on various losses for collaborative filtering. All credit goes to him.

Collaborative filtering is typically done with implicit feedback in the RecSys setting. In this setting, interactions are often very sparse. Most of the time, only positive signals are recorded, but a non-interaction could either mean (i) user dislikes the item or (ii) the user was not exposed to the item. Hence, we cannot use algorithms like SVD which assume no interactions as irrelevance.

A generic and fairly common architecture for the collaborative filtering model is to embed each user and item into separate fixed size vectors, and use the cosine similarity between the vectors to represent a score. This score is fed into a cross entropy loss against the labelled relevance of user to item to train the embeddings.

Setup

Let and denote the dimensional embedding vector for user and item . Let the similarity function be which is typically , and distance function which is typically . Then some common loss functions may be denoted as below.

Pointwise Losses are typically low-performing. For a given (u, i) pair, pointwise losses assume the presence of a 0, 1 label for relevance, and tries to predict it. The typical pointwise loss is Binary Cross Entropy, which may be expressed as:

Pairwise Losses assume the presence of training triplets (u, i, j) which correspond to user, positive item and negative item. A typical pairwise loss is Bayesian Personalized Ranking, as follows:

AB Testing

References

Examples

These summaries are based on reading PostHog's article on AB testing and studying the ones of interest further.

AB Testing at Airbnb

AB Testing is crucial because the outside world often has a larger effect on metrics than product changes. Factors like seasonality, economy can cause metrics to fluctuate greatly, hence a controlled experiment is necessary to control for external factors and isolate the effects of the product change.

Airbnb has a complex booking flow of search -> contact -> accept -> book. While they track the AB impact of each stage, the main metric of interest is the search to book metric.

One pitfall is stopping the AB Test too early. Airbnb noticed that their AB tests tend to follow a pattern of hitting significance early on but returning to neutral when the test has run its full course. This is a phenomenon known as peeking, which is to repeatedly examine an ongoing AB test. It makes it much more likely to find a significant effect when there isn't, since we are doing a statistical test each time we peek. For Airbnb, they hypothesize that this is also a phenomenon caused by the long lead time it takes from search -> book, such that early converters have a disproportionately large influence at the beginning of the experiment.

The natural solution to this problem is to conduct power analysis and determine the desired sample size prior to the experiment. However, Airbnb runs multiple AB tests at the same time, hence they required an automatic way to track ongoing AB tests and report when significance has been reached. Their solution back then was to create a dynamic p-value graph. The idea is that on day 1 of the experiment, we would require a very low p-value to declare success. As time goes on and more samples are collected, we can gradually increase the p-value until it hits 5%. The shape of this graph is unique to their platform and they did extensive simulations to create it, so this solution is not very generalizable.

Another pitfall was assuming that the system is working. After running an AB test for shifting from more words to more pictures, the initial result was neutral. However, they investigated and found that most browsers had a significant positive effect except for Internet Explorer. It turned out that the change had some breaking effect on older IE browsers. After fixing that, the overall result became positive. Hence some investigation is warranted when the AB test results are unintuitive. However, one needs to be cautious of multiple-testing when investigating breakdowns, since we are conducting multiple statistical tests.

Airbnb has a strong AB testing culture - only 8% of AB tests are successful (see Ron Kohavi's LinkedIn Post).

AB Testing at Monzo

Monzo has a big AB testing culture - they ran 21 experiments in 6 months. Monzo has a bottom-up AB testing culture where anyone can write a proposal on Notion. Some of the best ideas come from the customer operations staff working on the frontlines. A proposal comprises the following sections:

  • What problem are you trying to solve?
  • Why should we solve it?
  • How should we solve it (optional)?
  • What is the ideal way to solve this problem (optional)?

Many proposals end up becoming AB experiments. Monzo prefers launching pellets rather than cannonballs. This means that each experiment comprises small changes, is quick to build, and helps the team learn quickly.

AB Testing at Convoy

Convoy argues that bayesian AB testing is more efficient than frequentist AB testing and allows them to push out product changes faster while still controlling risk.

The argument against frequentist AB testing is as follows. Under traditional AB testing, we define a null hypothesis using the control group (call it A), and declare a treatment (call it B) as successful if the treatment value has a significant p-value, i.e. it falls outside of the range of reasonable values under the null. Based on power analysis and an expected effect size, we predetermine the necessary sample size to achieve sufficient power, and once this sample size is reached, we have a binary success or failure result based on the p-value.

Convoy argues that this approach is safe but inefficient. This is because prior to the sample size being reached, we do not have a principled way of saying anything about the effectiveness of the treatment, even if it is performing better. Furthermore, frequentist AB testing gives us a binary result, but it does not quantify the size of the difference. Specifically, an insignificant test where E(A)=10%, E(B)=11% is quite different from E(A)=15%, E(B)=10%. For the former case, one can argue for launching B even if the p-value did not hit significance, whereas for the latter we should definitely not launch.

Bayesian analysis comes in to make the above intuition concrete. Suppose we are interested in the clickthrough rate (CTR) of variant A vs B. Bayesian analysis provides a distribution of the average CTR for each variant A, B at any point of the AB test, based on the results that it has seen thus far. These posterior distributions reflect both the mean of the data (how far apart is from ) and the variance of the data (how spread out the distributions are), allowing us to quantify how much we stand to gain if we were to pick either variant A or B at this point in time.

Concretely, they define a loss function as follows. Let and be the unobserved true CTR for variants A and B respectively, and let the variable denote which variant we decide to choose. Then our loss for choosing each variant can be expressed as:

In other words, the loss above expresses how much we stand to lose by picking the unfortunately wrong variant based on incomplete evidence at this point in time. Of course, we do not know the true values of and , so we need to estimate the loss using our posterior distributions which we computed from data. We then compute the expected loss based on the posterior distributions , as such:

Here, is the joint posterior distribution, which I believe we can obtain by multiplying the two independent posterior distributions , together. We can also perform random draws from the posterior distributions to estimate this statistic. Finally, we make a decision by choosing the variant that dips below a certain loss threshold, which is usually a very small value.

The appeal of the bayesian approach is two-fold:

  1. It allows us to make faster decisions. Suppose an experiment is wildly successful, and it is clear within a day that variant B is better. Bayesian analysis will be able to reveal this result, whereas frequentist analysis will tell us to wait longer (since we estimated the effect size to be smaller).
  2. It allows us to control risk. Since we are making decisions based on minimizing risk (supposing we had picked the poorer variant), we may be sure that even if we are wrong, it will not severely degrade our product. So supposing that there is no significant engineering cost between variant A and B, we can more rapidly roll out new iterations with the assurance that on average our product will be improving.

Power Analysis

Reference: Probing into Minimum Sample Size by Mintao Wei

How to determine the minimum sample size required to achieve a certain significance level and power desired?

The following table helps us understand how type I and type II errors come into play:

Null Hypothesis: A is TrueAlternate Hypothesis: B is True
Reject AType I ErrorGood statistical power
Accept AGood significance levelType II Error

Type I Error refers to rejecting the null hypothesis when it is actually true, e.g. when we think that an AA test has significant difference. In short, it means we were too eager to deploy a poor variant. This should happen with probability , which is the significance level which we set (typically 5%). We have a better handle on type I error because the baseline conversion rate is typically known prior to an experiment.

Type II Error refers to failing to reject the null hypothesis when the alternate is actually true, i.e. we failed to get a significant effect on an improvement that is known to be better. In short, we were too conservative and failed to deploy a winning variant. In order to reason about type II error, we need to make a guess on what is the distribution of test variant B. Typically, this is done by assuming a minimum effect we wish to detect, and setting , and re-using the standard deviation from A. With these assumptions in place, we use to determine the type II error that should only occur with probability (typically 20%). Note that since is the minimum effect we wish to detect, if the actual effect turned out to be larger, the type II error can only be smaller than our desired amount, which is ok.

Now we can derive the formula for the minimum sample size required to achieve the desired levels of type I and type II error respectively.

Let us define the baseline conversion rate as , and the minimum relative detectable effect rate as . Consequently, the minimum detectable delta is . Let the desired power level be , and the desired significance level as . Assume the scenario where we are running an AA or AB test with two variants of sample size each.

Firstly, we write down the distribution of the sample mean difference supposing we knew the true population means and standard deviations. Let and . Note that may have arbitrary distributions, e.g. they could measure proportions, revenue etc.

Under the central limit theorem, the sample means will be distributed like so with samples: , . Importantly, the difference of the sample means will have the distribution below. Note that we add the variances together because for any two independent random variables .

Now we can start working from the desired levels to the minimum sample size. We need to ensure that both objectives below are achieved with our sample size :

  1. Assuming null hypothesis to be true, ensure that type I error .
  2. Assuming alternate hypothesis to be true, ensure that type II error .

Let us define some notation first.

  • Let denote the critical value under the standard normal distribution such that . This is basically the scipy.stats.norm.ppf function, e.g. .
  • We also want to denote the critical value under the distribution of the sample mean difference under the null or alternate hypothesis (these are non-standard normal distributions). Let these be and respectively.
Illustration for Power Analysis Derivation
Illustration for Power Analysis Derivation

For objective 1, assuming the null hypothesis and using equation (1) above, we have . Since is a two-tailed probability and we want the critical region on the right-side, let . E.g. implies . Then:

Note that equation (2) above tells us the critical value such that we will reject the null hypothesis if the sample mean of is greater than this value. To satisfy objective 2, we must thus ensure that the probability of rejecting the null hypothesis is at least . In other words, we want . Assuming the alternate hypothesis and again using equation (1), we have . So then:

For the purpose of getting a minimum , we assume . Then using this and squaring both sides with some rearranging gives us:

Which gives us the required minimum sample size equation. If we assume , as is often assumed because we do not know the variance of the treatment, then it simplifies to the following form (as seen in Ron Kohavi's paper).

Bernoulli Events

The equation (3) above for the minimum sample size requires us to know the standard deviation under the null and alternate hypotheses. Usually, the standard deviation under the null is computed from historical data, and it is assumed that . However, if the event we are interested in may be represented as a bernoulli random variable (e.g. an impression is shown and user either clicks or does not click with some probability), the equation may be simplified.

Specifically, the variance of a bernoulli random variable with probability is . Thus, if , then , and likewise for .

So we can use and and substitute these into equation (3). We will then be able to have a minimum sample size formula by just specifying , , baseline conversion and minimum relative difference . This is the formula used by Evan Miller's sample size calculator.

Large Language Models

LLMs are generally used in an auto-regressive way, where the user supplies a prompt and the LLM returns a generated response. This framework makes it amenable to a wide range of tasks.

HuggingFace has a great blog post explaining how we can run LLMs on humble hardware. Typically, LLMs have billions of parameters. The following rules of thumb helps us know how much memory we need to load the LLM into memory. For a model with X billion parameters:

  • Loading in float32 requires 4X GB of VRAM
  • Loading in bfloat16 requires 2X GB of VRAM
  • Loading in int8 requires X GB of VRAM

Hence we see that we can load a ~7 billion parameters model with around 14GB of VRAM if loaded in bfloat16, which makes it feasible to run on GPUs like Tesla T4 with 16GB of VRAM. This can be done when loading the model with from_pretrained(..., torch_dtype=torch.bfloat16). Most models are trained in bfloat16 anyway, so it makes sense to load them at that precision.

Current popular open source LLM of that size includes mosaicml/mpt-7b, which can be easily downloaded and used using huggingface.

Quantization

It turns out that we can lower the precision of models even further than 16 bits if we use a quantization method (see e.g. Dettmers 2022, this paper is the basis for the package bitsandbytes used for quantization). The general idea is akin to encoding - we encode each number from a higher precision into a "codeword" in the lower precision (i.e. quantization). Numbers that are close to one another in the higher precision may get mapped to the same "codeword". When we want to use the encoded value, we look up the value in the higher precision that it maps to (i.e. de-quantization).

When applying this to quantizing a neural network, the steps involved are:

  1. Quantize all model weights to target precision (e.g. int8)
  2. Pass the input vector at bfloat16
  3. At each layer, dequantize the weights and perform matmul in bfloat16
  4. Quantize the weights again for storage

Hence while quantization lowers the memory footprint of the model, it may increase inference time. To use quantization, we need to do the following (also make sure bitsandbytes is pip installed). We can also pass load_in_4bit=True for 4bit quantization. More info on quantization usage is available at HuggingFace. An important note is that a GPU is required for quantization, at least in the bitsandbytes package.

model = AutoModelForCausalLM.from_pretrained(..., load_in_8bit=True)

Flash Attention

The self-attention mechanism (Dao 2022) is at the heart of the transformer performance but is also a major bottleneck in terms of memory and computational cost. One of the most successful optimizations for the attention mechanism is the Flash Attention paper.

Suppose we have an input sequence of embeddings where , such that . The transformer stores parameters , such that such that . The self-attention matrix is then computed to represent the pairwise interaction between tokens at position ( row) and position ( column). The row-wise softmax is taken to convert these into probabilities and finally the output is .

Typically, is much larger than the hidden dimensions , as can be 2,048 or larger. Hence the matrix is the bottleneck for memory and computation. The flash attention proposes to do this computation in a block-wise manner to reduce the memory usage. Furthermore, the algorithm also speeds up the computation compared to naive attention because the block-wise implementation minimizes the number of read-write operations between the faster SRAM and slower HBM.

More details can be found in the notebook at Dao 2022 - FlashAttention. We can utilize flash attention like so:

%pip install optimum

model.to_bettertransformer()

Note that this is only supported for models that have implemented flash attention, e.g. gpt-neox, bloom etc.

Flash Attention is now support natively within Pytorch as torch.nn.functional.scaled_dot_product_attention (see blog). The usage is like below. We need transformers>=4.36 and torch>=2.1.1 to use it.

from optimum.bettertransformer import BetterTransformer
from transformers import AutoModelForCausalLM

model = AutoModelForCausalLM.from_pretrained("gpt2-large", torch_dtype=torch.float16)
model = BetterTransformer.transform(model, keep_original_model=False)

Position Representation

Recent innovations in position encoding has led to accuracy improvements for long input sequences. The initial attention papers used absolute position embeddings. Given an input sequence of embeddings , absolute position embeddings are generated by the model. These position embeddings are added to the input sequence , thereby allowing to model to use these position cues.

It turns out that fixed positional embeddings are not ideal because they require the model to learn a fixed, unique representation of each position . This does not represent language well, because the word in position i in one sentence does not necessarily serve the same purpose as a word in the same position in another sentence. Rather, it is the relative distance between words that we want to encode in the model. Furthermore, training absolute position embeddings makes it difficult for our model to generalize to texts with longer sequences than what it was trained with.

Recent papers advocate for relative positional embeddings, with the following differences:

  • Relative position encoding rather than absolute position encoding
  • The encoding of relative position is done most naturally within the self-attention matrix, since that is where the relative degree of interaction between tokens at different positions is encoded
  • The encoding should be such that tokens further apart have a lower value in the self-attention matrix and tokens closer together have a higher value

Rotational Position Embeddings (RoPE) (Su 2021) proposes rotating the query and key vectors by an angle proportional to the relative distance between the two positions. Specifically , where is a rotational matrix that performs the rotation.

Attention with Linear Biases (ALiBi) (Press 2022) proposes an even simpler method. It simply subtracts from row and column of the self-attention matrix, where is a fixed scalar (specific to each attention head). Intuitively, it penalizes the attention proportional to the distance between the tokens. The study shows that this method outperforms RoPE as we extrapolate to longer sequences, and is conceptually simpler.

Key-Value Cache

Most LLMs work in an auto-regressive manner, i.e. we provide an input sequence, generate the next token with the LLM, then append this token to the input sequence for the next iteration. Most LLMs are also trained with the causal language modelling objective and mask the upper triangle of the self-attention matrix, so that each query token can only interact with key token and value token if . This setup encourages us to cache results from previous time steps, since a lot of computation is repeated.

The following is based on how I imagine this to work, after reading Cameron R. Wolfe's LinkedIn post. During training, we compute the projections , where is the maximum sequence length and is the hidden dimension. The final output actually provides a -dimension representation of the model's prediction at each of the positions.

For next token generation, we can add a projection head, say , where represents the size of the vocabulary, such that can represent the activations at each of the positions for the next token. Specifically, represents the predictions of position given input tokens , represents the predictions of position given input tokens , and so on. These activations will then be fed into some cross-entropy loss such that activations at the correct token for each position gets rewarded. This allows us to do efficient training, since we simultaneously provide losses for the prediction at each of the positions to the model for backpropagation.

However, when we are doing inference generation, we only need to predict for the final position of the input sequence (suppose it is position ), i.e. we are only interested in and . Hence for starters, we only need instead of the entire matrix, since only that row comes into play. However, we still need the entire and matrices, since we want to interact with all tokens in the input sequence. This is where the KV cache comes in - we cache the existing and matrices, so that we only need to project the final token of the input sequence and at each step and append it to the existing cached and . We can then compute .

As one can imagine, this saves a lot of computation, but also increases memory costs. Kwon 2023 - PagedAttention shows that serving a 13B model on NVIDIA A100 with 40GB of memory:

  • of memory is model parameters
  • is the KV cache
  • A small amount of memory is used ephemerally for activation

The usage of KV cache is like so:

model = AutoModelForCausalLM.from_pretrained(...)
model.generate(..., use_cache=True)

How to fine-tune an LLM

  • trl RL example
    • Fine tune a 20B GPT model on text generation on IMDB dataset (loaded in 8 bit)
    • Since step 1 used PEFT, we need to merge the adapter weights with the base model
    • Finally, use RLHF to generate positive movie reviews. They used a BERT IMDB sentiment classifer to generate rewards
  • DataCamp example
    • Using SFTTrainer from the trl library to do supervised fine-tuning
  • PEFT - based on LORA - PEFT is built by hugginface to support LORA.

Fine-tuning

LLMs are typically trained with next-token prediction task on large amounts of text in an unsupervised manner. Some of the behaviour in these texts are not desirable to imitate. For example, while Github is full of code repositories with common programming mistakes, we do not want the LLM to replicate such behaviour. Hence, a process of alignment is necessary to encourage the model to produce desired responses.

There are typically two stages to this fine-tuning: Supervised Fine-Tuning and Reinforcement Learning from Human Feedback.

Supervised Fine-Tuning (SFT). In this stage, pairs of (prompt, desired response) are provided to the LLM. The desired responses are often called "demonstrations" by human annotators. Some form of cross-entropy loss is then used to update the LLM to encourage it to generate the desired response. This is a straightforward approach that is similar to the next-token prediction task (except we are predicting the desired response given the prompt). In the InstructGPT paper, the authors show that an SFT-aligned 1.3B model (using ~13k training data) generates human-preferred outputs compared to a 175B GPT model, showing the importance of SFT.

The problem with SFT is that it trains the model to provide a very specific response to a particular prompt, which makes it hard for the model to generalize. It also fails to express the natural ambiguity of responses for an open-ended prompt (e.g. write a poem about AI). Hence, unless we can collect quality responses for a very wide variety of prompts, SFT is limited in its ability to generalize.

Reinforcement Learning from Human Feedback (RLHF). This is where RLHF comes in: given triplets of (prompt, preferred response, not preferred response), we train the model to generate the preferred response in a more generalizable way.

More recently, a method called Direct Preference Optimization is used to solve this problem without needing to do reinforcement learning. More on that here.

Useful Models

There are too many models on Huggingface, so I try to keep track of useful ones here. Generally the open_llm_leaderboard is a good place to start.

Model Name# Params (Billions)openllm scoreComments
SOLAR-10.7B-Instruct-v1.010.7B74.2This model was trained using Supervised Finetuning (SFT) and Direct Preference Optimization (DPO) for single-turn chat. Seems to be derived from Llama. Verified that the outputs are sensible.

Testing SOLAR-10.7B

Here are some results from testing out SOLAR-10.7B-Instruct-v1.0 on Google Colab, with a T4 GPU. First we install packages and load the model. Note that we are quantizing it to 4bits to save memory, and specifying the compute type of torch.float16 as that is the tensor type that this model was trained in. Using torch.float32 (which is the default) will incur unnecessary conversion cost and slow down inference.

%pip install transformers~=4.36 torch~=2.1.1 bitsandbytes accelerate optimum

from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
import torch

quantization_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.float16
)

MODEL_NAME = "Upstage/SOLAR-10.7B-Instruct-v1.0"
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
model = AutoModelForCausalLM.from_pretrained(
    MODEL_NAME,
    device_map="auto",
    quantization_config=quantization_config,
)

We define a generate function.

def generate(prompt: str, max_length: int = 2048):  
    conversation = [{'role': 'user', 'content': prompt}] 
    prompt = tokenizer.apply_chat_template(conversation, tokenize=False, add_generation_prompt=True)
    inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
    outputs = model.generate(**inputs, use_cache=True, max_length=max_length)
    output_text = tokenizer.decode(outputs[0])
    return output_text

And now we can test it out on a few prompts.

Prompt: Can you explain web3 to a 5 year old? Response: Web3 is like a special world on the internet that lets people share and use things in a new and more open way. Imagine a playground where everyone can add their own swings, slides, and sandboxes without needing permission from a big boss. This makes it easier for people to work together and create new fun things online. (Inference time: 8s)

Prompt: Can a firefighter transit into the finance industry? Response: While it may not be a typical career transition, it is possible for a firefighter to transition into the finance industry. To make a successful move, the firefighter would need to acquire the necessary skills and qualifications relevant to the finance sector. This could involve pursuing additional education or training in finance, accounting, or business administration. Networking, gaining relevant work experience, and showcasing transferable skills from the firefighting profession can also help in making the transition.\n\nSome transferable skills from firefighting to finance could include strong problem-solving abilities, teamwork, decision-making under pressure, and the ability to manage risk. However, it's essential to note that the finance industry has its unique requirements, and the firefighter would need to demonstrate a strong commitment to learning and adapting to the new environment. (Inference time: 19s)

Seems to be pretty decent performance at a somewhat reasonable inference time out of the box.

Speeding up GPU inference.

Gemma Models

References:

Google released open source small models under the Gemma label. It comes in two sizes: 2B and 7B parameters, each with base (pretrained) and instruction-tuned versions. All have a context length of 8k tokens. Gemma integration is supported with transformers==4.38. Note that user consent is needed to access Gemma models, so one needs to create a HuggingFace account and create an access token.

Sample Code

# Use model_name="google/gemma-7b-it"
def load(model_name: str):
    tokenizer = AutoTokenizer.from_pretrained(model_name, token=TOKEN)
    model = AutoModelForCausalLM.from_pretrained(
        model_name,
        quantization_config=BitsAndBytesConfig(
            load_in_4bit=True,
            bnb_4bit_compute_dtype=torch.bfloat16,
            bnb_4bit_use_double_quant=True,  # nested quantization
            bnb_4bit_quant_type="nf4",  # normalized quantization
        ),
        token=TOKEN,
    )
    return model, tokenizer

Inference Speed

Tested out the inference speed of the model using the following prompt: Can you recommend good alternative occupations for a occupation_name? Please be concise.

An example is as follows, for an occupation counsellor:

Sure, here are some alternative occupations for a counsellor:

* **Social worker:** Similar to counselling, social workers help individuals and families facing various challenges.
* **Community organizer:** Organizes and leads community initiatives to address social issues.
* **Teacher:** Can be a good fit for those who enjoy working with children and adolescents.
* **Human resource manager:** HR managers handle employee relations and conflicts.
* **Public relations specialist:** Helps organizations communicate effectively with the public.

The inference speed is about 8-10 seconds with 4-bit quantization.

Gemma-2b

Testing out google/gemma-2b-it shows that it has poorer performance but faster inference speed at 2 billion parameters. Here is an example of the results for the same prompt.

* **Life coach**
* **Coach in education**
* **Career coach**
* **Executive coach**
* **Leadership coach**
* **Management consultant**
* **Business development consultant**
* **Financial advisor**
* **Investment banker**

The inference speed is as follows:

  • Single: 3s
  • Batch of 4: 14s
  • Batch of 8: 16s
  • Batch of 16: 20s
  • Batch of 32: OOM on T4 GPU

Phi-2

Mobius Labs fine-tuned the phi-2 model from Microsoft which seems promising, and released it under mobiuslabsgmbh/aanaphi2-v0.1. The output seems better than gemma-2b-it.

1. Social worker
2. Mental health therapist
3. School counselor
4. Employee assistance program (EAP) specialist
5. Rehabilitation counselor
6. Family therapist
7. Substance abuse counselor
8. Career counselor
9. Trauma-focused therapist
10. Child and adolescent therapist

These occupations involve working with individuals, families, and communities to promote mental health and well-being, and may provide similar skills and experiences to those of a counsellor.

Flash attention, torch.compile, quantization.

Encoder vs Decoder

There are broadly two categories of LLMs: Encoder-Decoder architecture (typified by BERT) and Decoder only architecture (typified by GPT-2 series). There are some innate differences between the two that affect the type of application each is well suited for.

Encoder-Decoder Architecture

The Encoder-Decoder architecture was proposed in the original Attention is All You Need paper by Vaswani et al. As the name suggests, there is a separate encoder and decoder module. The design choice of two distinct modules is easier to understand when we consider that the original paper was designed for the machine translation task, such that the inputs may be in English and the outputs may be in French.

On the encoder-side, the encoder module sees the full input sequence and encodes it into an numeric encoding representation matrix of size (seq_len, embed_dim). There is no input masking required because we assume we see the full English text before we begin translation. This encoding representation is passed to the decoder module as context.

On the decoder-side, we cannot allow the decoder module to see the full output sequence during training, since that would cause data leakage and the model to collapse. Note that the decoder simultaneously (in one forward pass) predicts a token for each position in the output sequence during training. Hence to avoid data leakage, causal masking is applied such that for each position being predicted, the decoder module is only allowed to "see" tokens in previous positions.

Note that this can be implemented simply by masking the attention matrix in each self-attention block in a triangular manner. The feedforward layers only map information from position i to position i, so there is no need to adjust that.

BERT Encoder Decoder Architecture
BERT Encoder Decoder Architecture (From Attention is All You Need)

The decoder module itself creates a representation of the output sequence (with causal masking) of size (seq_len, embed_dim). Now for the tricky part of merging the encoder and decoder representations. Naively, we can simply concatenate the two to form a representation of (seq_len x 2, embed_dim), and add further self-attention blocks (with appropriate causal masking). However, this would increase the number of parameters of the model.

The paper instead chose to take the Q matrix from the decoder and the K, V matrices from the encoder. Since the attention mechanism is , this allows every token on the decoder representation to attend to every token on the encoder representation. The result is a weighted combination of the V vectors taken from the encoder representation which is eventually used to generate the tokens for each decoder position.

Here we can already observe some inductive biases baked into the encoder-decoder architecture. These comments are inspired by / taken from Hyung Won Chung's lecture:

  1. There is a clear distinction between the tokens on the encoder-side and the decoder-side, such that tokens on either side cannot directly attend to each other until the final block when the mixing happens. This may be suitable for the translation task but not for generic chat completion, where it is quite arbitary which point in the conversation we determine as the encoding and which point we determine as the decoding.
  2. The V matrix is taken entirely from the encoder-side, which may again limit the expressiveness of the model (what if there are important tokens on the decoder-side that would be useful to taken the V values from?). One may argue that in translation, the tokens on the encoder-side capture the full meaning of the sentence to be translated, so it is comprehensive. But the same is not true of generic chat completions.
  3. The separation of an encoder and decoder module suggests that there is significant difference between the tokens on the encoder-side and decoder-side. Again, this inductive bias

Miscellaneous Notes

A collection of miscellaneous, useful notes.

f-strings

To surround a text with a symbol (say =) to a fixed length:

>>> text = "Title"
>>> print(f"{text:=^20}")

=======Title========

Vim

Command to interactively change each 'foo' to 'bar'. :%s triggers the substitute global command, followed by the search and replace phrases respectively. Finally g means replace all occurrences and c means with confirmation. Note that :s will only do the same for one line.

:%s/foo/bar/gc

Find Files

To find files anywhere on the system with the filename python using bash, use:

find . -name python

We can add * before and/or after the filename to allow other characters before or after our keyword:

find . -name *python*

To search not just in the filename but also in the full path (e.g. we only want to search in Desktop), we can do:

find . -wholename "*Desktop*python*"

Note that if we want to locate executable binaries, another useful command is whereis:

whereis cat
---
cat: /usr/bin/cat /usr/share/man/man1/cat.1.gz

Numpy Indexing

Suppose we have a 2D array X and would like to take a slice of certain rows and columns. We might try, for e.g., to take the first two rows of X and the 3rd/4th column of X, i.e. we expect to get a 2 by 2 matrix.

import numpy as np
X = np.random.randn(5, 5)
X[[0, 1], [2, 3]]

Unfortunately, this will return an array of items array(X[0, 2], X[1, 3]), which is not what we want. Instead, a slightly inefficient but clear way is to slice each axis separately. It is not optimal because the first slice creates a temporary array view before the second slice is applied.

X[[0, 1], :][:, [2, 3]]

Finally, the recommended way seems to be to use np.ix_. Doing np.ix_([0, 1], [2, 3]) creates a tuple of two elements.

idx = np.ix_([0, 1], [2, 3])
idx
>> (array([[0],
           [1]]),
>>  array([[2, 3]]))

Indexing the original array with this output X[idx] will then give us what we want.

asyncio

reference

asyncio is a single-threaded framework that does not use multi-threading or multi-processing to speed up tasks. Instead, a coordinator (or event loop) passes control from a time-consuming blocking function (e.g. time.sleep or an I/O operation) to other functions to run. This passing of control occurs with the await keyword. When the blocking function is completed, it notifies the coordinator and control returns to where the blocking function left off.

asyncio does not speed up CPU-bound tasks, due to its single-threaded design. It only works when the function being awaited is an I/O operation that is supported by asyncio. This includes stuff like:

  • HTTP (supported through aiohttp)
  • DB calls (e.g. aioredis)

asyncio is preferred for such tasks (e.g. making multiple I/O calls and doing something when they all return) over multi-processing because of its single-thread design, making it easier to debug.

relatedness

In the context of information retrieval, Trey Grainger in AI-Powered Search suggests a relatedness measure to connect arbitrary entities together. Suppose we have a collection of jobs and each job is tagged with a set of skills. Suppose we wish to retrieve relevant skills to an arbitrary free text query .

The relatedness idea is to define a foreground of documents, e.g. based on a retrieval of documents using query which are related to the query, and to compare the attributes of the foreground against the background, i.e. all documents.

Mathematically, we can think of the foreground documents as a sample, and the background documents as the population. The strength of the relationship between each skill to the query may then be defined as the z-statistic of the one-sample z-test of proportions of the occurrence of skill in the foreground sample compared against the background population. A significantly greater occurrence in the sample compared to the population suggests a strong relationship between and , and vice versa. Specifically:

Where:

  • is the sample proportion.
  • is the number of documents in the foreground corresponding to query and contains skill .
  • is the total number of documents in the foreground corresponding to query . It is also the number of samples .
  • is the probability of skill t appearing across all documents.

By performing a retrieval and ranking skills based on the z-statistic, we can compute a relationship between any arbitrary query and attribute of the documents (on the fly, if necessary). This functionality is implemented in solr.

Package Versioning

link

SemVer is the versioning standard for Python packages. For a package version of e.g. 1.2.3:

  • The first number 1 is the major version. We update it when we make major API changes. A major of 0 indicates that the package is under initial development and anything may change.
  • The second number 2 is the minor version. We update it when we add functionality in a backward compatible manner.
  • The third number 3 is the patch version. We update it when we patch bugs in a backward compatible manner.

Poetry is a useful tool to manage package dependencies in a Python library. In the pyproject.toml file, we specify package dependencies in the following manner:

[tool.poetry.dependencies]
python = ">=3.8,<3.11"
requests = "^2.26.0"
pytz = "~2022.1"

The caret requirement (e.g. requests = "^2.26.0") means that SemVer compatible changes are allowed, i.e. an update is allowed if the version number does not modify the left-most non-zero digit. E.g.:

  • ^1.2.3 means that 1.3.0 is allowed but not 2.0.0
  • ^0.2.4 means that 0.2.5 is allowed but not 0.3.0

The tilde requirement is somewhat stricter. It specifies a minimal version but allows room to upgrade depending on how many digits are supplied. If major.minor.patch or major.minor is supplied, only patch-level changes are allowed. If major is supplied, minor and patch level changes are allowed.

  • ~1.2.3 means that 1.2.7 is allowed but not 1.3.0
  • ~1.2 means that 1.2.7 is allowed but not 1.3.0
  • ~1 means that 1.3.0 is allowed but not 2.0.0

The poetry versioning specification depends on developer updating the library versions in a disciplined way. It allows us to provide some flexibility in package versioning while avoiding updating a dependent package to a version that breaks our code. Having more flexibility in specifying package dependencies reduces the risk of dependencies version conflicting.

Parallelism in Python

References: realpython, superfastpython

The Python Global Interpreter Lock (GIL) is a lock that allows only one thread to run at a time. The reason for this design feature in Python is because the Python interpreter is not thread-safe, meaning that if the GIL did not exist, the Python interpreter could introduce race conditions and other fatal errors due to multiple threads modifying memory at the same time. The GIL also allowed other non-thread-safe C programs to be easily integrated into the Python ecosystem, which contributed to Python's success as a language.

The GIL also improves performance for single-threaded programs, as it only requires a single lock to be managed (not sure how this works). There have been attempts to remove the GIL from Python but none have been successful because they would degrade single-threaded performance.

For many C extensions (e.g. numpy), multi-threading is still possible because these extensions are allowed to manually release the GIL (as long as they restore things back to normal when the functions return). This allows us to still use multi-threading for CPU-intensive functions with the GIL around. Similarly for Rust, we can release the GIL to achieve parallelism.

Alternatively, we can use multiprocessing to create multiple processes (instead of threads). Each process contains its own Python interpreter (and GIL) and hence can run truly in parallel. The downside is that the overhead of creating and managing processes is much more than that for threads, meaning that the benefits of multiprocessing are much dampened compared to multi-threading.

Memory Profiling

It is often useful to profile the memory usage of our script. In python, we can use memory_profiler to check the memory usage of our program line by line.

from memory_profiler import profile
import sys


@profile(stream=sys.stdout)
def f():
    a = [1] * (10**6)
    b = [2] * (2 * 10**7)
    del b


if __name__ == "__main__":
    f()

This will print the following useful information to stdout. Note that even before we did anything, there is background memory usage of 17MiB.

Filename: memory_profiling.py

Line #    Mem usage    Increment  Occurrences   Line Contents
=============================================================
     5     17.1 MiB     17.1 MiB           1   @profile(stream=sys.stdout)
     6                                         def f():
     7     24.5 MiB      7.5 MiB           1       a = [1] * (10**6)
     8    177.2 MiB    152.6 MiB           1       b = [2] * (2 * 10**7)
     9     24.8 MiB   -152.4 MiB           1       del b

We might also want to track memory usage of a function over time. We can use memory_usage instead for that.

import time
from memory_profiler import memory_usage
def g(a, n: int = 100):
    time.sleep(1)
    b = [a] * n
    time.sleep(1)
    del b
    time.sleep(1)

if __name__ == "__main__":
    usage = memory_usage((g, (1,), {"n": int(1e7)}), interval=0.5)
    print(usage)

This will give us an array like so, showing the spike in memory in the middle of g.

[17.375, 17.5234375, 17.5234375, 19.34765625, 93.59765625, 93.59765625, 17.53125, 17.53125, 17.53125]

CUDA

PyTorch may sometimes throw errors if the installed torch version does not match the installed CUDA version. To address this, we need to first check the CUDA version using the nvcc command:

/usr/local/cuda/bin/nvcc --version
---
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2022 NVIDIA Corporation
Built on Wed_Jun__8_16:49:14_PDT_2022
Cuda compilation tools, release 11.7, V11.7.99
Build cuda_11.7.r11.7/compiler.31442593_0

Then install the correct version of torch:

pip install -U torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118

List Comprehension

When doing nested list comprehensions in python, the later loop becomes nested as the inner loop. So this:

l = [i * j for i in range(4) for j in range(5)]

Is equivalent to:

l = []
for i in range(4):
    for j in range(5):
        l.append(i*j)

This also explains why flattening a list of lists is of the following form:

list_of_lists = [[0, 1, 2], [3, 4, 5], [6, 7, 8]]
flattened = [item for sublist in list_of_lists for item in sublist]

When we translate it into a nested for loop, we can see that the inner loop should indeed be placed at the end:

flattened = []
for sublist in list_of_lists:
    for item in sublist:
        flattened.append(item)

Bradley-Terry Model

Based on wikipedia. The Bradley-Terry model (1952) is a probability model that allows us to infer scores for individual objects based on a dataset of pairwise comparisons between them. Specifically, it estimates the probability that (i.e. is preferred to ) as:

Where is a positive real-valued score assigned to object (not necessarily a probability). Typically, is parametrized as an exponential score , and the goal is to learn the parameters from pairwise comparisons. This results in:

Parameter Estimation

Parameter estimation is typically done using maximum likelihood. Starting with a set of pairwise comparisons between individual objects, let be the number of times object beats object . Then the likelihood of a given set of parameters ( denotes number of objects) is as follows:

This likelihood function can then be minimized by differentiating wrt and solved by setting to zero.

BT Model as a Sigmoid Function

We can also express the likelihood as a function of the difference in scores . Recall that the sigmoid function is . Then:

The derivation for the second line above is found at /Identities/sigmoid. This re-parametrization shows that the BT-model is really modelling the preference as a difference in scores and then running that through the sigmoid function to convert it into a probability. This means that we are basically running a pairwise logistic regression.

Setting up WSL

We may need to uninstall and re-install WSL (Windows Subsystem for Linux) from time to time. Here is the step-by-step.

  1. Tear down and delete all files. wsl --unregister Ubuntu-22.04
  2. Re-install wsl. wsl --install Ubuntu-22.04
  3. Set up keychain to auto-find ssh-agent (from Windows) and add keys
    • Copy .ssh folder from Windows to wsl ~/.ssh
    • Add to ~/.bashrc the following: eval $(keychain --eval --agents ssh id_rsa)
  4. Install Docker
    • Add users to docker to allow vscode access
    • https://docs.docker.com/engine/install/linux-postinstall/
  5. Essentials (get C linker)
    • sudo apt install build-essential
  6. Install rust
    • https://www.rust-lang.org/tools/install
  7. Install mdbook
    • cargo install mdbook mdbook-katex

To Read

  • https://netflixtechblog.com/improve-your-next-experiment-by-learning-better-proxy-metrics-from-past-experiments-64c786c2a3ac
  • https://netflixtechblog.com/sequential-a-b-testing-keeps-the-world-streaming-netflix-part-1-continuous-data-cba6c7ed49df
  • https://arxiv.org/abs/2407.02464
  • https://arxiv.org/abs/2305.12102
  • https://arxiv.org/abs/1709.03933
  • https://arxiv.org/abs/2101.08769
  • https://arxiv.org/abs/1205.2618 # BPR
  • https://arxiv.org/abs/2312.08520 # InfoNCE+
  • https://arxiv.org/abs/2308.06091 # MAWU
  • https://arxiv.org/abs/2109.12613 # CCL
  • https://arxiv.org/abs/2201.02327 # SSM
  • https://arxiv.org/abs/2206.12811 # DirectAU
  • KG-BERT

Packages

KeyBERT, KeyLLM

References:

KeyBERT and KeyLLM are packages to perform unsupervised keyword extraction from text. KeyBERT relies on BERT-based models, and the main idea is to extract n-gram phrases which have high semantic similarity to the overall document embedding. Some additional features are:

  • Allow user to specify phrase length for extraction
  • Add diversification via MMR to get diverse phrases

KeyLLM taps on LLMs to enhance the keyword extraction. Basically, it creates a prompt to ask an LLM to extract keywords from a document. It integrates with KeyBERT such that we can use KeyBERT to cluster documents, and only run KeyLLM on one document per cluster to save costs. It can also use KeyBERT to suggest candidates and use the LLM to verify.

Skills

This documents LinkedIn's approach to constructing a knowledge graph around skills.

November 2022 - Building LinkedIn's Skills Graph to Power a Skills-First World

An important model for LinkedIn is the skills graph - it maps 39k skills over 26 languages and over 347k aliases. The components of their structured skills graph:

  • 39k nodes, each of which is a skill
  • Each skill has multiple aliases
  • Skills are connected via edges which specify a hierarchical relationship. (e.g. Marketing is a parent of Demand Generation which is a parent of Email Marketing)

Skills Usage. Skills are automatically extracted from job postings and member histories. Job postings are allowed to tag up to 10 skills. Users can see something like 4/10 skills match your profile to help them determine job fit.

The skills graph is constructed both using ML and manual review by taxonomists.

March 2023 - Building and maintaining the skills taxonomy that powers LinkedIn's Skills Graph

This article is more about how the Skills Graph is constructed.

Each skill node includes the following foundational details (e.g. Machine Learning):

  • Description of the skill: the study of computer algorithms...
  • Aliases: ML, ...
  • Skills type: Hard or Soft
  • Skill ID

Each skill is represented by a node in the graph and nodes are linked via edges called knowledge lineages. Skills can relate for various reasons. Edges are directed and represent a parent-child relationship (e.g. Software Development -> Back-end Software Development -> Python)

Each node can have multiple parents and/or children.

  • Multiple parents example: Both Back-end Software Development and Mobile Software Development are parents of Java
  • Multiple children example: Supply Chain Management has children Supply Chain Engineering, Logistics Management and Digital Supply Chain

The hierarchical relationship allows us to enrich skills understanding, since if a person knows a particular skill, we may infer that he knows something about all the parent nodes and perhaps some of the "sibling" nodes.

LinkedIn has certain quality guardrails on the structured skills, one of which is discouraging ambiguity. For example, an ambiguous skill like Networking may be mapped to skills like Computer Networking or Professional Networking. This type of ambigious relationship is not allowed, and LinkedIn either removes such edges or disallows such a node. The meaning of a phrase is determined by analyzing how the skill is used predominantly in LinkedIn. In cases where a phrase can have divergent meanings, the skill is disambiguated by expanding the phrase. For example, Cadence is disambiguated to Cadence Software and Boundary to Boundary Line.

Architecture. The components to their ecosystem are as follows:

  1. Human Curated KG. Presumably, this is a purely human-curated KG where all nodes and edges are taken as true, and constantly curated by human taxonomists.
  2. AI-generated KG. This is generated using ML models which use the human curated KG as training and validation data. The model behind this is KGBERT, which will be briefly covered below.
  3. Serving. Both the AI-generated KG and human-generated KG are made available to all LinkedIn services via a REST API for online serving and also on HDFS to power offline inference.

KGBERT is a model for automatic edge prediction, which can generate the AI-generated KG from the human curated one. The basic idea is that the human curated KG is used to generate training and validation data in the following form:

[CLS] Tensorflow [SEP] Machine Learning [SEP] -> label: child - parent

Two skills are concatenated to form the context. The skill is represented at random by its title, description or title + description. The context is fed into a BERT model and a linear + softmax layer is attached to the [CLS] token to generate a softmax probability over 3 options:

  • Parent -> Child
  • Child -> Parent
  • No relation

The positive labels are taken from edges in the human curated graph. The negative labels are generated by some heuristics:

  • Skill pairs from different industries
  • Niece / Nephew pairs e.g. Tensorflow vs Cognitive Computing
  • Sibling pairs with the same parent e.g. Tensorflow vs Pytorch
  • Loosely related pairs that are 3 or more steps apart

The nice thing about this architecture is the clean separation of the human curated graph from the AI-generated one. The human-in-the-loop setup allows humans to review the AI-generated graph and add new edges to the human curated graph, which in turn improves the AI-generated graph. This seems superior over mixing both AI-generated edges and human-curated edges in the same graph.

December 2023 - Extracting skills from content to fuel the LinkedIn Skills Graph

This post takes a deeper look into how skills are extracted from data by LinkedIn from various contexts, such as job listings or member profiles.

Note that skills can be mentioned either directly or indirectly (e.g. you are expected to know how to apply different techniques to extract information from data and communicate insights through meaningful visualizations). Hence, a simple span extraction approach will not be exhaustive in extracting skills. On the other hand, not constraining the problem to span extraction could lead to false positive errors, so our model has to be very accurate.

LinkedIn Skills Extraction
An overview of the skills extraction architecture

At a high level, the steps are:

  1. The Skill Segmenter organizes the raw text into structured input, e.g. a resume is split into the qualification, career history etc. sections
  2. The next step aims to generate a candidate list of skills from the context:
    • 2a. The Text Skill Tagger is a trie-based model that simply finds matches in the text that exactly match a node in the Skills Taxonomy. This puts the burden on the Skills Taxonomy to have all the aliases that cover every possible utterance of the skill. However, this model is very fast
    • 2b. The Semantic Tagger aims to overcome the coverage issues above and uses BERT semantic match to surface more candidates
    • 2c. The skills from 2a. and 2b. are expanded into more skills using the Skills Graph. It seems like they add immediate parents, children and siblings of each skill.
  3. Each skill candidate is scored against the context to generate a confidence score. This section has a Shared Model and Domain-specific Model.
    • The Shared Model contains a context-encoder, which encodes the text surrounding the skill into an embedding, and a entity-encoder, which encodes surrounding skills, the job title, etc. All these embeddings are fed as context into the next stage. The Shared Model assumes that the relationship between the context surrounding each skill and the skill itself is constant across the different domains and thus benefit from a shared module. Their AB tests confirm this hypothesis and show that such multi-task training does increase lift in online metrics.
    • Each domain (job posting, member profile, course) has its own scoring model. The embeddings from the Shared Model and each skill candidate are fed into the Domain-specific Model, which is presumably a form of cross encoder that generate a confidence score for each skill candidate. Finally, some threshold is applied and a final list of skills is generated.

LinkedIn needs to generate skill extractions within 100ms for a high volume of edits. Hence, they used knowledge distillation to distil the BERT model down 80% in parameters.

As we can see, the LinkedIn architecture is fairly complex and context specific. A natural question for smaller players is how we can tap on LLMs today to simplify parts of this architecture to achieve comparable performance.

Identities

Sigmoid

Sigmoid Relationship to Bradley-Terry Model

Suppose we have scores for objects and . Under the Bradley-Terry model, we can express the preference probability as follows:

Where is the sigmoid function.

Statistics

Adding / Subtracting RVs

Let be two independent random variables. Then .

First, show that where . Proof:

Reference: https://apcentral.collegeboard.org/courses/ap-statistics/classroom-resources/why-variances-add-and-why-it-matters

Summaries of individual papers that I have taken a deeper look into.

Weinberger 2009 - Hashing for Multitask Learning

Weinberger 2009 - Feature Hashing for Large Scale Multitask Learning

This paper proposes a method to represent a large feature set in a smaller space by using hashing. It shows analytically that with a sufficiently large hash dimension :

  • The inner product between instances is preserved, i.e. doing a dot product between instances in the hashed dimension approximates the true dot product in the original dimension
  • The same applies to learning a weight vector to generate predictions in the hashed space: the error of approximation goes to zero as increases

Setup

Consider having data points , where can be very large (e.g. millions). This setting is easily realized when we use, for example, word bi-grams and tri-grams as term-frequency features to perform some kind of text classification task. Such a large feature vector is unwieldy, and also inefficient since the feature vector is very sparse for a given text.

The hashing trick maps the high dimensional input vector to a smaller dimension feature space with the notation , such that .

We start with the following definitions:

  • Let be a hash function
  • Let be a hash function

Note that while the definitions map from an input integer, we may apply them to texts as well, since any finite-length text may be assigned to a unique integer. This is typically done in practice by applying some hash algorithm to a given string, and then using the modulo function to restrict it to the desired range.

With this, and given two vectors , we define the hash feature map:

Where is an index in the hashed dimension space, and is an index in the input dimension space. We get a hash collision if more than one term is hashed into a given position . For brevity, we may just write .

Analysis

With this setup, the paper aims to prove analytically that hashing in this manner preserves the characteristics of the original space. In other words, we can significantly reduce the dimension of our features but achieve the same predictive effect as the original space by doing the hashing trick. This also means that the detrimental effect of hash collisions is minimal with a sufficiently large .

We won't trace through all the results, just the important and simple ones.

Lemma 2 The hash kernel is unbiased, i.e. .

Proof. The proof simply starts by expanding the inner product in the hashed space as follows:

Where is an indicator variable which takes if (i.e. they are hashed to the same position) and otherwise.

To see that this expansion is true, consider a position in the hashed space, e.g. . The value at position looks something like the following. We just need to move the summands to the left and use the variable to denote the common hash positions where and interact (if i and j are hashed to different positions, they clearly do not interact in an inner product).

Now note that we can decompose the expectation over into its independent constituents, i.e. and respectively (since the two hashes are independent):

Now we just need to observe that the hashed values are independent from all other terms in general, but also independent from each other whenever (provided our hash function is pairwise independent). Thus when , the summand is:

These are clearly because . So the original summation reduces to:

Not only is the hashed inner product unbiased, it also has a variance that scales down in . The proof does a similar but more tedious expansion as the above, and assumes that have l2-norm of . This suggests that the hashed inner product will be concentrated within of the true value.

These results are sufficient to justify use of the hashed inner product space in practice. That is, we can perform recommendations in the hashed space with sufficiently large (we can tune that using validation error) to make the large feature space tractable. The paper goes on to prove more detailed bounds on the error and norm which are of less practical significance.

Multi-task Learning

The authors argue that this method is especially useful in the multi-task learning setting. Consider an email spam classification task where the vocab space is and the user space is . The parameter space is thus , i.e. we wish to learn a user-specific weight vector for each user , which allows us to personalize the spam filter for each user (different users have slightly differing definitions of what is spam).

The authors suggest the following approach:

  • Use the hashing trick to hash each term into the hashed space. e.g. data is passed into a global hash function and assigned to a position
  • Each user gets his/her own hash function . This may be implemented by using the same hash function but appending the user_id like so: user1_data, which hashes the same term into a new position.
  • We may thus represent each instance by , capturing both a global element (some terms are universally spam-indicative) and a personalized element (some terms are specifically indicative for a user)
  • Finally, we learn a weight parameter by training it in the hashed space

Empirically, for their task of , , they found that performance starts to saturate with . This is a very small fraction of the total space , showing the effectiveness of their method. Nevertheless, we should note that 4 million is still a rather large space.

From RankNET LambdaRank LambdaMART

Paper Link.

This is an overview paper that explains the model behind LambdaMART, a technique used to learn Gradient Boosted Trees that optimize for NDCG on recommendation-type of datasets.

RankNET

RankNET is a generic framework for training a learn to rank model. Given a differentiable model that produces a score from an n-dimensional input vector, RankNET is able to train such that for a given query session with items and corresponding features , learns to predict the relationship for any two items in the query session when is a better recommendation than . The differentiable model is typically a neural network or boosted trees.

RankNET uses revealed pair-wise preferences within each query session to train . Specifically, suppose we have a data for one query session as follows:

queryitemclicked
qid1a0
qid1b1
qid1c0

We can transform this data into a pairwise dataset as follows, where denotes the preference relationship between and which we inferred from the click data. Note that the pairwise comparisons are only made within the same query session (e.g. qid1), as it reflects a given user's preferences in the context of a query and the items impressed to him/her in that session.

query
qid1ab-1
qid1ac0
qid1ba1
qid1bc0

The pairwise setting is now more amenable to modelling (compared to directly optimizing for a good ranking), since we can now treat the task as a classification problem. For each row of the pairwise dataset, we only need to model the probability that is preferred (or not) to . This can be formalized using a cross entropy loss comparing the predicted preference of our model to the revealed preference in the dataset.

First, we model the predicted probability from the model. Given row of the pairwise dataset and and respectively, we model the predicted probability that is preferred to (using to denote a preference relationship) by passing the score difference between the predicted scores and for items j and k respectively through a sigmoid function, like so:

Now let us denote the revealed probability that is preferred to as such that:

  • if we prefer item j to item k
  • if we have no preference between the two items
  • if we prefer item k to item j

The cross entropy loss of our model can then be expressed as:

For convenience, let us denote (and conversely, ), which translates into the following:

  • if we prefer item j to item k
  • if we have no preference between the two items
  • if we prefer item k to item j

Let us also define the convenience variable . The cross entropy loss then simplifies to:

Note that in line 2 of the above, we use the useful identity that and . In line 3, the first and last term of line 2 cancel out to simply return .

Having written out the loss function, we now need to differentiate the loss with respect to the model scores and parameters to obtain the gradient descent formula used to train the RankNET model. Differentiating wrt and gives:

Note that the first line of the above uses the result . We obtain line 2 by multiplying the right term by in both the numerator and denominator. We obtain line 3 by observing that is a function of , such that and likewise . The symmetry of the derivative wrt and will be important for the next section on factorizing RankNET to speed it up.

Finally, we use the gradient to update individual parameters of the model . In the below, denotes the number of data points in the pairwise dataset. This update procedure rounds up the discussion on RankNET and is sufficient for training a generic differentiable model from ranking data.

Factorizing RankNET

Schroff 2015 - FaceNET

Schroff 2015 - FaceNet: A Unified Embedding for Face Recognition and Clustering

This paper proposes the Triplet Loss for learning a face recognition system.

The task is that given a large number of person identities with a number of images associated with each person, to learn a model representation for an image in euclidean space, such that images belonging to the same person are close by and images for different persons are far away. This paper was impactful because it improved the SOTA on face verification by a large margin.

At this point, representation learning often trained a CNN classification model to classify images to a known identity. A bottleneck layer of relatively low dimensionality in the middle of the network is chosen as the representation of an image. In contrast to this indirect method, this paper directly optimizes the representation using contrastive learning.

Setup

Let the embedding of image be represented by , and constrain . Now for a given person i, we want the anchor image to be closer to other images of the same person (positive) than images of any other person (negative), with some margin . That is, we desire a model such that:

Where denotes the set of all possible triplets in the dataset. The authors note that most triplets would easily satisfy the desired constraint, hence we need some way to picking good triplets to optimize the learning process.

The loss we desire to minimize is thus as below. The intuition is that we wish to have small distances for the anchor-positive pair but large distances for the anchor-negative pair .

  • Easy triplets where will contribute 0 loss
  • Semi-hard triplets where contribute small losses
  • Hard triplets where contribute large losses

Triplet Selection

As with most contrastive learning methods, the authors observe that the selection of triplets is important for learning well. In an ideal world, we would want to select the hardest negatives for each anchor, i.e.:

However, this is undesirable because:

  • We may end up mostly selecting mislabelled instances or corrupted images
  • It is computationally infeasible to search across the whole corpus
  • Trying to learn hard negatives at the start of training may cause learning to collapse

Hence, the authors propose:

  • Mini batch negatives. Instead of finding the hardest negatives across the entire corpus, they mine for the hardest negatives in the mini-batch. This improves computational efficiency and mitigates the mislabelled/corrupted data issue.
  • Curriculum Learning. At the start of training, the authors select easier triplets, and gradually increase the difficulty as training goes along. Presumably, this is done by thresholding negatives based on and starting from high to low .
  • Semi-hard negatives. The authors mention that it is helpful to limit negatives to , but it is unclear whether they always do this or only at the start of training. The intuition is to cap the difficulty of the triplets, as triplets that are too difficult may actually hinder learning.

The triplet loss may be applied directly to semantic search. However, it is important to note that the paper assumes that for each person, all others-labelled instances are negatives. This is a suitable assumption for face recognition as each image can only belong to one person, but it is not true for semantic search, where a document may be relevant for multiple queries. Hence, the mislabelling issue when mining hard negatives is amplified for semantic search. I imagine that the selection of accurate negatives for semantic search would require more verification and filtering.

Recommendations as Treatments

Paper Link.

Training and evaluation data for recommender systems is subject to selection bias, because the probability of observing a data point depends on (i) users interacting with items which they like and (ii) the recommender system pushing items which they think the user likes. This leads to data Missing Not at Random (MNAR) and leads to biased model training and evaluation.

Consider users interacting with movies. Denote users and movies . Let denote the true rating / interaction matrix between user and item, and the predicted matrix. Let be the observability matrix of whether an interaction was observed, and be the probability of observation, i.e. . Ideally, we want the propensity matrix P to be uniform across all entries, which gives us the Missing Completely at Random (MCAR) condition.

Evaluating only on observed entries is biased

Given a model , we often want to compute evaluation metrics on how well approximates . For example:

The Risk function may then be denoted:

However, cannot be computed because most entries in are missing. Typically, we estimate using only observed entries. This estimator is naive because it simply assumes the propensity matrix is uniform. This assumption is often false in recommender systems because a small set of popular items tend to get impressed a lot. Hence this estimator will favour models that lean towards the popular items compared to models that recommend rarer / new items.

Unbiased estimators

The main idea in this paper is to view this problem as analogous to estimating treatment effects in causal inference. Think of recommending an item as an intervention analogous to treating a patient with a specific drug. The goal is to estimate the outcome of a new recommendation (clicked or not) or new treatment (recovered or not), while most outcomes between u,i pairs are not known.

The key to resolving this bias is to understand the assignment mechanism that generated , namely the propensity matrix . We can then correct for the propensity. We need to assume that , i.e. full support, because otherwise the IPS estimator below is undefined.

IPS Estimator

The main estimator is the Inverse Propensity Score (IPS) estimator. Assuming that the assignment mechanism is known:

We have simply normalized each score by its inverse propensity, so that popular items with a high chance of being shown get their score reduced proportionally (and likewise in the opposite direction for rare items).

We can show that the IPS estimator is unbiased as we take the expectation over the random observability matrix. That is, suppose we are allowed to sample an infinitely large number of observations based on , the average of the IPS estimator over these datasets will be the true risk function. This simply happens because the expected value of is simply the propensity, and if we know the propensity we can just cancel it out.

Comments

  • This paper only addresses exposure bias, but not position bias. In other words, it assumes that all impressions are equal, which is valid when a user is shown an ad, but not when a user is shown a list of search results. In the latter case, items in lower positions have a lower probability of being considered by the user, even though both are considered to be observed.

Guo 2017 - DeepFM

Guo 2017 is an important innovation for recommender systems. The task tackled in this paper is to predict clickthrough rate (CTR) of recommended items.

The paper argues that it is important to learn implicit interaction signals between user and item features. Suppose we define the of a signal as the number of features involved in deriving the signal. Then:

  • Recency of an item is an order-1 signal
  • Item category and timestamp is an order-2 signal, since e.g. demand for food items spikes at meal times
  • Gender, game category and user age is an order-3 signal, since e.g. male teenagers prefer shooting games

One can see that both low and high order signals are important for modelling CTR, and manual feature engineering is cumbersome to derive all such interaction rules. Hence we would like to have a deep learning model that can model such signals directly from raw user / item features and interaction data. This paper proposes to use Factorization Machines to model low order feature interactions and deep neural network to model high order feature interactions.

Setup

Suppose we have a dataset comprising n instances of tuples , where is an m-fields raw feature record where each field could be categorical or numerical. Each categorical field is represented as a vector of one-hot encoding, and each continuous field is represented as-is. Let denote the vector representation of field j (where dimension of numericals is 1 and dimension of categoricals is the number of categories), and denote the flattened vector of the laid out horizontally.

DeepFM

DeepFM comprises two distinct models: the FM component and the deep component which are simply summed together: . We go into each component below.

The FM component is a factorization machine that captures order-1 and order-2 interactions between the raw features. We first project each feature field to a k dimensional latent vector using a learned embedding matrix (in the paper, the authors set k=10). The latent vector representation of field j is denoted as . We compute the FM output as follows:

The first term represents the order-1 representation of the features as-is. The second term is a pairwise inner product between the embedding representations of each feature field, which represents order-2 interactions between the features.

The deep component tries to capture higher-order interactions between feature fields. This is done simply by laying out the embedding vectors horizontally, such that we form a flat input vector of size k \times m, and we call this . This fixed size vector is then fed into a multi-layer perceptron (or dense neural network) to finally return a sigmoid output. The standard forward equation for layer of the MLP is denoted below. Note that the embedding layer is shared between the deep and FM networks, allowing the deep component to benefit from the faster learning of the FM component.

In the paper, the MLP is of the form 400->400->400. Dropout is set to 0.5.

Implementation

Huawei has an implementation of DeepFM (amongst other models) in pytorch.

Hamilton 2017 - GraphSAGE

Paper.

This paper presents a framework to efficiently generate node embeddings, especially for previously unseen nodes. It calls this inductive learning, i.e. the model is able to generalize to new nodes, as opposed to previous frameworks which are transductive and only learn embeddings for seen nodes. For example, matrix factorization methods are transductive because we can only make predictions on a graph with fixed nodes, and need to be retrained when new nodes are added.

GraphSAGE (i.e. Sample and Aggregation) aggregates feature information from the neighbourhood of a node to represent a given node. Feature information can include structural information (e.g. degree of a node) or other content-based information about a node, e.g. description of a job for a job node etc.

Setup

We start with a graph that is provided by data. We have input features . For a user-chosen depth of , we have aggregator functions and weight matrices . We also have a neighbourhood function , which means it maps from a node to a set of nodes in . Note that is the powerset of the set , i.e. the set of all possible subsets of . We wish to generate low-dimensional, dense vector representations of each node .

Forward Propagation

The algorithm for the forward propagation (in words) is as follows:

  1. We start with hidden representations , i.e. at layer 0, we just use the input features to represent each node
  2. At depth , we perform a neighbourhood aggregation step at each node :
  3. The aggregated vector is then passed through a dense layer to get the hidden representation at depth . Note that represents a non-linear activation, such as ReLU:
  4. We L2-normalize each vector
  5. We then repeat this process repeatedly for depths
  6. We then take the last layer:

The intuition behind the forward propagation is that we use the neighbours of to represent it. Importantly, we also include the hidden representation of the current node from the previous depth (analogous to a residual connection). At each depth level , we increasingly pull more information from further reaches of the graph. Note that in the aggregation step, a subset of each node's neighbours are sampled uniformly (as opposed to taking the full neighbour set) to control the complexity of the algorithm.

Loss

To train the weights, we define an unsupervised loss based on how well the embeddings are able to reconstruct the graph. Specifically, we have a loss which:

  • Rewards positive pairs for having a high dot product
  • Penalizes negative pairs ( being sampled negatives according to the negative sampling distribution )

Alternatively, we can also define a supervised loss based on classification cross entropy loss, with presumably some form of negative sampling. The authors did not elaborate on this.

Aggregation Methods

The authors explored a few ways to define the function to aggregate neighbour embeddings together:

  • GraphSAGE-mean: The element-wise mean of the neighbour embeddings is taken
  • GraphSAGE-GCN: Same as above, except that the current node's hidden representation from the previous depth is not included. The experiments show that omitting this residual connection actually leads to significant performance degradation.
  • GraphSAGE-LSTM: An LSTM is fitted over the sequence of embeddings. Since there is no inherent order to the neighbours, the authors randomize the ordering for each training sample
  • GraphSAGE-pool: An additional linear layer is added over the sequence of embeddings, before an element-wise max-pool operation is carried out

Generally from the experiments, it seems that GraphSAGE-mean is sufficient.

Ma 2018 - Entire Space Multi-Task Model

Ma 2018 tackles the problem of building a post-click Conversion Rate (CVR) prediction model. Note that CVR is the task of predicting conversions from impressions, whilst click-through rate prediction (CTR) is predicting clicks from impressions.

In a typical recommender system, users follow the sequential process of impression -> click -> conversion, where conversion may refer to making a job application, purchase action etc. Usually, CVR models are built the same way as CTR models: a dataset of clicked impressions is prepared, and the converted items are labelled as relevant. A model is trained on this dataset and then used to make conversion predictions on all impressions. This paper argues that there are two problems with this approach:

  1. Sample Selection Bias (SSB). The distribution of the training set (comprising only of clicked impressions) differs greatly from the distribution of the testing set (comprising all impressions), and this distribution shift will hurt generalization performance of the trained model.
  2. Data Sparsity (DS). The dataset for CVR (clicked impressions) is typically much less than the dataset for CTR (all impressions), and the lack of data makes model fitting difficult. The paper estimates that CVR dataset is typically 4% of that of CTR dataset.

Setup

Denote the observed dataset to be , with each sample tuple representing one impression drawn from a distribution with domain . is the feature space, and are label spaces (i.e. 0 or 1). Each feature vector captures all the user attributes, item attributes or user-item interaction for the impression event. The notation represents the sequential nature where a click event must precede a conversion event .

We can denote the various prediction tasks as follows:

  • Post-view clickthrough:
  • Post-click conversion:
  • Post-view click + conversion:

The conventional way of modelling pCVR is to construct a sample from only click impressions, i.e. , where clicked but not converted impressions are treated as negative samples. We can see than . As mentioned above, there are problems with this approach.

ESMM Model

The ESMM model breaks down the pCTCVR task into its constituents, and uses two neural networks to model pCTR and pCVR simultaneously. Based on the diagram, it seems to embed each user field and item field into a fixed-size embedding, where the user field embeddings are summed up element-wise to produce an overall user embedding. The same is done to produce an overall item embedding. The user and item embeddings are then concatenated together, and this combined embedding is fed into a dense layer to finally output a real score representing either pCVR or pCTR. The two scores are then multiplied together to form the final prediction of pCTCVR.

Importantly, the projection (or lookup) layer from raw features to embedding is shared between the two neural networks. This allows the pCVR network in particular to benefit from the richer sample data that the pCTR network enjoys and addresses the data sparsity issue.

ESMM Architecture
ESMM Model Architecture (Figure 2 from ESMM Paper)

Finally, the model is trained with a multi-task objective. Specifically, the losses are computed on the dataset with all impressions. The output pCTR is compared against clicks using a cross-entropy loss, and the output pCTCVR is compared against conversions using a cross-entropy loss. This multi-task loss allows us to exploit the sequential nature of the data generation process, such that only needs to model the delta aspect that leads from a click to a conversion.

The authors show that modelling in this multi-task manner outperforms a setup where two models are trained independently to predict CTR and CVR respectively, and their product is taken to estimate pCTCVR. Unfortunately, we cannot replicate this joint-task learning setup with gradient tree-based models, at least not naively.

Details

The authors set the dimension of each embedding vector to be 18, and each MLP is 360 -> 200 -> 80 -> 2 dimensional. Adam solver is used with .

Reimers 2019 - Sentence-BERT

Link to paper.

Sentence-BERT (or SBERT) was one of the first papers to suggest a way to fine-tune BERT models to generate useful embeddings that can be used for search / retrieval.

Prior to SBERT, BERT models were mainly used for sentence pair regression tasks by passing two sentences into the transformer network and adding a classification head on top to produce a float value. We can call this the cross-encoder approach. In other words, researchers only cared about the final prediction and did not make use of the embeddings, or the final vector representation of the inputs. This approach is suitable for reranking a small number of documents but not for nearest neighbour search in a corpus with millions of documents.

Naively, one can element-wise average the BERT embeddings at the final layer to produce a vector representation of the text. This vector representation can then be used for nearest neighbour search or clustering. However, because BERT was not explicitly trained for this objective, it results in rather bad sentence embeddings, often worse than GloVe embeddings.

Method

The SBERT paper presents 3 different training objectives, all of which perform well on embedding similarity tasks. The choice of objective depends on the dataset:

  1. Classification objective. This is for tasks where the objective is to predict a label given two sentences A, B. We pass each sentence into the BERT network and a pooling layer to get two vector representations, and . The pooling layer can be (i) take the [CLS] token embedding, (ii) take the element-wise mean or (iii) take the element-wise max. We then create a concatenated vector which is fed into a softmax classifer. The network is trained using cross-entropy loss.

    Note that this siamese approach (where each sentence is passed into the same network) differs a little from the typical cross-encoder approach, where the sentences are concatenated as a string with the [SEP] token before passed into the network. The latter approach is presumably more powerful because the attention mechanism can attend to all pairwise relationships

  2. Regression objective. This is for tasks where the objective is to predict a float given two sentences A, B. Given the vectors and , the cosine similarity is simply taken to generate a float between and . The cosine similarity is then compared with the actual float value using mean-squared error to generate a loss.

  3. Triplet objective. This is for tasks where each data point is a triplet (anchor sentence , positive sentence , negative sentence ). We then minimize the loss function, where is the margin:

Ablation

  1. Pooling strategy. Using [CLS] or mean seems to be largely similar. The authors saw some degradation using max for the regression objective.
  2. Concatenation. For the classification objective, the concatenation strategy makes some difference. In particular, using yields but yields . Thus the element-wise difference is important in yielding useful embeddings, probably because it can be used to push similar sentences together and dissimilar sentences apart. The authors also found that adding element-wise multiplication does not help.

Takeaway

It is interesting that the classification objective, which is close to a cross-encoder framework, is also able to learn useful embeddings by adding the difference operation . This suggests that we can train a cross encoder and simultaneously get useful embeddings for nearest neighbour retrieval at the same time.

He 2020 - LightGCN

He 2020 - LightGCN is a simple and effective Graph Convolution Network for recommendation.

LightGCN is an adaptation of Graph Convolutional Neural Networks (GCN) to the task of recommendations. In a typical Convolutional Neural Network for vision, convolution aggregations (such as linear projections, pooling, average) are applied to a neighbourhood of pixels that are near to one another. The aggregations transform the raw pixel values into a hidden layer of "embedding" values, and the next layer of aggregations is applied to the hidden layer, allowing the CNN to learn more abstract features with each increasing layer. A GCN uses essentially the same idea, except that the definition of neighbourhood of a node A are the neighbouring nodes that are connected by an edge to A. The GCN thus allows us to train node embeddings on all types of graphical data, such as social networks, user-item interactions etc.

Setting

This paper tackles the task of collaborative filtering without features, i.e. making recommendations purely from the user and item id. Also, no negative samples are required - all we need is edges between users and items based on some form of implicit interaction.

Neural Graph Collaborative Filtering (NGCF)

The LightGCN model is essentially a simplification of the NGCF model, so the paper starts here. Btw, there are some overlaps between LightGCN authors and NGCF authors. The setup is as follows:

  • Each user and item are embedded from their id -> embedding
  • Let denote the ID embedding of user and denote the ID embedding of item

NGCF uses the user-item interaction graph (derived from data) to propagate the embeddings as follows:

Some notes about the propagation equations above:

  • and denote the embedding of user and item respectively after k layers of propagation
  • is a non-linear activation function
  • denotes the set of items that interacted with user . For instance, it could be all the items the user purchased within the past 3 months. is the set of users defined in a similar way.
  • and are trainable weights

Intuitively for a given user, the equation propagates (i) the user embeddings itself (order-1), (ii) the embeddings of neighbouring items (order-1) and (iii) the hadamard interaction between the user and the neighbouring items (order-2). And likewise for the item embeddings. Note that is performed - the entire neighbour set is taken per node.

Finally, after training the network of layers, we obtain embeddings for each user and item. The embeddings are concatenated as such and where are vectors of dimension . Prediction scores for the match between user and item are then computed via the inner product .

Problem With NGCF

The authors argue that NGCF is unnecessarily complicated because traditionally, the base embedding layer is derived from rich semantic features such as embedding the title of papers etc. This justifies the usage of the activation function and the projection weights etc. to learn a transformation of the semantic features. In contrast, for the collaborative filtering setting, the embeddings are arbitrary numbers tied to each user or item ID. Hence, performing multiple non-linear transformations will not lead to better feature learning.

: I'm not fully convinced by this argument, although the empirical results do support it. I agree with the argument to the extent that the base embedding layer is arbitrary, but imo NGCF can still learn a bigger representation space of models through its non-linear transformations. The problem seems to be more that (i) the richer feature representation is not very useful and (ii) the additional complexity makes the model harder to learn.

LightGCN Forward Propagation

In LightGCN, we essentially remove the non-linear activation and weight projections. The propagation equations simplify to the following:

The final representation of each node (whether user or item) is then a weighted sum of its hidden representation across all layers:

Although could be a parameter to be optimized, the authors propose just setting for simplicity.

Noticeably, the forward propagation does not include the self-connection from the previous layer, i.e. the update for does not explicitly include , which other papers like GraphSAGE argue is important. The authors argue that because they use a weighted sum of hidden representations across all layers, this essentially is equivalent to including self-connections, so that is no longer necessary.

Loss

The only trainable parameters of the model are the embeddings at the 0th layer, i.e. . The authors propose using loss, which is a pairwise loss that encourages the score of a neighbour to be higher than the score of an unobserved, randomly sampled counterpart.

In contrast to NGCF and other GCN approaches, the authors do not use dropout as a regularizer. Instead, they think the L2 regularization on the embedding layer is sufficient, as these are the only parameters in the model. Training of the model is done in a mini-batch manner, where batches of (user, item) tuples are drawn, negative items sampled, and the loss evaluated.

Ablation Studies

The paper has a few ablation findings:

  1. is important, i.e. it is important in the forward propagation to divide by . Omitting either one leads to performance degradation. Note that in GraphSAGE, the GraphSAGE-mean variant essentially does , i.e. it only normalizes by the user degree. I suppose normalizing by the item degree as well penalizes popular items, so it could be useful.
  2. is important for robustness as we increase the number of layers, i.e. instead of just taking as the final embeddings, it is useful to take the element-wise mean of the embeddings at each layer. This might be analogous to the impact of including self connections.

References

Lewis 2020 - Retrieval Augmented Generation

Lewis 2020 - Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks

This paper proposes a way to fine-tune a parametric seq2seq transformers (e.g. GPT) with a non-parametric memory through dense retrieval. The main idea is to extend parametric memory (i.e. the "knowledge" that is stored within the LLM floating point parameters) of the seq2seq model by coupling it with retrieval of documents from a vector database (dubbed non-parametric memory, using Wikipedia articles in the original paper). We can update the non-parametric database's knowledge as the world changes.

The paper argues that this setup is ideal for knowledge-intensive tasks such as open question answering, fact verification etc., where it is impossible to store all knowledge in parametric memory.

Setup

Given an input sequence of tokens , we wish to retrieve contextual documents and use them as additional context when generating target sequence . We have two models:

  • Retriever which returns top-k truncated probability distributions over text passages. Truncated probably means that the probability is normalized such that the top-k probabilities sum up to 1 for each retrieval.
  • Generator parametrized by which generates tokens in an auto-regressive manner.

The goal is to train both models end-to-end on a fine-tuning corpus of input sequence / output sequence pairs. The loss objective is to minimize the negative log-likelihood .

Training

The paper proposes two different models to achieve the end-to-end prediction.

  1. RAG-Sequence. In this setting, we retrieve k documents and keep using them to generate the entire target sequence.

  • Note that we are marginalizing (or taking a weighted combination) over the truncated distribution of , implying that we trust each document according to its retrieval probability in the final probability for generating each token.
  • The expression just means that we generate the target sequence with an input sequence that is a concatenation of , and .
  • The retrieval is done using a BERT encoder using Maximum Inner Product Search (MIPS). To avoid re-indexing, the document vectors are held constant whilst the query encoder is trained in the above end-to-end fashion.
  • There is no explicit supervision on what documents are to be retrieved. Intuitively, if a document is useful for generating the correct tokens, the loss objective would encourage to be larger, thus encouraging the retriever to retrieve more relevant documents.
  • Another interesting way to think about this setup: suppose the generator just returns the retrieved document (token for token) and k=1, and the input / output pairs are anchor-positive pairs in a standard retrieval setting. Then we can see that this matches the standard retrieval training objectives but without negative sampling. Thus it seems that the token prediction task is sufficient to generate negative signals for non-useful documents such that explicit negatives are not needed.
  1. RAG-token. In contrast, RAG-token retrieves k documents at each time step, allowing us to sample new documents according to the needs at each decoding step.

Note that the retrieved context is now which varies at each time step. The change in retrieval at each step seems to add complexity during training.

Ablation

  • Increasing number of retrieved documents. 5 or 10 documents are used for retrieval. Ablation shows that performance increases monotonically (albeit diminishingly) for RAG-sequence with increasing number of retrieved documents.
  • Learned Retrieval is useful. The authors try freezing the retriever and compare it against the setting of allowing the retriever to learn. They find that generally learned retrieval improves results significantly.
  • RAG generates more diverse outputs. They measure the ratio of distinct tri-grams / total tri-grams and find that RAG-decoding compared to normal decoding is more diverse.

Gao 2021 - Simple Contrastive Learning of Sentence Embeddings

Paper Link

This paper proposes an unsupervised and supervised approach to fine-tune sentence encoder models to perform semantic textual similarity tasks. The STS tasks refer to a set of tasks taken from SemEval 2012 to 2017 where given two sentences, the model is to output a score between 1-5 and scored based on correlation to human inputted scores.

Task Dataset

Example sentence pair with a score of 4 (taken from SemEval 2016):

  • In May 2010, the troops attempted to invade Kabul.
  • The US army invaded Kabul on May 7th last year, 2010

Example sentence pair with a score of 3:

  • John said he is considered a witness but not a suspect.
  • "He is not a suspect anymore." John said.

Unsupervised SimCSE

SimCSE follows the popular framework of contrastive learning using in-batch negatives, where pairs of related sentences are trained to have high similarity whilst having low similarity with other random sentences.

The idea of unsupervised SimCSE is simple: given a collection of sentences , treat each sentence itself as its own positive pair, and use dropout noise to introduce random perturbations such that the self similarity is not trivially perfect.

Specifically, if we denote as the embedding for under dropout mask , then the loss for unsupervised SimCSE for a mini-batch of sentences is:

Note that is the temperature hyperparameter. Importantly, the authors found that setting with cosine similarity performs very poorly (64.0), compared to dot product (85.9). However, carefully tuning can lead to similar performance ( had a score of 86.2).

We may view this procedure as data augmentation, analogous to how random pixel distortions and rotations are applied to images to improve computer vision models. The paper shows that this simple unsupervised method significantly outperforms other data augmentation methods. Note that the authors used the default 10% dropout for BERT models.

Supervised SimCSE

The supervised version follows a similar framework, although the positive pairs are taken from an external dataset. They chose the Natural Language Inference (NLI) datasets, where each example is a triplet of sentences. The premise is denoted , the entailment sentence is denoted and the contradiction sentence is denoted . The loss is then formulated as:

Ablation Studies

  • The paper finds that including contradiction sentences as hard negatives has a small but significant improvement in performance
  • The paper finds that using the [CLS] token or averaging embeddings across the first and last layer does not make much difference

Alignment and Uniformity

Wang and Isola 2020 propose two metrics for measuring the effectiveness of an embedding method on a set of documents:

  • Alignment. Given a distribution of positive pairs of documents , alignment desires the expected distance between embeddings of each pair to be small:

  • Uniformity. Given any two documents drawn from the corpus, uniformity metric should be small (i.e. distance between them is large).

A common problem pointed out in training language models is anisotropy, in which embeddings are pushed into a narrow cone in the vector space, which severely limits their expressiveness. The anisotropy problem is naturally connected to uniformity, which aims at distributing embeddings evenly in the space. The authors argue that contrastive learning as proposed in this paper addresses this problem through some analysis (omitted for now).

Empirically, they show that the alignment metric for SimCSE is comparable to average BERT, but the uniformity measure is significantly lower, leading to much better performance in terms of accuracy.

Weng 2021 - Contrastive Representation Learning

Reference: Weng, Lilian. (May 2021). Contrastive representation learning. Lil’Log. https://lilianweng.github.io/posts/2021-05-31-contrastive.

This page just follows Lilian's notes on contrastive learning.

Contrastive learning seeks to learn representations such that related samples are close together and unrelated samples are far apart. A sample is typically represented by a 1D vector (or embedding). The literature on contrastive learning comes from a wide range of tasks such as semantic retrieval, computer vision, hence we may see the same ideas repeated in slightly different forms.

Dao 2022 - Flash Attention

Paper Link

This paper argues that the attention mechanism is slow because of reading / writing between GPU High Bandwidth Memory and GPU on-chip SRAM. The authors hence create a block-wise attention algorithm that minimizes such IO read / writes and speeds up attention significantly especially when the sequence length is long.

Brief Overview of Attention

Suppose we have an input sequence of embeddings where , such that . Naively, we can compute activations by , where , such that . However, this naive way of encoding our input sequence does not allow interaction between inputs at different positions (say with ). We can see this by observing that the first row of is only affected by the first column of (i.e. first encoding ), and likewise for all the other positions.

Attention addresses this problem by adding an interaction mechanism. Besides , we also create weight parameters . Given an input , we compute as follows:

We then create an interaction matrix , and apply row-wise softmax to get . can be thought of as a pairwise similarity matrix between the encoding at position and position that captures the degree of interaction. For example, in a sentence the economy has been in decline, the value of (assuming 0-index) measuring the interaction between economy and decline might be high.

Finally, we produce the output , which is an activation output from the input sequence that has captured the interactions between tokens at different positions of the input. This simple mechanism has led to significant improvements in language modelling.

GPU Memory Hierarchy

The memory hierarchy is such that read/write speed is super fast on the SRAM but memory is highly limited. Hence, the N x N attention matrix is written/read repeatedly to/from HBM, resulting in IO being a bottleneck. The numbers are as such on an A100 GPU:

  • SRAM: 19 TB/s (20 MB RAM)
  • HBM: 1.5 TB/s (40 GB RAM)

Naive Attention Algorithm

The naive attention algorithm has many reads and writes to HBM. (ps: Not sure why we cannot persist the intermediate matrices on SRAM and complete the computations, but in any case the naive algorithm requires materializing the matrices on SRAM which will quickly flood it. For example, a sequence length of 2,048 at float32 already takes up 33MB for the matrix).

  1. Load from HBM, compute , write to HBM
  2. Read from HBM, compute , write to HBM
  3. Load from HBM, compute , write to HBM

Flash Attention

The main idea is quite simple: instead of computing the full attention matrix, we use block-wise tiling to compute parts of it at a time. This reduces the memory required for each block and allows the whole computation to be done on SRAM while minimizing the amount of IO read from HBM, leading to faster compute time and lower memory usage on SRAM. The difficulty is in devising a block-wise softmax algorithm that yields the exact same result as computing it all at once.

Consider the naive softmax algorithm on an arbitrary vector .

Note that the maximum value is subtracted for numerical stability to avoid overflow (underflow is ok because ). is the numerator and is the sum of all elements in .

Now, the problem with the naive softmax algorithm in the context of attention is that we need an entire row of ( elements) to perform the row-wise softmax computation. This will not be available if we are performing block-wise computation, since we are splitting row-wise into blocks of . When we compute , blocks of will be materialized in each pass, but not the entire row at a time.

Hence, we need a modified algorithm that allows us to compute chunks of the final output at a time by iterating block-wise through , such that the combination of the new chunk of at each step with the already written intermediate gives the correct result at the end. The key to realizing this algorithm is in decomposing the softmax step, as shown below.

Consider two vectors . We can decompose the softmax of their concatenated vector as follows:

The first line of the above simply notes that the maximum of is the maximum over each of the subvector maximums . The second line notes that we previously multiplied each element of by a factor, say for those in . To get the correct multiplier for the full vector , we need to divide away the previous multiplier and apply the new multiplier, i.e. . The third line notes that the new denominator is the sum over each of the subvector sums, after we apply the correct multiplier from line 2.

The decomposition is simple but powerful. It implies that so long as we keep track of intermediate statistics and , we can compute the softmax of a long vector by splitting into subvectors and operate over each subvector at a time.

Now we are ready for Algorithm 1: Flash Attention of the paper.



Note that we use to denote a zero matrix of size and a zero array of size respectively. For simplicity, we divide into equal -sized blocks but the paper allows different block sizes for and . The equation numbers on the right in parentheses show which equations the lines correspond to above. Equation line 15 is a bit confusing because it combines multiple steps together. The next few paras try to unpack this.

Firstly, note that we are using the operator to denote an element-wise broadcasted multiplication. For a vector and a matrices , observe the associative property , since each element of only affects the corresponding row in the final matrix. This allows us to apply the scaling to either or and the result will be the same.

Next, see that the term is simply the corrected numerator of the softmax dotted with . Dividing this term by gives the output block for this particular pair.

Similarly, the other term is the existing output that has been accumulated from previous steps . Due to the associative property, we can also directly apply the scaling correction to . The are scaling factors according to equations (6), (8) to correct the scaling of previous steps.

Finally, we should understand why there is a + in equation 15. I find it easier to visualize if we set . If we trace the matrix multiplications, we will observe that is only affected by , i.e. it corresponds to only the query token in position . Now, represents the weighted average over all positions of the matrix where the weights are determined by the softmax over the interaction between (representing one token) and all positions on the matrix. This weighted average is why it is a symbol: we are accumulating the weighted sum over into . The only complication is that we are applying the scaling corrections at each step.

Hopefully these explanations provide some intuition to the FlashAttention algorithm, which is quite a simple idea but makes a ton of difference practically. It should be easy to implement this algorithm in numpy if the reader wishes to understand it better.

Tunstall 2022 - SetFit

Paper link.

The task which SetFit tackles is few-shot labelling of texts. Since labelling data is scarce, current SOTA methods rely on large language models and techniques like In Context Learning (ICL) or Parameter Efficient Fine Tuning (PEFT).

Rafailov 2023 - Direct Preference Optimization

Here we trace the derivations from the DPO paper. Denote the model after SFT as , i.e. the policy is a probability function for each pair of inputs and answers (x, y). Naturally, we can use this policy to generate tokens by choosing the response with the highest probability (or approximate it in a greedy token-by-token manner).

To perform RLHF, we first need to build a reward model. First, we prompt the SFT model to obtain pairs of answers . These samples are presented to human labellers who record their preferences in the form , where wins and loses. These preferences are assumed to be generated from an underlying reward model which we do not have access to.

We wish to learn this reward model. Since we only have access to pairwise preferences instead of the reward score, a common approach is to model the pairwise preferences using the Bradley-Terry model. Specifically, we assume that the observed human preference decisions are related to the underlying human reward model in the following way:

Suppose we have a static dataset of comparisons sampled from the human preference model of . We can create a reward model and use the BT-model to express the negative log likelihood of . Note that are we using the expression of the BT-model as a sigmoid function. With this NLL expression, we can optimize for using gradient descent to learn the reward model from . Notice that the heart of Equation (2) is essentially just a difference in reward between the winning answer and losing answer .

Note that is usually initialized using the SFT model , but with an additional linear layer over the final transformer layer to generate a scalar value as the reward.

Having learned the reward model, we then need to use the learned reward function to fine-tune using reinforcement learning. Specifically, we set as the reference model, and initialize a new model that we wish to train. Usually, is also initialized as a copy of . The objective function we wish to maximize is:

Inspecting this objective, we see that we are trying to tune such that it generates answers that maximize the learned reward function , while at the same time ensuring that we do not deviate too far from the original reference model. is a hyperparameter controlling the degree of deviation. This penalty constraint serves to:

  1. Ensure that we do not drift too far from the (x, y) distribution on which the reward model is accurate
  2. Ensure that we maintain generational diversity and not just collapse into a single high-reward answer for a given prompt

Objective (3) may not be directly optimized, because we need to generate at each step from the current policy (not sure if I fully understand this). Hence typically this is optimized using reinforcement learning using PPO.

Direct Preference Optimization

The RLHF process in general is unstable, requires more memory / computation and requires tricks to make it work. Hence the authors of DPO set out to create an optimization procedure that:

  1. Avoids fitting an explicit, standalone reward model
  2. Avoids using reinforcement learning

DPO starts off with the KL-constrained reward maximization objective from Equation (3) above. The first step is to show that the optimal policy for this objective is of the following form for an arbitrary reward model :

The derivation for Equation (4) is as follows. For a given reward function :

For line 2 above, recall that is the expected value of if the random variable is drawn from . Since the outer expectation is over draws from , we can breakdown the KL-divergence by bringing the log difference into the expectation. Line 3 simply divides by and flips max to min. Line 4 uses to bring the reward term into the denominator of the left term, then introduces an arbitrary . Note that the two can be cancelled out if we brought them together, but we will be using them later on.

Now let us define the optimal policy . We will need to prove that is indeed optimal. Note that is a valid probability distribution as:

  • ; and
  • , since the denominator is just the sum over of the numerator

Since is not a function of , we can sub in and take out. The left term becomes a KL-divergence between which we are optimizing over and the optimal policy .

Finally, note that does not depend on , so we only need to consider the KL-divergence term. Gibb's inequality tells us that KL-divergence is minimized at if and only if the two distributions are identical. This completes our derivation of (4) by showing that is indeed the optimal policy.

Now that we have completed the derivation, let's consider what Equation (4) is saying. It tells us that we have an analytical solution for the policy that optimizes (3), and that it can be expressed in terms of (which we already have) and a given reward function .

Since we previously learned a reward model , we could simply plug that into (4) to get our optimal policy. Specifically, for a new input prompt , we can compute for all possible values of and pick the best . We can ignore since it is fixed for a given prompt . Intuitively, the new model scales the probability of high reward answers up with an exponential multiplier, and the degree of scaling is controlled by . However, this is not computationally practical as we need to evaluate over a very large space of (i.e. all possible answers for a given prompt ).

Hence, we want to find a form of the optimal policy which does not involve the partition function nor the reward model . We start by re-arranging Equation (4), taking log on both sides and re-arranging:

Since Equation (5) holds for any arbitrary reward model, we can use the optimal (unknown) human reward model back in Equation (1), . Also, note that in Equation (5) refers to the optimal policy under , so since we are using the optimal reward , we can call this optimal policy as well. Now we plug Equation (5) back into Equation (1).

The derivation of the above is quite simple, we just need to note that it is of the form , and use the expression of the BT-model as a sigmoid function. The great thing is that the pesky partition function cancels out because the BT-model simply ends up with the difference between two scores / rewards.

Equation (6) looks simple but is quite remarkable when we compare it to Equation (1), because we now have the probability of the human preference data in terms of just the reference policy and the optimal policy , without the reward model at all! Of course, the reward model is implicitly embedded in the equation. If we stare at Equation (6), we see the implicit reward function is .

The benefit of this new formulation is that we can write the maximum likelihood objective in terms of the optimal policy. We do not know the optimal policy, but we can parametrize it as and use the maximum likelihood objective to train the model. Our new policy objective becomes:

Application to Dual Encoder Retrieval

The DPO framework offers a way to fine-tune a policy according to human preferences, whilst ensuring stability against a reference model. In the original formulation, represents the probability of generative model generating given an input prompt .

A related problem is that of fine-tuning embedding models for search retrieval, in what is known as the dual encoder framework. In this case, we also have preference data in the form of triplets (query, positive_passage, negative_passage), and we wish to fine-tune embeddings such that the dot-product or cosine similarity between (query, positive_passage) is high whilst that of (query, negative_passage) is low. In this formulation, we could let the policy represent the normalized probability of a relevant match between query and passage. We can then borrow the framework of DPO to fine-tune our embeddings. The benefit of DPO compared to typical optimization objectives for dual encoder is the stability of the policy against the reference policy, which hopefully is a good form of regularization even when preference data is limited.

Specifically, we define a dual encoder policy , where represents a query and represents a passage, like so:

  1. Encode into a vector using a BERT model
  2. Encode into a vector using a BERT model (can be same model as the query encoder)
  3. Take the dot product
  4. Run the dot product through a sigmoid layer to convert it into a probability

We can then apply this method to a reference encoder model and call it , and optimize another encoder model against this reference model. Will need to conduct experiments to see if this method offers gains against the typical dual encoder objectives, such as Karpuhkhin 2020.

References

Blecher 2023 - Nougat

Blecher 2023 - Nougat: Neural Optical Understanding for Academic Documents.

Nougat is an end-to-end system that converts a scientific PDF into a sequence of tokens in markdown format in an auto-regressive way.

Prior methods for Visual Document Understanding (VDU) usually rely on an external OCR service to generate intermediate outputs. In contrast, this method is end-to-end, and the text is generated directly from image embeddings in a decoder manner. Thus, the model is very simple and most of the work in this paper is in data preparation.

Model

  1. Encoder. The encoder gets a variable size document image and applies crop / resizing to generate a fixed rectangle of size . Smaller images are white padded. The fixed size image can then be passed into a Swin Transformer to output a sequence of embedded patches where is the latent dimension and is the number of patches.

Borisyuk 2024 - GNN at LinkedIn

Borisyuk 2024 - LiGNN: Graph Neural Networks at LinkedIn

This is an applied paper documenting learning points from LinkedIn's latest Graph Neural Network (GNN) models, geared toward industry practitioners.

The main architecture is based on GraphSAGE. The graph is heterogenous, where nodes can represent a company or member or post etc., and edges can represent (i) engagement of a user with a post, (ii) affinity between a member and a creator, and (iii) whether a member or company has a particular attribute. Edges are weighted by the strength of the engagement, except attribute edges which all have a weight of 2.0.

The GNN's are trained as encoders for the various types of nodes, and the generated embeddings may be used for downstream link prediction tasks. They are trained using the GraphSAGE framework, which inductively generates the node embeddings using graph sampling and neighbourhood aggregation. The graph sampling is powered by the Microsoft DeepGNN Graph Engine, which is run on a Kubernetes CPU cluster.

Temporal GNN

One of the main innovations of this paper is to adapt the GraphSAGE framework to the temporal nature of a live recommender system, where items lose relevance very quickly with time and modelling the activity sequence is important.

Firstly, the typical SAGE encoder is applied to the member and its neighbours (e.g. connections, affinities etc.). This generates an embedding of d dimensions. Multi-head attention is applied to generate a few of these embeddings, e.g. H=4. In addition, sampling of the member's past activities (e.g. N=100) is performed, and each of these activity nodes are also encoded. These are the embeddings labelled A1, ..., A100 in the diagram below, note that the numbers correspond to chronological order. The input is now of dimension H+N x d.

We can see how this now resembles a sequential input embedding (e.g. from text) which is typically fed into transformers and trained with a causal prediction task. Indeed, here we feed the sequential input into a transformer and get an output of H+N x d as well (referring to H1, ..., H80 at the top). Positional encodings are added to the transformer so that the positions may be interpreted as a sequence. Finally, causal masking is used for training the transformer. This means that the first H tokens, no masking is applied, but for the last N positions, each token can only attend to the first H tokens and the activity tokens that preceded it.

The temporal modelling contributes around 4-5% lift in AUC which is significant.

Temporal GNN Framework
Temporal GNN Framework (Figure 4 from paper)

Graph Densification

Since GNNs rely on neighbourhood aggregation, nodes with few interactions pose a cold start problem. To solve this problem, artificial edges are added between nodes with low out-degrees and nodes with high out-degrees based on content-based similarity. Specifically, their algorithm works like so:

  1. Nodes with out-degree 30th percentile designated as low-degree
  2. Nodes with out-degree 90th percentile designated as high-degree
  3. Pre-embed all nodes with an LLM based on content
  4. For each low-degree node, find its approximate top-50 neighbours amongst the high-degree set. Create an edge between each of them and the low-degree node with edge weight based on embedding similarity

It seems that graph densification adds a small (0.2%-0.5%) lift to AUC both on offline and online metrics. It probably also increases the diversity of recommendations.

Personalized Page Rank Graph Sampling

Personalized Page Rank (PPR) is an adaptation of the well-known Page Rank algorithm and is widely used as a notion of node similarity. Specifically, given a source node s and a target node t, the PPR score represents the probability that a random walk originating at s will terminate at t.

In training the GNN under the GraphSAGE framework, embeddings of neighbours to a node s are aggregated together to form the representation for s. A terse way of saying the same thing is that "GNNs propagate signals across graphs". Hence, how neighbours are sampled is crucial in determining the performance of the GNN.

The simple way is Random Sampling, which simply chooses neighbours randomly amongst nodes connected to node s. This can done over a single-hop or multi-hops, and is efficient but not the most performant. The better way is PPR sampling, which chooses neighbours weighted by their PPR score, with the search space limited to nodes connected to s over k hops. This is slower but a better measure of neighbourhood.

From their experiments, 2-hop PPR sampling contributes 2.1% validation AUC lift. Adding more hops beyond 2 hops contributes marginally to performance, hence 2-hops is chosen for efficiency.

Other Tips and Tricks

The paper is full of practical tips on how to speed up training and improve stability. Here are some that stood out to me:

Adaptive Graph Sampling. IO time is correlated with number of neighbours sampled. So they start training with a small number of neighbours and increases it only when validation metric starts to stall. It seems that the final number of neighbours is capped at 200.

Mixed Precision. Mixed precision (float-16) gave roughly 10% speed up in training. But they had to be careful to retain float32 in the last layer otherwise it degraded the training process.

Data Bound. One reflection point they made is that GNN training is generally data-bound rather than compute bound. So the focus on the paper was to make data loading and processing more efficient. For example, they implemented a shared memory queue, and use the multiprocessing package to have different processes pre-fetch data from the DeepGNN Graph Engine and pre-process it simultaneously.

Liu 2023 - Meaning Representations from Trajectories

Paper Link

This paper demonstrates a way to compute sentence similarities using auto-regressive LLMs with a decoder-only architecture.

In the encoder-decoder setting, characterized by BERT models, a sentence is typically represented by the embedding generated by the encoder at the final layer. Typically, either the [CLS] token embedding is taken, or the element-wise mean of the embeddings at the final layer is taken to represent the sentence. As Reimers 2019 demonstrates, the default BERT trained using masked language modelling objective does not automatically have good embeddings suitable for Semantic Textual Similarity tasks. Hence the BERT model needs to be fine-tuned with some sentence similarity training task to generate good embeddings.

This paper is interesting as it does not require any such fine-tuning. The main idea is to represent a sentence (sequence of tokens) by the distribution of text continuations (denoted as trajectories) generated by the auto-regressive LLM on the context sentence. In other words, as the paper elegantly puts it, we represent the meaning of linguistic terms by their usage.

Setup

Let denote a finite vocabulary of tokens and indicate the set of all variable-length finite sequences of tokens in . A language model may be viewed as a map that associates a prompt sequence and a possible continuation sequence with a likelihood score .

The paper chose inverse perplexity as the score:

Side note on inverse perplexity. We know that perplexity is as below. Hence the above score is indeed the inverse perplexity.

Given two prompts and , we denote their semantic representation under model as and . The semantic distance between them may be denoted where is some distance function. The paper chose:

In other words, we sample a continuation trajectory with equal probability from either prompt and generate a continuation sequence of length m, and compute the expected difference in log-likelihood of the generated continuation between the two models and . Since it is not feasible to integrate the above expression over all possible trajectories , we sample trajectories and compute an estimation to the above expected difference.

Results

For the experiments, the authors set n=20, m=20, and sampling temperature =1.0.

  • Ablating on number of trajectories n shows that most of the performance is achieved with n=20, with marginal gains thereafter
  • Ablating on sequence length m shows that most of the performance is achieved with m=20, with marginal gains thereafter. Interestingly, there is a big improvement in performance from m=1 to m=10, showing that the distribution of the next token alone is not sufficiently discriminating as a semantic representation.
  • Ablating on sampling temperature shows that 1.0 is optimal. Sampling from too diverse or too similar trajectories hurts the performance of this method.

The results show that this method is the best way to represent autoregressive models thus far, with even slightly better performance than encoder-based models like T5 on the STS task. The results are not as good as models explicitly trained with unsupervised contrastive learning (see SimCSE) for the STS task, but as the decoder model size is increased, the performance starts to reach similar levels.

Application to Semantic Retrieval

Unfortunately, this method is not suitable for semantic retrieval, as the distance measure requires sampling from both query and document distributions (if we treat query as one prompt and document as the other prompt). Note that the setting is Semantic Textual Similarity where query and documents come from similar distributions.

The authors create a proof-of-concept workaround by creating a fixed set of trajectories that is used instead of sampling , where they select random documents from the dataset and create a sample trajectory from each document to form . We can then precompute for each document on this set of trajectories . At inference time, we just need to compute on the fixed set of trajectories, then compare the distance against the pre-computed to get the distance scores. The authors show that this approximation obtains reasonable performance.

However, I note that in typical semantic search settings, query and document usually come from different distributions where queries are short and documents are long. Hence, the continuations of a query and document might not be directly comparable. However, the beauty of this method is that since it is based on continuations, we can use prompting to adjust the behaviour of the LLM to somewhat align distributions.

For example, we may on the document-side generate trajectories using a prompt like Given document d, generate a query. On the query-side, we may use something like Given query q, generate other similar queries. We can then compare trajectories generated in this fashion.

Another issue with the original workaround is that of latency. Even with a fixed trajectory set, we need to compute at inference time, which is costly even for small number of trajectories. One way to deal with this issue is to have a hybrid setup:

  • On the document side, we use the high-precision decoder to generate a prospective set of queries from documents of size n (say n=1,000). Each document is represented as a embedding of size n where each position records the likelihood of prospective query given document using the decoder model.
  • We use a separate fine-tuned encoder model to precompute embeddings for each of the prospective queries.
  • On the query side, during inference, we use the same encoder model to embed the query and compute similarity against the n prospective queries. The similarity scores form a representation of size n for the query.
  • We may then compute similarity between the size n query and size n document representations and perform retrieval with standard nearest neighbour vector databases.

In contrast to standard embedding vectors where each position encodes an unknown quality, these representation vectors have clear semantic meaning in each position (indicating the strength of relation to a particular prospective query). This setup also allows us to trade-off precision against recall by tweaking the sparsity of the representation vectors using score thresholding or top-k thresholding on either (or both) the query and document side.

NLP Course

Based on lectures from Graham Neubig's Advanced NLP Class in 2024.

Lecture 1: Introduction

Lecture 1 Link

A general framework for NLP systems: create a function to map an input X into an output Y, where X and/or Y involved language. Tasks include:

  • Language Modelling
  • Translation
  • Text Classification
  • Langugage Analysis
  • Image Captioning

Challenges of NLP

  • Low frequency words
  • Conjugation (e.g. adjectives modifying a word)
  • Negation
  • Metaphor, analogy
  • Other languages

Topics:

  1. Language Modelling Fundamentals
  2. Training and Inference Methods
  3. Experimental Design and Evaluation
  4. Advanced Training and Architectures
  5. NLP Applications
  6. Linguistics and Multi-Linguality

Lecture 2: Word Representation and Text Classification

Subword models aim to address issue of low-frequency words. How many words are there in the English language? This is a bit of a trick question because do we consider company or companies different words? By using subwords we can greatly reduce the vocabulary size to around 60k which is used in modern tokenizers.

Another way to address this is to use character-level models. The main issue of character-level models is that our sequences become very long, so for a fixed sequence length we cover a lot less text in the same batch size. Using subwords saves compute and memory.

Byte Pair Encoding (2015) is a very simple method to create subwords. It simply incrementally combines the most frequent token pairs together, starting at the character level. e.g. starting with the sentence newest, widest, we first combine es into a token, then est, and so on.

Another way to do this is to use Unigram Models (e.g. Kudo 2018). First we use a unigram LM that generates all words in the sequence independently. We then pick a vocabulary that maximizes the log likelihood of the corpus given a fixed vocabulary size. The optimization process is using the EM algorithm.

Sentencepiece is a highly optimized library to train both these types of subword models.

Subword nuances:

  • Subword models are hard to use multilingually because they will over-segment less common languages naively
  • Subword segments are sometimes arbitrary (e.g. is it es t or e st)?

Pytorch vs Tensorflow / JAX

  • Pytorch is more widely used
  • Pytorch favour dynamic execution vs compile + execute

Training transformers typically uses a learning rate schedule. Start low, increase in the middle and decrease toward the end. Starting with a high learning rate will lead to weird models because transformers are sensitive.

Lecture 3: Language Modelling

Lecture 3 Link

Generative vs Discriminative models:

  • Discriminative model: a model that calculates the probability of a latent trait given the data, i.e. , where is say the sentiment and is the text
  • Generative model: a model that calculates the probability of the data itself i.e.

Generative models for text are called language models. We can use LMs to generate sentences. We can also use them to score sentences (e.g. when a sentence is a concatenation of a question and answer pair). Can also use LMs to correct sentences.

Auto-regressive language models compute . Question: why do we do it auto-regressively instead of computing the entire sequence at once? The problem is computational - predicting the next token has a space of , but predicting the entire sequence is in the order of where is the length of the sequence, which is currently untractable. That being said, if we can model the entire sequence, it will probably be a lot more efficient than the auto-regressive way.

The simplest language model is count-based unigram model. By making an indepenence assumption , we just ignore all the previous words and predict the probability of a word occurring.

The maximum likelihood estimation for the probability of a given token will simply be:

Detail: parametrizing in log space. Multiplication of probabilities are re-expressed as additions in log space, because otherwise, if we multiply 100 probabilities together for a sequence of 100 tokens, we will easily underflow the numeric precision.

Correspondingly, we can define the parameters .

Moving on to higher order n-gram models, the idea is to limit the context length to one-word before the token we are predicting, and then count:

e.g. P(example | this is an) = c(this is an example) / c(this is an).

Due to sparsity of data, we need to add smoothing to deal with zero counts, i.e. instead of just using tri-gram, we smooth tri-gram and bi-gram probabilities together:

e.g. .

More sophisticated smoothing techniques are studied in Goodman 1998: An Empirical Study of Smoothing Techniques for Language Modelling.

Problems:

  1. Cannot share strength amongst similar words, e.g. car and bicycle
  2. Cannot condition on context with intervening words, e.g. Dr Jane Smith vs Dr Gertrude Smith
  3. Cannot handle long-distance dependencies, e.g.tennis and racquet in for tennis class he wanted to buy his own racquet

The standard toolkit for n-gram models is kenlm which is extremely fast and scalable, written in c++.

Evaluating Language Models

  1. Log Likelihood:

  1. Per-word Log Likelihood:

  1. Per-word Cross Entropy:

Aside: Any probabilistic distribution can also be used to compress data. The entropy measure is closely related to the number of bits needed to store the data based on our language model.

  1. Perplexity. Lower is better. Perplexity is the number of times we need to sample from the probability distribution until we get the answer right.

Other Desiderata of LMs

Calibration Guo 2017. Formally, we want the model probability of the answer matching the actual probability of getting it right. Typically we measure calibration by bucketing the model output probabilities and calculating expected calibration error:

where represents a sub-segment of the data which corresponds to a confidence interval from the model.

How do we calculate answer probabilities? e.g. the university is CMU and the university is Carnegie Mellon University should both be acceptable.

Another desirable characteristic is efficiency. Some metrics are:

  • Memory usage (load model only, peak memory usage)
  • Latency (to first token, to last token)
  • Throughput

Some efficiency tips:

  • On modern hardware doing 10 operations of size 1 is much slower than doing 1 operation of size 10
  • CPUs are like motorcycles and GPUs are like airplanes
  • Try to avoid memory moves between CPU and GPU, and if we need to move memory, do so as early as possible (as GPU operations are asynchronous).

Lecture 4: Sequence Modelling

Lecture 4 Link

NLP is full of sequential data, especially those containing long range dependencies. References can also be complicated, e.g. the trophy would not fit in the suitcase because it was too big. What does it refer to? These are called winograd schemas.

Types of tasks:

  • Binary classification
  • Multi-class classification
  • Structured Prediction. e.g. predicting the parts-of-speech tag for each word in the sentence

Sequence labelling is an important category of tasks in NLP:

  • e.g. Parts of speech tagging
  • Lemmatization
  • Morphological tagging, e.g. PronType=prs

Span labelling:

  • Named entity resolution
  • Syntactic Chunking
  • Entity Linking
  • Semantic role labelling

We can treat span labelling as a sequence labelling task by using beginning, in and out tags.

Three major types of sequence modelling:

  1. Recurrence
  2. Convolutional
  3. Attentional

Recurrent neural networks essentially unroll a computational graph through "time" (where time is the position in the sequence). This results in a vanishing gradient problem, where the gradient on the last token is largest, and the gradient becomes smaller as we move backwards towards the first token. This problem is not just present in RNNs, but in general for computation graphs, if there is important information, adding direct connections from the important nodes to the loss is one way to improve performance. This is the motivation for residual or skip-connections.

LSTM is one way of solving this problem - the basic idea is to make additive connections between time steps, which does not result in vanishing. The idea is to have different "gates" that control the information flow, and use additive connections. Additive connections solves the vanishing gradient problem because it does not modify the gradient (?).

Attention score functions:

  • Original paper in Bahdanau 2015 used a multi-layer perceptron for the attention function, which is very expressive but has extra parameters

  • Bilinear in Luong 2015:

  • Dot product:

  • Scaled Dot product (Vaswani 2017):

Note that the attention mechanism in Vaswani 2017 is essentially a bilinear function + scaling, because the weight matrices are involved in obtaining the query and key vectors.

Comparing RNN, convolution and attention, for a token at position 200 to attend to token at position 1:

  • RNN will take 199 steps through the computation graph
  • Convolution will take 20 steps if the convolution window is 10
  • Attention will take 1 step

This is an advantage of the attention representation of text. Another advantage of attention is the computation speed. For a given text of N tokens, in order to produce N-1 token predictions for the next word at each step:

  • RNN will have to sequentially run N-1 steps
  • Attention will do it in 1 step by masking the attention matrix suitably. This makes it much faster to train attention models.

Truncated Backpropagation is one technique to reduce computation. e.g. in the context of RNN, we might forward propagate through positions 1-10 and compute the gradients. For the next positions 11-20, the hidden state from position 10 is used for forward propagation, but we do not backpropagate the gradients back to positions 1-10.

Lecture 5: Transformers

Lecture 5 Link

Transformers started as cross attention (Bahdanau 2014) and evolved to self-attention (Vaswani 2017). There are two main types of transformers:

  1. Encoder-Decoder model, e.g. T5 or BART. Encoder-decoder models have a separate encoder and decoder module. The encoder takes the input sequence and passes it into a transformer block to return a "context" embedding. The decoder then takes in both the "context" embedding and the output sequence (e.g. the same sentence in French for translation, or the same sentence itself shifted right for self-attention) with appropriate masking and generates a probability for each position in the output sequence. Note that the encoder module has un-masked full attention over the entire input sequence, but the decoder module is masked.

    The benefit of the encoder-decoder model is that we get an embedding representation of a given input sequence which can be used for downstream tasks. But it is also less flexible because we need to have a concept of the "input sequence" vs the "output sequence", which is suitable for tasks like translation but not in text generation tasks there is no clear input/output.

  2. Decoder only model, e.g. GPT or LLaMa. The decoder model simply takes in the input sequence and passes it through a transformer module, resulting in a probability for each position in the input sequence. Using appropriate causal masking, we can train the network to predict the next token at each position given only information about tokens at previous positions.

    The benefit of the decoder model is fewer parameters than the encoder-decoder model. It's also more suitable for text generation tasks. In the encoder decoder framework, at each time step we need to recompute the encoder representation, because the representation at earlier positions can change with the addition of a new token. In the decoder only framework, the previous cached Q, K, V values do not change due to the causal masking, so we can re-use those in decoding the next time step. We should thus expect decoder only models to be faster at decoding.

Core concepts

Multi-head attention. The intuition for having multiple attention heads is that information from different parts of the sentence can be useful to disambiguate in different ways. Typically for a given word to attend to nearby words is useful for learning syntax, but attending to further words is useful for learning semantics.

Multi-head attention basically comprises of multiple attention modules, each with their own weights . The attention outputs are computed for each head, and then concatenated together. Since the resulting matrix will have times the size on one dimension, we pass it through a final linear transformation to get the desired dimension size (as though we only used one attention head).

In practice the attention weights across all the heads are concatenated together first before the matrix multiplication to vectorize the computation. The resulting matrix is then sliced and attention computed on each head, before concatenating together again for the final matrix multiplication.

Positional Encoding. There is no notion of position in the transformer. Positional encoding simply adds an embedding at each position to the word embedding to encode this information. Note that it is added right at the beginning to the raw token embeddings.

  1. Sinusoidal Encoding was the original proposal in the Vaswani 2017 paper. The position embedding is a fixed vector at each position , where the element is for even and for odd , and . Here is an index on the dimension going from and is the dimension of the embedding.

    The method is rather counter-intuitive but the basic idea is that we want the dot product between two embeddings to be high when the relative position is near and decay as we move away. The intuition is covered nicely in Kazemnejad 2019 and basically we can think of each positional embedding as analogous to a binary encoding represented by bits at each dimension. Instead of a hard or , the and functions provide a smoothed version so that we get a nice relative decay.

    Note that this method does allow us to extrapolate the position embedding to longer sequences that we have not seen before in training.

  2. Learnable Embeddings. This was proposed in Shaw 2018, which basically just allows the embedding at each position to be a learnable vector. The problem of this approach is that it becomes impossible to extrapolate to longer sequences in inference time.

  3. Rotary Positional Encodings or RoPE. This was proposed in Su 2021. The fundamental idea is that we want the dot product of embeddings to result in a function of relative position.

    Specifically, we desire that for given positions , and the respective word embeddings , the dot product of the resulting embeddings may be expressed purely as a function of their relative distance , and in so doing we lose notion of the absolute position entirely.

    The paper uses trigonometry and imaginary numbers to come up with a function that satisfies this property. The benefit of losing notion of absolute position entirely means that we can extrapolate to longer sequences that we have not seen before, and RoPE extrapolates better than sinusoidal embeddings. LLaMa uses RoPE embeddings.

Stability. Problem of gradient vanishing or exploding as we pass through the layers of an rnn or transformer. Layer normalization (Ba 2016) is the traditional way to deal with this issue. The intuition is that it normalizes the outputs of each attention layer to a consistent range, preventing too much variance in the scale of outputs.

Here, is the output of an attention layer of embedding dimension and sequence length , and are the element-wise mean and standard deviation respectively across the time positions, and and are learnable vectors of dimension . Hence, if has very large or small values, the normalization process standardizes the range of values. The parameters and allow the model flexibility to shift the values to a different part of the space.

A simplification of layer norm is RMSNorm (Zhang and Sennrich 2019). It removes the mean and bias terms but does not hurt performance empirically. It is used in Llama. The only learnable parameters per layer is .

Residual Connections. Add an additive connection between input (to an attention layer) and the output.

Where is the attention layer function. It prevents vanishing gradients (since we get ) and also allows to focus on learning the difference from the input. In the self-attention context, having the residual connection also prevents the need for tokens to attend to themselves, since that is provided by itself.

Post vs Pre Layer Norm (Xiong 2020). This paper found that applying LayerNorm after the attention layer (and the residual connection) broke the residual connection, because we have , which means that we are no longer guaranteed to get the input . This led to some instability in transformer training. Instead, they found it is better to apply , which led to more stable training, since the residual connection is preserved.

Activation functions.

  • Vaswani used
  • LLaMA uses Swish or SiLU (Hendricks and Gimpel 2016), which is Looks a lot like ReLU but avoids the zero gradient problem.

Transformers are powerful but fickle - Vaswani 2017 used Adam with learning rate increase and decrease (warm up). This is no longer that necessary after the pre-layer norm (Xiong 2020).

AdamW (Loshchilov and Hutter 2017) is now more popular. It applies weight decay for regularization to Adam and corrects the previous implementation which applied the regularization incompletely. AdamW is thus theoretically more stable than Adam.

In summary, some comparisons. The Mamba paper found that the LLaMA architecture is 10x more efficient in terms of scaling law compared to the original Vaswani.

VaswaniLLaMA
Norm PositionPostPre
Norm TypeLayerNormRMSNorm
Non-linearityReLUSiLU
Position-EncodingSinusoidalRoPE

Lecture 6: Decoding Strategies

What is an LLM? It is a model that defines a conditional probability distribution of a sequence given some input sequence .

The nice thing about the conditional distribution is that we get some notion of the model's confidence about the next token to generate. The problem with the conditional distribution is hallucination, since models generally assign some small but non-zero probability to incorrect tokens, even if all the pre-training data is factual. See Kalai and Kempala 2023.

Ancestral Sampling is to sample the next token based on the indicated confidence by the model. The nice thing is that the resultant generations follow exactly the distribution of the model.

The problem with ancestral sampling is the long tail problem. Most language models have around 30k tokens, and the probabilities from the long tail adds up, so that there's a somewhat good chance of sampling something really unlikely.

The obvious solution to this problem is to ignore the long tail and only sampling from the top-k most probably tokens. This is top-k sampling. This results in only tokens that the model is somewhat confident in.

Alternatively, we could only sampling from the top-p probability mass. This is called top-p or nucleus sampling. This is to account for the case where top-k sampling is not so desirable because if most of the probability mass is only on say 3 tokens, we may only want to sample from them.

Another alternative is epsilon sampling, where we only sample tokens with some minimum probability. This ensures that we only sample from tokens where the model is somewhat confident.

Another strategy is to modify the "peakiness" of the data by controlling the distribution temperature. This is done by modifying the scaling factor for the final softmax layer. This allows the user to put more weights on the top results (temperature < 1.0) for factual answers, or spread the weights out more by increasing the temperature for say story generaion.

Contrastive Decoding is a newer idea - the idea is that we use a smaller model to improve the performance of a larger model. Instead of just decoding from the larger model's distribution, we contrastively decode by choosing outputs where the expert model thinks are more likely than the smaller model. The intuition is that both the big and small model are degenerate in similar ways (e.g. keep repeating itself), but the expert has knowledge which the smaller model does not have. Hence taking the probability difference helps to eliminate the degenerate cases and produce better results.

Mode-Seeking Decoding Methods

Instead of sampling, another approach is to try to maximize the probability of generation, i.e. mode-seeking.

Greedy decoding chooses the most likely token at each step.

However, greedy decoding does not guarantee finding the most likely sequence. e.g. the is often the most likely word, but choosing the would exclude many other sentences that may be more likely. Hence Beam Search, which ensures that we don't miss a high-probability sequence "hidden" behind a lower-probability prefix. This is a form of breadth-first search, where we maintain a few options at any point in the search.

  • First, we explore the top 3 next tokens at time step 1
  • Next, we explore the top 3 next tokens at time step 2 from each branch, leading to 9 options
  • Then, we prune down to the top 3 paths from the 9 options and repeat the process for time step 3

In practice, beam search often results in a very non-diverse set, i.e. sentences that are very similar to each other. Hence we may want to introduce diversity into the process. Diverse beam search modifies the scoring when pruning beams to avoid choosing overly similar beams. The similarity score can be as simple as word jaccard similarity between pairs. Stochastic beam search modifies the next token selection to sampling instead of using top greedy decodings.

Minimum Bayes Risk

The question is: Do we actually want the generations with the highest probability? In general, we do find that outputs with low probability tend to be worse than those with high probability. However, when we compare amongst the top outputs where probabilities are quite close, it becomes less clear. Say e.g. we performed beam search and found the top 3 sequences:

the cat sat down - 0.3
the cat ran away - 0.25
the cat sprinted off - 0.2

the cat sat down is the highest probability sequence, but the combined probability mass is higher for the idea that "the cat left the area". So the idea is that we might prefer generations that have high agreement with other sequences (high probability and low risk).

In the equation above, refers to a random sample from the model, say 100 samples. refers to our hypothesis space, supposedly the top 10 outputs. The risk function measures the similarity of each candidate against the samples. For example, a risk function could be ROUGE score, which measures the n-gram overlap between two sequences. Generally, MBR is high performance but high cost - even if G is ROUGE-1 (i.e. unigram overlap), it significantly outperforms beam search or greedy search.

Other MBR variants: output ensembling. Post-ensemble (Kobayashi 2018) compares pairwise embedding similarity between outputs acoss models and chooses outputs with highest average similarity. self-consistency (Wang 2023) prompts for an answer using chain of thought, samples multiple outputs, and extracts the answer from each sample (ignoring the explanations). The most frequently generated answer is then chosen.

Constrained Generation

Sometimes we want to impose some constraints on the outputs, e.g. we want the model to suggest some hobbies but we do not want climbing, or more properly, say we want to omit toxic texts. Options:

  • Ask the model to exclude climbing: often does not work

  • Logit Manipulation. Set the logit for the token(s) corresponding to climbing to be 0. This often messes up because there may be many synonyms or ways to express the same thing and its impossible to enumerate them

  • Sample and Discard. We set up a new discriminator model which predicts whether a sequence corresponds to the idea of climbing or not. This is an easier task and often people will initialize the model from the original language model and train it to predict this idea from a small set of fine-tuning data.

    We can then get the generative model to generate a few samples and keep only samples where the predicted probability of the idea to avoid is low. Another variant of this is FUDGE (Yang and Klein 2021), where we multiply the generative probability of the next token by the discriminator probability that the new sequence will belong to the idea that we desire (e.g. formality). The chosen token will be that which maximizes the combined score.

  • RLHF. The alignment fine-tuning using RLHF may be viewed as a way to do constrained generation. An interesting paper that discusses RLHF as bayesian inference is Korbak 2022.

    Instead of fine-tuning, one way is to do reward-augmented decoding (Deng and Raffel 2023). The idea is to have a reward model that modifies the generative probabilities based on rewards. (?)

Human In the Loop Decoding

Some strategies to incorporate human intervention in the generation:

  • Interleaved text. Model generates some text, human continues, then back to the model etc.
  • Fine-grained replacement. Human selects some part of the generated text, and twiddles some knobs (e.g. "more descriptive" etc.)
  • Choosing outputs. Model generates a few options, and human chooses one.

We could also use a model in the loop. One idea is Tree of thought prompting, which is somewhat analogous to beam search. We have the model generate a few sentences at a time, with a few samples. An external model then judges and chooses the best branches to continue generation.

Practical Considerations

To increase decoding speed, one method is speculative decoding. We generally generate tokens with a small model, but when the small model is very uncertain, the small model will generate topk samples and a large model will pick the next token. This can speed up generation significantly.

There are many libraries for fast decoding, e.g. vLLM, Outlines, disco etc. General takeaway is that a lot can be done at decoding time without needing to fine-tune the original model.

Lecture 7: Prompting Strategies

Basic Prompting. Append a textual string to be beginning of the sequence and let the model complete. Some models are trained as chatbots and require a specific prompting template, e.g. GPT, but in the backend its simply formatted into a string and passed to the model. The important thing is that we need to follow the template that the model was trained on, otherwise performance could suffer.

Post-processing refers to formatting the returned output from the LLM. E.g. ChatGPT supports markdown rendering, so if you ask it to generate a table it will generate it and then render it as a markdown table. Another form of post-processing is output selection, i.e. we extract the part of the output that we are actually interested in. We could do so by extracting keywords (e.g. fantastic, terrible) for a sentiment analysis task.

An interesting phenomenon for predicting labels is that getting the model to predict positive or negative vs 1-5 labels, the former will do better. The intuition behind this is to think about the data that the model was trained on. It is likely that the model has seen many more reviews with the works positive or excellent vs numeric labels, so it might do better with those types of labels.

Few-shot prompting (Brown 2021) basically injects a few examples of the task together with the instruction. One thing to take note of is that LLMs (especially smaller ones) are sensitive to small changes in the in-context examples:

  • Example ordering (Lu 2021)
  • Label balance (Zhang 2022)
  • Label coverage (Zhang 2022)

Effects of few-shot prompting are also sometime counter-intuitive. For example, replacing correct labels with random labels sometimes barely hurts the accuracy of the task. This suggests that the few-shot prompts are more for getting the structure of the response correct rather than learning the desired labelling logic. Sometimes, more demonstrations can also hurt accuracy (this may be due to the longer context length confusing the model).

Chain of thought prompting (Wei 2022) basically tries to get the model to explain its reasoning before making an answer. The original idea was to include reasoning steps in the few-shot prompts to get the model to do likewise, and it found that this significantly improved the accuracy of the model. One interpretation for why this works is that it provides the model with adaptive computation time to generate the correct answer. e.g. a simple question like 1+1= may be answered immediately but some complex logical question might require several steps, and the reasoning step allows the model to generate whatever it needs to get the answer right.

The next step is unsupervised chain of thought prompting (Kojima 2022), which basically found that we can get the same results by just appending let's think step by step to the prompt, without requiring the few-shot examples.

Another idea is that structuring outputs as computer programs can help (Madaan 2022). e.g. if we want the model to output a Direct Acyclic Graph, we could represent the output as a python graph class object. The reason that this works is perhaps because programs are highly structured and there is a lot of code in the pre-training data. Another useful method is to get the model to output JSON format, which it has also seen a lot of.

Another idea is to have program-aided language models (Gao 2022). This allows the LLM to call a code interpreter or calculator to compute answers. This works especially well for numeric questions.

Prompt Engineering

One thing of take note of is that the format should match that of a trained model. e.g. leaving out the space after the colon Passage:text can lead to severe performance degradation. Changing the casing to PASSAGE: text can also degrade performance.

We can also do automatic prompt generation. e.g. use another model to paraphrase our prompt and then select the best response out of the samples. Another approach is gradient-based, where we try out different prompt words and choose the best prompt words based on some kind of loss. These types of methods can result in highly non-human sequences that somehow produce the best results, but they can also be exploited to elicit harmful responses.

Another method along these lines is Prefix tuning (Li and Liang 2021), where they train an embedding prefix that is appended to the transformer weights in each layer according to the task of interest. This is akin to LORA methods which train additional weights that are appendable to the model.

One way to view prompting is to view it as a human-interpretable prior to the model, which can be easier than fine-tuning.

Lecture 8: Fine-tuning and Instruction Tuning

The general framework is that language models are pre-trained on the semi-supervised language modelling task on a very large corpus, and then fine-tuned on a downstream task. There are two paradigms for doing this:

  • Multi-task learning. The standard multi-task learning framework is to train the multiple tasks (language modelling and downstream task) simultaneously, e.g. by alternating mini batches or combining losses together.

    In Dery 2021, the paper argues that learning jointly on the language modelling and end-task produces better results than if we pre-trained and then fine tuned. The intuition is that the model will be learning representations that are useful for both tasks.

    Another paper from Anthropic also shows that incorporating safety training at the beginning out-performs pre-training first and then fine-tuning to incorporate safety.

  • Pre-train then fine-tune. However, because it is so expensive to perform the language modelling, usually pre-train and fine-tune is the actual paradigm that we follow.

Full fine-tuning means to simply continue training the language model on the training data that we have. This is also called supervised fine-tuning. This can be prohibitively expensive. Rajbhandari 2019 showed that training a 65B parameter with 16-bit mixed precision without any optimizations requires around 1TB of GPU memory, which is clearly not feasible.

The simplest solution is to scale-up horizontally across multiple-GPUs. DeepSpeed ZeRo (Rajbhandari 2019) is a popular framework for doing so:

  • Stage 1: partitioning the optimizer state does not hurt optimization speed much. This brings down memory per device from 120GB to 31GB across 12 devices.
  • Stage 2: in addition, it partitions the gradients. This brings memory further down to 16GB.
  • Stage 3: in addition, it partitions the parameters. This brings memory down to 1.9GB but severely impacts computation speed.

An alternative strategy is to only fine tune a smaller set of parameters. Adapters (Houlsby 2019) is one way to do this. The idea is to add an adapter block after each attention block. The adapter block down-projects the embedding dimensionality to something small like 16, passes it through a non-linearity, then up-projects back to the original dimension. Each adapter block only uses 2 x model_dim x adapter_dim parameters.

There are generally two benefits to parameter-efficient fine-tuning methods:

  • They are much more memory efficient. This is because we only need to back-propagate gradients on nodes which are on the computation path between the adapter block sto the final loss function
  • They are more robust to over-fitting on the small set of fine-tuning data

An extension to Adapters is Adapter Fusion (Pfeiffer 2020). The idea is that instead of a single adapter after each attention block, we have multiple adapters, each trained on a different task. We then add an AdapterFusion block, which is basically a multi-head attention over each adapter block, so that it can learn to choose which adapters to use automatically.

LoRA (Hu 2021) is very similar conceptually to adapters, with the important difference that it does not have any non-linearity. The idea is that we express the fine-tuned weights as follows (reference: Cameron Wolfe's blog post):

The goal is to learn with a low rank adaptation, so that it is parameter-efficient. Suppose for simplicity that . We may approximate , where . We can then simply freeze and modify the forward pass to become , and fine-tune the parameters for and . can be as small as 8 or 2, leading to a very parameter-efficient fine-tuning method. is initialized with small random values, whilst is initialized as zero, ensuring that we begin the finetuning process with the model's original outputs.

The efficiency of LoRA compared to full fine-tuning is significant. For d=512, r=8, the efficiency is around 3%.

The reason LoRA has caught on is two-fold:

  1. It does not require modifying the original model structure - we simply modify the state_dict of the original model by adding the
  2. Adapters incur additional inference latency due to the additional layers. LoRA has no additional latency at all

Database Course

Based on CMU 15-445/645 Intro to Database Systems course taught by Andy Pavlo and Jignesh Patel in Fall 2023. The youtube playlist.

Lecture 1

Course about how to design and implement a database management system. Textbook: Database System Concepts by Silberschatz, Korth and Sudarshan.

Agenda:

  • Database Systems Background
  • Relational Model
  • Relational Algebra
  • Alternative Data Models

A database is an organized collection of inter-related data that models some aspect of the real world. Databases are the core component of most computer applications. e.g. an excel spreadsheet is a database. SQLite is the most popular database as it is used in every cellphone.

Flat File Strawman

Store our database as a csv file that we manage. e.g. each line corresponds to an artist, year, country etc. Problems:

  • Super slow to find the artist of interest as we use a for-loop to find each
  • Super slow to update or delete an artist
  • Data types are not stored on the csv file, we need to know which is an integer etc.
  • Concurrent writes to the file are not supported

A Database Management System (DBMS) is a software that allows applications to store and analyze information in a database. A general purpose DBMS supports the definition, creation, querying, update and administrations of databases in accordance with some data model. Usually first choice is postgres or sqlite.

A data model is a collection of concepts for describing the data in a database. A schema is a description of a particular collection of data, using a given data model. Examples of data models:

  1. Relational
  2. Key / Value
  3. Graph
  4. Document / XML / Object
  5. Wide-Column / Column-family
  6. Array / Matrix / Vectors
  7. Hierarchical
  8. Network
  9. Multi-Value

1 is the most common. 2-4 are considered NoSQL models (a loose term). 7-9 are obsolete.

Relational Model

Early database applications were difficult to write. Every time the database schema or layout changed, IBM would need to rewrite database programs. Ted Codd devised the relational model to address this problem. The relational model is an abstraction. The relational model defines a database abstraction based on relations to reduce maintenance overhead. Key tenets:

  • Store database in simple data structures (relations)
  • Physical storage left up to the DBMS implementation
  • Access data through high-level language, DBMS figures out the best execution model

A relation is an unordered set that contain the relationship of attributes that represent entities. An n-ary relation is equivalent to a table with n columns. A tuple is a set of attributes values (also known as its domain) in the relation. The special value NULL is a member of every domain.

A relation's primary key uniquely identifies a single tuple. Some DBMSs automatically create an internal primary key if a table does not define one. Primary key is a constraint that the DBMS will enforce to ensure no duplicates exist. A foreign key specifies that an attribute from one relation maps to a tuple in another relation. E.g. If we have an artist table with the artist id, and an album table with an artist column, the artist column is a foreign key.

We can impose constraints on the database that must hold for any tuple. DBMS will then prevent any modification that could violate those constraints. Unique and foreign key constraints are the most common. e.g. CREATE ASSERTION in SQL.

Data Manipulation Languages (DML)

There are two broad methods to store and retrieve information from a database:

  1. Procedural: the query specifies the high level strategy to find the desired result based on sets. This uses relational algebra.
  2. Non-Procedural (Declarative): The query specifies only what data is wanted and not how to find it. This uses relational calculus.

Relational Algebra

Fundamental operations to retrieve and manipulate tuples in a relation. Based on set algebra (unordered lists with no duplicates). Each operator takes one or more relations as its inputs and outputs a new relation. We can thus chain operators together to create more complex operations. The operations are:

  • SELECT. Choose a subset of the tuples from a relation that satisfies a selection predicate (filter). Predicates act as filters to retain only tuples that fulfill the qualifying requirement. We can combine multiple predicates using conjunctions / disjunctions.
    • Syntax:
    • SELECT * from TABLE where id="a"
  • PROJECTION. Generate a relation with tuples that contains only the specified attributes. E.g. re-arrange ordering, manipulate values ( etc.) and remove unwanted attributes.
    • Syntax:
    • Example: SELECT b_{id} - 100
  • UNION. Generate a relation that contains all tuples that appear in one or both input relations. Note that R and S must have the same schema.
    • Syntax:
    • Example: (SELECT * from R) UNION (SELECT * from S)
  • INTERSECTION. Generate a relation that contains only the tuples that appear in both of the input relations.
    • Syntax:
    • Example: (SELECT * from R) INTERSECT (SELECT * from S)
  • DIFFERENCE. Generate a relation that contains only the tuples that appear in the first and not the second of the input relations.
    • Syntax:
    • Example: (SELECT * from R) EXCEPT (SELECT * from S)
  • JOIN. Generate a relation that contains all tuples that are a combination of two tuples (one from each input relation) with a common value for one or more attributes.
    • Syntax:
    • Example: SELECT * FROM R NATURAL JOIN S;
  • Rename
  • Assignment
  • Duplicate Elimination
  • Aggregation
  • Sorting
  • Division

Relational algebra defines an ordering of the high level steps of how to compute a query. E.g.

  • vs . The former will do a huge join before filtering, whereas the latter filters first before joining, which is much better.

Instead of specifying the exact operations, DBMS allow us to state the high level answer we want, and the DBMS decides on the exact operations and the ordering to perform. This abstracts away the need for the user to know which operations are more efficient. Note that the relational model is independent of the query language implementation, although SQL is the de-facto standard (with many variants).

Document Data Model

A collections of record documents containing a hierarchy of named field / value pairs. A field's value can be a scalar type, array, or another document. Modern implementations use JSON. Main reason for this model is to avoid relational object impedance msimatch, i.e. relational databases store data in rows with relationships between tables, but in object oriented languages like Python data is stored in objects with nested attributes, which could result in inefficient queries when we try to map between the two. In contrast, Document Databases store data in a nested json which closely resembles the object-oriented approach, making it easier to work with. The down side is that we could end up storing a lot of duplicate data in the json objects.

  • Examples: MongoDB, RavenDB, DynamoDB etc.

Vector Data Model

One-dimensional arrays used for nearest neighbour search, used for semantic search on embeddings generated by transformer models. Native integration with modern ML tools etc. At their core, these are just systems that use specialized indexes (e.g. Meta FAISS) to perform NN search quickly.

  • Examples: Pinecone, Weaviate, Milvus etc.

Lecture 2: Modern SQL

1971, the first relational query language called SQUARE was created. 1972 SEQUEL (Structured English Query Language) was created. SQL was added to ANSI standard in 1986. Current standard is SQL:2023.

  • SQL:2023 - property graph queries, multi-dim arrays
  • SQL:2016 - JSON, polymorphic tables etc.

Relational languages:

  • Data Manipulation Langauge (DML)
  • Data Definition Language (DDL)
  • Data Control Language (DCL)

Important: (with duplicates) not sets (no duplicates).

We should try to do everything on the database, in one big query.

Example database:

  • student table: sid, login, gpa etc.
  • enrolled table: sid, cid (course id), grade
  • course table: cid, name

Aggregates:

  • AVG(col). e.g. SELECT AVG(s.gpa) FROM student as s
  • MIN(col)
  • MAX(col)
  • COUNT(col). e.g. SELECT COUNT(LOGIN) as cnt FROM student WHERE login LIKE '%@cs. Equivalently, COUNT(1). Count number of rows where their login matches the pattern.

Groupby: Get average gpa by course id.

SELECT AVG(s.gpa), e.cid
    FROM enrolled as e JOIN student AS s
    ON e.sid = s.sid
GROUP BY e.cid

String operations.

  • LIKE is used for string matching. % matches any substring, including empty strings. _ matches any one character.
  • SQL-92 defines string functions.

Window functions. Perform a sliding calculation across a set of tuples that are related. Like an aggregation but tuples are not grouped into a single output tuple.

SELECT ... FUNC-NAME(...) OVER (...)
    FROM TABLE_NAME

e.g. Get row number per course id.

SELECT cid, sid
    ROW_NUMBER() OVER (PARTITION BY cid)
    FROM enrolled
ORDER BY cid

Nested queries. Invoke a query inside of another query to compose more complex computations. These are often difficult for the DBMS to optimize. e.g. This one below is a join written as a nested query.

outer query ->    SELECT name FROM student WHERE
                    sid IN (SELECT sid FROM enrolled) <- inner query

e.g. Get the names of students in '15-445':

SELECT name FROM student
    WHERE sid IN (
        SELECT sid FROM enrolled
        WHERE cid = '15-445'
    )

Lateral joins. LATERAL operator allows us to reference tables preceding the expression. e.g. below, the second expression can reference t1.

SELECT * FROM (SELECT 1 AS X) AS t1,
    LATERAL (SELECT t1.x+1 AS y) AS t2;

Common Table Expressions. Provides a way to write auxiliary statements for use in a larger query, i.e. a temp table just for one query.

WITH cteName (col1, col2) as (
    SELECT 1, 2
)
SELECT col1 + col2 FROM cteName

Demonstration of CTEs: Find student record with the highest id that is enrolled in at least one course. We use the maxId in the temporary table below.

WITH cteCourse (maxId) AS (
    SELECT MAX(sid) FROM enrolled
)
SELECT name FROM student, cteSource
    WHERE student.sid = cteSource.maxId

We can also use recursion with CTE. Print the sequence of numbers from 1 to 10.

WITH RECURSIVE cteSource (counter) AS (
    (SELECT 1)
    UNION
    (SELECT counter + 1 FROM cteSource
    WHERE counter < 10)
)
SELECT * FROM cteSource

Lecture 3: Database Storage 1

Class will shift gears towards how to implement a DBMS. Focus is on a single node machine first, then move to concurrency. Today's focus is on the .

Focus on , i.e. the DBMS assumes that the primary storage location of the database is on non-volatile disk. The DBMS's components manage the movement of data between volatile (e.g. RAM) and non-volatile storage.

Storage hierarchy. Higher rank is faster, smaller, expensive.

  1. CPU Registers
  2. CPU Caches
  3. DRAM (i.e. memory)
  4. SSD
  5. HDD
  6. Network Storage

Items 1-3 are volatile, i.e. random access, byte-addressable (we can fetch specific bytes). Items 4-6 are non-volatile, i.e. sequential access, block-addressable. For non-volatile storage, we need to fetch data one block at a time, and accessing data in one contiguous block sequentially is much faster than trying to access blocks scattered around. CPU refers to items 1 and 2, Memory refers to item 3, Disk refers to items 4-6.

Some new storage category: Fast network storage is in between HDD and SSD. Persistent Memory is between SSD and DRAM.

DBMSs need to exploit the sequential access on non-volatile memory to maximize speed of retrieval. Hence how we store data contiguously matters in the design of the DBMS.

Access times:

- 1 ns          L1 Cache Ref        1 sec
- 4 ns          L2 Cache Ref        4 sec
- 100 ns        DRAM                100 sec
- 16,000 ns     SSD                 4.4 hours
- 2,000,000 ns  HDD                 3.3 weeks
- 50,000,000 ns Network Storage     1.5 years

System Design Goals

The overall goal is to make it appear like we have more memory than we do:

  1. We want to allow the DBMS to manage databases that exceed the amount of memory available.
  2. Reading/ writing to disk is expensive, so it must be managed carefully to avoid large stalls.
  3. Random access on disk is much slower, so DBMS wants to maximize sequential access.

Disk oriented DBMS

On disk, we have database files stored in pages. Each page is a block of data. When an execution engine makes a query, it loads a page into buffer pool in memory. The buffer pool returns a 64-bit pointer to the memory location where the page is located. The execution engine interprets the layout of the page, updates it. The new page is then written back into the database file on disk.

Why not use the OS?

The DBMS could potentially use memory mapping mmap to store the contents of a file (or page) in the address space of a program. The benefit is that we can tap on the OS to decide which pages to load into physical memory and when to swap pages out. The downside is that the lack of control over memory management can lead to stalling and bad performance.

How the OS works:

  1. Suppose we have Page 1, ..., Page 4 on disk. There's only enough space in physical memory to load 2 pages.
  2. The OS represents this to the program as virtual memory, where Page 1, ..., Page 4 are available
  3. When the program touches Page 2, the OS will load Page 2 into physical memory and return a pointer to the program
  4. Suppose the program now touches Page 3 and Page 4
  5. The OS needs to decide which page to evict from physical memory

So the problems with using mmap to handle I/O for the DBMS:

  1. Transaction safety. The OS can flush dirty pages to disk at any time, even mid-write. We can get corrupted data on disk.
  2. I/O stalls. The DBMS does not know which pages are in memory, so the OS could stall due to page faults (fetching a page not in memory).
  3. Error handling. Difficult to validate pages. Any access can cause a SIGBUS that the DBMS needs to handle.
  4. Performance issues. The OS has its own scheduling and data structures, and can contend with the DBMS's priorities.

There are some companies that use mmap:

  • RavenDB
  • ElasticSearch
  • QuestDB
  • MongoDB (moved away from mmap)

So DBMS almost always wants to control things itself:

  • Flushing dity pages to disk in the correct order
  • Specialized pre-fetching
  • Buffer replacement policy
  • Thread / process scheduling

Paper on this: Are you sure you want to Use MMAP in your DBMS?

For database storage, we need to deal with two problems:

  1. How the DBMS represents the database in files on disk
  2. How the DBMS manages its memory and moves data back and forth from disk

File Storage

The DBMS stores a database as one or more files on disk typically in a proprietary format. The OS doesn't know anything about the contents of these files. There are portable file formats (e.g. parquet etc.). Early systems in 1980s further used custom filesystems on raw block storage. Most newer DBMSs do not do this.

The is responsible for maintaining a database's files. It organizes the files as a collection of pages, and tracks the data read / written to pages, and tracks the available space in each page. A DBMS does not typically maintain multiple copies of a page on disk. This happens above or below the storage manager.

A is a fixed-size block of data. It can contain tuples, meta-data, indexes, logs etc. Most systems do not mix page types. Some systems require a page to be self-contained, i.e. all the information needed to understand the data in the page is contained within the page itself (e.g. Oracle). Each page is given a unique identifier. The DBMS uses an indirection layer to map page IDs to physical locations (e.g. in memory or in S3 or whatever).

There are 3 different notions of what a page is:

  1. Hardware page (typically 4KB). A hardware page is the largest block of data that the storage device can guarantee failsafe writes (i.e. if it says it wrote 4KB, it actually happened).
  2. OS Page (usually 4KB, but can be much larger)
  3. Database Page (512B to 32KB)

Why might a larger page size be a good idea? Sequential data access. If we request a page of 16KB, it is one call to the OS and data is stored sequentially. If we request 4 pages of 4KB each, it is potentially random access. But the downside is that writing to a larger page is slower than writing to a small one.

Page Storage Architecture

How do DBMSs map page IDs to physical location? Different approaches:

  • Heap file organization <- most common
  • Tree file organization
  • Sequential / Sorted File Organization (ISAM)
  • Hashing file organization

A heap file is an unordered collection of pages with tuples that are stored in random order.