Gao 2021 - Simple Contrastive Learning of Sentence Embeddings

Paper Link

This paper proposes an unsupervised and supervised approach to fine-tune sentence encoder models to perform semantic textual similarity tasks. The STS tasks refer to a set of tasks taken from SemEval 2012 to 2017 where given two sentences, the model is to output a score between 1-5 and scored based on correlation to human inputted scores.

Task Dataset

Example sentence pair with a score of 4 (taken from SemEval 2016):

In May 2010, the troops attempted to invade Kabul.
The US army invaded Kabul on May 7th last year, 2010

Example sentence pair with a score of 3:

John said he is considered a witness but not a suspect.
"He is not a suspect anymore." John said.

Unsupervised SimCSE

SimCSE follows the popular framework of contrastive learning using in-batch negatives, where pairs of related sentences are trained to have high similarity whilst having low similarity with other random sentences.

The idea of unsupervised SimCSE is simple: given a collection of sentences ${x_{i}}_{i = 1}^{m}$ , treat each sentence itself as its own positive pair, and use dropout noise to introduce random perturbations such that the self similarity is not trivially perfect.

Specifically, if we denote $h_{i}^{z}$ as the embedding for $x_{i}$ under dropout mask $z$ , then the loss for unsupervised SimCSE for a mini-batch of $N$ sentences is:

$l_{1} = - l o g \frac{e ^{s im (h_{i}^{z_{i}}, h_{i}^{z_{i}^{'}}) / τ}}{\sum _{j = 1}^{N} e ^{s im (h_{j}^{z_{j}}, h_{j}^{z_{j}^{'}}) / τ}}$

Note that $τ$ is the temperature hyperparameter. Importantly, the authors found that setting $τ = 1$ with cosine similarity performs very poorly (64.0), compared to dot product (85.9). However, carefully tuning $τ$ can lead to similar performance ( $τ = 0.05$ had a score of 86.2).

We may view this procedure as data augmentation, analogous to how random pixel distortions and rotations are applied to images to improve computer vision models. The paper shows that this simple unsupervised method significantly outperforms other data augmentation methods. Note that the authors used the default 10% dropout for BERT models.

Supervised SimCSE

The supervised version follows a similar framework, although the positive pairs are taken from an external dataset. They chose the Natural Language Inference (NLI) datasets, where each example is a triplet of sentences. The premise is denoted $x_{i}$ , the entailment sentence is denoted $x_{i}^{+}$ and the contradiction sentence is denoted $x_{i}^{-}$ . The loss is then formulated as:

$l_{2} = - l o g \frac{e ^{s im (h_{i}, h_{i}^{+}) / τ}}{\sum _{j = 1}^{N} e ^{s im (h_{i}, h_{j}^{+}) / τ} + e ^{s im (h_{i}, h_{j}^{-}) / τ}}$

Ablation Studies

The paper finds that including contradiction sentences as hard negatives has a small but significant improvement in performance
The paper finds that using the [CLS] token or averaging embeddings across the first and last layer does not make much difference

Alignment and Uniformity

Wang and Isola 2020 propose two metrics for measuring the effectiveness of an embedding method on a set of documents:

Alignment. Given a distribution of positive pairs of documents $p_{p os}$ , alignment desires the expected distance between embeddings of each pair to be small:

$l_{a l i g n} = E_{(x, x^{+}) \sim p_{p os}} ∣∣ f (x) - f (x^{+}) ∣ ∣^{2}$

Uniformity. Given any two documents drawn from the corpus, uniformity metric should be small (i.e. distance between them is large).

$l_{u ni f or m} = l o g E_{(x, y) \sim p_{d a t a}} e^{- 2∣∣ f (x) - f (y) ∣ ∣^{2}}$

A common problem pointed out in training language models is anisotropy, in which embeddings are pushed into a narrow cone in the vector space, which severely limits their expressiveness. The anisotropy problem is naturally connected to uniformity, which aims at distributing embeddings evenly in the space. The authors argue that contrastive learning as proposed in this paper addresses this problem through some analysis (omitted for now).

Empirically, they show that the alignment metric for SimCSE is comparable to average BERT, but the uniformity measure is significantly lower, leading to much better performance in terms of accuracy.

Keyboard shortcuts