Liu 2023 - Meaning Representations from Trajectories

Paper Link

This paper demonstrates a way to compute sentence similarities using auto-regressive LLMs with a decoder-only architecture.

In the encoder-decoder setting, characterized by BERT models, a sentence is typically represented by the embedding generated by the encoder at the final layer. Typically, either the [CLS] token embedding is taken, or the element-wise mean of the embeddings at the final layer is taken to represent the sentence. As Reimers 2019 demonstrates, the default BERT trained using masked language modelling objective does not automatically have good embeddings suitable for Semantic Textual Similarity tasks. Hence the BERT model needs to be fine-tuned with some sentence similarity training task to generate good embeddings.

This paper is interesting as it does not require any such fine-tuning. The main idea is to represent a sentence (sequence of tokens) by the distribution of text continuations (denoted as trajectories) generated by the auto-regressive LLM on the context sentence. In other words, as the paper elegantly puts it, we represent the meaning of linguistic terms by their usage.

Setup

Let $A$ denote a finite vocabulary of tokens and $A^{*}$ indicate the set of all variable-length finite sequences of tokens in $A$ . A language model may be viewed as a map $M (\cdot ∣ \cdot) : A^{*} \times A^{*} \to [0, 1]$ that associates a prompt sequence $s \in A^{*}$ and a possible continuation sequence $t \in A^{*}$ with a likelihood score $M (t ∣ s) \in [0, 1]$ .

The paper chose inverse perplexity as the score: $M (t = (a_{1} ... a_{m}) ∣ s) := i = 1 \prod m P_{M} (a_{i} ∣ s a_{1} ... a_{i - 1})^{1/ m}$

Side note on inverse perplexity. We know that perplexity is as below. Hence the above score is indeed the inverse perplexity. $P er pl e x i t y (Model M) = e x p [- \frac{1}{m} i = 1 \sum m l o g P_{M} (a_{i} ∣ s a_{1} ... a_{i = 1})] = e x p [l o g (i = 1 \prod m P_{M} (a_{i} ∣ s a_{1} ... a_{i = 1})^{- 1/ m})] = i = 1 \prod m P_{M} (a_{i} ∣ s a_{1} ... a_{i = 1})^{- 1/ m}$

Given two prompts $u$ and $v$ , we denote their semantic representation under model $M$ as $M_{u}$ and $M_{v}$ . The semantic distance between them may be denoted $d (M_{u}, M_{v})$ where $d$ is some distance function. The paper chose: $d (M_{u}, M_{v}) = E_{t \sim \frac{1}{2} (M_{u} + M_{v})} ∣ l o g M_{u} (t) - l o g M_{v} (t) ∣ = E_{t \sim \frac{1}{2} (M_{u} + M_{v})} \frac{1}{m} i = 1 \sum m l o g \frac{P _{M} ( a _{i} ∣ u , a _{< i} )}{P _{M} ( a _{i} ∣ v , a _{< i} )}$

In other words, we sample a continuation trajectory $t$ with equal probability from either prompt and generate a continuation sequence of length m, and compute the expected difference in log-likelihood of the generated continuation between the two models $M_{u}$ and $M_{v}$ . Since it is not feasible to integrate the above expression over all possible trajectories $t$ , we sample $n$ trajectories and compute an estimation to the above expected difference.

Results

For the experiments, the authors set n=20, m=20, and sampling temperature $λ$ =1.0.

Ablating on number of trajectories n shows that most of the performance is achieved with n=20, with marginal gains thereafter
Ablating on sequence length m shows that most of the performance is achieved with m=20, with marginal gains thereafter. Interestingly, there is a big improvement in performance from m=1 to m=10, showing that the distribution of the next token alone is not sufficiently discriminating as a semantic representation.
Ablating on sampling temperature $λ$ shows that 1.0 is optimal. Sampling from too diverse or too similar trajectories hurts the performance of this method.

The results show that this method is the best way to represent autoregressive models thus far, with even slightly better performance than encoder-based models like T5 on the STS task. The results are not as good as models explicitly trained with unsupervised contrastive learning (see SimCSE) for the STS task, but as the decoder model size is increased, the performance starts to reach similar levels.

Application to Semantic Retrieval

Unfortunately, this method is not suitable for semantic retrieval, as the distance measure requires sampling from both query and document distributions (if we treat query $q$ as one prompt and document $d$ as the other prompt). Note that the setting is Semantic Textual Similarity where query and documents come from similar distributions.

The authors create a proof-of-concept workaround by creating a fixed set of trajectories $T_{D}$ that is used instead of sampling $t \sim \frac{1}{2} (M_{u} + M_{v})$ , where they select $n$ random documents from the dataset and create a sample trajectory from each document to form $T_{D}$ . We can then precompute $M_{d}$ for each document $d$ on this set of trajectories $T_{D}$ . At inference time, we just need to compute $M_{q}$ on the fixed set of trajectories, then compare the distance $M_{q}$ against the pre-computed $M_{d}$ to get the distance scores. The authors show that this approximation obtains reasonable performance.

However, I note that in typical semantic search settings, query and document usually come from different distributions where queries are short and documents are long. Hence, the continuations of a query and document might not be directly comparable. However, the beauty of this method is that since it is based on continuations, we can use prompting to adjust the behaviour of the LLM to somewhat align distributions.

For example, we may on the document-side generate trajectories using a prompt like Given document d, generate a query. On the query-side, we may use something like Given query q, generate other similar queries. We can then compare trajectories generated in this fashion.

Another issue with the original workaround is that of latency. Even with a fixed trajectory set, we need to compute $M (q ∣ d_{i}) \forall d_{i} \in T_{D}$ at inference time, which is costly even for small number of trajectories. One way to deal with this issue is to have a hybrid setup:

On the document side, we use the high-precision decoder to generate a prospective set of queries from documents of size n (say n=1,000). Each document $d$ is represented as a embedding of size n where each position records the likelihood of prospective query $p_{i}$ given document $d$ using the decoder model.
We use a separate fine-tuned encoder model to precompute embeddings for each of the prospective queries.
On the query side, during inference, we use the same encoder model to embed the query and compute similarity against the n prospective queries. The similarity scores form a representation of size n for the query.
We may then compute similarity between the size n query and size n document representations and perform retrieval with standard nearest neighbour vector databases.

In contrast to standard embedding vectors where each position encodes an unknown quality, these representation vectors have clear semantic meaning in each position (indicating the strength of relation to a particular prospective query). This setup also allows us to trade-off precision against recall by tweaking the sparsity of the representation vectors using score thresholding or top-k thresholding on either (or both) the query and document side.

Keyboard shortcuts

Chux's Notebook

Liu 2023 - Meaning Representations from Trajectories

Setup

Results

Application to Semantic Retrieval