Kaiser 2018 - Fast Decoding in Latent Space

Fast Decoding in Sequence Models using Discrete Latent Variables

This paper uses the idea in VQ-VAE to do fast decoding in latent space.

Main Idea

Transformers parallelize during training well, but are slow at inference due to autoregressive nature. This paper proposes to autoregressively generate in discrete latent space instead, which can be much faster depending on the compression:

First auto-encode target sequence into discrete latent space which is a shorter sequence
Train a decoder to generate in discrete latent space autoregressively
Decode output sequence from the shorter latent sequence in parallel

Here is an infographic from nano banana that captures the idea of generation in latent space:



Fast Generation in Latent Space

Setup

When generating sequential output, autoregressive models generates $y_{1}, ..., y_{n}$ in a canonical order. This is because the model is trained to predict: $P (y_{t} ∣ y_{t - 1}, y_{t - 2}, ..., y_{1})$

During training, because ground truth is known, we can train in parallel. During decoding, this is a fundamental limitation.

The proposal is to first encode the original sequence into a shorter sequence of discrete tokens $l_{1}, ..., l_{m}$ , where $m$ is much smaller than $n$ . This latent sequence is also autoregressively predicted, and subsequently decoded in parallel back into the original token space. In the experiments, $m = \frac{n}{8}$ .

Latent Transformer

The latent transformer is described in terms of a machine translation task. Given an input sequence $x$ in English, and a corresponding output sequence $y$ in German, our input-output pair is $(x, y) = (x_{1}, ..., x_{k}, ..., y_{1}, ..., y_{n})$ .

The high level architecture is:

The function $ae (y, x)$ will autoencode $y$ into a shorter sequence $l = l_{1}, ..., l_{m}$ of discrete latent tokens using some discretization technique
The latent prediction model $lp (x)$ which is a transformer will autoregressively predict $l$ based on $x$
The decoder $ad (l, x)$ is a parallel model that will decode $y$ from $l$ and the input sequence $x$

Loss

The reconstruction loss compares the encoded - decoded output to the original target sequence $y$ : $\overset{y}{^} L_{reco n} = ad (ae (y, x), x) = CrossEntropyLoss (\overset{y}{^}, y)$

The latent prediction loss compares the true latents generated by the autoencoder to the generated latents $lp (x)$ : $l_{t r u e} l_{p re d i c t e d} L_{l a t e n t} = ae (y, x) = lp (x) = CrossEntropyLoss (l_{t r u e}, l_{p re d i c t e d})$

The final loss is the sum of the two: $L = L_{reco n} + L_{l a t e n t}$

Autoencoder

Now we describe each part in more detail. The diagram below summarizes the architecture of the autoencoder $ae (y, x)$ .



Autoencoder architecture

The input to the autoencoder is of shape length x hidden_size. The aim is to shorten the sequence length by downsampling using convolutions.

We first pass through a residual block which aims to learn local features using convolutions:

Pass through relu
Pass through 1D convolution with k=3, s=1. This means we look over a window of 3 tokens each time, and take a step of 1 token each step.
- Since stride is 1, the sequence length is unchanged
Pass through layer norm + add the residual connection

Then we pass through a standard self attention block.

Then we pass c times through a downsampling convolution:

k=2 means that we look over a window of 2 tokens
s=2 means that we stride over 2 steps at a time: this effectively halves the sequene length
Doing this c times means we get $C = 2^{c}$ reduction in length

Finally we pass through the bottleneck function, which has yet to be described. But essentially the bottleneck function converts the representation of hidden_size at each position into a discrete token, and then represents it instead by the embedding of the discrete token using a lookup.

Decoder

The decoder $ad (l, x)$ decodes from latent space back into token space.



Decoder architecture

We first pass through c up-steps, each of which will double the sequence length of the encoded sequence:

The same residual block from before is used
Self attention layer is used
An up-conv step is performed:
- A standard feed forward MLP is used that doubles the hidden dimension
- A reshape operation is done to translate the extra hidden dimension into sequence length instead

Finally, the decompressed sequence of length x hidden_size is fed into a standard transformer decoder to decode the target sequence $y$ .

Note that for the first 10k steps of training, instead of feeding the decompressed sequence, the authors pretrain this decoder transformer model by feeding the target sequence $y$ (i.e. training it like a standard transformer decoder). This is to warm up the head so that gradients to upstream components are reasonable.

Discretization Bottleneck

This is actually the main part of the paper - how do we discretize our token sequence into latent space so that we can predict the latent tokens in an autoregressive manner?

The bottleneck function takes a target token embedding $y$ and maps it to a discrete token in $[K] : = {1, 2, ..., K}$ :

First, take $y$ and encode it: $enc (y) \in R^{D}$ , where $D$ is the dimension of the latent space.
Next, pass the encoding through a bottleneck to product a discrete latent token $z_{d} (y) \in [K]$
Look up embedding table using $z_{d} (y)$ to produce an embedding $z_{q} (y) \in R^{D}$ to pass to the decoder

Note that the bottleneck function is applied to each sequence position in parallel, indepedently.

Gumbel Softmax

The gumbel-softmax trick is found in Jang 2016. It is a popular discretization technique.

First, we project the encoder output $enc (y) \in R^{D}$ using a learned projection matrix $W \in R^{K \times D}$ , such that we get logits $l$ : $l = W enc (y) \in R^{K}$

We can simply extract the discrete code $z_{d} (y)$ for the decoder as: $z_{d} (y) = i \in [K] arg max l_{i}$

For evaluation and inference, we can simply use $j : = z_{d} (y)$ to look up an embedding table $e \in R^{K \times D}$ to get $z_{q} (y) = e_{j}$ . However, we cannot use this approach for training as the argmax and embedding lookup are non-differentiable.

For training, the Gumbel-softmax trick is used to make the whole thing differentiable.

First, we draw $g_{1}, ..., g_{K} \in R$ i.i.d samples from the standard Gumbel distribution $g_{i} \sim - lo g (- lo g u), where u \sim U (0, 1)$
Then we compute the weight vector $w \in R^{K}$ using a softmax, where for each latent dimension $i \in [K]$ : $w_{i} = \frac{exp (( l _{i} + g _{i} ) / τ )}{\sum _{i = 1}^{K} exp (( l _{i} + g _{i} ) / τ )}$
Now we simply compute the input to the decoder as the weighted sum of the embedding table: $z_{q} (y) = w e \in R^{D}$

For low temperatures of $τ$ , $w$ would be close to representing the 1-hot arg-max index of $l$ , which is what we use for evaluation and inference. Setting $τ$ too high would cause divergence between training and inference, which is not ideal.

Why do we add the gumbel $g_{i}$ ? Ideally during training, we want the encoder logits $Softmax (l)$ to represent probabilities of a categorical distribution, from which we sample the discrete code $z_{d} (y)$ . This ensures that we have:

Exploration from the stochasticity

Discretization. We also force the embedding $z_{q} (y)$ to be a single embedding (i.e. discretization) rather than a smooshed average of several embeddings.

The gumbel-max trick tells us that choosing $k = arg max (l_{i} + g_{i})$ results in an index $k$ distributed exactly according to $P (class k)$ as dictated by the softmax of the logits, which is what we want.

Since the gumbel-max is not differentiable, we simply relax it into a softmax operation, giving us the gumbel-softmax.

Vector Quantization

The VQ-VAE is covered in VQ-VAE. Here is a brief recap.

The encoder output $enc (y) \in R^{D}$ is used to do nearest neighbour lookup on embedding vectors $e \in R^{K \times D}$ to give us:
- $z_{q} (y) = e_{k}$ and $z_{d} (y) = k$
- Where $k = arg min_{j \in [K]} ∣∣ enc (y) - e_{j} ∣ ∣_{2}$
The loss is the reconstruction loss + the commitment loss between encoder output and codebook vectors: $L = l_{reco n} + β ∣∣ enc (y) - sg (z_{q} (y)) ∣ ∣_{2}$

It turns out that the gradient update can be replaced by an exponential moving average over:

The codebook embeddings $e_{j}$ ; and
The count $c_{j}$ measuring the number of times $e_{j}$ is selected as a nearest neighbour

For a given mini-batch of target embeddings ${y_{1}, ..., y_{l}, ...}$ , we can update the exponential moving average for counts like so: $c_{j} \leftarrow λ c_{j} + (1 - λ) l \sum 1 [z_{q} (y_{l}) = e_{j}]$

And the codebook embeddings $e_{j}$ updated as the average encoder output selected for $e_{j}$ : $e_{j} \leftarrow λ e_{j} + (1 - λ) l \sum \frac{1 [ z _{q} ( y _{l} ) = e _{j} ] enc ( y )}{c _{j}}$

$λ$ is a decay parameter and set to $0.999$ in the experiments

Note. The paper also covers two other discretization techniques, namely improved semantic hashing and decomposed vector quantization (to counteract codebook collapse), but I don't cover it for now.

Results

The results show that the discretization bottleneck is not as good as a standard transformer, but beats all previous discrete methods and runs faster than a normal transformer.

Ablation results show that:

Increasing the codebook size $K$ improves performance
Increasing the ratio $n / m$ (i.e. shortening the sequence more aggressively) leads to speedup in inference but also clearly hurts performance

Hence it seems like there is no free lunch here.

Keyboard shortcuts

Chux's Notebook