Lee 2022 - RQ-VAE

Autoregressive Image Generation using Residual Quantization

Vector quantization VQ-VAE is used to represent an image as a sequence of discrete codes. After quantizing, an autoregressive model (AR model, e.g. transformer) is used to predict the codes.

This paper aims to reduce the sequence length of the discrete codes, as it becomes more computationally efficient for the AR model. However, reducing the sequence length (aka reducing the spatial resolution) causes a rate-distortion trade-off. The dilemma is such:

Increasing the codebook size improves resolution of discretization to preserve quality
But increasing the codebook size also increases the probability of codebook collapse, where only a subset of codes are used

The main idea of RQ-VAE is to reduce the spatial resolution by more precisely approximating the feature map at each location. Instead of increasing the codebook size, we use the codebook to recursively quantize the feature map in a coarse-to-fine manner. That is, the feature representation at each location is the sum of the selected codebook vectors at each level. For a codebook where each level has $K$ codebook vectors and $D$ levels, we can represent $K^{D}$ vectors for learning just $D \times K$ vectors.

The paper claims that due to this precise approximation, they can use just 8 x 8 latents to represent a 256 x 256 image. That is a reduction of 1024x!

Note: other papers usually use a distinct set of vectors for each codebook level, but in this paper they use a shared codebook for all levels. This is probably an implementation detail / hyperparameter to be tuned.

Method

There are two main stages:

Stage 1: Residual Quantized VAE.
Stage 2: RQ-Transformer.

Stage 1: Residual Quantized VAE

Let a codebook $C$ be a finite set ${(k, e (k))}, k \in [K]$ be a codebook of tuples of a code k and its code embedding $e (k) \in R^{n_{z}}$ .

$K$ is the codebook size
$n_{z}$ is the code embedding dimension

Given a vector $z \in R^{n_{z}}$ , let $Q (z; C)$ denote the discrete code of $z$ : $Q (z; C) = k \in [K] arg min ∣∣ z - e (k) ∣ ∣_{2}^{2}$

The normal VQ-VAE flow is such:

Start with an input image $X \in R^{H_{o} \times W_{o} \times 3}$ , where $H_{o}, W_{o}$ are the original image dimensions
The encoder extracts a feature map $Z = Enc (X) \in R^{H \times W \times n_{z}}$ , where $H = H_{o} / f, W = W_{o} / f$ are downsampled $f$ times to form the latent feature map
Applying the discretization independently to each position, we obtain a code map $M \in [K]^{H \times W}$ : $M_{h, w} = Q (Z_{h, w}; C)$
We also obtain the quantized feature map $\hat{Z} \in R^{H \times W \times n_{z}}$ : $\hat{Z}_{h, w} = e (M_{h, w})$
Finally, we decode the quantized feature map to reconstruct the image: $\hat{X} = G (\hat{Z})$

For Residual Quantization, we instead represent $z \in R^{D}$ as an ordered tuple of $D$ codes: $RQ (z; C, D) = (k_{1}, ..., k_{D}) \in [K]^{D}$

Where $k_{d}$ is the discrete latent code of $z$ at depth $d$ . Specifically, we obtain $k_{d}$ by recursively quantizing the residual. Starting with residual $r_{o} = z$ , we compute the next $k_{d}, r_{d}$ as follows: $k_{d} r_{d} = Q (r_{d - 1}; C) = r_{d - 1} - e (k_{d})$

Finally, we represent $\overset{z}{^} : = \overset{z}{^}^{(D)} = \sum_{d = 1}^{D} e (k_{d})$ as the quantized embedding of $z$ using the sum of embedding across all depths.

Note: by using a shared codebook and summing up the embeddings at each depth, this resembles bloom hashing (see Weinberger 2009). The difference is that we recursively / sequentially encode the residuals, whereas bloom hashing simultaneously applies multiple independent hashes and sums up those embeddings at the hashed positions.

However, as the experiments indicate, the embedding norms get smaller as we go deeper into the codebook levels, since it learns a coarse-to-fine encoding. So it seems to make better sense to have separate codebook vectors for each level.

The RQ-VAE flow is thus identical to the VQ-VAE flow above, except that the discretization step differs:

Start with an input image $X \in R^{H_{o} \times W_{o} \times 3}$
Encoder extracts $Z = Enc (X) \in R^{H \times W \times n_{z}}$
Discretize each position to get code map $M \in [K]^{H \times W \times D}$ (note the additional dimension $D$ , since we now have $D$ codes per position): $M_{h, w, d} \hat{Z}_{h, w} = RQ (Z_{h, w}; C, d) = d = 1 \sum D e (M_{h, w, d})$
Finally decode: $\hat{X} = G (\hat{Z})$

RQ-VAE Training

The loss is: $L_{reco n} L_{co mmi t} L = ∣∣ X - \hat{X} ∣ ∣_{2}^{2} = d = 1 \sum D ∣∣ Z - sg [\hat{Z}^{(d)}] ∣ ∣_{2}^{2} = L_{reco n} + β L_{co mmi t}$

The first loss is the reconstruction error of making the RQ-VAE roundtrip. The second loss is the commitment loss, which aims to push the encoder to reduce the quantization errors.

Note that the commitment loss sums up the quantization error at each level of the codebook, not just the final quantization error at the end
Also note that the authors did not include the loss to update the codebook as in VQ-VAE; instead they use the Exponential Moving Average approach to update the codebook $C$ using the average encoder output assigned to each cluster

Stage 2: RQ-Transformer

Given that we have learned a code map of $M \in [K]^{H \times W \times D}$ , the goal is to now autoregressively predict the code map (laid out in raster scan order). Specifically, let us lay out $M$ into a 2D array of codes $S \in [K]^{T \times D}$ , where $T = H \times W$ . Thus each row of $S$ (call it $S_{t}$ ) will contain $D$ codes: $S_{t} = (S_{t 1}, ..., S_{t D}) \in [K]^{D}$

Note that $S$ is laid out in raster scan order, meaning that we read the first row of pixels left-to-right, then move to the next row and so on.

Thus the goal of stage 2 is to learn an autoregressive model which learns the joint probability function $p (S)$ , given all the prior tokens (i.e. causal masking). Note that we predict $D$ codes for each position before moving on to the next: $p (S) = t = 1 \prod T d = 1 \prod D p (S_{t d} ∣ S_{< t}, S_{t, < d})$

The natural approach is to just fit an autoregressive transformer decoder to this unrolled sequence (of length $T \times D$ , where $T = H \times W$ ). This incurs attention bottleneck complexity of $T^{2} D^{2}$ . However, this paper argues that this is inefficient as it does not exploit the structure of our tokens, and proposes a customized RQ-transformer for this situation.

The idea is that there are two orthogonal dimensions, so we can decouple and create a specialized transformer to handle each one orthogonally:

The first two dimensions, $H$ and $W$ , encode the position
The last dimension, $D$ , encodes the depth



RQ-Transformer Architecture

Spatial Transformer

The spatial transformer encodes the position dimension and marginalizes away the depth dimension. It is concerned with the "big picture". Specifically, each input position to the spatial transformer is: $u_{t} = PE_{T} (t) + d = 1 \sum D e (S_{t - 1}, d) for t > 1$

Note that:

$PE_{T} (t)$ is the position encoding for spatial position $t$
We re-use the quantized embedding codebook vectors to represent inputs to the spatial transformer
The depth position is marginalized out, such that we only need to encode $T = H \times W$ positions with the spatial transformer
$u_{1}$ is a special case for start of sequence and will have its own learnable embedding

The sequence of inputs $(u_{t})_{t = 1}^{T}$ is passed into a causally masked transformer to produce spatial context vectors $h_{t}$ like so: $h_{t} = SpatialTransformer (u_{1}, ..., u_{t})$

Depth Transformer

Given the context vector $h_{t}$ provided by the spatial transformer, the depth transformer predicts a very short sequence of $D$ codes for this position $t$ . Thus, the depth transformer can be a smaller stack of transformer layers.

Note: in the experiments, generally $N_{s p a t ia l} = 24$ and $N_{d e pt h} = 4$ in number of layers

Specifically, at spatial position $t$ and depth $d$ , input $v_{t, d}$ to the depth transformer is the sum of codebook embeddings up to depth $d$ : $v_{t, d} v_{t, d} = PE_{D} (d) + h_{t} = PE_{D} (d) + d^{'} = 1 \sum d - 1 e (S_{t, d^{'}}) for d = 1 for d > 1$

The short sequence of inputs $(v_{t 1}, ..., v_{t D})$ is passed into a causally masked transformer + classifier head to predict the codes for each position. The same depth transformer is reused across all spatial positions $t$ .

The autoregressive loss is simply the negative log likelihood of the correct latent token sequence: $L_{A R} = E_{S} E_{t, d} [- lo g p (S_{t, d} ∣ S_{< t}, S_{t, < d})]$

Computation Savings

As observed above, fitting a naive transformer to the full sequence of length $T D$ will incur attention complexity of $T^{2} D^{2}$ . Compare this to the RQ-transformer:

Spatial transformer: $T^{2}$ , since we marginalize away the depth dimension
Depth transformer: $T \times D^{2}$ , since we re-use the depth transformer $T$ times
RQ transformer: $T^{2} + T D^{2}$

With $T = 512, D = 4$ , we have around 15x lower computational complexity:

Naive approach: 4,194,304
RQ transformer: 270,336
Ratio: ~15x!

Other Tricks

There are two other tricks that they use in this paper, presumably to improve performance. The aim is to reduce the impact of exposure bias, which is the divergence between the samples during training and inference. During training, due to teacher forcing, we still get correct tokens to predict from at each position. During inference, since we generate each position autoregressively, errors can compound and lead the model to predict from a sequence that deviates far from what it has seen during training.

The main idea is to introduce some uncertainty into both the labels and inputs to the training process. Let us define a categorical distribution on $[K]$ conditioned by the encoder embedding $z \in R^{n_{z}}$ , with probability distribution: $Q_{τ} (k ∣ z) \propto e^{- ∣∣ z - e (k) ∣ ∣_{2}^{2} / τ} for k \in [K]$

Thus the probability of a discrete token $k$ increases exponentially as its quantized embedding $e (k)$ gets closer to the encoder embedding $z$ . Note that as $τ$ approaches $0$ , $Q_{τ}$ gets sharper and converges to the one-hot distribution $Q_{0} (k ∣ z) = 1 [k = Q (z; C)]$ .

This soft approximation of the argmin is used for soft labelling of the objective when training the RQ-transformer. Specifically, note that at position $t$ and depth $d$ , we quantize by doing a nearest neighbour search for the residual vector $r_{t, d - 1}$ . Hence the hard target for the RQ-transformer training is the one-hot label $Q_{0} (\cdot ∣ r_{t, d - 1})$ . Instead we replace with $Q_{τ} (\cdot ∣ r_{t, d - 1})$ .

We also do Stochastic Sampling of the latent codes when generating training samples for the RQ-transformer. Instead of deterministic code selection when encoding the raw images to latent codes, we sample from the categorical distribution of $Q_{τ} (\cdot ∣ r_{t, d - 1})$ .

The ablation studies show that using soft labelling + stochastic sampling significantly improves performance.

Results and Hyperparameters

General hyperparameters:

Codebook size: $K = 16, 384$
Codebook levels: $D = 4$
Latent dimension size: $n_{z} = 256$
Temperature: $τ = 0.5$

Generally RQ-VAE matches or outperforms image generation of other autoregressive competitors like VQ-GAN. It is also a lot faster in generation speed and memory usage.

Keyboard shortcuts

Chux's Notebook