Doersch 2016 - Tutorial on VAEs
Tutorial on Variational Autoencoders
Tutorial on VAEs from a computer vision perspective.
Introduction
Generative modelling deals with models of distributions , defined over data points . For example, may represent pixels of an image. may indicate that a set of pixels that look like a real image has high probability, whereas images looking like random noise get low probability.
For generative modelling to be useful, we don't just want an unconditional distribution. Instead, we want to generate images from a conditional distribution, e.g. images that look like another image or following a certain caption.
Latent Variable Models
Generative modelling usually starts with a latent variable. For example, in digit image generation, we may first want to choose a particular digit (say 5
) to generate. The image model then knows to generate pixels corresponding to this digit (as opposed to deciding on the fly). is typically chosen to indicate the latent variable (which can be multi-dimensional).
Formally, we have a vector of latent variables which we can easily sample from based on some probability density function defined over . In VAEs, this distribution is typically a multi-variate standard gaussian distribution.
We now want to make sure that we have some function that maps from the latent variable space to the output space . Say we have a family of deterministic functions (typically some neural network) which is parametrized by a vector , where . In this setup, randomness is injected by the random variable such that is a random variable in the space .
To actually generate images that look like , we wish to optimize such that we can sample from and with high probability, will look like that 's in our data.
How do we formalize this notion? Naively, just gives a random point prediction. We need some probability distribution which tells us how likely a given training sample is under the generative model . By the law of total probability, we can then write down the Maximum Likelihood objective:
The choice of for VAEs is often gaussian. Specifically, the most standard choice is:
This means that the function gives us the gaussian mean, and the covariance matrix is a fixed diagonal (as is a hyperparameter). Notice that this gaussian expression allows us to express the idea that we just need to produce samples that look like , but does not have to exactly match some .
This smoothness is critical for generative modelling — if we specify to be the dirac delta function (i.e. all the probability mass is on the specific output produced by ), it would be impossible to learn from data, as the function is zero almost everywhere.
Also note that while gaussian is the most common choice, it does not have to be so. We just need to be computable and be continuous in . For example, if is binary, then can be a bernoulli parametrized by (although it's unfathomable why one would do this).
Variational Autoencoders
In order to maximize above, there are two big problems to solve:
- How do we define the latent variables and what information they encode
- How to deal with the integral over
Problem 1: How to represent Latents?
For the first problem, VAEs basically prescribe minimal structure to the latents , and say that the samples of can be drawn from a simple distribution, say . Then, the onus falls on the model to map from a simple gaussian distribution to the complex distribution which describes our data. We know empirically that this is not a problem for an arbitrarily large neural network.
Problem 2: How to deal with integral over P(z)?
For the second problem, the naive approach is to deal with the integral via sampling. That is we sample a large number of latents and compute . The problem with this approach is that in high dimensional spaces, we need a very large to get an accurate estimate of . This is because for most instances of , will be very close to .
The key idea behind VAEs is to speed up sampling by attempting to sample values of that are likely to have produced , and compute just from those. To do this, we need a new function which when given an image , produces a distribution over values likely to have produced . If the space of values that are likely under is smaller than the space of values that are likely under , we will be able to estimate much more cheaply.
However, by introducing a new arbitrary distribution , we are no longer sampling under , so we cannot directly obtain from it. Thus we need some way to relate and . This relationship is one of the cornerstones of variational Bayesian methods.
Let us start by defining KL divergence between and some arbitrary distribution (which may or may not depend on ).
Recall that KL divergence is asymmetric, and measures how different two probability distributions are. In this case, the expectation is taken over the distribution of values under .
Now apply Bayes rule to , and do some rearranging:
Some comments on the above:
- comes out as a constant because it does not depend on .
- We can group and together into its own KL divergence term.
So far, we have not made any assumption on the arbitrary distribution . In the context of trying to maximize , it makes sense to construct a which does depend on . So we make that dependency on explicit. Let's call this the ELBO equation.
This equation is core to the VAE, so we should understand it deeply.
- The left hand side represents what we want to optimize:
- was the original maximum likelihood objective - we want our model to produce images that look like
- is the error or deviation of our tractable, estimated distribution from the true, intractable oracle distribution . This term is always more than or equals to and is if and only if .
- The right hand side is called the Evidence Lower Bound (ELBO). In bayesian statistics, the marginal is called the evidence, because our data is evidence for how good our model is. The RHS is a lower bound for our evidence precisely because the KL divergence term , which implies .
- We cannot directly optimize , but we can do the next best thing which is to optimize the tractable RHS, given an appropriate choice of .
- As we increase the capacity of , the "error" term should become smaller and smaller, so the RHS will more accurately estimate the evidence (and lead to better optimization)
Notice how the RHS now resembles an auto-encoder:
- "encodes" into a latent
- "decodes" back to reconstruct
Optimizing the Objective
Now we need to perform SGD on the RHS. First we need to specify . The usual choice is:
Where and are both neural networks with parameters that map a given deterministically into a mean and variance vector respectively. We only need a vector for the variance because is typically constrained to be a diagonal matrix.
Because we chose both to be multi-variate gaussians, the KL divergence between them may now be computed in closed form ( is the dimensionality of the distribution):
The second line above gives the general case, but since our , it reduces into the third line. The functions express the fact that the parameters of the normal distribution of are determined by .
Reparametrization Trick
So we have the second term on the RHS expressed as a function of , which can be optimized via SGD. What about the first term ?
This term is more tricky, because it involves two steps. Suppose we approximate the expectation by performing SGD. Then we have to:
- Sample a
- Compute using the decoder
The first sampling step is not an operation that can be backpropagated through, so we cannot optimize this equation as-is. This is where the re-parametrization trick comes into play.
The re-parametrization trick essentially moves the stochasticity of the sampling step out of the model forward pass into the data layer. Instead of sampling directly, we first sample an intermediate and treat it as data. Then, we compute deterministically . This achieves the same effect as the direct sampling approach, but the difference is that the parameters of have entered the equation in a deterministic way that can be backpropagated.
With the trick, we now have fully specified the optimization objective. The equation we take the gradient of is:
Test Time Inference
At test time, when we want to generate new samples, we simply sample a new , then feed it into the decoder to get a new .
Suppose we want to evaluate the probability of this new test example, i.e. . This is not tractable for the reasons we discussed earlier, as it involves an integral over the distribution of . However, we can make use of the ELBO concept and use the RHS of the ELBO equation as an approximation to . There is still an expectation over , but because sampling gives an expectation that converges much faster than sampling , we can get a good sense of the probability by sampling a few times.
Extra Info
This section tackles 3 questions to help our understanding:
- How much error is introduced by the additional term ?
- How is the VAE framework linked to Minimum Description Length?
- Do VAEs have regularization parameters analogous to sparsity penalties?
Q1. Error from Lower Bound
Given that we are optimizing for the RHS and not directly for , how much error does the additional term introduce?
Since we assumed that takes the form of a high dimension gaussian, must also take the form of a gaussian for the KL divergence term to go to . However, this is not necessarily the case - we make no assumption on the distribution of . The hope is that if is sufficiently high-capacity, then there exists some which both (i) maximizes and (ii) results in a gaussian-like . If such a function exists, then our objective would find it due to the term.
Q2. Minimum Description Length interpretation
(I don't understand this part.)
Another way to look at the RHS of the ELBO equation is in terms of information theory. may be seen as the total number of bits required to construct a given under our model using an ideal encoding. The RHS views this as a two-step process to construct .
- We first use some bits to construct . may be viewed as the expected information required to convert an uninformative sample from to a sample from
- In the second step, measures the amount of information required to reconstruct from under an ideal encoding.
Hence the total number of bits is the sum of these two steps, minus a penalty we pay for being a sub-optimal encoding .
Q3. Regularization Effect
It is interesting to view the term as regularization, since it is encouraging our distribution to be similar to a simple distribution.
In a usual sparse autoencoder, we have a parameter in an objective function:
That is, for encoder and decoder , we encourage the encoding to be sparse. Similarly, the KL divergence term encourages our encoder to be simple.
Where does a similar parameter like enter into the ELBO equation? Recall that we chose a normal distribution for . It turns out that plays a similar role to , as we shall see.
Using the PDF of the normal distribution, we have that , where is a constant that does not depend on and can be ignored during optimization. In the ELBO equation, appears in the first term of the RHS but not the second term. Hence, by varying , we can control the relative weighting between the two terms. Specifically, a lower implies less regularization and a larger implies more regularization.