Doersch 2016 - Tutorial on VAEs
Tutorial on Variational Autoencoders
Tutorial on VAEs from a computer vision perspective.
Introduction
Generative modelling deals with models of distributions , defined over data points . For example, may represent pixels of an image. may indicate that a set of pixels that look like a real image has high probability, whereas images looking like random noise get low probability.
For generative modelling to be useful, we don't just want an unconditional distribution. Instead, we want to generate images from a conditional distribution, e.g. images that look like another image or following a certain caption.
Latent Variable Models
Generative modelling usually starts with a latent variable. For example, in digit image generation, we may first want to choose a particular digit (say 5
) to generate. The image model then knows to generate pixels corresponding to this digit (as opposed to deciding on the fly). is typically chosen to indicate the latent variable (which can be multi-dimensional).
Formally, we have a vector of latent variables which we can easily sample from based on some probability density function defined over . In VAEs, this distribution is typically a multi-variate standard gaussian distribution.
We now want to make sure that we have some function that maps from the latent variable space to the output space . Say we have a family of deterministic functions (typically some neural network) which is parametrized by a vector , where . In this setup, randomness is injected by the random variable such that is a random variable in the space .
To actually generate images that look like , we wish to optimize such that we can sample from and with high probability, will look like that 's in our data.
How do we formalize this notion? Naively, just gives a random point prediction. We need some probability distribution which tells us how likely a given training sample is under the generative model . By the law of total probability, we can then write down the Maximum Likelihood objective:
The choice of for VAEs is often gaussian. Specifically, the most standard choice is:
This means that the function gives us the gaussian mean, and the covariance matrix is a fixed diagonal (as is a hyperparameter). Notice that this gaussian expression allows us to express the idea that we just need to produce samples that look like , but does not have to exactly match some .
This smoothness is critical for generative modelling — if we specify to be the dirac delta function (i.e. all the probability mass is on the specific output produced by ), it would be impossible to learn from data, as the function is zero almost everywhere.
Also note that while gaussian is the most common choice, it does not have to be so. We just need to be computable and be continuous in . For example, if is binary, then can be a bernoulli parametrized by (although it's unfathomable why one would do this).
Variational Autoencoders
In order to maximize above, there are two big problems to solve:
- How do we define the latent variables and what information they encode
- How to deal with the integral over
Problem 1: How to represent Latents?
For the first problem, VAEs basically prescribe minimal structure to the latents , and say that the samples of can be drawn from a simple distribution, say . Then, the onus falls on the model to map from a simple gaussian distribution to the complex distribution which describes our data. We know empirically that this is not a problem for an arbitrarily large neural network.
Problem 2: How to deal with integral over P(z)?
For the second problem, the naive approach is to deal with the integral via sampling. That is we sample a large number of latents and compute . The problem with this approach is that in high dimensional spaces, we need a very large to get an accurate estimate of . This is because for most instances of , will be very close to .
The key idea behind VAEs is to speed up sampling by attempting to sample values of that are likely to have produced , and compute just from those. To do this, we need a new function which when given an image , produces a distribution over values likely to have produced . If the space of values that are likely under is smaller than the space of values that are likely under , we will be able to estimate much more cheaply.
However, by introducing a new arbitrary distribution , we are no longer sampling under , so we cannot directly obtain from it. Thus we need some way to relate and . This relationship is one of the cornerstones of variational Bayesian methods.
Let us start by defining KL divergence between and some arbitrary distribution (which may or may not depend on ).
Recall that KL divergence is asymmetric, and measures how different two probability distributions are. In this case, the expectation is taken over the distribution of values under .
Now apply Bayes rule to , and do some rearranging:
Some comments on the above:
- comes out as a constant because it does not depend on .
- We can group and together into its own KL divergence term.
So far, we have not made any assumption on the arbitrary distribution . In the context of trying to maximize , it makes sense to construct a which does depend on . So we make that dependency on explicit:
This equation is core to the VAE, so we should understand it deeply.
- The left hand side represents what we want to optimize:
- was the original maximum likelihood objective - we want our model to produce images that look like
- is the error or deviation of our tractable, estimated distribution from the true, intractable oracle distribution . This term is always more than or equals to and is if and only if .
- The right hand side is called the Evidence Lower Bound (ELBO). In bayesian statistics, the marginal is called the evidence, because our data is evidence for how good our model is. The RHS is a lower bound for our evidence precisely because the KL divergence term , which implies .
- We cannot directly optimize , but we can do the next best thing which is to optimize the tractable RHS, given an appropriate choice of .
- As we increase the capacity of , the "error" term should become smaller and smaller, so the RHS will more accurately estimate the evidence (and lead to better optimization)
Notice how the RHS now resembles an auto-encoder:
- "encodes" into a latent
- "decodes" back to reconstruct