Variational Autoencoder (VAE)

Deep Learning


Variational Autoencoders (VAEs) CITE[kingma-2013] are generative models, more specifically a probabilistic directed graphical model whose posterior is approximated by an Autoencoder-like neural network. Traditional variational approaches use slower iterations fixed-point equations. On the other hand, being a neural network, VAEs have the benefit of being trained with gradient-based approaches and enable approximate inference. Recent years have shown that VAEs are not only elegant, but also offer state-of-the-art results in generative modeling.


To understand VAEs, we recommend familiarity with the concepts in

  • Probability: A sound understanding of conditional and marginal probabilities and Bayes Theorem is desirable.
  • Introduction to machine learning: An introduction to basic concepts in machine learning such as classification, training instances, features, and feature types.

Follow the above links to first get acquainted with the corresponding concepts.

Autoencoder: a quick recap

Autoencoders are neural networks with two separately parametrized components — an encoder and a decoder. In the traditional autoencoder, the encoder arrives at intermediate transforms, the so-called encodings and passes those on to the decoder. The decoder then transforms these encodings back to the input. Trained jointly, the decoder encourages the encoder to learn meaningful representations of the data. Mathematically, let's represent the input as \( \vx \), the encodings as \( \vz \), and the output of the decoder as \( \dash{\vx} \). The autoencoder process can be summarized as this chain of events.

$$ \vx \overset{\text{encoder}(\vx;\mTheta_e)}{\Longrightarrow} \vz \overset{\text{decoder}(\vz;\mTheta_d)}{\Longrightarrow} \dash{\vx} $$

where, \( \mTheta_e \) and \( \mTheta_d \) are the separate parametrizations of the encoder and decoder networks, respectively.

Variational autoencoder

Variational autoencoders (VAEs) are generative models, with latent variables, much like Gaussian mixture models (GMMs). The encoder in a VAE arrives at the latent variables that may have generated the observed data point, and the decoder attempts to draw a sample that is approximately same as the input sample from the latent variables inferred by the encoder.

In VAEs, the encoder component is known as the recognition model or the inference model. The encoder is a conditional Bayesian network of the form \( q(\vz|\vx) \). Given an input example \( \vx \), the recognition model infers \( \vz \), a sample from the approximate inference network \( q(\vz | \vx) \).

The decoder component is known as the generative model or generator network. The decoder is a generative model of the form \( p(\vx|\vz)p(\vz) \), a form that you may recognize from GMMs. The decoder uses the latent variable to infer the posterior probability \( p(\vx | \vz) \).

The inference model and generator model events in an VAE can be summarized as follows.

$$ \vx \overset{\text{inference}(\vx;\mTheta_e)}{\Longrightarrow} q(\vz|\vx) \overset{\text{generator}(\vz;\mTheta_d)}{\Longrightarrow} p(\vx|\vz) $$

The encoder and the decoder are easily extended to multiple levels or layers of latent variables. For example, the encoder may model a multilevel conditional Bayesian network of the form \( q(\vz_0|\vz_1)\ldots q(\vz_L|\vx) \), where, \( \vz_l \) denotes the latent variables at level \( l=0,\ldots,L \). Similarly, the decoder may represent a multilevel Bayesian network of the form \( p(\vx | \vz_L) p(\vz_L | \vz_{L-1}) \ldots p(\vz_1 | \vz_0) \).

Just like the traditional autoencoder, the encoder and decoder work together to infer user representations of the data, in a probabilistic setting.

The variational lower bound

Typically, for a generative model, a good model should have a high log-likelihood of having generated the observed data, the so-called evidence. In other words, for a data point \( \vx \), it is desirable to have high value of \( \log p_{\text{model}}(\vx) \). Expanding this in terms of the latent variables we get,

\begin{aligned} \log p_{\text{model}}(\vx) &= \log \left(\int_{\vz} p_{\text{model}}(\vx,\vz) d\vz \right) \\\\ &= \log \left(\int_{\vz} p_{\text{model}}(\vx,\vz) \frac{q(\vz|\vx)}{q(\vz|\vx)} d\vz \right) \\\\ &= \log \left(\int_{\vz} \frac{p_{\text{model}}(\vx,\vz)}{q(\vz|\vx)} q(\vz|\vx) d\vz \right) \\\\ &= \log \left(\expect{\vz \sim q(\vz|\vx)}{\frac{p_{\text{model}}(\vx,\vz)}{q(\vz|\vx)}} \right) \\\\ &\ge \expect{\vz \sim q(\vz|\vx)}{\log \left(\frac{p_{\text{model}}(\vx,\vz)}{q(\vz|\vx)} \right)} \\\\
&\ge \expect{\vz \sim q(\vz|\vx)}{\log p_{\text{model}}(\vx,\vz)} - \expect{\vz \sim q(\vz|\vx)}{\log q(\vz|\vx)} \\\\ &\ge \expect{\vz \sim q(\vz|\vx)}{\log p_{\text{model}}(\vx,\vz) } + \entropy{q(\vz|\vx)} \\\\

In these steps, we first exploded the marginal into an integral over joint probability. Then, after some simple mathematical manipulation, we applied Jensen's inequality to upper bound the log-likelihood in the 5th step. Finally, in the last step, we represented \( -\expect{r} \log r \) as the entropy \( \entropy{r} \).

This last bound is known as the variational lower bound that we have studied as part of the variational inference approaches. It is also known as evidence lower bound (ELBO) because it is a lower bound on the evidence \( p_{\text{model}}(\vx) \). Variational inference approaches optimize to maximize the value of this lower bound.

Intuitively, the first term, \( p_{\text{model}}(\vx,\vz) \) ensures that the model achieves a high joint-likelihood of latent variables with the observed variables. Maximizing for a high value of the second term, \( \entropy{q(\vz|\vx)} \), encourages the placement of high probability mass on many values of \( \vz \), as opposed to collapsing to a single most likely value of \( \vz \).

We can write this lower bound as a function of the parameters of \( q \), as \( \loss(q) \). Variational autoencoders are trained to maximize the value of \( \loss(q) \).

$$ \loss(q) = \expect{\vz \sim q(\vz|\vx)}{\log p_{\text{model}}(\vx,\vz) } + \entropy{q(\vz|\vx)} $$

The VAE loss: practically

For implementation though, we like to keep things simple. Let's expand the variational lower bound further.

\begin{aligned} \loss(q) &= \expect{\vz \sim q(\vz|\vx)}{\log p_{\text{model}}(\vx,\vz) } + \entropy{q(\vz|\vx)} \\\\ &= \expect{\vz \sim q(\vz|\vx)}{\log p_{\text{model}}(\vx | \vz)p_{\text{model}}(\vz) } - \expect{\vz \sim q(\vz|\vx)}{q(\vz|\vx)} \\\\ &= \expect{\vz \sim q(\vz|\vx)}{\log p_{\text{model}}(\vx | \vz)} + \expect{\vz \sim q(\vz|\vx)}{\log p_{\text{model}}(\vz) } - \expect{\vz \sim q(\vz|\vx)}{q(\vz|\vx)} \\\\ &= \expect{\vz \sim q(\vz|\vx)}{\log p_{\text{model}}(\vx | \vz)} - \expect{\vz \sim q(\vz|\vx)}{\log \frac{q(\vz|\vx)}{p_{\text{model}}(\vz)}} \\\\ &= \expect{\vz \sim q(\vz|\vx)}{\log p_{\text{model}}(\vx | \vz)} - D_{\text{KL}}\left(q(\vz|\vx) || p_{\text{model}}(\vz) \right) \end{aligned}

where, \( D_{\text{KL}} (q || p) \) is the KL-divergence between the distributions \( q \) and \( p \).

From the autoencoder perspective, the first term in this expansion, \( \expect{\vz \sim q(\vz|\vx)}{\log p_{\text{model}}(\vx | \vz)} \) is the expected log-likelihood of output generated by the decoder from the encoding \( \vz \). In other words, this term is akin to the reconstruction error that we have studied and applied as loss function in autoencoders. If we were training a variational autoencoder on image data, then this term can just be the mean squared error of the input image compared to the output of the decoder.

The second term, \( D_{\text{KL}}\left(q(\vz|\vx) || p_{\text{model}}(\vz) \right) \), needs to be lower to maximize \( \loss(q) \). This term tries to ensure that the approximate posterior distribution \( q(\vz | \vx) \) is similar to the model prior \( p_{\text{model}}(\vz) \).

A simple example VAE

Consider a VAE to be trained on image data. For simplicity, we will choose \( q(\vz | \vx) \) to be a Gaussian distribution. We will also assume the model prior \( p_{\text{model}}(\vz) \) to be the standard Gaussian \( \Gauss(0,1) \).

In this case, the encoder network could be a fully convolutional network, ending with a fully connected layer with two kinds of outputs — the mean and variance of the distribution \( q \). This means, for a given input \( \vx \), the encoder will return two values \( \vmu_\vx \) and \( \mSigma_\vx \), the mean and variance of the distribution \( q(\vz|\vx) \), respectively.

We can now sample \( \vz \) from the distribution \( q(\vz|\vx) \), that is from \( \Gauss(\vmu_\vx, \mSigma_\vx) \).

The decoder is also a fully convolutional network, that takes the samples \( \vz \) as input. The output of the decoder is of the same dimensionality as that of the input to the encoder, the original image.

For training this network, the loss includes reconstruction error term and the KL-divergence term that we described earlier in the section on VAE loss.

  • The reconstruction error is simply the mean squared error of input image to the output of the decoder.
  • The KL-divergence term is the comparison of the approximating conditional distribution \( q(\vz|\vx) \) to the marginal distribution \( p(\vz) \). In this case, it is the KL-divergence of \( \Gauss(\vmu_\vx, \mSigma_\vx) \) to \( \Gauss(0,1) \).

Training a VAE

Training the VAE involves inferring the parameters of the model that maximize the variational lower bound. These parameters and training data are task-dependent. For example, in the case of the simple VAE example we described in the previous section, the parameters to be trained are the weight matrices of the fully convolutional network and those of the fully-connected output units. The training set will be image data.

As with all neural networks, training a VAE follows the usual process. First we define a task-dependent loss for the predictions of the model. In the case of VAE, it is the KL-divergence subtracted from the reconstruction error term. Subject to this loss, we utilize a gradient-based optimization strategy such as stochastic gradient descent (SGD) or its variant to fit the model parameters to the available training data. The gradients are computed using backpropagation.

The recipe for training is standard. No special derivations are needed, unlike those necessitated in the case of variational inference on probabilistic graphical models. The neural architectures make it extremely easy with a standardized training process.

Please support us

Help us create more engaging and effective content and keep it free of paywalls and advertisements!

Let's connect

Please share your comments, questions, encouragement, and feedback.