The variational lower bound
Typically, for a generative model, a good model should have a high log-likelihood of having generated the observed data, the so-called evidence.
In other words, for a data point \( \vx \), it is desirable to have high value of \( \log p_{\text{model}}(\vx) \). Expanding this in terms of the latent variables we get,
\begin{aligned}
\log p_{\text{model}}(\vx) &= \log \left(\int_{\vz} p_{\text{model}}(\vx,\vz) d\vz \right) \\\\
&= \log \left(\int_{\vz} p_{\text{model}}(\vx,\vz) \frac{q(\vz|\vx)}{q(\vz|\vx)} d\vz \right) \\\\
&= \log \left(\int_{\vz} \frac{p_{\text{model}}(\vx,\vz)}{q(\vz|\vx)} q(\vz|\vx) d\vz \right) \\\\
&= \log \left(\expect{\vz \sim q(\vz|\vx)}{\frac{p_{\text{model}}(\vx,\vz)}{q(\vz|\vx)}} \right) \\\\
&\ge \expect{\vz \sim q(\vz|\vx)}{\log \left(\frac{p_{\text{model}}(\vx,\vz)}{q(\vz|\vx)} \right)} \\\\
&\ge \expect{\vz \sim q(\vz|\vx)}{\log p_{\text{model}}(\vx,\vz)} - \expect{\vz \sim q(\vz|\vx)}{\log q(\vz|\vx)} \\\\
&\ge \expect{\vz \sim q(\vz|\vx)}{\log p_{\text{model}}(\vx,\vz) } + \entropy{q(\vz|\vx)} \\\\
\end{aligned}
In these steps, we first exploded the marginal into an integral over joint probability. Then, after some simple mathematical manipulation, we applied Jensen's inequality to upper bound the log-likelihood in the 5th step.
Finally, in the last step, we represented \( -\expect{r} \log r \) as the entropy \( \entropy{r} \).
This last bound is known as the variational lower bound that we have studied as part of the variational inference approaches.
It is also known as evidence lower bound (ELBO) because it is a lower bound on the evidence \( p_{\text{model}}(\vx) \).
Variational inference approaches optimize to maximize the value of this lower bound.
Intuitively, the first term, \( p_{\text{model}}(\vx,\vz) \) ensures that the model achieves a high joint-likelihood of latent variables with the observed variables.
Maximizing for a high value of the second term, \( \entropy{q(\vz|\vx)} \), encourages the placement of high probability mass on many values of \( \vz \), as opposed to collapsing to a single most likely value of \( \vz \).
We can write this lower bound as a function of the parameters of \( q \), as \( \loss(q) \).
Variational autoencoders are trained to maximize the value of \( \loss(q) \).
$$ \loss(q) = \expect{\vz \sim q(\vz|\vx)}{\log p_{\text{model}}(\vx,\vz) } + \entropy{q(\vz|\vx)} $$