## From likelihood to KL-divergence

Typically, for a generative model, a good model should have a high log-likelihood of having generated the observed data, the so-called **evidence**.
In other words, for a data point \( \vx \), it is desirable to have high value of \( \log p(\vx) \).
Expanding this in terms of the latent variables we get,

\begin{aligned}
\log p(\vx) &= \log \left(\int_{\vz} p(\vx,\vz) d\vz \right) \\\\
&= \log \left(\int_{\vz} p(\vx,\vz) \frac{q(\vz)}{q(\vz)} d\vz \right) \\\\
&= \log \left(\int_{\vz} \frac{p(\vx,\vz)}{q(\vz)} q(\vz) d\vz \right) \\\\
&= \log \left(\expect{\vz \sim q(\vz)}{\frac{p(\vx,\vz)}{q(\vz)}} \right) \\\\
\end{aligned}

In these steps, we first exploded the marginal into an integral over joint probability.
If \( \vz \) follows a discrete probability distribution, then imagine a summation instead of the integral.
Nevertheless, the remaining analysis remains the same with integrals replaced by summations.

Then, in the second step, we retained equality by multiplying the numerator and denominator with an unrelated distribution \( q(\vz) \).
Finally, through some simple mathematical manipulation, we identified this quantity to be an expectation over the distribution \( q(\vz) \), because \( \expect{a}{f(b,a)} = \int_{a} f(b,a) p(a) da \).

Now, note that logarithm is a concave function. So, by Jensen's inequality, \( \log \expect{}{a} \ge \expect{}{\log a} \).

\begin{aligned}
\log p(\vx) &= \log \left(\expect{\vz \sim q(\vz)}{\frac{p(\vx,\vz)}{q(\vz)}} \right) \\\\
&\ge \expect{\vz \sim q(\vz)}{\log \left(\frac{p(\vx,\vz)}{q(\vz)} \right)} \\\\

\end{aligned}

Thus, using Jensen's inequality, we have upper bounded the log-likelihood.

Trudging along, with some basic mathematical manipulation, we get,

\begin{aligned}
\log p(\vx) &\ge \expect{\vz \sim q(\vz)}{\log \left(\frac{p(\vx,\vz)}{q(\vz)} \right)} \\\\

&\ge \expect{\vz \sim q(\vz)}{\log \left(\frac{p(\vx)p(\vz|\vx)}{q(\vz)} \right)} \\\\

&\ge \expect{\vz \sim q(\vz)}{\log p(\vx)} + \expect{\vz \sim q(\vz)}{\log \frac{p(\vz|\vx)}{q(\vz)}}\\\\
&\ge \expect{\vz \sim q(\vz)}{\log p(\vx)} - D_{\text{KL}}\left( q(\vz) || p(\vz|\vx) \right) \\\\
\label{eqn:lik-to-kldiv}
\end{aligned}

where, we have replaced \( \expect{\vz \sim q(\vz)}{\log \frac{p(\vz|\vx)}{q(\vz)}} \) with the KL-divergence term \( D_{\text{KL}}\left( q(\vz) || p(\vz|\vx) \right) \) because, \( D_{\text{KL}}(q || p) = \expect{a \sim q}{\log \frac{q(a)}{p(a)}} \).

The first term, \( \expect{\vz \sim q(\vz)}{\log p(\vx)} \), in the last equation is a constant with respect to the distribution \( q(\vz) \).
Thus, to maximize the overall sum, the evidence log-likelihood \( \log p(\vx) \), we need to minimize the second term, the KL-divergence.

In fact, if we are able to minimize the KL-divergence term to zero, the equality will hold, as it is easy to check.

There, minimizing the KL-divergence of the approximate distribution \( q(\vz) \) to the true conditional density \( p(\vz|\vx) \) is the key to maximizing the log-likelihood of the evidence.