Evidence Lower Bound (ELBO)

Data/Machine learning 2021. 8. 24. 18:21

In the variational auto encoder (VAE) (my blog posting link), a probability of observing $z$ given $x$ can be presented as follows using Bayes' theorm:

$$ p(z | x) = \frac{ p(x|z) p(z) }{ p(x) } $$

where $p(z|x)$ is a posterior. However, this posterior is difficult to compute due to its marginal likelihood $p(x)$. $p(x)$ can be expanded as below using the law of total probability:

$$ p(x) = \int p(x|z) p(z) dz $$

This usually turns out to be an intractable distribution. Therefore, we use variational inference (VI) to approximate $p(z|x)$. That is,

$$ \mathrm{argmin}_{\lambda} KL[ q_\lambda(z|x) || p(z|x) ] \tag{1} $$

where $\lambda \in \{ \mu_{\lambda_0}, \sigma_{\lambda_0}, \cdots \}$ is a variational parameter. By minimizing $Eq. (1)$, we can approximate the true posterior by the variational distribution.

$Eq. (1)$ can be re-written by Evidence Lower Bound (ELBO):

$$ KL[q_\lambda(z|x) || p(z|x)] = \mathbb{E}_{q_\lambda} \left[ \log \frac{q_\lambda(z|x)}{p(z|x)} \right] \tag{2} $$

Note that an equation of the KL divergence is $ KL[p || q] = \sum p_i \log \frac{p_i}{q_i} $. In $Eq. (2)$, $p(z|x)$ is intractable. Therefore, we expand $Eq. (2)$ further (see https://www.youtube.com/watch?v=GbCAwVVKaHY&ab_channel=SmartDesignLab%40KAISTYT and https://process-mining.tistory.com/161):

$$ KL[q_\lambda(z|x) || p(z|x)] = \mathbb{E}_{q_\lambda} \left[ \log \frac{q_\lambda(z|x)}{p(z|x)} \right] $$

$$ = \mathbb{E}_{q_\lambda} [ \log q_\lambda (z|x) ] - \mathbb{E}_{q_\lambda} [ \log p(z|x) ] $$

$$ = \mathbb{E}_{q_\lambda} [ \log q_\lambda (z|x) ] - \mathbb{E}_{q_\lambda} \left[ \log \frac{p(z|x) p(x)}{p(x)} \right] $$

$$ = \mathbb{E}_{q_\lambda} [ \log q_\lambda (z|x) ] - \mathbb{E}_{q_\lambda} \left[ \log \frac{p(z,x)}{p(x)} \right] $$

$$ = \mathbb{E}_{q_\lambda} [ \log q_\lambda (z|x) ] - \mathbb{E}_{q_\lambda}[\log p(z,x)] + \mathbb{E}_{q_\lambda}[\log p(x)] $$

$$ = \mathbb{E}_{q_\lambda} [ \log q_\lambda (z|x) ] - \mathbb{E}_{q_\lambda}[\log p(z,x)] + \int q_\lambda (z|x) \log p(x) dz $$

$$ = \mathbb{E}_{q_\lambda} [ \log q_\lambda (z|x) ] - \mathbb{E}_{q_\lambda}[\log p(z,x)] + \log p(x) \int q_\lambda (z|x) dz $$

Note that integration of any density function equals to 1.

$$ = \mathbb{E}_{q_\lambda} [ \log q_\lambda (z|x) ] - \mathbb{E}_{q_\lambda}[\log p(z,x)] + \log p(x) $$

To summarize up to here:

$$ KL[q_\lambda(z|x) || p(z|x)] = \mathbb{E}_{q_\lambda} [ \log q_\lambda (z|x) ] - \mathbb{E}_{q_\lambda}[\log p(z,x)] + \log p(x) \tag{3} $$

The third component $\log p(x)$ is a marginal log likelihood (or log evidence). $\log p(x)$ is the total probability taking into account of all hidden variables and parameters. $\log p(x)$ is also a probability of data generation. We rewrite $Eq. (3)$ w.r.t $\log p(x)$.

$$ \log p(x) = - \mathbb{E}_{q_\lambda} [ \log q_\lambda (z|x) ] + \mathbb{E}_{q_\lambda}[\log p(z,x)] + KL[q_\lambda(z|x) || p(z|x)] \tag{4}$$

Here, $KL[q_\lambda(z|x) || p(z|x)]$ is intractable. We know this from the beginning, and that's why we've been expanding it. Yet, we can remove the KL term by using its characteristics that the KL divergence can never be negative. (e.g., $KL[..] \geq 0$). Then, $Eq. (4)$ can be expressed as follows:

$$ \log p(x) \geq -\mathbb{E}_{q_\lambda} [ \log q_\lambda (z|x) ] + \mathbb{E}_{q_\lambda}[\log p(z,x)] \tag{5}$$

The terms on the right hand side (RHS) in $Eq.(5)$ are called "Evidence Lower Bound (ELBO)" since $\log p(x)$ is a log evidence.

One important argument through $Eq.(4)$ is that if we maximize the first two components in RHS ($- \mathbb{E}_{q_\lambda} [ \log q_\lambda (z|x) ] + \mathbb{E}_{q_\lambda}[\log p(z,x)]$), then we implicitly minimize the third term ($KL[q_\lambda(z|x) || p(z|x)]$). Therefore, maximizing the ELBO would mean minimizing $KL[q_\lambda(z|x) || p(z|x)]$.

We can re-write the ELBO from $Eq.(5)$ further:

$$ ELBO = -\mathbb{E}_{q_\lambda} [ \log q_\lambda (z|x) ] + \mathbb{E}_{q_\lambda}[\log p(z,x)] $$

$$ = -\mathbb{E}_{q_\lambda} [ \log q_\lambda (z|x) ] + \mathbb{E}_{q_\lambda}[\log p(x|z) p(z) ] $$

$$ = -\mathbb{E}_{q_\lambda} [ \log q_\lambda (z|x) ] + \mathbb{E}_{q_\lambda}[\log p(x|z)] + \mathbb{E}_{q_\lambda}[\log p(z)] $$

$$ = \mathbb{E}_{q_\lambda}[\log p(x|z)] -\mathbb{E}_{q_\lambda} [ \log q_\lambda (z|x) ] + \mathbb{E}_{q_\lambda}[\log p(z)] $$

$$ = \mathbb{E}_{q_\lambda}[\log p(x|z)] -\mathbb{E}_{q_\lambda} \left[ \frac{\log q_\lambda (z|x)}{p(z)} \right] $$

To summarize the ELBO up to here:

$$ ELBO = \mathbb{E}_{q_\lambda}[\log p(x|z)] -\mathbb{E}_{q_\lambda} \left[ \frac{\log q_\lambda (z|x)}{p(z)} \right] \tag{6} $$

In $Eq.(6)$, the first term is "expected reconstruction error" and the second term is "KL divergence between approximate posterior and the prior". Don't be confused with the KL divergence we started with. That was the KL divergence between the approximate posterior and the true posterior. The KL divergence in $Eq.(6)$ is the KL divergence between the approximate posterior and the prior.

We can define a prior as we desire. For example, it can be a normal distribution.

저작자표시

'Data > Machine learning' 카테고리의 다른 글

Machine learning - Introduction to Gaussian processes (0)	2021.09.15
Loss function for multi-label classification (0)	2021.05.03
Metrics for Multi-label classification (0)	2021.05.03

Comments

ABOUT ME

Daesoo Lee's Blog

'Data > Machine learning' 카테고리의 다른 글

티스토리툴바