Part of a series on |

Bayesian statistics |
---|

Posterior = Likelihood × Prior ÷ Evidence |

Background |

Model building |

Posterior approximation |

Estimators |

Evidence approximation |

Model evaluation |

In variational Bayesian methods, the **evidence lower bound** (often abbreviated **ELBO**, also sometimes called the **variational lower bound**^{[1]} or **negative variational free energy**) is a useful lower bound on the log-likelihood of some observed data.

Let and be random variables, jointly-distributed with distribution . For example, is the marginal distribution of , and is the conditional distribution of given . Then, for any sample , and any distribution , we have

The left-hand side is called the

In the terminology of variational Bayesian methods, the distribution is called the *evidence*. Some authors use the term *evidence* to mean , and others authors call the *log-evidence*, and some use the terms *evidence* and *log-evidence* interchangeably.

There is no generally fixed notation for the ELBO. In this article we use

Suppose we have an observable random variable , and we want to find its true distribution . This would allow us to generate data by sampling, and estimate probabilities of future events. In general, it is impossible to find exactly, forcing us to search for a good **approximation .**

That is, we define a sufficiently large parametric family of distributions, then solve for for some loss function . One possible way to solve this is by considering small variation from to , and solve for . This is a problem in the calculus of variations, thus it is called the **variational method**.

Since there are not many explicitly parametrized distribution families (all the classical distribution families, such as the normal distribution, the Gumbel distribution, etc, are far too simplistic to model the true distribution), we consider *implicitly parametrized* probability distributions:

- First, define a simple distribution over a latent random variable . Usually a normal distribution or a uniform distribution suffices.
- Next, define a family of complicated functions (such as a deep neural network) parametrized by .
- Finally, define a way to convert any into a simple distribution over the observable random variable . For example, let have two outputs, then we can define the corresponding distribution over to be the normal distribution .

This defines a family of joint distributions over . It is very easy to sample : simply sample , then compute , and finally sample using .

In other words, we have a **generative model** for both the observable and the latent.
Now, we consider a distribution good, if it is a close approximation of :

since the distribution on the right side is over only, the distribution on the left side must marginalize the latent variable away.

In general, it's impossible to perform the integral , forcing us to perform another approximation.

Since , it suffices to find a good approximation of . So define another distribution family and use it to approximate . This is a **discriminative model** for the latent.

The entire situation is summarized in the following table:

: observable | : latent | |
---|---|---|

approximable | , easy | |

, easy | ||

approximable | , easy |

In **Bayesian** language, is the observed evidence, and is the latent/unobserved. The distribution over is the *prior distribution* over , is the likelihood function, and is the *posterior* *distribution* over .

Given an observation , we can *infer* what likely gave rise to by computing . The usual Bayesian method is to estimate the integral , then compute by Bayes rule . This is expensive to perform in general, but if we can simply find a good approximation for most , then we can infer from cheaply. Thus, the search for a good is also called **amortized inference**.

All in all, we have found a problem of **variational Bayesian inference**.

A basic result in variational inference is that minimizing the Kullback–Leibler divergence (KL-divergence) is equivalent to maximizing the log-likelihood:

where is the entropy of the true distribution. So if we can maximize , we can minimize , and consequently find an accurate approximation .

To maximize , we simply sample many , then use

In order to maximize , it's necessary to find :

This usually has no closed form and must be estimated. The usual way to estimate integrals is Monte Carlo integration with importance sampling:

where is a sampling distribution over that we use to perform the Monte Carlo integration.

So we see that if we sample , then is an unbiased estimator of . Unfortunately, this does not give us an unbiased estimator of , because is nonlinear. Indeed, we have by Jensen's inequality,

In fact, all the obvious estimators of are biased downwards, because no matter how many samples of we take, we have by Jensen's inequality:

Subtracting the right side, we see that the problem comes down to a biased estimator of zero:

By the delta method, we have

If we continue with this, we would obtain the importance-weighted autoencoder.

The tightness of the inequality has a closed form:

We have thus obtained the ELBO function:

For fixed , the optimization simultaneously attempts to maximize and minimize . If the parametrization for and are flexible enough, we would obtain some , such that we have simultaneously

Since

we have

and so

In other words, maximizing the ELBO would simultaneously allow us to obtain an accurate generative model and an accurate discriminative model .

The ELBO has many possible expressions, each with some different emphasis.

This form shows that if we sample , then is an unbiased estimator of the ELBO.

This form shows that the ELBO is a lower bound on the evidence , and that maximizing the ELBO with respect to is equivalent to minimizing the KL-divergence from to .

This form shows that maximizing the ELBO simultaneously attempts to keep close to and concentrate on those that maximizes . That is, the approximate posterior balances between staying close to the prior and moving towards the maximum likelihood .

This form shows that maximizing the ELBO simultaneously attempts to keep the entropy of high, and concentrate on those that maximizes . That is, the approximate posterior balances between being a uniform distribution and moving towards the maximum a posteriori .

Suppose we take independent samples from , and collect them in the dataset , then we have empirical distribution .

Fitting to can be done, as usual, by maximizing the loglikelihood :

Now, by the ELBO inequality, we can bound , and thus

The right-hand-side simplifies to a KL-divergence, and so we get:

This result can be interpreted as a special case of the data processing inequality.

In this interpretation, maximizing is minimizing , which upper-bounds the real quantity of interest via the data-processing inequality. That is, we append a latent space to the observable space, paying the price of a weaker inequality for the sake of more computationally efficient minimization of the KL-divergence.^{[3]}