Mnemonic for closed-form Bayesian univariate inference with Gaussians
The following note helps me remember the closed-form solution for Bayesian inference using Gaussian distributions, which comes in handy very often. See Bishop p.98 (2.141 and 2.142) for closed-form parameter updates for univariate Bayesian inference using a Gaussian likelihood with conjugate Gaussian prior. Let’s start with the rule for the posterior mean:
$$\mu_{new} = \frac{\sigma^2}{N \sigma_{0}^2 + \sigma^2} \mu_0 + \frac{N \sigma_{0}^2}{N \sigma_{0}^2 + \sigma^2} \mu_{MLE}$$
Where we know that the maximum likelihood estimate for the mean is the sample mean:
$$\mu_{MLE} = \frac{1}{N} \sum_{n=1}^N x_{n}$$
See the bottom of this post for a derivation of the maximum likelihood estimate for the mean. Notice that the mean update has the following shape, balancing between the prior mean and the data sample mean:
$$\mu_{new} = \lambda \mu_{0} + (1-\lambda) \mu_{MLE}$$
With
$$ \lambda = \frac{\sigma^2}{N \sigma_{0}^2 + \sigma^2}$$
I find this form a bit hard to remember by heart. We can also think of this weighting factor $\lambda$ as the prior precision divided by the posterior precision. This is not immediately obvious, but we can already see in the formula for $\mu_{new}$ that, as the posterior precision increases, the posterior density becomes more concentrated around the maximum likelihood solution for the mean, $\mu_{MLE}$, since then $\lambda$ diminishes and $(1-\lambda)$ will be larger.
Let’s first write down the posterior variance from Bishop and see how we can use it to support above intuition:
$$\sigma_{new}^2 = ( \frac{1}{\sigma_{0}^2} + \frac{N}{\sigma^2} )^{-1}$$
This formula is much easier to remember in terms of precision $\tau = \frac{1}{\sigma^2}$:
$$\tau_{new} = \tau_{0} + N \tau$$
Which intuitively reads: the posterior precision is the prior precision plus N times the precision of the likelihood function. The multiplication by N shows the again intuitive result that the more observations you make, the more “certain” the posterior distribution becomes.
It’s not immediately obvious that that $\lambda = \frac{\tau_{0}}{\tau_{new}}$ so let’s write it out:
$$\frac{\tau_{0}}{\tau_{new}} = \frac{\tau_{0}}{ \tau_{0} +_N \tau }$$ $$= \frac{ \frac{1}{\sigma_{0}^2} }{ \frac{1}{\sigma_{0}^2} + \frac{N}{\sigma^2}}$$
Multiplying both sides with $\sigma_{0}^2$ gives:
$$= \frac{1}{ 1 + \frac{N\sigma_{0}}{\sigma^2} }$$
Multiplying both sides with $\sigma^2$ gives:
$$= \frac{\sigma^2}{ \sigma^2 + N\sigma_{0}^2 } = \lambda$$
QED.
This results suggest that it’s efficient to first compute the new variance, and then use this variance to compute $\lambda$ in the formula for the posterior mean.
So if you want to easily remember the parameter update rules, in natural language:
- The posterior precision is the prior precision plus N times the likelihood precision
- The posterior mean is a mix between the prior mean and the mean of the observed data sample
- The mixing coefficient of the prior mean is the prior precision divided by the posterior precision which we called $\lambda$
- $(1-\lambda)$ is the mixing coefficient of the sample mean.
MLE for the mean ¶
First find the expression for the Gaussian log likelihood:
$$ln p(D|\mu, \sigma^2) = ln \prod_{n=1}^N \mathbb{N}(x_n | \mu, \sigma^2)$$ $$= \sum_{n=1}^N ln \mathbb{N}(x_n | \mu, \sigma^2)$$ $$= \sum_{n=1}^N ln( \frac{1}{\sqrt{ 2\pi \sigma^2}} exp{ - \frac{(x_n - \mu)^2}{2 \sigma^2}})$$ $$= N ln( \frac{1}{\sqrt{2\pi \sigma^2}}) - \sum_{n=1}^N \frac{(x_n - \mu)^2}{2 \sigma^2}$$ $$= -\frac{1}{2 \sigma^2} \sum_{n=1}^N (x_n - \mu)^2 - \frac{N}{2} ln(\sigma^2) - \frac{N}{2}ln(2\pi)$$
The next step is to find $\mu_{ML}$ by deriving with respect to $\mu$. The derivative w.r.t. $\mu$ is:
$$\frac{1}{\sigma^2} \sum_{n=1}^N (x_n - \mu)$$
Setting to zero gives the MLE:
$$\mu_{ML} = \frac{1}{N}\sum_{n=1}^N x_n$$