Bayesian Inference

Posterior inference, MAP, LMS estimation, and the linear least-squares estimator

1. The Bayesian Framework

In the Bayesian framework, the unknown parameter \(\Theta\) is treated as a random variable with a prior distribution \(f_\Theta(\theta)\). After observing data \(X = x\), we update to the posterior distribution via Bayes' rule:

Bayes' Rule (Continuous)

\[f_{\Theta \mid X}(\theta \mid x) = \frac{f_X(\theta)\,f_{X \mid \Theta}(x \mid \theta)}{f_X(x)} \propto f_\Theta(\theta)\,f_{X \mid \Theta}(x \mid \theta)\]

The posterior is proportional to prior times likelihood. The denominator \(f_X(x) = \int f_\Theta(\theta) f_{X|\Theta}(x|\theta)\,d\theta\) is a normalising constant that does not depend on \(\theta\).

Key Quantities

Prior \(f_\Theta(\theta)\): belief about \(\Theta\) before seeing data.
Likelihood \(f_{X|\Theta}(x|\theta)\): probability of observed data under model \(\theta\).
Posterior \(f_{\Theta|X}(\theta|x)\): updated belief after seeing data.
Posterior predictive \(f_{X_{\text{new}}|X}(x'|x)\): distribution over future observations.

2. MAP Estimation

Maximum A Posteriori (MAP) Estimate

\[\hat{\theta}_{\text{MAP}} = \arg\max_\theta f_{\Theta \mid X}(\theta \mid x) = \arg\max_\theta \bigl[f_\Theta(\theta)\,f_{X \mid \Theta}(x \mid \theta)\bigr]\]

MAP finds the mode of the posterior. It is the most probable value of \(\Theta\) given the data. Taking logs (which doesn't change the argmax):

\[\hat{\theta}_{\text{MAP}} = \arg\max_\theta \bigl[\log f_\Theta(\theta) + \log f_{X \mid \Theta}(x \mid \theta)\bigr]\]

Compared to MLE: MAP adds the log-prior as a regularisation term. With a uniform prior, MAP reduces to MLE.

MAP as Regularised MLE

Gaussian prior \(\Theta \sim \mathcal{N}(0, v^2)\) leads to \(\ell_2\) (ridge) regularisation.
Laplace prior leads to \(\ell_1\) (lasso) regularisation.
As data \(n \to \infty\), MAP and MLE converge — the prior is swamped by the likelihood.

3. LMS Estimation (Conditional Expectation)

Least Mean Squares (LMS) Estimator

\[\hat{\theta}_{\text{LMS}} = \mathbb{E}[\Theta \mid X = x]\]

The LMS estimator minimises the mean squared error \(\mathbb{E}[(\Theta - \hat{\theta})^2 \mid X = x]\). It is the posterior mean — the expected value of \(\Theta\) under the posterior distribution.

Properties of the LMS Estimator

Minimises \(\mathbb{E}[(\Theta - g(X))^2]\) over all functions \(g\).
The estimation error \(\tilde{\Theta} = \Theta - \hat{\Theta}_{\text{LMS}}\) satisfies \(\mathbb{E}[\tilde{\Theta} \mid X] = 0\) (unbiased given \(X\)).
\(\tilde{\Theta}\) is uncorrelated with any function of \(X\): \(\text{Cov}(\tilde{\Theta}, h(X)) = 0\) for any \(h\).
The MSE of the LMS estimator is \(\mathbb{E}[\text{Var}(\Theta \mid X)]\) — the expected posterior variance.

Comparison of Estimators

Estimator	Definition	Optimality
MAP	mode of \(f_{\Theta\|X}\)	Most probable value
LMS	mean of \(f_{\Theta\|X}\)	Minimises MSE
Median of posterior	median of \(f_{\Theta\|X}\)	Minimises mean absolute error

For symmetric unimodal posteriors (e.g. Gaussian), all three coincide.

4. LLMS Estimation

The LMS estimator requires knowledge of the full joint distribution. The Linear Least Mean Squares (LLMS) estimator restricts to linear estimators \(\hat{\Theta} = aX + b\):

LLMS Estimator

\[\hat{\Theta}_{\text{LLMS}} = \mathbb{E}[\Theta] + \frac{\text{Cov}(\Theta, X)}{\text{Var}(X)}\bigl(X - \mathbb{E}[X]\bigr)\]

Equivalently: \(\hat{\Theta}_{\text{LLMS}} = \mathbb{E}[\Theta] + \rho\,\frac{\sigma_\Theta}{\sigma_X}\bigl(X - \mathbb{E}[X]\bigr)\), where \(\rho = \text{Corr}(\Theta, X)\).

The coefficients are:

\[a = \frac{\text{Cov}(\Theta, X)}{\text{Var}(X)}, \qquad b = \mathbb{E}[\Theta] - a\,\mathbb{E}[X]\]

MSE of LLMS

\[\mathbb{E}\bigl[(\Theta - \hat{\Theta}_{\text{LLMS}})^2\bigr] = (1 - \rho^2)\,\text{Var}(\Theta)\]

When \(\rho = \pm 1\): perfect linear relationship, zero error. When \(\rho = 0\): \(X\) is useless, estimate degrades to the prior mean \(\mathbb{E}[\Theta]\).

LLMS with Multiple Observations

Given observations \(\mathbf{X} = (X_1, \ldots, X_n)\), the LLMS estimator is:

\[\hat{\Theta}_{\text{LLMS}} = \mathbb{E}[\Theta] + \mathbf{c}^\top(\mathbf{X} - \mathbb{E}[\mathbf{X}])\]

where \(\mathbf{c} = \text{Cov}(\mathbf{X}, \mathbf{X})^{-1}\,\text{Cov}(\mathbf{X}, \Theta)\). This is exactly the solution from linear regression.

5. Conjugate Priors

A prior is conjugate to a likelihood if the posterior is in the same family as the prior. This makes Bayesian updating analytically tractable.

Likelihood	Conjugate Prior	Posterior
Binomial(\(n,p\))	Beta(\(a,b\))	Beta(\(a+k, b+n-k\))
Poisson(\(\lambda\))	Gamma(\(a,b\))	Gamma(\(a+\sum x_i, b+n\))
Normal(\(\mu, \sigma^2\)) known \(\sigma^2\)	\(\mathcal{N}(\mu_0, \sigma_0^2)\)	Normal (precision-weighted average)

With a Beta(\(a,b\)) prior and observing \(k\) successes in \(n\) trials, the posterior mean is \((a+k)/(a+b+n)\) — a weighted combination of the prior mean and the observed fraction \(k/n\).