Posterior inference, MAP, LMS estimation, and the linear least-squares estimator
In the Bayesian framework, the unknown parameter \(\Theta\) is treated as a random variable with a prior distribution \(f_\Theta(\theta)\). After observing data \(X = x\), we update to the posterior distribution via Bayes' rule:
\[f_{\Theta \mid X}(\theta \mid x) = \frac{f_X(\theta)\,f_{X \mid \Theta}(x \mid \theta)}{f_X(x)} \propto f_\Theta(\theta)\,f_{X \mid \Theta}(x \mid \theta)\]
The posterior is proportional to prior times likelihood. The denominator \(f_X(x) = \int f_\Theta(\theta) f_{X|\Theta}(x|\theta)\,d\theta\) is a normalising constant that does not depend on \(\theta\).
\[\hat{\theta}_{\text{MAP}} = \arg\max_\theta f_{\Theta \mid X}(\theta \mid x) = \arg\max_\theta \bigl[f_\Theta(\theta)\,f_{X \mid \Theta}(x \mid \theta)\bigr]\]
MAP finds the mode of the posterior. It is the most probable value of \(\Theta\) given the data. Taking logs (which doesn't change the argmax):
\[\hat{\theta}_{\text{MAP}} = \arg\max_\theta \bigl[\log f_\Theta(\theta) + \log f_{X \mid \Theta}(x \mid \theta)\bigr]\]Compared to MLE: MAP adds the log-prior as a regularisation term. With a uniform prior, MAP reduces to MLE.
\[\hat{\theta}_{\text{LMS}} = \mathbb{E}[\Theta \mid X = x]\]
The LMS estimator minimises the mean squared error \(\mathbb{E}[(\Theta - \hat{\theta})^2 \mid X = x]\). It is the posterior mean — the expected value of \(\Theta\) under the posterior distribution.
| Estimator | Definition | Optimality |
|---|---|---|
| MAP | mode of \(f_{\Theta|X}\) | Most probable value |
| LMS | mean of \(f_{\Theta|X}\) | Minimises MSE |
| Median of posterior | median of \(f_{\Theta|X}\) | Minimises mean absolute error |
For symmetric unimodal posteriors (e.g. Gaussian), all three coincide.
The LMS estimator requires knowledge of the full joint distribution. The Linear Least Mean Squares (LLMS) estimator restricts to linear estimators \(\hat{\Theta} = aX + b\):
\[\hat{\Theta}_{\text{LLMS}} = \mathbb{E}[\Theta] + \frac{\text{Cov}(\Theta, X)}{\text{Var}(X)}\bigl(X - \mathbb{E}[X]\bigr)\]
Equivalently: \(\hat{\Theta}_{\text{LLMS}} = \mathbb{E}[\Theta] + \rho\,\frac{\sigma_\Theta}{\sigma_X}\bigl(X - \mathbb{E}[X]\bigr)\), where \(\rho = \text{Corr}(\Theta, X)\).
The coefficients are:
\[a = \frac{\text{Cov}(\Theta, X)}{\text{Var}(X)}, \qquad b = \mathbb{E}[\Theta] - a\,\mathbb{E}[X]\]When \(\rho = \pm 1\): perfect linear relationship, zero error. When \(\rho = 0\): \(X\) is useless, estimate degrades to the prior mean \(\mathbb{E}[\Theta]\).
Given observations \(\mathbf{X} = (X_1, \ldots, X_n)\), the LLMS estimator is:
\[\hat{\Theta}_{\text{LLMS}} = \mathbb{E}[\Theta] + \mathbf{c}^\top(\mathbf{X} - \mathbb{E}[\mathbf{X}])\]where \(\mathbf{c} = \text{Cov}(\mathbf{X}, \mathbf{X})^{-1}\,\text{Cov}(\mathbf{X}, \Theta)\). This is exactly the solution from linear regression.
A prior is conjugate to a likelihood if the posterior is in the same family as the prior. This makes Bayesian updating analytically tractable.
| Likelihood | Conjugate Prior | Posterior |
|---|---|---|
| Binomial(\(n,p\)) | Beta(\(a,b\)) | Beta(\(a+k, b+n-k\)) |
| Poisson(\(\lambda\)) | Gamma(\(a,b\)) | Gamma(\(a+\sum x_i, b+n\)) |
| Normal(\(\mu, \sigma^2\)) known \(\sigma^2\) | \(\mathcal{N}(\mu_0, \sigma_0^2)\) | Normal (precision-weighted average) |
With a Beta(\(a,b\)) prior and observing \(k\) successes in \(n\) trials, the posterior mean is \((a+k)/(a+b+n)\) — a weighted combination of the prior mean and the observed fraction \(k/n\).