Maximum Likelihood Estimation

The likelihood function, MLE derivation, Fisher information, and asymptotic normality

1. Introduction

Maximum Likelihood Estimation is the workhorse of parametric statistics. The core idea is simple: given observed data, find the parameter value that makes the data most probable. Despite this simplicity, MLE has deep theoretical guarantees — it is consistent, asymptotically normal, and achieves the Cramér-Rao lower bound in large samples.

2. The Likelihood Function

Discrete Case

The likelihood of the model given sample \(x_1, \ldots, x_n\):

\[L_n(x_1,\ldots,x_n;\theta) = \prod_{i=1}^n \mathbb{P}_\theta[X_i = x_i]\]

Continuous Case

When \(\mathbb{P}_\theta\) has density \(f_\theta\):

\[L(x_1,\ldots,x_n;\theta) = \prod_{i=1}^n f_\theta(x_i)\]

The likelihood is a function of \(\theta\) for fixed data — not a probability over \(\theta\). In practice we work with the log-likelihood \(\ell(\theta) = \log L(\theta) = \sum_i \log f_\theta(x_i)\), which converts products into sums and is numerically stable.

3. The MLE

Maximum Likelihood Estimator

\[\hat{\theta}_n^{\text{MLE}} = \arg\max_{\theta \in \Theta}\, L(x_1,\ldots,x_n;\theta) = \arg\max_{\theta \in \Theta}\, \ell(\theta)\]

θ̂ᴹᴸᴱ ℓ(θ) θ ℓ(θ̂)
Figure 1 — The MLE is the parameter value at which the log-likelihood is maximised.

Example: Gaussian MLE

For \(X_1,\ldots,X_n \sim \mathcal{N}(\mu, \sigma^2)\), taking the derivative of the log-likelihood and setting to zero gives: \(\hat{\mu}^{\text{MLE}} = \bar{X}_n\) and \(\hat{\sigma}^2{}^{\text{MLE}} = \frac{1}{n}\sum(X_i - \bar{X}_n)^2\). The MLE for the mean is the sample mean (unbiased), but the MLE for variance divides by \(n\) (biased).

4. Consistency of MLE

Under mild regularity conditions, the MLE is consistent:

\[\hat{\theta}_n^{\text{MLE}} \xrightarrow{\mathbb{P}} \theta^* \quad \text{as } n \to \infty\]

Intuition: maximising \(\frac{1}{n}\ell(\theta) = \frac{1}{n}\sum \log f_\theta(X_i)\) is equivalent (by the LLN) to maximising \(\mathbb{E}_{\theta^*}[\log f_\theta(X)]\), which is maximised at \(\theta = \theta^*\) by the properties of KL divergence.

5. Fisher Information

Fisher Information

For a one-dimensional parameter, define the score function \(\ell(\theta) = \log L_1(X,\theta)\). The Fisher information is:

\[I(\theta) = \mathbb{E}\!\left[\nabla\ell(\theta)\nabla\ell(\theta)^\top\right] - \mathbb{E}[\nabla\ell(\theta)]\mathbb{E}[\nabla\ell(\theta)^\top] = -\mathbb{E}[\ell''(\theta)]\]

In one dimension: \(I(\theta) = \text{Var}[\ell'(\theta)] = -\mathbb{E}[\ell''(\theta)]\).

Fisher information measures how much information the data carries about \(\theta\). High Fisher information means the likelihood is sharply peaked — the data is highly informative. Low Fisher information means the likelihood is flat — many parameter values are nearly equally plausible.

The Cramér-Rao bound states that for any unbiased estimator \(\hat{\theta}\): \(\text{Var}(\hat{\theta}) \geq 1/I(\theta)\). The MLE achieves this bound asymptotically.

6. Asymptotic Normality of MLE

Asymptotic Normality of MLE

Under regularity conditions, with \(\theta^*\) the true parameter:

\[\sqrt{n}\!\left(\hat{\theta}_n^{\text{MLE}} - \theta^*\right) \xrightarrow{(d)} \mathcal{N}\!\left(0,\, I(\theta^*)^{-1}\right)\]

Two remarkable facts: (1) the MLE converges at rate \(1/\sqrt{n}\), the fastest possible for regular estimators; (2) its asymptotic variance equals the Cramér-Rao lower bound — the MLE is asymptotically efficient.

7. Multivariate Gaussian Distribution

A random vector \(X \sim \mathcal{N}_d(\mu, \Sigma)\) has density:

\[f(x) = \frac{1}{(2\pi)^{d/2}\det(\Sigma)^{1/2}}\exp\!\left(-\frac{1}{2}(x-\mu)^\top\Sigma^{-1}(x-\mu)\right)\]

where \(\mu \in \mathbb{R}^d\) is the mean vector and \(\Sigma \in \mathbb{R}^{d\times d}\) is the positive definite covariance matrix.

Covariance Matrix

For a random vector \(X = (X^{(1)}, \ldots, X^{(d)})^\top\):

\[\Sigma = \text{Cov}(X) = \mathbb{E}\!\left[(X - \mathbb{E}[X])(X - \mathbb{E}[X])^\top\right]\]

For matrices \(A, B\): \(\text{Cov}(AX + B) = A\,\text{Cov}(X)\,A^\top\). The diagonal entries are variances; off-diagonal entries are covariances.

Multivariate CLT

If \(X_1, \ldots, X_n\) are i.i.d. copies of a random vector with mean \(\mu\) and covariance \(\Sigma\):

\[\sqrt{n}(\bar{X}_n - \mu) \xrightarrow{(d)} \mathcal{N}_d(0, \Sigma)\]