Maximum Likelihood Estimation — Statistics From Scratch

The likelihood function, MLE derivation, Fisher information, and asymptotic normality

1. Introduction

Maximum Likelihood Estimation is the workhorse of parametric statistics. The core idea is simple: given observed data, find the parameter value that makes the data most probable. Despite this simplicity, MLE has deep theoretical guarantees — it is consistent, asymptotically normal, and achieves the Cramér-Rao lower bound in large samples.

2. The Likelihood Function

Discrete Case

The likelihood of the model given sample \(x_1, \ldots, x_n\):

\[L_n(x_1,\ldots,x_n;\theta) = \prod_{i=1}^n \mathbb{P}_\theta[X_i = x_i]\]

Continuous Case

When \(\mathbb{P}_\theta\) has density \(f_\theta\):

\[L(x_1,\ldots,x_n;\theta) = \prod_{i=1}^n f_\theta(x_i)\]

The likelihood is a function of \(\theta\) for fixed data — not a probability over \(\theta\). In practice we work with the log-likelihood \(\ell(\theta) = \log L(\theta) = \sum_i \log f_\theta(x_i)\), which converts products into sums and is numerically stable.

3. The MLE

Maximum Likelihood Estimator

\[\hat{\theta}_n^{\text{MLE}} = \arg\max_{\theta \in \Theta}\, L(x_1,\ldots,x_n;\theta) = \arg\max_{\theta \in \Theta}\, \ell(\theta)\]

Example: Gaussian MLE

For \(X_1,\ldots,X_n \sim \mathcal{N}(\mu, \sigma^2)\), taking the derivative of the log-likelihood and setting to zero gives: \(\hat{\mu}^{\text{MLE}} = \bar{X}_n\) and \(\hat{\sigma}^2{}^{\text{MLE}} = \frac{1}{n}\sum(X_i - \bar{X}_n)^2\). The MLE for the mean is the sample mean (unbiased), but the MLE for variance divides by \(n\) (biased).

4. Consistency of MLE

Intuition: maximising \(\frac{1}{n}\ell(\theta) = \frac{1}{n}\sum \log f_\theta(X_i)\) is equivalent (by the LLN) to maximising \(\mathbb{E}_{\theta^*}[\log f_\theta(X)]\), which is maximised at \(\theta = \theta^*\) by the properties of KL divergence.

5. Fisher Information

Fisher Information

For a one-dimensional parameter, define the score function \(\ell(\theta) = \log L_1(X,\theta)\). The Fisher information is:

\[I(\theta) = \mathbb{E}\!\left[\nabla\ell(\theta)\nabla\ell(\theta)^\top\right] - \mathbb{E}[\nabla\ell(\theta)]\mathbb{E}[\nabla\ell(\theta)^\top] = -\mathbb{E}[\ell''(\theta)]\]

In one dimension: \(I(\theta) = \text{Var}[\ell'(\theta)] = -\mathbb{E}[\ell''(\theta)]\).

Fisher information measures how much information the data carries about \(\theta\). High Fisher information means the likelihood is sharply peaked — the data is highly informative. Low Fisher information means the likelihood is flat — many parameter values are nearly equally plausible.

The Cramér-Rao bound states that for any unbiased estimator \(\hat{\theta}\): \(\text{Var}(\hat{\theta}) \geq 1/I(\theta)\). The MLE achieves this bound asymptotically.

6. Asymptotic Normality of MLE

Asymptotic Normality of MLE

Under regularity conditions, with \(\theta^*\) the true parameter:

\[\sqrt{n}\!\left(\hat{\theta}_n^{\text{MLE}} - \theta^*\right) \xrightarrow{(d)} \mathcal{N}\!\left(0,\, I(\theta^*)^{-1}\right)\]

Two remarkable facts: (1) the MLE converges at rate \(1/\sqrt{n}\), the fastest possible for regular estimators; (2) its asymptotic variance equals the Cramér-Rao lower bound — the MLE is asymptotically efficient.

7. Multivariate Gaussian Distribution

where \(\mu \in \mathbb{R}^d\) is the mean vector and \(\Sigma \in \mathbb{R}^{d\times d}\) is the positive definite covariance matrix.

Covariance Matrix

For matrices \(A, B\): \(\text{Cov}(AX + B) = A\,\text{Cov}(X)\,A^\top\). The diagonal entries are variances; off-diagonal entries are covariances.

Multivariate CLT

If \(X_1, \ldots, X_n\) are i.i.d. copies of a random vector with mean \(\mu\) and covariance \(\Sigma\):