The likelihood function, MLE derivation, Fisher information, and asymptotic normality
Maximum Likelihood Estimation is the workhorse of parametric statistics. The core idea is simple: given observed data, find the parameter value that makes the data most probable. Despite this simplicity, MLE has deep theoretical guarantees — it is consistent, asymptotically normal, and achieves the Cramér-Rao lower bound in large samples.
The likelihood of the model given sample \(x_1, \ldots, x_n\):
\[L_n(x_1,\ldots,x_n;\theta) = \prod_{i=1}^n \mathbb{P}_\theta[X_i = x_i]\]
When \(\mathbb{P}_\theta\) has density \(f_\theta\):
\[L(x_1,\ldots,x_n;\theta) = \prod_{i=1}^n f_\theta(x_i)\]
The likelihood is a function of \(\theta\) for fixed data — not a probability over \(\theta\). In practice we work with the log-likelihood \(\ell(\theta) = \log L(\theta) = \sum_i \log f_\theta(x_i)\), which converts products into sums and is numerically stable.
\[\hat{\theta}_n^{\text{MLE}} = \arg\max_{\theta \in \Theta}\, L(x_1,\ldots,x_n;\theta) = \arg\max_{\theta \in \Theta}\, \ell(\theta)\]
For \(X_1,\ldots,X_n \sim \mathcal{N}(\mu, \sigma^2)\), taking the derivative of the log-likelihood and setting to zero gives: \(\hat{\mu}^{\text{MLE}} = \bar{X}_n\) and \(\hat{\sigma}^2{}^{\text{MLE}} = \frac{1}{n}\sum(X_i - \bar{X}_n)^2\). The MLE for the mean is the sample mean (unbiased), but the MLE for variance divides by \(n\) (biased).
Under mild regularity conditions, the MLE is consistent:
\[\hat{\theta}_n^{\text{MLE}} \xrightarrow{\mathbb{P}} \theta^* \quad \text{as } n \to \infty\]Intuition: maximising \(\frac{1}{n}\ell(\theta) = \frac{1}{n}\sum \log f_\theta(X_i)\) is equivalent (by the LLN) to maximising \(\mathbb{E}_{\theta^*}[\log f_\theta(X)]\), which is maximised at \(\theta = \theta^*\) by the properties of KL divergence.
For a one-dimensional parameter, define the score function \(\ell(\theta) = \log L_1(X,\theta)\). The Fisher information is:
\[I(\theta) = \mathbb{E}\!\left[\nabla\ell(\theta)\nabla\ell(\theta)^\top\right] - \mathbb{E}[\nabla\ell(\theta)]\mathbb{E}[\nabla\ell(\theta)^\top] = -\mathbb{E}[\ell''(\theta)]\]
In one dimension: \(I(\theta) = \text{Var}[\ell'(\theta)] = -\mathbb{E}[\ell''(\theta)]\).
Fisher information measures how much information the data carries about \(\theta\). High Fisher information means the likelihood is sharply peaked — the data is highly informative. Low Fisher information means the likelihood is flat — many parameter values are nearly equally plausible.
The Cramér-Rao bound states that for any unbiased estimator \(\hat{\theta}\): \(\text{Var}(\hat{\theta}) \geq 1/I(\theta)\). The MLE achieves this bound asymptotically.
Under regularity conditions, with \(\theta^*\) the true parameter:
\[\sqrt{n}\!\left(\hat{\theta}_n^{\text{MLE}} - \theta^*\right) \xrightarrow{(d)} \mathcal{N}\!\left(0,\, I(\theta^*)^{-1}\right)\]
Two remarkable facts: (1) the MLE converges at rate \(1/\sqrt{n}\), the fastest possible for regular estimators; (2) its asymptotic variance equals the Cramér-Rao lower bound — the MLE is asymptotically efficient.
A random vector \(X \sim \mathcal{N}_d(\mu, \Sigma)\) has density:
\[f(x) = \frac{1}{(2\pi)^{d/2}\det(\Sigma)^{1/2}}\exp\!\left(-\frac{1}{2}(x-\mu)^\top\Sigma^{-1}(x-\mu)\right)\]where \(\mu \in \mathbb{R}^d\) is the mean vector and \(\Sigma \in \mathbb{R}^{d\times d}\) is the positive definite covariance matrix.
For a random vector \(X = (X^{(1)}, \ldots, X^{(d)})^\top\):
\[\Sigma = \text{Cov}(X) = \mathbb{E}\!\left[(X - \mathbb{E}[X])(X - \mathbb{E}[X])^\top\right]\]For matrices \(A, B\): \(\text{Cov}(AX + B) = A\,\text{Cov}(X)\,A^\top\). The diagonal entries are variances; off-diagonal entries are covariances.
If \(X_1, \ldots, X_n\) are i.i.d. copies of a random vector with mean \(\mu\) and covariance \(\Sigma\):
\[\sqrt{n}(\bar{X}_n - \mu) \xrightarrow{(d)} \mathcal{N}_d(0, \Sigma)\]