Matching sample moments to population moments, and a unified framework for loss-based estimators
Maximum likelihood is not the only way to construct estimators. The Method of Moments (MoM) is older, simpler, and sometimes computationally tractable when MLE is not. M-Estimation generalises both MLE and MoM into a single framework of optimisation-based estimators.
The \(k\)-th population moment is \(m_k(\theta) = \mathbb{E}_\theta[X_1^k]\), for \(1 \leq k \leq d\).
The \(k\)-th sample moment is \(\hat{m}_k = \bar{X^k} = \frac{1}{n}\sum_{i=1}^n X_i^k\).
By the LLN: \(\hat{m}_k \xrightarrow{\mathbb{P}} m_k(\theta^*)\). By the multivariate CLT: \(\sqrt{n}(\hat{m}_1,\ldots,\hat{m}_d) \xrightarrow{(d)} (m_1(\theta^*),\ldots,m_d(\theta^*))\).
The idea: if \(\theta \in \mathbb{R}^d\), match \(d\) sample moments to their population counterparts and solve.
Define \(M: \Theta \to \mathbb{R}^d\) by \(M(\theta) = (m_1(\theta), \ldots, m_d(\theta))\). Assuming \(M\) is one-to-one:
\[\hat{\theta}_n^{\text{MM}} = M^{-1}(\hat{m}_1, \ldots, \hat{m}_d)\]
For \(X \sim \mathcal{N}(\mu, \sigma^2)\), we have \(m_1 = \mu\) and \(m_2 = \mu^2 + \sigma^2\). Setting sample moments equal:
These happen to match the MLE here, but that's not always the case.
By the multivariate Delta method, if \(M^{-1}\) is differentiable:
\[\sqrt{n}(\hat{\theta}_n^{\text{MM}} - \theta^*) \xrightarrow{(d)} \mathcal{N}_d(0,\, \Gamma(\theta^*))\]where \(\Gamma = \left[\frac{\partial M^{-1}}{\partial \theta}\right]\Sigma\left[\frac{\partial M^{-1}}{\partial \theta}\right]^\top\) and \(\Sigma\) is the covariance of the sample moments.
| MLE | MoM | |
|---|---|---|
| Consistency | Yes (regularity conditions) | Yes |
| Asymptotic efficiency | Yes (Cramér-Rao) | Not in general |
| Computation | Can be hard (non-convex) | Often easier (solve equations) |
| Misspecification | May fail badly | More robust |
M-Estimation unifies MLE, MoM, and many robust estimators under one framework: find the parameter that minimises an average loss over the data.
Let \(\rho: E \times \mathcal{M} \to \mathbb{R}\) be a loss function. The population minimiser is:
\[\mu^* = \arg\min_{\mu \in \mathcal{M}}\, Q(\mu) \quad \text{where } Q(\mu) = \mathbb{E}[\rho(X_1, \mu)]\]
The M-estimator plugs in the empirical average:
\[\hat{\mu}_n = \arg\min_{\mu \in \mathcal{M}}\, Q_n(\mu) \quad \text{where } Q_n(\mu) = \frac{1}{n}\sum_{i=1}^n\rho(X_i, \mu)\]
With \(\rho(x, \mu) = -\log L_1(x, \mu)\), \(Q(\mu) = -\mathbb{E}[\log f_\mu(X)]\). The population minimiser is \(\mu^* = \theta^*\) (by KL divergence minimisation). So MLE is an M-estimator with the negative log-likelihood as the loss.
For \(\alpha \in (0,1)\), the check function is:
\[C_\alpha(u) = \begin{cases} -(1-\alpha)u & u < 0 \\ \alpha u & u \geq 0 \end{cases}\]The population minimiser \(\mu^*\) is the \(\alpha\)-quantile of \(\mathbb{P}\). M-estimation with the check loss gives a robust quantile estimator that doesn't require any distributional assumptions.
Under regularity conditions (uniqueness of \(\mu^*\), invertibility of the Hessian \(J(\mu^*)\)):
\[\sqrt{n}(\hat{\mu}_n - \mu^*) \xrightarrow{(d)} \mathcal{N}\!\left(0,\, J(\mu^*)^{-1}K(\mu^*)J(\mu^*)^{-1}\right)\]
where \(J(\mu) = \mathbb{E}\!\left[\frac{\partial^2\rho(X_1,\mu)}{\partial\mu\partial\mu^\top}\right]\) (Hessian) and \(K(\mu) = \text{Cov}\!\left(\frac{\partial\rho(X_1,\mu)}{\partial\mu}\right)\).
In the log-likelihood case, \(J(\mu) = K(\mu) = I(\theta)\) (Fisher information), so the sandwich reduces to \(I(\theta)^{-1}\) — recovering the MLE's asymptotic variance.