Method of Moments & M-Estimation

Matching sample moments to population moments, and a unified framework for loss-based estimators

1. Introduction

Maximum likelihood is not the only way to construct estimators. The Method of Moments (MoM) is older, simpler, and sometimes computationally tractable when MLE is not. M-Estimation generalises both MLE and MoM into a single framework of optimisation-based estimators.

2. Moments

Population Moments

The \(k\)-th population moment is \(m_k(\theta) = \mathbb{E}_\theta[X_1^k]\), for \(1 \leq k \leq d\).

Empirical Moments

The \(k\)-th sample moment is \(\hat{m}_k = \bar{X^k} = \frac{1}{n}\sum_{i=1}^n X_i^k\).

By the LLN: \(\hat{m}_k \xrightarrow{\mathbb{P}} m_k(\theta^*)\). By the multivariate CLT: \(\sqrt{n}(\hat{m}_1,\ldots,\hat{m}_d) \xrightarrow{(d)} (m_1(\theta^*),\ldots,m_d(\theta^*))\).

3. Method of Moments Estimator

The idea: if \(\theta \in \mathbb{R}^d\), match \(d\) sample moments to their population counterparts and solve.

MoM Estimator

Define \(M: \Theta \to \mathbb{R}^d\) by \(M(\theta) = (m_1(\theta), \ldots, m_d(\theta))\). Assuming \(M\) is one-to-one:

\[\hat{\theta}_n^{\text{MM}} = M^{-1}(\hat{m}_1, \ldots, \hat{m}_d)\]

Example: Gaussian MoM

For \(X \sim \mathcal{N}(\mu, \sigma^2)\), we have \(m_1 = \mu\) and \(m_2 = \mu^2 + \sigma^2\). Setting sample moments equal:

\(\hat{\mu}^{\text{MM}} = \bar{X}_n\)
\(\hat{\sigma}^2{}^{\text{MM}} = \hat{m}_2 - \hat{m}_1^2 = \frac{1}{n}\sum X_i^2 - \bar{X}_n^2\)

These happen to match the MLE here, but that's not always the case.

Asymptotic Normality of MoM

By the multivariate Delta method, if \(M^{-1}\) is differentiable:

\[\sqrt{n}(\hat{\theta}_n^{\text{MM}} - \theta^*) \xrightarrow{(d)} \mathcal{N}_d(0,\, \Gamma(\theta^*))\]

where \(\Gamma = \left[\frac{\partial M^{-1}}{\partial \theta}\right]\Sigma\left[\frac{\partial M^{-1}}{\partial \theta}\right]^\top\) and \(\Sigma\) is the covariance of the sample moments.

4. MLE vs Method of Moments

	MLE	MoM
Consistency	Yes (regularity conditions)	Yes
Asymptotic efficiency	Yes (Cramér-Rao)	Not in general
Computation	Can be hard (non-convex)	Often easier (solve equations)
Misspecification	May fail badly	More robust

5. M-Estimation: A Unified Framework

M-Estimation unifies MLE, MoM, and many robust estimators under one framework: find the parameter that minimises an average loss over the data.

M-Estimator

Let \(\rho: E \times \mathcal{M} \to \mathbb{R}\) be a loss function. The population minimiser is:

\[\mu^* = \arg\min_{\mu \in \mathcal{M}}\, Q(\mu) \quad \text{where } Q(\mu) = \mathbb{E}[\rho(X_1, \mu)]\]

The M-estimator plugs in the empirical average:

\[\hat{\mu}_n = \arg\min_{\mu \in \mathcal{M}}\, Q_n(\mu) \quad \text{where } Q_n(\mu) = \frac{1}{n}\sum_{i=1}^n\rho(X_i, \mu)\]

MLE is an M-Estimator

With \(\rho(x, \mu) = -\log L_1(x, \mu)\), \(Q(\mu) = -\mathbb{E}[\log f_\mu(X)]\). The population minimiser is \(\mu^* = \theta^*\) (by KL divergence minimisation). So MLE is an M-estimator with the negative log-likelihood as the loss.

Examples of Loss Functions

\(\rho(x,\mu) = (x-\mu)^2\): minimiser is \(\mu^* = \mathbb{E}[X]\) — gives the sample mean.
\(\rho(x,\mu) = |x-\mu|\): minimiser is \(\mu^* = \text{median}(\mathbb{P})\) — gives the sample median.
Check function \(\rho(x,\mu) = C_\alpha(x-\mu)\): gives quantile regression.

Check Function (Quantile Regression)

For \(\alpha \in (0,1)\), the check function is:

\[C_\alpha(u) = \begin{cases} -(1-\alpha)u & u < 0 \\ \alpha u & u \geq 0 \end{cases}\]

The population minimiser \(\mu^*\) is the \(\alpha\)-quantile of \(\mathbb{P}\). M-estimation with the check loss gives a robust quantile estimator that doesn't require any distributional assumptions.

6. Asymptotic Normality of M-Estimators

Asymptotic Normality

Under regularity conditions (uniqueness of \(\mu^*\), invertibility of the Hessian \(J(\mu^*)\)):

\[\sqrt{n}(\hat{\mu}_n - \mu^*) \xrightarrow{(d)} \mathcal{N}\!\left(0,\, J(\mu^*)^{-1}K(\mu^*)J(\mu^*)^{-1}\right)\]

where \(J(\mu) = \mathbb{E}\!\left[\frac{\partial^2\rho(X_1,\mu)}{\partial\mu\partial\mu^\top}\right]\) (Hessian) and \(K(\mu) = \text{Cov}\!\left(\frac{\partial\rho(X_1,\mu)}{\partial\mu}\right)\).

In the log-likelihood case, \(J(\mu) = K(\mu) = I(\theta)\) (Fisher information), so the sandwich reduces to \(I(\theta)^{-1}\) — recovering the MLE's asymptotic variance.