What a statistical model is, identifiability, and the properties that make a good estimator
Before we can estimate anything, we need a formal framework for what we're trying to do. A statistical model specifies the family of distributions we believe generated our data. An estimator is a function of the data that produces a guess for the unknown parameter. The question is: what makes an estimator good?
A statistical experiment is described by a pair \((E, (\mathbb{P}_\theta)_{\theta \in \Theta})\) where \(E\) is the sample space, \(\Theta\) is the parameter space, and \(\mathbb{P}_\theta\) is the distribution of the data when the true parameter is \(\theta\).
We observe a sample \(X_1, \ldots, X_n \sim \mathbb{P}_{\theta^*}\) for some unknown true parameter \(\theta^*\). The goal of statistics is to recover information about \(\theta^*\) from the data.
The parameter \(\theta\) is identifiable if the map \(\theta \mapsto \mathbb{P}_\theta\) is injective: \(\theta \neq \theta' \implies \mathbb{P}_\theta \neq \mathbb{P}_{\theta'}\).
If the model is not identifiable, two different parameter values produce exactly the same distribution — we cannot distinguish them even with infinite data. Identifiability is a necessary condition for consistent estimation.
Example: In a mixture model \(\alpha \mathcal{N}(\mu_1, 1) + (1-\alpha)\mathcal{N}(\mu_2, 1)\), swapping labels \((\alpha, \mu_1, \mu_2) \leftrightarrow (1-\alpha, \mu_2, \mu_1)\) gives the same distribution. The model is not identifiable without additional constraints (e.g., \(\mu_1 < \mu_2\)).
Any measurable function of the data \(X_1, \ldots, X_n\). Examples: \(\bar{X}_n\), \(\max_i X_i\), \(\sum_i X_i^2\).
A statistic \(\hat{\theta}_n = \hat{\theta}(X_1, \ldots, X_n)\) used to estimate the unknown parameter \(\theta^*\). Its expression must not depend on \(\theta^*\).
\(\text{bias}(\hat{\theta}_n) = \mathbb{E}[\hat{\theta}_n] - \theta^*\)
An estimator is unbiased if \(\text{bias}(\hat{\theta}_n) = 0\), i.e., it gets the right answer on average. Unbiasedness is not always possible and is sometimes not even desirable (biased estimators can have lower total error).
Example: The sample mean \(\bar{X}_n\) is unbiased for \(\mu\). The sample variance \(S_n^2 = \frac{1}{n}\sum(X_i - \bar{X})^2\) is biased — the unbiased version divides by \(n-1\).
\(R(\hat{\theta}_n) = \mathbb{E}\!\left[(\hat{\theta}_n - \theta^*)^2\right] = \text{Var}(\hat{\theta}_n) + \text{bias}(\hat{\theta}_n)^2\)
This bias-variance decomposition is fundamental. Reducing bias often increases variance and vice versa. The optimal estimator minimises total risk, which may accept some bias in exchange for lower variance.
An estimator \(\hat{\theta}_n\) is weakly consistent for \(\theta^*\) if:
\(\hat{\theta}_n \xrightarrow{\mathbb{P}} \theta^* \quad \text{as } n \to \infty\)
Consistency is a minimum requirement: as we collect more data, we should eventually recover the truth. A sufficient condition: \(\text{bias}(\hat{\theta}_n) \to 0\) and \(\text{Var}(\hat{\theta}_n) \to 0\).
An estimator \(\hat{\theta}_n\) is asymptotically normal if:
\(\sqrt{n}(\hat{\theta}_n - \theta^*) \xrightarrow{(d)} \mathcal{N}(0, \sigma^2)\)
for some asymptotic variance \(\sigma^2\). We write \(\hat{\theta}_n \approx \mathcal{N}(\theta^*, \sigma^2/n)\) for large \(n\).
Asymptotic normality is the key result that enables confidence intervals and hypothesis tests. We don't need to know the exact distribution of the estimator — only its limiting behaviour.
Jensen's inequality appears throughout statistics — it shows that the log-likelihood is concave in the parameters (enabling optimisation), and that certain estimators are necessarily biased. For example, \(\mathbb{E}[1/X] \geq 1/\mathbb{E}[X]\) since \(1/x\) is convex.