Statistical Models & Estimation

What a statistical model is, identifiability, and the properties that make a good estimator

1. Introduction

Before we can estimate anything, we need a formal framework for what we're trying to do. A statistical model specifies the family of distributions we believe generated our data. An estimator is a function of the data that produces a guess for the unknown parameter. The question is: what makes an estimator good?

2. The Statistical Model

Statistical Model

A statistical experiment is described by a pair \((E, (\mathbb{P}_\theta)_{\theta \in \Theta})\) where \(E\) is the sample space, \(\Theta\) is the parameter space, and \(\mathbb{P}_\theta\) is the distribution of the data when the true parameter is \(\theta\).

We observe a sample \(X_1, \ldots, X_n \sim \mathbb{P}_{\theta^*}\) for some unknown true parameter \(\theta^*\). The goal of statistics is to recover information about \(\theta^*\) from the data.

Types of Models

3. Identifiability

Definition

The parameter \(\theta\) is identifiable if the map \(\theta \mapsto \mathbb{P}_\theta\) is injective: \(\theta \neq \theta' \implies \mathbb{P}_\theta \neq \mathbb{P}_{\theta'}\).

If the model is not identifiable, two different parameter values produce exactly the same distribution — we cannot distinguish them even with infinite data. Identifiability is a necessary condition for consistent estimation.

Example: In a mixture model \(\alpha \mathcal{N}(\mu_1, 1) + (1-\alpha)\mathcal{N}(\mu_2, 1)\), swapping labels \((\alpha, \mu_1, \mu_2) \leftrightarrow (1-\alpha, \mu_2, \mu_1)\) gives the same distribution. The model is not identifiable without additional constraints (e.g., \(\mu_1 < \mu_2\)).

4. Statistics and Estimators

Statistic

Any measurable function of the data \(X_1, \ldots, X_n\). Examples: \(\bar{X}_n\), \(\max_i X_i\), \(\sum_i X_i^2\).

Estimator

A statistic \(\hat{\theta}_n = \hat{\theta}(X_1, \ldots, X_n)\) used to estimate the unknown parameter \(\theta^*\). Its expression must not depend on \(\theta^*\).

5. Bias

Bias

\(\text{bias}(\hat{\theta}_n) = \mathbb{E}[\hat{\theta}_n] - \theta^*\)

An estimator is unbiased if \(\text{bias}(\hat{\theta}_n) = 0\), i.e., it gets the right answer on average. Unbiasedness is not always possible and is sometimes not even desirable (biased estimators can have lower total error).

Example: The sample mean \(\bar{X}_n\) is unbiased for \(\mu\). The sample variance \(S_n^2 = \frac{1}{n}\sum(X_i - \bar{X})^2\) is biased — the unbiased version divides by \(n-1\).

6. Quadratic Risk

Quadratic Risk (MSE)

\(R(\hat{\theta}_n) = \mathbb{E}\!\left[(\hat{\theta}_n - \theta^*)^2\right] = \text{Var}(\hat{\theta}_n) + \text{bias}(\hat{\theta}_n)^2\)

This bias-variance decomposition is fundamental. Reducing bias often increases variance and vice versa. The optimal estimator minimises total risk, which may accept some bias in exchange for lower variance.

Variance Bias² Total MSE Simple Complex Model complexity →
Figure 1 — Bias-variance trade-off. Total MSE is minimised at intermediate complexity.

7. Consistency

Consistency

An estimator \(\hat{\theta}_n\) is weakly consistent for \(\theta^*\) if:

\(\hat{\theta}_n \xrightarrow{\mathbb{P}} \theta^* \quad \text{as } n \to \infty\)

Consistency is a minimum requirement: as we collect more data, we should eventually recover the truth. A sufficient condition: \(\text{bias}(\hat{\theta}_n) \to 0\) and \(\text{Var}(\hat{\theta}_n) \to 0\).

8. Asymptotic Normality

Asymptotic Normality

An estimator \(\hat{\theta}_n\) is asymptotically normal if:

\(\sqrt{n}(\hat{\theta}_n - \theta^*) \xrightarrow{(d)} \mathcal{N}(0, \sigma^2)\)

for some asymptotic variance \(\sigma^2\). We write \(\hat{\theta}_n \approx \mathcal{N}(\theta^*, \sigma^2/n)\) for large \(n\).

Asymptotic normality is the key result that enables confidence intervals and hypothesis tests. We don't need to know the exact distribution of the estimator — only its limiting behaviour.

9. Jensen's Inequality

Jensen's Inequality

Jensen's inequality appears throughout statistics — it shows that the log-likelihood is concave in the parameters (enabling optimisation), and that certain estimators are necessarily biased. For example, \(\mathbb{E}[1/X] \geq 1/\mathbb{E}[X]\) since \(1/x\) is convex.