Statistical Models & Estimation — Statistics From Scratch

What a statistical model is, identifiability, and the properties that make a good estimator

1. Introduction

Before we can estimate anything, we need a formal framework for what we're trying to do. A statistical model specifies the family of distributions we believe generated our data. An estimator is a function of the data that produces a guess for the unknown parameter. The question is: what makes an estimator good?

2. The Statistical Model

Statistical Model

A statistical experiment is described by a pair \((E, (\mathbb{P}_\theta)_{\theta \in \Theta})\) where \(E\) is the sample space, \(\Theta\) is the parameter space, and \(\mathbb{P}_\theta\) is the distribution of the data when the true parameter is \(\theta\).

We observe a sample \(X_1, \ldots, X_n \sim \mathbb{P}_{\theta^*}\) for some unknown true parameter \(\theta^*\). The goal of statistics is to recover information about \(\theta^*\) from the data.

Types of Models

3. Identifiability

Definition

The parameter \(\theta\) is identifiable if the map \(\theta \mapsto \mathbb{P}_\theta\) is injective: \(\theta \neq \theta' \implies \mathbb{P}_\theta \neq \mathbb{P}_{\theta'}\).

If the model is not identifiable, two different parameter values produce exactly the same distribution — we cannot distinguish them even with infinite data. Identifiability is a necessary condition for consistent estimation.

Example: In a mixture model \(\alpha \mathcal{N}(\mu_1, 1) + (1-\alpha)\mathcal{N}(\mu_2, 1)\), swapping labels \((\alpha, \mu_1, \mu_2) \leftrightarrow (1-\alpha, \mu_2, \mu_1)\) gives the same distribution. The model is not identifiable without additional constraints (e.g., \(\mu_1 < \mu_2\)).

4. Statistics and Estimators

Statistic

Any measurable function of the data \(X_1, \ldots, X_n\). Examples: \(\bar{X}_n\), \(\max_i X_i\), \(\sum_i X_i^2\).

Estimator

A statistic \(\hat{\theta}_n = \hat{\theta}(X_1, \ldots, X_n)\) used to estimate the unknown parameter \(\theta^*\). Its expression must not depend on \(\theta^*\).

5. Bias

Bias

\(\text{bias}(\hat{\theta}_n) = \mathbb{E}[\hat{\theta}_n] - \theta^*\)

An estimator is unbiased if \(\text{bias}(\hat{\theta}_n) = 0\), i.e., it gets the right answer on average. Unbiasedness is not always possible and is sometimes not even desirable (biased estimators can have lower total error).

Example: The sample mean \(\bar{X}_n\) is unbiased for \(\mu\). The sample variance \(S_n^2 = \frac{1}{n}\sum(X_i - \bar{X})^2\) is biased — the unbiased version divides by \(n-1\).

6. Quadratic Risk

Quadratic Risk (MSE)

\(R(\hat{\theta}_n) = \mathbb{E}\!\left[(\hat{\theta}_n - \theta^*)^2\right] = \text{Var}(\hat{\theta}_n) + \text{bias}(\hat{\theta}_n)^2\)

This bias-variance decomposition is fundamental. Reducing bias often increases variance and vice versa. The optimal estimator minimises total risk, which may accept some bias in exchange for lower variance.

7. Consistency

Consistency

An estimator \(\hat{\theta}_n\) is weakly consistent for \(\theta^*\) if:

\(\hat{\theta}_n \xrightarrow{\mathbb{P}} \theta^* \quad \text{as } n \to \infty\)

Consistency is a minimum requirement: as we collect more data, we should eventually recover the truth. A sufficient condition: \(\text{bias}(\hat{\theta}_n) \to 0\) and \(\text{Var}(\hat{\theta}_n) \to 0\).

8. Asymptotic Normality

Asymptotic Normality

An estimator \(\hat{\theta}_n\) is asymptotically normal if:

\(\sqrt{n}(\hat{\theta}_n - \theta^*) \xrightarrow{(d)} \mathcal{N}(0, \sigma^2)\)

for some asymptotic variance \(\sigma^2\). We write \(\hat{\theta}_n \approx \mathcal{N}(\theta^*, \sigma^2/n)\) for large \(n\).

Asymptotic normality is the key result that enables confidence intervals and hypothesis tests. We don't need to know the exact distribution of the estimator — only its limiting behaviour.

9. Jensen's Inequality

Jensen's Inequality

If \(f\) is convex: \(\mathbb{E}[f(X)] \geq f(\mathbb{E}[X])\)
If \(f\) is concave: \(\mathbb{E}[f(X)] \leq f(\mathbb{E}[X])\)

Jensen's inequality appears throughout statistics — it shows that the log-likelihood is concave in the parameters (enabling optimisation), and that certain estimators are necessarily biased. For example, \(\mathbb{E}[1/X] \geq 1/\mathbb{E}[X]\) since \(1/x\) is convex.