Testing whether data follows a hypothesised distribution — KS test, chi-squared test, and the empirical CDF
So far we've tested hypotheses about parameters within a fixed distributional family. Goodness-of-fit tests ask a harder question: does the data follow a specific distribution at all? These tests require no parametric assumptions and are nonparametric at heart.
Given i.i.d. observations \(X_1,\ldots,X_n\), the empirical CDF is:
\[F_n(t) = \frac{1}{n}\sum_{i=1}^n \mathbf{1}[X_i \leq t] = \frac{\#\{i : X_i \leq t\}}{n}\]
The empirical CDF is a step function that jumps by \(1/n\) at each observation. It completely characterises the sample and is the natural nonparametric estimator of the true CDF \(F\).
\[\sup_{t \in \mathbb{R}} |F_n(t) - F(t)| \xrightarrow{a.s.} 0 \quad \text{as } n \to \infty\]
The empirical CDF converges to the true CDF uniformly over all \(t\), almost surely. This is a strong result: not just pointwise convergence, but simultaneous convergence everywhere. It justifies using \(F_n\) as a nonparametric estimate of \(F\).
Glivenko-Cantelli gives convergence. Donsker's theorem gives the rate and the limiting distribution of the scaled process:
If \(F\) is continuous:
\[\sqrt{n}\,\sup_{t \in \mathbb{R}} |F_n(t) - F(t)| \xrightarrow{(d)} \sup_{0 \leq t \leq 1} |\mathbf{B}(t)|\]
where \(\mathbf{B}(t)\) is a Brownian bridge on \([0,1]\).
This limiting distribution is distribution-free — it doesn't depend on \(F\). This makes KS-type tests universally applicable.
Test \(H_0: F = F^0\) (data follows a specific hypothesised CDF \(F^0\)) vs \(H_1: F \neq F^0\).
\[T_n = \sqrt{n}\,\sup_{t \in \mathbb{R}} |F_n(t) - F^0(t)|\]
For ordered sample \(X_{(1)} \leq \cdots \leq X_{(n)}\):
\[T_n = \sqrt{n}\,\max_{i=1}^n \max\!\left(\left|\frac{i-1}{n} - F^0(X_{(i)})\right|, \left|\frac{i}{n} - F^0(X_{(i)})\right|\right)\]
Reject \(H_0\) at level \(\alpha\) if \(T_n > q_\alpha\) where \(q_\alpha\) is the \((1-\alpha)\)-quantile of the Kolmogorov-Smirnov distribution. The critical values are tabulated or computed from the Brownian bridge distribution.
Key property: If \(F^0\) is fully specified (no estimated parameters), the test is exact for continuous \(F^0\). If parameters are estimated from the data, the test is conservative and corrections are needed.
For discrete or binned data. Observe \(X_1,\ldots,X_n \stackrel{iid}{\sim} \mathbb{P}_\mathbf{p}\) on \(\{a_1,\ldots,a_K\}\). Test \(H_0: \mathbf{p} = \mathbf{p}^0\).
\[T_n = n\sum_{j=1}^K \frac{(\hat{p}_j - p_j^0)^2}{p_j^0} \xrightarrow{(d)} \chi^2_{K-1} \quad \text{under } H_0\]
where \(\hat{p}_j = N_j/n\) is the observed proportion and \(N_j = \#\{i: X_i = a_j\}\).
The test has \(K-1\) (not \(K\)) degrees of freedom because the proportions sum to 1, removing one degree of freedom. The chi-squared approximation requires \(n p_j^0 \geq 5\) for all \(j\).
An alternative to chi-squared for categorical data. The likelihood ratio statistic is:
\[T_n = 2\!\left(\ell_n(\hat{\mathbf{p}}) - \ell_n(\mathbf{p}^0)\right) = 2\sum_{j=1}^K N_j \log\frac{\hat{p}_j}{p_j^0} \xrightarrow{(d)} \chi^2_{K-1}\]Both tests have the same asymptotic distribution. The Pearson chi-squared is simpler; the likelihood ratio test has better power in some settings and connects more directly to information-theoretic ideas.