Goodness of Fit & Empirical Processes

Testing whether data follows a hypothesised distribution — KS test, chi-squared test, and the empirical CDF

1. Introduction

So far we've tested hypotheses about parameters within a fixed distributional family. Goodness-of-fit tests ask a harder question: does the data follow a specific distribution at all? These tests require no parametric assumptions and are nonparametric at heart.

2. The Empirical CDF

Empirical CDF

Given i.i.d. observations \(X_1,\ldots,X_n\), the empirical CDF is:

\[F_n(t) = \frac{1}{n}\sum_{i=1}^n \mathbf{1}[X_i \leq t] = \frac{\#\{i : X_i \leq t\}}{n}\]

The empirical CDF is a step function that jumps by \(1/n\) at each observation. It completely characterises the sample and is the natural nonparametric estimator of the true CDF \(F\).

Figure 1 — The empirical CDF (solid steps) approximates the true CDF (dashed). The maximum gap is the KS statistic.

3. Glivenko-Cantelli Theorem

Glivenko-Cantelli (Fundamental Theorem of Statistics)

\[\sup_{t \in \mathbb{R}} |F_n(t) - F(t)| \xrightarrow{a.s.} 0 \quad \text{as } n \to \infty\]

The empirical CDF converges to the true CDF uniformly over all \(t\), almost surely. This is a strong result: not just pointwise convergence, but simultaneous convergence everywhere. It justifies using \(F_n\) as a nonparametric estimate of \(F\).

4. Donsker's Theorem

Glivenko-Cantelli gives convergence. Donsker's theorem gives the rate and the limiting distribution of the scaled process:

Donsker's Theorem

If \(F\) is continuous:

\[\sqrt{n}\,\sup_{t \in \mathbb{R}} |F_n(t) - F(t)| \xrightarrow{(d)} \sup_{0 \leq t \leq 1} |\mathbf{B}(t)|\]

where \(\mathbf{B}(t)\) is a Brownian bridge on \([0,1]\).

This limiting distribution is distribution-free — it doesn't depend on \(F\). This makes KS-type tests universally applicable.

5. The Kolmogorov-Smirnov Test

Test \(H_0: F = F^0\) (data follows a specific hypothesised CDF \(F^0\)) vs \(H_1: F \neq F^0\).

KS Test Statistic

\[T_n = \sqrt{n}\,\sup_{t \in \mathbb{R}} |F_n(t) - F^0(t)|\]

For ordered sample \(X_{(1)} \leq \cdots \leq X_{(n)}\):

\[T_n = \sqrt{n}\,\max_{i=1}^n \max\!\left(\left|\frac{i-1}{n} - F^0(X_{(i)})\right|, \left|\frac{i}{n} - F^0(X_{(i)})\right|\right)\]

Reject \(H_0\) at level \(\alpha\) if \(T_n > q_\alpha\) where \(q_\alpha\) is the \((1-\alpha)\)-quantile of the Kolmogorov-Smirnov distribution. The critical values are tabulated or computed from the Brownian bridge distribution.

Key property: If \(F^0\) is fully specified (no estimated parameters), the test is exact for continuous \(F^0\). If parameters are estimated from the data, the test is conservative and corrections are needed.

6. Chi-Squared Goodness-of-Fit Test

For discrete or binned data. Observe \(X_1,\ldots,X_n \stackrel{iid}{\sim} \mathbb{P}_\mathbf{p}\) on \(\{a_1,\ldots,a_K\}\). Test \(H_0: \mathbf{p} = \mathbf{p}^0\).

Chi-Squared Test Statistic

\[T_n = n\sum_{j=1}^K \frac{(\hat{p}_j - p_j^0)^2}{p_j^0} \xrightarrow{(d)} \chi^2_{K-1} \quad \text{under } H_0\]

where \(\hat{p}_j = N_j/n\) is the observed proportion and \(N_j = \#\{i: X_i = a_j\}\).

The test has \(K-1\) (not \(K\)) degrees of freedom because the proportions sum to 1, removing one degree of freedom. The chi-squared approximation requires \(n p_j^0 \geq 5\) for all \(j\).

Categorical Likelihood Ratio Test

An alternative to chi-squared for categorical data. The likelihood ratio statistic is:

Both tests have the same asymptotic distribution. The Pearson chi-squared is simpler; the likelihood ratio test has better power in some settings and connects more directly to information-theoretic ideas.