Probability Foundations

Law of Large Numbers, Central Limit Theorem, Hoeffding's Inequality, and the Gaussian Distribution

1. Introduction

Statistics and probability are deeply intertwined but answer different questions. Probability starts from a known data-generating process and asks: what outcomes should we expect? Statistics starts from observed data and asks: what process could have generated this?

This notebook covers the foundational probability results that underpin all of statistical inference — how sample averages behave, how they concentrate around their mean, and why the Gaussian distribution appears everywhere.

2. The Law of Large Numbers

Let \(X_1, X_2, \ldots, X_n\) be i.i.d. random variables with \(\mathbb{E}[X] = \mu\) and \(\text{Var}(X) = \sigma^2\). The sample mean is:

\[\bar{X}_n = \frac{1}{n}\sum_{i=1}^n X_i\]

Weak Law of Large Numbers

\(\bar{X}_n \xrightarrow{\mathbb{P}} \mu \quad \text{as } n \to \infty\)

The sample average converges in probability to the true mean. No matter how noisy individual observations are, averaging enough of them drives the error to zero. This is why we can estimate population parameters from samples at all.

Figure 1 — The sample mean concentrates around μ as n grows. The band of variability collapses.

3. The Central Limit Theorem

The LLN tells us the sample mean converges to μ. The CLT tells us how fast and in what shape.

Central Limit Theorem

\(\sqrt{n}\,\frac{\bar{X}_n - \mu}{\sigma} \xrightarrow{(d)} \mathcal{N}(0,1) \quad \text{as } n \to \infty\)

Equivalently, \(\sqrt{n}(\bar{X}_n - \mu) \xrightarrow{(d)} \mathcal{N}(0, \sigma^2)\). The normalised sample mean is approximately Gaussian regardless of the original distribution of \(X_i\). This is why Gaussian assumptions are so pervasive — they emerge naturally from averaging.

Practical implication: for large \(n\), we can treat \(\bar{X}_n\) as approximately \(\mathcal{N}(\mu, \sigma^2/n)\), enabling confidence intervals and hypothesis tests without knowing the true distribution.

Figure 2 — As n increases, the distribution of the sample mean approaches a Gaussian regardless of the original distribution.

4. Hoeffding's Inequality

The CLT is asymptotic. Hoeffding gives a finite-sample guarantee on how far \(\bar{X}_n\) can deviate from \(\mu\).

Hoeffding's Inequality

Let \(X_1, \ldots, X_n\) be i.i.d. with \(\mathbb{E}[X]=\mu\) and \(X \in [a,b]\) almost surely. Then for all \(\epsilon > 0\):

\[\mathbb{P}\!\left(|\bar{X}_n - \mu| \geq \epsilon\right) \leq 2\exp\!\left(\frac{-2n\epsilon^2}{(b-a)^2}\right)\]

Key observations:

The bound decays exponentially in \(n\) and in \(\epsilon^2\) — much faster than Markov or Chebyshev bounds.
It requires bounded support \([a,b]\); wider range means weaker bound.
No knowledge of \(\sigma^2\) is needed — only the range.
Used heavily in machine learning theory to bound generalisation error.

5. The Gaussian Distribution

A random variable \(X \sim \mathcal{N}(\mu, \sigma^2)\) has probability density:

\[f_{\mu,\sigma^2}(x) = \frac{1}{\sigma\sqrt{2\pi}}\exp\!\left(-\frac{(x-\mu)^2}{2\sigma^2}\right)\]

with \(\mathbb{E}[X] = \mu\) and \(\text{Var}(X) = \sigma^2\).

Key Properties

Affine invariance: If \(X \sim \mathcal{N}(\mu, \sigma^2)\), then \(aX + b \sim \mathcal{N}(a\mu+b,\; a^2\sigma^2)\).
Standardisation: \(Z = (X-\mu)/\sigma \sim \mathcal{N}(0,1)\).
Symmetry: If \(X \sim \mathcal{N}(0,\sigma^2)\), then \(-X \sim \mathcal{N}(0,\sigma^2)\), so \(\mathbb{P}(|X|>x) = 2\mathbb{P}(X>x)\) for \(x>0\).
Sum closure: If \(X \sim \mathcal{N}(\mu_1,\sigma_1^2)\) and \(Y \sim \mathcal{N}(\mu_2,\sigma_2^2)\) are independent, then \(X+Y \sim \mathcal{N}(\mu_1+\mu_2,\, \sigma_1^2+\sigma_2^2)\).

Quantiles

The quantile of order \(1-\alpha\) of a random variable \(X\) is the number \(q_\alpha\) such that \(\mathbb{P}(X \leq q_\alpha) = 1-\alpha\). For \(Z \sim \mathcal{N}(0,1)\): \(\mathbb{P}(|Z| > q_{\alpha/2}) = \alpha\). Common values: \(q_{0.025} \approx 1.96\), \(q_{0.05} \approx 1.645\).

Figure 3 — The central 1−α probability mass lies between −q_α/2 and q_α/2.

6. Why It Matters

These three results form the backbone of all frequentist statistics:

The LLN guarantees estimators are consistent — they converge to the truth.
The CLT gives the limiting distribution — enabling confidence intervals and p-values.
Hoeffding gives non-asymptotic guarantees — essential when \(n\) is small or we need hard bounds.
The Gaussian is the universal limiting distribution that makes inference tractable.