Distance Measures

Total Variation Distance and KL Divergence — how to measure the gap between probability distributions

1. Introduction

To reason about how close an estimator's distribution is to the truth, or how distinguishable two probability distributions are, we need to measure the "distance" between distributions. Two canonical measures are Total Variation distance and KL Divergence. They have different interpretations, different properties, and are used in different contexts.

2. Total Variation Distance

Total Variation Distance

The TV distance between \(\mathbb{P}\) and \(\mathbb{Q}\) is:

\[\text{TV}(\mathbb{P},\mathbb{Q}) = \max_{A \subseteq E} |\mathbb{P}(A) - \mathbb{Q}(A)|\]

This is the largest possible difference in probability that any event can have under the two distributions.

Computational Formulas

Discrete case: If \(E\) is countable:

\[\text{TV}(\mathbb{P},\mathbb{Q}) = \frac{1}{2}\sum_{x \in E} |p(x) - q(x)|\]

Continuous case: If \(E\) is continuous with densities \(f_\theta, f_{\theta'}\):

\[\text{TV}(\mathbb{P}_\theta, \mathbb{P}_{\theta'}) = \frac{1}{2}\int_E |f_\theta(x) - f_{\theta'}(x)|\, dx\]

Figure 1 — TV distance measures the maximum gap between probabilities. The overlap region contributes to similarity.

Properties of TV Distance

Symmetric: \(\text{TV}(\mathbb{P},\mathbb{Q}) = \text{TV}(\mathbb{Q},\mathbb{P})\)
Non-negative: \(\text{TV}(\mathbb{P},\mathbb{Q}) \geq 0\), with equality iff \(\mathbb{P} = \mathbb{Q}\)
Bounded: \(\text{TV}(\mathbb{P},\mathbb{Q}) \leq 1\)
Triangle inequality: \(\text{TV}(\mathbb{P},\mathbb{R}) \leq \text{TV}(\mathbb{P},\mathbb{Q}) + \text{TV}(\mathbb{Q},\mathbb{R})\)
Definite: If \(\mathbb{P}_\theta = \mathbb{P}_{\theta'}\), then \(\theta = \theta'\) (when the model is identifiable)

TV is a true metric on probability distributions. Its operational meaning: if you want to distinguish \(\mathbb{P}\) from \(\mathbb{Q}\) using a single binary test, the best possible probability of error is \(\frac{1}{2}(1 - \text{TV}(\mathbb{P},\mathbb{Q}))\).

3. KL Divergence

KL Divergence (Relative Entropy)

\[\text{KL}(\mathbb{P}_\theta, \mathbb{P}_{\theta'}) = \begin{cases} \sum_{x \in E} p_\theta(x) \log\dfrac{p_\theta(x)}{p_{\theta'}(x)} & E \text{ discrete} \\[10pt] \int_E f_\theta(x) \log\dfrac{f_\theta(x)}{f_{\theta'}(x)}\, dx & E \text{ continuous} \end{cases}\]

KL divergence measures how much information is lost when we use distribution \(\mathbb{P}_{\theta'}\) to approximate the true distribution \(\mathbb{P}_\theta\). Also called relative entropy.

Properties of KL Divergence

Non-negative: \(\text{KL}(\mathbb{P},\mathbb{Q}) \geq 0\) (by Jensen's inequality applied to \(-\log\))
Definite: \(\text{KL}(\mathbb{P},\mathbb{Q}) = 0 \iff \mathbb{P} = \mathbb{Q}\)
Not symmetric: \(\text{KL}(\mathbb{P},\mathbb{Q}) \neq \text{KL}(\mathbb{Q},\mathbb{P})\) in general
No triangle inequality — KL is not a metric
Sub-additivity: \(\text{KL}(\mathbb{P},\mathbb{R}) \leq \text{KL}(\mathbb{P},\mathbb{Q}) + \text{KL}(\mathbb{Q},\mathbb{R})\) does not hold in general

Because KL is asymmetric, \(\text{KL}(\mathbb{P}\|\mathbb{Q})\) ("forward KL") and \(\text{KL}(\mathbb{Q}\|\mathbb{P})\) ("reverse KL") have different uses. Forward KL is used in MLE; reverse KL appears in variational inference.

4. Connecting TV and KL

The two measures are related by Pinsker's inequality:

\[\text{TV}(\mathbb{P},\mathbb{Q}) \leq \sqrt{\frac{1}{2}\text{KL}(\mathbb{P},\mathbb{Q})}\]

So small KL divergence implies small TV distance. The converse does not hold — small TV does not imply small KL.

	Total Variation	KL Divergence
Symmetric	Yes	No
Metric	Yes	No
Bounded	[0, 1]	[0, ∞)
Main use	Hypothesis testing, model distinguishability	MLE, information theory, variational methods

5. Why KL Appears in MLE

Maximum likelihood estimation minimises the KL divergence between the empirical distribution and the model. Maximising the log-likelihood:

\[\hat{\theta}_n = \arg\max_\theta \frac{1}{n}\sum_{i=1}^n \log p_\theta(X_i)\]

is equivalent to minimising \(\text{KL}(\hat{\mathbb{P}}_n, \mathbb{P}_\theta)\) where \(\hat{\mathbb{P}}_n\) is the empirical distribution. This is why MLE has such clean information-theoretic properties.