Measuring dependence, conditional expectation as a random variable, and the law of total variance
\[\text{Cov}(X,Y) = \mathbb{E}[(X-\mathbb{E}[X])(Y-\mathbb{E}[Y])] = \mathbb{E}[XY] - \mathbb{E}[X]\mathbb{E}[Y]\]
Covariance measures how \(X\) and \(Y\) vary together. Positive: they tend to move in the same direction. Negative: opposite directions. Zero: uncorrelated (but not necessarily independent).
For \(n\) random variables (identical distribution, same pairwise covariance):
\[\text{Var}(X_1 + \cdots + X_n) = n\,\text{Var}(X_1) + 2\binom{n}{2}\text{Cov}(X_1, X_2)\]\[\text{Corr}(X,Y) = \frac{\text{Cov}(X,Y)}{\sqrt{\text{Var}(X)\,\text{Var}(Y)}}\]
Correlation is a dimensionless, scale-invariant measure of linear dependence: \(-1 \leq \text{Corr}(X,Y) \leq 1\). \(\text{Corr} = \pm 1\) iff one is an exact linear function of the other. For any constants \(a,b,c,d\) with \(ac > 0\): \(\text{Corr}(aX+b, cY+d) = \text{Corr}(X,Y)\).
When we condition on a random variable \(Y\) rather than a fixed value, \(\mathbb{E}[X \mid Y]\) is itself a random variable — a function of \(Y\). Write \(\mathbb{E}[X \mid Y] = g(Y)\) where \(g(y) = \mathbb{E}[X \mid Y=y]\).
\[\mathbb{E}[X] = \mathbb{E}[\mathbb{E}[X \mid Y]]\]
The expected value of the conditional expectation equals the unconditional expectation. Outer \(\mathbb{E}\) averages over the randomness in \(Y\).
\[\text{Var}(X) = \mathbb{E}[\text{Var}(X \mid Y)] + \text{Var}(\mathbb{E}[X \mid Y])\]
The total variance of \(X\) splits into two parts:
Exactly the same decomposition as ANOVA in statistics.
Let \(Y = X_1 + \cdots + X_N\) where \(N\) is itself a random variable independent of the i.i.d. \(X_i\). Then:
\[\mathbb{E}[Y] = \mathbb{E}[N]\,\mathbb{E}[X]\] \[\text{Var}(Y) = \mathbb{E}[N]\,\text{Var}(X) + (\mathbb{E}[X])^2\,\text{Var}(N)\]Derived by conditioning on \(N\) and applying the laws of total expectation and total variance.
\(|\mathbb{E}[XY]| \leq \sqrt{\mathbb{E}[X^2]\,\mathbb{E}[Y^2]}\). Equivalently, \(|\text{Corr}(X,Y)| \leq 1\).
For \(X \geq 0\) and \(a > 0\): \(\mathbb{P}(X \geq a) \leq \mathbb{E}[X]/a\).
\(\mathbb{P}(|X - \mu| \geq c) \leq \sigma^2/c^2\) for \(\mathbb{E}[X]=\mu\), \(\text{Var}(X)=\sigma^2\).
Chebyshev follows from Markov applied to \((X-\mu)^2\). Both are distribution-free but loose — they hold for any distribution, which is why they can't be tight.