Week 3 — Essentials of Probability Theory

Section 1Motivating Example

Consider estimating a random variable $X$ generated by numpy.random.randn(). We do not observe $X$ directly, but observe a noisy version:

$$Y = X + W$$

where $W$ is also generated by numpy.random.randn(), independent of $X$. The goal is to form an estimate $\hat{X}$ given the observation $Y$.

Guiding Questions

1. What formula would you choose for $\hat{X}$? Is $\hat{X} = Y$ a good choice?

2. How do you evaluate the performance of your estimate?

3. If you restrict yourself to linear estimates $\hat{X} = kY$, what is the optimal $k$? Can a nonlinear estimate do better?

Answering these questions requires the framework of probability theory, developed in this lecture.

Section 2Discrete Random Variables

A discrete random variable $X$ is characterized by a set of possible outcomes $\{x_1, x_2, \ldots, x_K\}$ and associated probabilities $\{p_1, \ldots, p_K\}$ such that $P(X = x_k) = p_k$. The probabilities must satisfy:

$p_k \geq 0$ for all $k$.
$\sum_k p_k = 1$.

Intuitively, $P(X = x_k)$ characterizes the fraction of times $X_i = x_k$ occurs over $N$ independent repetitions, as $N \to \infty$.

Example 2.1 · Bernoulli

A Bernoulli random variable with parameter $p \in [0,1]$ has $K=2$, outcomes $\{0, 1\}$, and probabilities $P(X=0) = 1-p$, $P(X=1) = p$.

Example 2.2 · Poisson

Let $X$ be the number of emails received in $T$ hours, where emails arrive at average rate $\lambda$ per hour. Then $X$ takes values in $\{0, 1, 2, \ldots\}$ with:

$$P(X = k) = e^{-\lambda T} \frac{(\lambda T)^k}{k!}$$

This is called the Poisson distribution with parameter $\lambda T$.

Exercise 1

Show that $\sum_{k=0}^\infty P(X=k) = 1$ for the Poisson random variable.

Section 3Continuous Random Variables

A continuous random variable $X$ takes values in $\mathbb{R}$ and is characterized by a probability density function $p(x)$ satisfying:

$$P(X \in [a,b]) = \int_a^b p(x)\,dx$$

A valid density must have $p(x) \geq 0$ and $\int_{-\infty}^\infty p(x)\,dx = 1$. Note that $P(X = x) = 0$ for any single point — densities assign probability to intervals, not points.

Example 3.1 · Uniform

$X \sim \text{Unif}[\alpha, \beta]$ has density $p(x) = \frac{1}{\beta - \alpha}$ for $x \in [\alpha, \beta]$, zero otherwise. It can be generated from $U \sim \text{Unif}[0,1]$ via $X = (\beta - \alpha)U + \alpha$.

Example 3.2 · Exponential

$X$ has an exponential distribution with parameter $\lambda > 0$ if:

$$p(x) = \lambda e^{-\lambda x}, \quad x \geq 0$$

It models waiting times when the probability of an event does not change over time — e.g. lightbulb lifetimes, radioactive decay, or server request arrivals.

Section 4Expectation and Variance

The expectation is a probability-weighted average of outcomes:

$$E[X] = \begin{cases} \displaystyle\sum_{k=1}^K x_k p_k & \text{(discrete)} \\[1em] \displaystyle\int_{-\infty}^\infty x\, p(x)\,dx & \text{(continuous)} \end{cases}$$

Example 4.1 · Die Roll

$E[X] = \frac{1}{6}(1+2+3+4+5+6) = 3.5$

Example 4.2 · Poisson

$E[X] = \lambda T$ — the average number of emails expected after time $T$.

Example 4.3 · Uniform

$E[X] = \displaystyle\int_\alpha^\beta \frac{x}{\beta-\alpha}\,dx = \frac{\alpha+\beta}{2}$

Example 4.4 · Exponential

$E[X] = \displaystyle\int_0^\infty x\lambda e^{-\lambda x}\,dx = \frac{1}{\lambda}$ — the average waiting time when events occur at rate $\lambda$.

Remark 4.5

In this course we rarely compute expectations by direct integration. Instead, we simplify combinations of random variables to individual terms using the following rules.

Rules for Expectation

$E[\alpha X] = \alpha E[X]$ for any constant $\alpha \in \mathbb{R}$.
$E[X + Y] = E[X] + E[Y]$ for any two random variables $X, Y$.
$E[\alpha] = \alpha$ for any constant $\alpha$.

For any random variable $X$ and function $f$, the expectation of $f(X)$ is:

$$E[f(X)] = \begin{cases} \displaystyle\sum_k f(x_k)p_k & \text{(discrete)} \\[0.5em] \displaystyle\int_{-\infty}^\infty f(x)p(x)\,dx & \text{(continuous)} \end{cases}$$

Example 4.6

For an exponential random variable: $E[X^2] = \int_0^\infty x^2 \lambda e^{-\lambda x}\,dx = \frac{2}{\lambda^2}$.

The variance measures spread in the outcomes:

$$\text{var}(X) = E[(X - E[X])^2]$$

Exercise 2

Use the rules of expectation to show $\text{var}(X) = E[X^2] - E[X]^2$.

Example 4.7 · Bernoulli

$\text{var}(X) = p - p^2 = p(1-p)$.

Example 4.8 · Exponential

$\text{var}(X) = \frac{2}{\lambda^2} - \frac{1}{\lambda^2} = \frac{1}{\lambda^2}$.

Rules for Variance

$\text{var}(\alpha X) = \alpha^2 \text{var}(X)$ for any $\alpha \in \mathbb{R}$.
$\text{var}(X + \alpha) = \text{var}(X)$ for any constant $\alpha$.

Section 5Gaussian Random Variable

A standard Gaussian $Z \sim \mathcal{N}(0,1)$ has density:

$$p(x) = \frac{1}{\sqrt{2\pi}} e^{-x^2/2}$$

with $E[Z] = 0$ and $\text{var}(Z) = 1$. More generally, $X \sim \mathcal{N}(\mu, \sigma^2)$ has density:

$$p(x) = \frac{1}{\sqrt{2\pi\sigma^2}} e^{-\frac{(x-\mu)^2}{2\sigma^2}}$$

with $E[X] = \mu$ and $\text{var}(X) = \sigma^2$. The parameter $\sigma > 0$ is the standard deviation.

A Gaussian remains Gaussian under linear transformations: if $X \sim \mathcal{N}(\mu, \sigma^2)$, then $Y = \alpha X + \beta \sim \mathcal{N}(\alpha\mu + \beta,\, \alpha^2\sigma^2)$.

Exercise 3

Use the rules of expectation and variance to compute the mean and variance of $Y = \alpha X + \beta$ when $X \sim \mathcal{N}(\mu, \sigma^2)$. Determine $\alpha$ and $\beta$ so that $Y \sim \mathcal{N}(0, 1)$.

Two useful confidence intervals for $X \sim \mathcal{N}(\mu, \sigma^2)$:

≈ 69%

$P(|X - \mu| \leq \sigma) \approx 0.69$
$X \in [\mu - \sigma,\; \mu + \sigma]$

≈ 96%

$P(|X - \mu| \leq 2\sigma) \approx 0.96$
$X \in [\mu - 2\sigma,\; \mu + 2\sigma]$

These confidence intervals are useful for verifying estimation algorithm performance.

Section 6Jointly Distributed Random Variables

For two random variables $X$ and $Y$, we associate a joint density $p_{X,Y}(x,y)$ such that:

$$P\bigl((X,Y) \in [a,b]\times[c,d]\bigr) = \int_a^b\int_c^d p_{X,Y}(x,y)\,dx\,dy$$

The marginal densities are recovered by integrating out the other variable:

$$p_X(x) = \int_{-\infty}^\infty p_{X,Y}(x,y)\,dy, \qquad p_Y(y) = \int_{-\infty}^\infty p_{X,Y}(x,y)\,dx$$

Example 6.1

Let $p(x,y) = x + y$ for $(x,y) \in [0,1]\times[0,1]$. Then:

$$E[X] = \frac{7}{12}, \quad E[X^2] = \frac{5}{12}, \quad E[XY] = \frac{1}{3}$$ $$p_X(x) = x + \frac{1}{2}, \quad p_Y(y) = y + \frac{1}{2}$$

Section 7Independent Random Variables

Two random variables $X$ and $Y$ are independent if knowing one provides no information about the other. Formally, $X \perp Y$ if and only if:

$$p_{X,Y}(x,y) = p_X(x)\, p_Y(y)$$

Independence implies that $f(X)$ and $g(Y)$ are also independent for any functions $f$ and $g$, and:

$$E[f(X)g(Y)] = E[f(X)]\, E[g(Y)]$$

Section 8Covariance, Correlation, and Geometry

The covariance between $X$ and $Y$ is:

$$\text{cov}(X,Y) = E[(X-E[X])(Y-E[Y])] = E[XY] - E[X]E[Y]$$

Covariance has a geometric interpretation as an inner product on the space of random variables. The induced norm is $\|X\| = \sqrt{\text{var}(X)}$ and the correlation is like a cosine of an angle:

$$\text{cor}(X,Y) = \frac{\text{cov}(X,Y)}{\sqrt{\text{var}(X)\,\text{var}(Y)}} \in [-1, 1]$$

Correlation equal to $\pm 1$ implies a deterministic linear relationship $Y = \alpha X$. Correlation zero means $X$ and $Y$ are uncorrelated. Note that:

$$\text{independence} \implies \text{uncorrelated} \quad \text{(but not conversely)}$$

Rules for Covariance

$\text{cov}(X,X) = \text{var}(X)$
$\text{cov}(X,Y) = \text{cov}(Y,X)$
$\text{cov}(X, \alpha Y) = \alpha\,\text{cov}(X,Y)$
$\text{cov}(X, Y+\alpha) = \text{cov}(X,Y)$
$\text{cov}(X, Y+Z) = \text{cov}(X,Y) + \text{cov}(X,Z)$
$\text{cov}(X,Y) = 0$ when $X$ and $Y$ are independent.

Example 8.1

For $p(x,y) = x+y$ on $[0,1]^2$, using $E[XY] = \frac{1}{3}$ and $E[X] = E[Y] = \frac{7}{12}$:

$$\text{cov}(X,Y) = \frac{1}{3} - \left(\frac{7}{12}\right)^2, \qquad \text{cor}(X,Y) = -\frac{1}{11}$$

The negative correlation makes sense: for large $x$, the density $p(x,y) = x+y$ is already large, so $Y$ tends to be smaller to keep the marginal normalized.

Exercise 4

Show that $\text{var}(X+Y) = \text{var}(X) + \text{var}(Y)$ if and only if $\text{cov}(X,Y) = 0$.

Exercise 5

Assume $X$ and $Y$ are independent. Show that $\text{var}(X+Y) = \text{var}(X) + \text{var}(Y)$.

Exercise 6

Consider the motivating example with $Y = X + W$, $X \sim \mathcal{N}(\mu, \sigma^2)$, and $W \sim \mathcal{N}(0, r^2)$ independent of $X$. Find the formula for $\text{cor}(X,Y)$ and explain how it depends on $\mu$, $\sigma$, and $r$.

Section 9Vector-Valued Random Variables

Stacking $n$ random variables $X_1, \ldots, X_n$ gives a vector-valued random variable $X \in \mathbb{R}^n$. Its expectation is defined componentwise:

$$E[X] = \begin{bmatrix} E[X_1] \\ \vdots \\ E[X_n] \end{bmatrix}$$

For $X \in \mathbb{R}^n$ and $Y \in \mathbb{R}^m$, the $n \times m$ covariance matrix is:

$$\text{Cov}(X,Y) = E\bigl[(X - E[X])(Y - E[Y])^\top\bigr]$$

The $(i,j)$-th entry is $\text{cov}(X_i, Y_j)$. We write $\text{Cov}(X) := \text{Cov}(X,X)$.

Rules for Covariance Matrix

$\text{Cov}(X,Y) = \text{Cov}(Y,X)^\top$
$\text{Cov}(AX + b, Y) = A\,\text{Cov}(X,Y)$
$\text{Cov}(X, Y+Z) = \text{Cov}(X,Y) + \text{Cov}(Y,Z)$
$\text{Cov}(X+Y, X+Y) = \text{Cov}(X) + \text{Cov}(Y)$ if and only if $\text{Cov}(X,Y) = 0$
$\text{Cov}(X,Y) = 0$ when $X$ and $Y$ are independent.

Section 10Gaussian Random Vectors

$X \sim \mathcal{N}(\mu, \Sigma)$ is a Gaussian random vector with mean $\mu \in \mathbb{R}^n$ and positive definite covariance $\Sigma \in \mathbb{R}^{n\times n}$ if:

$$p_X(x) = \frac{1}{\sqrt{(2\pi)^n \det(\Sigma)}}\, \exp\!\left(-\tfrac{1}{2}(x-\mu)^\top \Sigma^{-1}(x-\mu)\right)$$

with $E[X] = \mu$ and $\text{Cov}(X) = \Sigma$. When $\Sigma = \text{diag}(\sigma_1^2, \ldots, \sigma_n^2)$, all components are independent, since $p_X(x) = \prod_i p_{X_i}(x_i)$.

Properties of Gaussian Vectors

If $X \sim \mathcal{N}(\mu, \Sigma)$ and $Y = AX + b$, then $Y \sim \mathcal{N}(A\mu + b,\; A\Sigma A^\top)$.
If $X \sim \mathcal{N}(\mu_1, \Sigma_1)$ and $Y \sim \mathcal{N}(\mu_2, \Sigma_2)$ are independent, then $X + Y \sim \mathcal{N}(\mu_1+\mu_2,\; \Sigma_1+\Sigma_2)$.
Two Gaussian vectors $X$ and $Y$ are independent if and only if $\text{Cov}(X,Y) = 0$.

Example 10.1 · Observation Model

Let $X \sim \mathcal{N}(\mu, \Sigma)$ and $Y = HX + W$ with $W \sim \mathcal{N}(0, R)$ independent of $X$. Then:

$$\text{Cov}(X, Y) = \text{Cov}(X, HX + W) = \text{Cov}(X,X)H^\top = \Sigma H^\top$$ $$\text{Cov}(Y, Y) = H\Sigma H^\top + R$$

These quantities will be central to the optimal estimation formulas developed later in the course.

Exercise 7

Let $X \sim \mathcal{N}(\mu, \Sigma)$ be an $n$-dimensional Gaussian. Find $A$ and $b$ such that $Y = AX + b \sim \mathcal{N}(0, I)$.

Section 11Central Limit Theorem and Law of Large Numbers

Let $\{X_1, \ldots, X_N\}$ be $N$ independent copies of a random variable $X$ with $E[X] = \mu$ and $\text{var}(X) = \sigma^2$. Define the empirical average $Y_N = \frac{1}{N}\sum_{i=1}^N X_i$.

Computing the mean-squared error of $Y_N$ as an estimate of $\mu$:

$$E\bigl[(Y_N - \mu)^2\bigr] = \frac{1}{N^2}\sum_{i=1}^N \text{var}(X_i) = \frac{\sigma^2}{N}$$

Law of Large Numbers

As $N \to \infty$, the empirical average converges to the true mean:

$$\lim_{N\to\infty} Y_N = \mu$$

Central Limit Theorem

For large $N$, the normalized empirical average converges in distribution to a standard Gaussian:

$$\frac{\sqrt{N}}{\sigma}(Y_N - \mu) \xrightarrow{d} \mathcal{N}(0,1) \quad \text{as } N \to \infty$$

This holds regardless of the distribution of $X_i$. It is the fundamental reason why Gaussian distributions arise naturally — uncertainties are typically sums of many small independent contributions.

Section 12Programming Exercise

The goal is to become familiar with generating random variables, computing expectations, and verifying the central limit theorem. Throughout, use only numpy.random.rand() for random number generation (which produces $U \sim \text{Unif}[0,1]$).

Deliverables

(a) Let $X$ be a biased coin toss with $P(\text{heads}) = 2 \cdot P(\text{tails})$. Assign 1 to heads, 0 to tails. Express $X$ in terms of $U \sim \text{Unif}[0,1]$ in the form $X = a$ if $U \leq c$, else $X = b$. Determine $a$, $b$, and $c$.
(b) Define $Y_n = \sum_{i=1}^n X_i$ (number of heads in $n$ tosses). Generate $N = 10^4$ realizations of $Y_n$. Plot histograms for $n = 1, 2, 10, 100$.
(c) Derive formulas for $E[Y_n]$ and $\text{var}(Y_n)$ in terms of $n$. Compare with numerical estimates $\frac{1}{N}\sum_i Y_n^i$ and $\frac{1}{N}\sum_i (Y_n^i - \bar{Y}_n)^2$.
(d) Express $Z_n$ as a function of $Y_n$ such that $Z_n \approx \mathcal{N}(0,1)$ for large $n$. Verify by generating $Z_n$ for $n=100$, plotting a histogram, and overlaying the $\mathcal{N}(0,1)$ density.