Information Theory: Entropy & Channel Capacity
Shannon entropy, mutual information, and the noisy-channel coding theorem.
Shannon entropy of a discrete random variable $X$ with PMF $p(x)$:
$$H(X) = -\sum_x p(x) \log_2 p(x) \quad \text{(bits)}.$$Properties: $H(X) \geq 0$, with equality iff $X$ is deterministic; maximized at the uniform distribution on $|X|$ outcomes, giving $H = \log_2 |X|$. Joint and conditional entropy satisfy $H(X, Y) = H(X) + H(Y \mid X)$.
Mutual information:
$$I(X; Y) = H(X) + H(Y) - H(X, Y) = H(X) - H(X \mid Y) \geq 0.$$Symmetric, nonnegative, and zero iff $X \perp Y$.
Source coding theorem: $H(X)$ is the minimum average codeword length (in bits/symbol) for lossless compression of i.i.d. $X$. Noisy-channel coding theorem: the maximum reliable rate over a memoryless channel is
$$C = \max_{p(x)} I(X; Y) \quad \text{(channel capacity)}.$$For the binary symmetric channel with bit-flip probability $p$, $C_{\rm BSC} = 1 - H_2(p)$ where $H_2(p) = -p \log_2 p - (1-p)\log_2(1-p)$ is the binary entropy.