Information Theory

Information theory provides mathematical tools for quantifying information, uncertainty, and the relationships between random variables. Originally developed for communication systems, it's now fundamental to machine learning.

The Big Picture

Information theory answers questions like:

  • How much uncertainty is in a distribution?
  • How different are two distributions?
  • How much does knowing X tell us about Y?

These concepts are central to understanding loss functions, model evaluation, and feature selection.


Entropy

Definition

Entropy measures the uncertainty or "unpredictability" of a random variable:

$$H(X) = -\sum_x p(x) \log p(x) = -\mathbb{E}[\log p(X)]$$

Intuition: How many bits do we need, on average, to encode samples from this distribution?

Key Properties

  • Non-negative: $H(X) \geq 0$
  • Maximum for uniform distribution: If all outcomes equally likely, uncertainty is maximized
  • Minimum for deterministic: $H(X) = 0$ when outcome is certain (Dirac delta)

Examples

Fair coin: $H = -\frac{1}{2}\log\frac{1}{2} - \frac{1}{2}\log\frac{1}{2} = 1$ bit

Biased coin (p=0.9): $H = -0.9\log 0.9 - 0.1\log 0.1 \approx 0.47$ bits

More predictable → lower entropy!


Cross-Entropy

Definition

Cross-entropy measures the average number of bits needed to encode data from distribution p using a code optimized for distribution q:

$$H(p, q) = -\sum_x p(x) \log q(x)$$

Intuition: How well does q approximate p?

Key Properties

  • $H(p, q) \geq H(p)$ (equality when p = q)
  • Cross-entropy is what we minimize in classification!

Connection to Machine Learning

When training a classifier:

  • p = true distribution (one-hot labels)
  • q = predicted distribution (softmax outputs)

Cross-entropy loss: $$\mathcal{L} = -\sum_c y_c \log \hat{y}_c$$

For one-hot labels, this simplifies to: $-\log \hat{y}_{true}$


Joint and Conditional Entropy

Joint Entropy

Uncertainty about both X and Y together: $$H(X, Y) = -\sum_{x,y} p(x, y) \log p(x, y)$$

Conditional Entropy

Remaining uncertainty about Y after observing X: $$H(Y | X) = \sum_x p(x) H(Y | X = x) = -\sum_{x,y} p(x,y) \log p(y|x)$$

Chain Rule

$$H(X, Y) = H(X) + H(Y | X)$$

Intuition: Total uncertainty = uncertainty in X + remaining uncertainty in Y given X.


Perplexity

Definition

Perplexity is the exponentiated cross-entropy: $$\text{Perplexity}(p, q) = 2^{H(p, q)}$$

Or for a sequence of N tokens: $$\text{Perplexity} = \sqrt[N]{\prod_{i=1}^N \frac{1}{p(x_i)}}$$

Interpretation: The weighted average number of choices (branching factor) the model is uncertain between.

Use in Language Models

  • Lower perplexity = better model
  • Perplexity of 1 = perfect prediction
  • Perplexity of V (vocabulary size) = random guessing

Example: Perplexity of 50 means the model is, on average, choosing between 50 equally likely next words.


KL Divergence

Definition

Kullback-Leibler divergence measures how different distribution q is from distribution p:

$$D_{KL}(p | q) = \sum_x p(x) \log \frac{p(x)}{q(x)} = H(p, q) - H(p)$$

Intuition: Extra bits needed when using code for q to encode data from p.

Key Properties

  • Non-negative: $D_{KL}(p | q) \geq 0$ (Gibbs' inequality)
  • Zero iff p = q: Perfect match
  • Asymmetric: $D_{KL}(p | q) \neq D_{KL}(q | p)$ in general
  • Not a true distance (asymmetric, no triangle inequality)

Connection to MLE

When we minimize NLL, we're minimizing: $$\text{NLL} = -\frac{1}{N}\sum_i \log q(x_i)$$

If samples come from empirical distribution $\hat{p}$: $$\text{NLL} = H(\hat{p}, q) = H(\hat{p}) + D_{KL}(\hat{p} | q)$$

Since $H(\hat{p})$ is constant, minimizing NLL = minimizing KL divergence!

Forward vs. Reverse KL

Forward KL $D_{KL}(p | q)$: p is the reference

  • Mode-covering: q tries to cover all of p's mass
  • Penalizes q for missing modes of p

Reverse KL $D_{KL}(q | p)$: q is the reference

  • Mode-seeking: q concentrates on modes of p
  • Okay to miss some modes, but penalizes placing mass where p has none

Mutual Information

Definition

How much does knowing X tell us about Y?

$$I(X; Y) = D_{KL}(p(x, y) | p(x)p(y))$$

Or equivalently: $$I(X; Y) = H(X) - H(X | Y) = H(Y) - H(Y | X)$$

Intuition: Reduction in uncertainty about X from knowing Y.

Key Properties

  • Symmetric: $I(X; Y) = I(Y; X)$
  • Non-negative: $I(X; Y) \geq 0$
  • Zero iff independent: $I(X; Y) = 0 \Leftrightarrow X \perp Y$

As Generalized Correlation

Mutual information captures any dependence (not just linear), making it more general than Pearson correlation.

Data Processing Inequality

If X → Y → Z forms a Markov chain: $$I(X; Z) \leq I(X; Y)$$

Implication: Processing cannot increase information. You can only lose information through transformations!


Fano's Inequality

Statement

For any estimator $\hat{X}$ of X based on Y: $$H(X | Y) \leq H(P_e) + P_e \log(|X| - 1)$$

Where $P_e = P(\hat{X} \neq X)$ is the error probability.

Implications

  • Lower bound on error: If $H(X|Y)$ is high, error must be high
  • Feature selection: Features with high mutual information with the target reduce classification error

Applications in ML

Loss Functions

Cross-entropy loss minimizes KL divergence between true and predicted distributions.

Variational Inference

Approximate intractable posterior by minimizing KL divergence.

Information Bottleneck

Find representations that maximally compress input while retaining relevant information about the output.

Data Augmentation

Spreads probability mass over larger input space, reducing overfitting.


Summary

Concept Formula Meaning
Entropy $H(X) = -\mathbb{E}[\log p(X)]$ Uncertainty in X
Cross-Entropy $H(p, q) = -\mathbb{E}_p[\log q]$ Bits to encode p using q
KL Divergence $D_{KL}(p|q) = H(p,q) - H(p)$ Extra bits; difference between distributions
Mutual Information $I(X;Y) = H(X) - H(X|Y)$ Information shared between X and Y
Perplexity $2^{H(p,q)}$ Effective vocabulary size