N-Grams and Language Models
Language models assign probabilities to sequences of words. They answer: "How likely is this sentence?" This fundamental capability underlies spell checking, machine translation, speech recognition, and text generation.
The Big Picture
Key Question: What's the probability of a word sequence?
$$P(\text{"the cat sat on the mat"})$$
Why This Matters:
- Spell checking: P("the cat") > P("the kat")
- Machine translation: Choose more fluent translation
- Speech recognition: Distinguish "recognize speech" from "wreck a nice beach"
- Text generation: Sample likely continuations
Language Model Fundamentals
Joint Probability of Words
We want to compute: $$P(w_1, w_2, ..., w_n)$$
Using the chain rule of probability: $$P(w_1, w_2, ..., w_n) = P(w_1) \times P(w_2|w_1) \times P(w_3|w_1,w_2) \times ... \times P(w_n|w_1,...,w_{n-1})$$
More compactly: $$P(w_1, w_2, ..., w_n) = \prod_{i=1}^{n} P(w_i | w_{1:i-1})$$
Problem: As sequences get longer, we need infinitely many parameters!
The Markov Assumption
Key insight: Approximate by using only recent history.
For a bigram model (2-gram): $$P(w_n | w_1, w_2, ..., w_{n-1}) \approx P(w_n | w_{n-1})$$
Assumption: The next word depends only on the previous word.
N-Gram Models
| Model | Conditioning Context | Example |
|---|---|---|
| Unigram | None | P(the) |
| Bigram | Previous 1 word | P(cat | the) |
| Trigram | Previous 2 words | P(sat | the cat) |
| 4-gram | Previous 3 words | P(on | the cat sat) |
Trade-off: Larger n = better context but more parameters and sparser data.
Estimating N-Gram Probabilities
Maximum Likelihood Estimation
Use relative frequency (counting):
Bigram probability: $$P(w_n | w_{n-1}) = \frac{\text{count}(w_{n-1}, w_n)}{\text{count}(w_{n-1})}$$
Example: Computing P(sat | the) from a corpus:
- Count "the sat" occurrences: 100
- Count "the ___" occurrences: 10,000
- P(sat | the) = 100/10,000 = 0.01
Handling Sentence Boundaries
Add special tokens:
- <s>: Beginning of sentence (BOS)
- </s>: End of sentence (EOS)
Example: "<s> the cat sat </s>"
This allows us to model:
- P(the | <s>) — how likely is "the" to start a sentence?
- P(</s> | sat) — how likely is "sat" to end a sentence?
Practical Note: Log Probabilities
Multiplying many small probabilities → numerical underflow!
Solution: Work in log space: $$\log(p_1 \times p_2) = \log(p_1) + \log(p_2)$$
Add log probabilities instead of multiplying probabilities.
Perplexity
What Is Perplexity?
The standard metric for evaluating language models.
Definition: $$\text{PP}(W) = P(w_1, w_2, ..., w_n)^{-1/n}$$
Equivalently: $$\text{PP}(W) = \sqrt[n]{\prod_{i=1}^{n} \frac{1}{P(w_i | w_{1:i-1})}}$$
Intuition: Weighted Branching Factor
Perplexity ≈ average number of equally likely choices at each step.
Example:
- Perplexity of 100 ≈ model is choosing between 100 equally likely words
- Perplexity of 10 ≈ model is choosing between 10 equally likely words
Lower perplexity = better model (more confident predictions).
Connection to Information Theory
Perplexity relates to entropy: $$\text{PP}(W) = 2^{H(W)}$$
Where entropy H measures uncertainty: $$H(W) = -\frac{1}{n} \log_2 P(w_1, ..., w_n)$$
Important Caveats
Perplexities are only comparable when:
- Using the same vocabulary
- Using the same test set
Adding rare words to vocabulary increases perplexity (more choices).
The Unknown Word Problem
The Problem
What if we see a word we've never seen before? $$P(\text{unigoogleable} | w_{n-1}) = 0$$
Zero probability breaks everything:
- Product becomes zero
- Perplexity becomes infinite
Solution: <UNK> Token
- Define a vocabulary (e.g., most frequent 50,000 words)
- Replace all other words with <UNK>
- Treat <UNK> as just another word
Training: "I saw a brachiosaurus" → "I saw a <UNK>" Testing: Same replacement
Caveat: Smaller vocabulary = lower perplexity (fewer choices). Not always fair to compare!
Smoothing
The Zero Probability Problem
Even with <UNK>, we'll see new n-grams: $$P(\text{cat} | \text{the green}) = 0 \text{ (if never seen together)}$$
Smoothing redistributes probability mass to unseen events.
Laplace (Add-One) Smoothing
Idea: Pretend we saw everything at least once.
Unigram: $$P(w_i) = \frac{\text{count}(w_i) + 1}{N + V}$$
Bigram: $$P(w_i | w_j) = \frac{\text{count}(w_j, w_i) + 1}{\text{count}(w_j) + V}$$
Where V is vocabulary size.
Problem: Add-1 is too aggressive — steals too much from seen events.
Solution: Add-k smoothing (k < 1, e.g., k = 0.01).
Backoff
Idea: If no evidence for trigram, use bigram. If no bigram, use unigram.
$$P_{BO}(w_i|w_{i-2},w_{i-1}) = \begin{cases} P(w_i|w_{i-2},w_{i-1}) & \text{if count > 0} \ \alpha \cdot P_{BO}(w_i|w_{i-1}) & \text{otherwise} \end{cases}$$
Where α is a normalization factor.
Interpolation
Idea: Always mix all n-gram levels.
$$P(w_i | w_{i-2}, w_{i-1}) = \lambda_1 P(w_i) + \lambda_2 P(w_i | w_{i-1}) + \lambda_3 P(w_i | w_{i-2}, w_{i-1})$$
Where $\lambda_1 + \lambda_2 + \lambda_3 = 1$.
Learn λ values from held-out data.
Kneser-Ney Smoothing
State-of-the-art for n-gram models.
Key insight: Use continuation probability — how likely is a word to appear in new contexts?
$$P_{KN}(w_i | w_{i-1}) = \frac{\max(\text{count}(w_{i-1}, w_i) - d, 0)}{\sum_v \text{count}(w_{i-1}, v)} + \lambda(w_{i-1}) P_{continuation}(w_i)$$
Intuition:
- Words like "Francisco" have high count but appear after "San" — low continuation probability
- Words like "the" appear in many contexts — high continuation probability
Efficiency Considerations
N-gram models can be huge! Practical tricks:
| Technique | Purpose |
|---|---|
| Quantization | Store probabilities with fewer bits |
| Tries | Efficient storage and lookup |
| String hashing | Reduce memory for n-grams |
| Bloom filters | Fast membership testing |
| Stupid Backoff | Simple, fast approximation |
Summary
| Concept | Key Idea |
|---|---|
| Language Model | Assign probabilities to word sequences |
| N-gram | Approximate using last n-1 words |
| MLE | Estimate from counts (relative frequency) |
| Perplexity | Model evaluation metric (lower = better) |
| Smoothing | Handle unseen n-grams |
| Kneser-Ney | Best n-gram smoothing technique |
The Bigger Picture
N-gram models are:
- Simple and interpretable
- Capture local syntax (word order)
- Fast to train and use
But they have limitations:
- Fixed context (can't capture long-range dependencies)
- Sparse data (even trigrams need lots of data)
- No semantic understanding (just counting patterns)
This motivates neural language models (covered in later chapters).