Sequence Architectures: RNNs, LSTMs, and Attention
Language is inherently sequential. This chapter covers neural architectures designed to process sequences: from basic RNNs to LSTMs to the attention mechanism that revolutionized NLP.
The Big Picture
The Problem: Language has dependencies across arbitrary distances.
- "The cat that I saw yesterday was cute" (subject-verb agreement)
- Standard feedforward networks have fixed input size
The Solution: Architectures with memory that process sequences step by step.
The Evolution:
FFNNs → RNNs → LSTMs/GRUs → Attention → Transformers
(fixed context) (memory) (better memory) (direct connections)
Why Not Feedforward Networks?
Limitation: Fixed context window.
A bigram FFNN can only see the previous word. But language has long-range dependencies:
- "The students who did well on the exam were happy"
- Verb agrees with "students", not "exam"
We need: Variable-length context.
Recurrent Neural Networks (RNNs)
The Core Idea
Add a recurrent connection — the hidden state from the previous step feeds into the current step.
$$h_t = g(W_{hh} h_{t-1} + W_{xh} x_t + b_h)$$ $$y_t = W_{hy} h_t + b_y$$
The hidden state $h_t$ acts as memory!
Intuition
Think of reading a sentence word by word:
- You update your understanding as each word arrives
- Your current understanding depends on what you've read so far
- That's exactly what $h_t$ does
Training: Backpropagation Through Time (BPTT)
- Unroll the network across time steps
- Forward pass: Compute all hidden states and outputs
- Backward pass: Compute gradients through the unrolled graph
- Update: Sum gradients for shared weights
For long sequences: Use truncated BPTT (limit how far back gradients flow).
RNN Language Model
$$e_t = E x_t \quad \text{(word embedding)}$$ $$h_t = g(W_{hh} h_{t-1} + W_{he} e_t)$$ $$P(w_{t+1}) = \text{softmax}(W_{hy} h_t)$$
Training: Teacher forcing — use true previous word, not predicted word.
Weight tying: Share parameters between input embedding E and output layer.
RNN Task Variants
Sequence Labeling (Many-to-Many, aligned)
Task: Label each token (NER, POS tagging).
Input: John loves New York
Output: B-PER O B-LOC I-LOC
Predict at each step based on hidden state.
Sequence Classification (Many-to-One)
Task: Classify entire sequence (sentiment analysis).
Input: This movie was great!
Output: POSITIVE
Use final hidden state (or pooled states) for classification.
Sequence Generation (One-to-Many or Many-to-Many)
Task: Generate text.
Input: <BOS>
Output: The cat sat on the mat <EOS>
Autoregressive: Each output becomes next input.
RNN Architectures
Stacked RNNs
Multiple RNN layers:
Layer 3: h3_t = f(h3_{t-1}, h2_t)
Layer 2: h2_t = f(h2_{t-1}, h1_t)
Layer 1: h1_t = f(h1_{t-1}, x_t)
Benefit: Different abstraction levels at each layer.
Bidirectional RNNs
Process sequence both ways:
- Forward: $\vec{h}_t$ (left to right)
- Backward: $\overleftarrow{h}_t$ (right to left)
- Combined: $h_t = [\vec{h}_t; \overleftarrow{h}_t]$
Benefit: Each position has access to full context.
Limitation: Can't use for autoregressive generation (need future that doesn't exist yet).
The Vanishing Gradient Problem
The Problem
Gradients shrink exponentially as they flow backward through time:
$$\frac{\partial L}{\partial h_1} = \frac{\partial L}{\partial h_T} \cdot \prod_{t=2}^{T} \frac{\partial h_t}{\partial h_{t-1}}$$
If $\frac{\partial h_t}{\partial h_{t-1}} < 1$ consistently → vanishing gradients.
Consequence: RNNs struggle to learn long-range dependencies.
Why It Happens
- Sigmoid derivative: max 0.25
- Tanh derivative: max 1.0
- Repeated multiplication through many steps → exponential decay
Solutions
- Gradient clipping (for exploding gradients)
- Better architectures: LSTM, GRU
- Skip connections: Allow gradients to flow directly
LSTM (Long Short-Term Memory)
The Innovation
Add explicit memory management through gates.
Two types of state:
- Cell state $c_t$: Long-term memory (conveyor belt)
- Hidden state $h_t$: Working memory / output
The Three Gates
| Gate | Purpose | Controls |
|---|---|---|
| Forget | What to erase from memory | $f_t$ |
| Input | What new info to add | $i_t$ |
| Output | What to expose as output | $o_t$ |
LSTM Equations
Forget gate (what to keep from old memory): $$f_t = \sigma(W_f x_t + U_f h_{t-1} + b_f)$$
Input gate (how much of new info to add): $$i_t = \sigma(W_i x_t + U_i h_{t-1} + b_i)$$
Candidate values (new info to potentially add): $$\tilde{c}t = \tanh(W_c x_t + U_c h{t-1} + b_c)$$
Cell state update (the key equation!): $$c_t = f_t \odot c_{t-1} + i_t \odot \tilde{c}_t$$
Output gate (what to output): $$o_t = \sigma(W_o x_t + U_o h_{t-1} + b_o)$$
Hidden state: $$h_t = o_t \odot \tanh(c_t)$$
Why LSTM Works
The cell state update is additive: $$c_t = f_t \odot c_{t-1} + i_t \odot \tilde{c}_t$$
Gradients can flow through unchanged (when forget gate ≈ 1).
GRU (Gated Recurrent Unit)
Simplified LSTM with fewer parameters:
- Combines forget and input gates into update gate
- No separate cell state
Often performs comparably to LSTM with less computation.
Attention Mechanism
The Bottleneck Problem
In encoder-decoder models, all information must pass through a fixed-size vector.
Problem: Information gets lost for long sequences.
The Attention Solution
Let the decoder look at all encoder states when making each prediction.
How Attention Works
For each decoder step:
Score: Compute similarity between decoder state and each encoder state $$e_{ij} = \text{score}(s_{i-1}, h_j)$$
Normalize: Convert scores to weights (softmax) $$\alpha_{ij} = \frac{\exp(e_{ij})}{\sum_k \exp(e_{ik})}$$
Combine: Weighted sum of encoder states $$c_i = \sum_j \alpha_{ij} h_j$$
Use: Context vector informs prediction $$s_i = f(s_{i-1}, y_{i-1}, c_i)$$
Scoring Functions
| Type | Formula |
|---|---|
| Dot product | $s^T h$ |
| Scaled dot product | $\frac{s^T h}{\sqrt{d}}$ |
| MLP | $v^T \tanh(W_1 s + W_2 h)$ |
Benefits of Attention
- Long-range dependencies: Direct connections regardless of distance
- Interpretability: Attention weights show what the model focuses on
- Alignment: Helpful for translation (which source words map to which target words)
Self-Attention and Transformers
From Attention to Self-Attention
Regular attention: Query from decoder, keys/values from encoder.
Self-attention: Query, keys, values all from same sequence.
$$\text{output}_i = \text{Attention}(x_i, (x_1, x_1), (x_2, x_2), ..., (x_n, x_n))$$
Each position attends to all positions in the same sequence.
Scaled Dot-Product Attention
$$\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right) V$$
Where:
- Q = queries (what I'm looking for)
- K = keys (what I offer for matching)
- V = values (what I actually provide)
- $\sqrt{d_k}$ scaling prevents softmax saturation
Multi-Head Attention
Run multiple attention operations in parallel: $$\text{head}_i = \text{Attention}(QW_i^Q, KW_i^K, VW_i^V)$$ $$\text{MultiHead} = \text{Concat}(\text{head}_1, ..., \text{head}_h) W^O$$
Benefit: Each head can capture different types of relationships.
Positional Encoding
Attention is permutation invariant — doesn't know word order!
Solution: Add position information to embeddings.
Sinusoidal encoding: $$PE_{(pos, 2i)} = \sin(pos / 10000^{2i/d})$$ $$PE_{(pos, 2i+1)} = \cos(pos / 10000^{2i/d})$$
BERT Architecture
Base model:
- 12 attention heads
- 12 layers
- 768 hidden size (12 × 64)
- 110M parameters
Summary
| Architecture | Memory | Long-range | Parallelizable |
|---|---|---|---|
| RNN | Hidden state | Limited | No |
| LSTM | Cell + hidden | Better | No |
| Attention RNN | + context | Good | Partially |
| Transformer | Attention only | Excellent | Yes |
Key Takeaways
- RNNs process sequences with memory but struggle with long dependencies
- LSTMs use gates to control information flow, solving vanishing gradients
- Attention provides direct connections between any positions
- Transformers replace recurrence with pure attention, enabling parallelism
- Modern NLP is dominated by Transformer-based models (BERT, GPT, etc.)