Vector Semantics and Word Embeddings
How do we represent word meaning computationally? This chapter covers the evolution from sparse count-based vectors to dense neural embeddings — one of the most important advances in NLP.
The Big Picture
The Problem: Computers need numerical representations of words.
Key Insight (Distributional Hypothesis):
"You shall know a word by the company it keeps" — J.R. Firth
Words that appear in similar contexts have similar meanings.
The Evolution:
One-hot vectors → Count-based vectors → Neural embeddings
(sparse, no similarity) (sparse, some similarity) (dense, learned similarity)
Challenges of Lexical Semantics
Why is meaning hard?
| Challenge | Example |
|---|---|
| Word forms | sing, sang, sung (same lemma "sing") |
| Polysemy | "bank" = river bank or financial bank |
| Synonymy | couch ≈ sofa (same meaning) |
| Relatedness | coffee ~ cup (not synonyms, but related) |
| Semantic frames | "A bought from B" ≈ "B sold to A" |
| Connotation | "slender" vs. "skinny" (same denotation, different feeling) |
Vector Space Models
The Core Idea
Represent words as vectors in a high-dimensional space where:
- Similar words are close together
- Dissimilar words are far apart
Document Vectors (Term-Document Matrix)
| Doc1 | Doc2 | Doc3 | |
|---|---|---|---|
| cat | 3 | 0 | 1 |
| dog | 2 | 4 | 0 |
| pet | 1 | 2 | 1 |
- Rows: Words (vocabulary of size V)
- Columns: Documents (D documents)
- Cell: Count of word in document
Use case: Information retrieval (find similar documents).
Word Vectors (Term-Term Matrix)
| cat | dog | pet | food | |
|---|---|---|---|---|
| cat | - | 15 | 20 | 8 |
| dog | 15 | - | 25 | 12 |
| pet | 20 | 25 | - | 10 |
- Rows and Columns: Words
- Cell: Co-occurrence count (how often words appear together)
Result: Each word is a V-dimensional vector.
Measuring Similarity
Cosine Similarity
Normalized dot product — measures angle between vectors:
$$\cos(\theta) = \frac{\vec{a} \cdot \vec{b}}{|\vec{a}| \cdot |\vec{b}|} = \frac{\sum_i a_i b_i}{\sqrt{\sum_i a_i^2} \cdot \sqrt{\sum_i b_i^2}}$$
Interpretation:
- cos = 1: Identical direction (most similar)
- cos = 0: Perpendicular (unrelated)
- cos = -1: Opposite direction (antonyms, in some cases)
Why cosine over Euclidean?
- Handles different vector magnitudes
- A long document and short document can still be similar
For Unit Vectors
When vectors are normalized (length 1): $$||\vec{a} - \vec{b}||^2 = 2(1 - \cos\theta)$$
Euclidean distance and cosine become equivalent!
TF-IDF Weighting
Raw counts have problems:
- Common words ("the", "is") dominate
- Rare but meaningful words get drowned out
Term Frequency (TF)
How often does word appear in document?
Raw TF: $\text{tf}_{t,d} = \text{count}(t, d)$
Log TF (dampens large counts): $$\text{tf}_{t,d} = \log(1 + \text{count}(t, d))$$
Inverse Document Frequency (IDF)
How rare is the word across documents?
$$\text{idf}_t = \log\left(\frac{N}{\text{df}_t}\right)$$
Where:
- N = total number of documents
- df_t = number of documents containing term t
Effect: Common words (low IDF) get downweighted.
TF-IDF
Combine both: $$w_{t,d} = \text{tf}_{t,d} \times \text{idf}_t$$
High TF-IDF: Word appears often in this document but rarely overall → distinctive!
Pointwise Mutual Information (PMI)
The Intuition
Are two words appearing together more than we'd expect by chance?
$$\text{PMI}(x, y) = \log_2 \frac{P(x, y)}{P(x) \cdot P(y)}$$
Interpretation:
- PMI > 0: Words co-occur more than expected (associated)
- PMI = 0: Words co-occur as expected (independent)
- PMI < 0: Words co-occur less than expected (avoid each other)
From Counts
$$\text{PMI}(x, y) = \log_2 \frac{\text{count}(x, y) \cdot N}{\text{count}(x) \cdot \text{count}(y)}$$
Positive PMI (PPMI)
Negative PMI values are unreliable (rare events).
$$\text{PPMI}(x, y) = \max(0, \text{PMI}(x, y))$$
From Sparse to Dense: Word2Vec
The Problem with Count Vectors
- Very high dimensional (vocabulary size)
- Very sparse (mostly zeros)
- No generalization between similar words
The Neural Solution
Learn dense, low-dimensional vectors (typically 100-300 dimensions).
Key properties:
- Similar words have similar vectors
- Relationships are captured geometrically
Static vs. Contextual Embeddings
| Type | Same word = same vector? | Examples |
|---|---|---|
| Static | Yes | Word2Vec, GloVe, FastText |
| Contextual | No (depends on context) | ELMo, BERT, GPT |
Skip-Gram with Negative Sampling (SGNS)
The most popular Word2Vec algorithm.
The Task
Given a target word, predict surrounding context words.
Example: "The quick brown fox jumps"
- Target: "brown"
- Context (window=2): "The", "quick", "fox", "jumps"
Training Setup
- Positive examples: (target, context) pairs from real text
- Negative examples: (target, random_word) pairs — fake associations
The Objective
Maximize probability of real pairs, minimize probability of fake pairs:
$$L = \log \sigma(v_w \cdot v_c) + \sum_{i=1}^{k} \mathbb{E}{c_i \sim P_n}[\log \sigma(-v_w \cdot v{c_i})]$$
Where:
- $\sigma$ is sigmoid function
- $v_w$ is target word vector
- $v_c$ is context word vector
- k is number of negative samples (typically 5-20)
Negative Sampling Distribution
Don't sample uniformly — would get too many rare words.
$$P(w) \propto \text{freq}(w)^{0.75}$$
The 0.75 power smooths the distribution (gives rare words a better chance than pure frequency).
Two Embeddings Per Word
Each word has:
- Target embedding: When it's the center word
- Context embedding: When it appears in context
Final embedding is often their sum or average.
Enhancements and Variations
FastText (Subword Embeddings)
Problem: What about unknown words like "ungooglable"?
Solution: Represent words as bag of character n-grams.
"where" → {<wh, whe, her, ere, re>}
Word vector = sum of n-gram vectors.
Benefit: Can handle any word, even unseen ones!
GloVe (Global Vectors)
Combines advantages of count-based and neural methods.
Uses global co-occurrence statistics + optimization: $$J = \sum_{i,j} f(X_{ij})(w_i^T \tilde{w}_j + b_i + \tilde{b}j - \log X{ij})^2$$
Often comparable to Word2Vec in practice.
Word Analogies
Famous Word2Vec property:
"king" - "man" + "woman" ≈ "queen"
Find word that completes analogy a:b :: a':?
$$b' = \arg\min_{x} \text{distance}(x, b - a + a')$$
Works for:
- Gender: king:queen :: man:woman
- Capitals: Paris:France :: Tokyo:Japan
- Tense: walking:walked :: swimming:swam
Bias in Word Embeddings
The Problem
Word embeddings learn biases present in training data.
Examples:
- "doctor" closer to "man" than "woman"
- "homemaker" closer to "woman" than "man"
- Names associated with certain ethnic groups linked to negative words
Types of Harm
Allocation harm: System makes unfair decisions
- Resume screening favoring male-associated names
Representation harm: Reinforces stereotypes
- Search results, autocomplete suggestions
Mitigation Strategies
- Debias during training or post-hoc
- Careful data curation
- Evaluation for fairness
Summary
| Representation | Pros | Cons |
|---|---|---|
| Count-based (TF-IDF) | Interpretable, simple | Sparse, high-dimensional |
| PMI | Captures associations | Sparse, noisy for rare words |
| Word2Vec | Dense, captures analogy | Static, no context |
| FastText | Handles OOV words | Still static |
| Contextual | Word sense disambiguation | Computationally expensive |
Key Takeaways
- Words can be represented as vectors in semantic space
- Distributional similarity = semantic similarity
- Dense embeddings outperform sparse for most tasks
- Context matters — motivates contextual embeddings (BERT, etc.)
- Beware of biases inherited from training data