Recommendation Systems
Recommendation systems predict user preferences for items (movies, products, songs, etc.). They power personalization across the internet — from Netflix to Amazon to Spotify.
The Big Picture
The problem: Users interact with a tiny fraction of available items. Can we predict what they'd like?
Key challenge: The user-item matrix is extremely sparse (>99% missing).
Goal: Fill in the missing entries (predict ratings) or rank items for each user.
Types of Feedback
Explicit Feedback
Users directly express preferences:
- Star ratings (1-5)
- Thumbs up/down
- Reviews
Pros: Clear signal about preferences. Cons: Sparse (users rarely rate), not missing at random (users rate things they care about).
Implicit Feedback
Inferred from user behavior:
- Clicks, views, purchases
- Time spent
- Add to cart
Pros: Abundant, always available. Cons: Noisy, positive-only (absence doesn't mean dislike).
Collaborative Filtering
The Core Idea
Users "collaborate" to help recommend items:
- Users with similar preferences in the past will have similar preferences in the future
- Items liked by similar users are likely to be liked by the target user
User-Based CF
For target user u and item i: $$\hat{y}{ui} = \frac{\sum{u'} \text{sim}(u, u') \cdot y_{u'i}}{\sum_{u'} \text{sim}(u, u')}$$
Steps:
- Find users similar to u (based on rating patterns)
- Aggregate their ratings for item i
Item-Based CF
For target user u and item i: $$\hat{y}{ui} = \frac{\sum{i'} \text{sim}(i, i') \cdot y_{ui'}}{\sum_{i'} \text{sim}(i, i')}$$
Steps:
- Find items similar to i (based on who rated them)
- Aggregate user u's ratings for those items
In practice: Item-based often preferred (item similarities more stable than user similarities).
Challenges
- Sparsity: Few common ratings for similarity calculation
- Scalability: Computing all pairwise similarities is expensive
- Cold start: No history for new users/items
Matrix Factorization
The Idea
The rating matrix R can be approximated as a product of low-rank matrices: $$R \approx U \cdot V^T$$
Where:
- U is user matrix (N × K): Each row is a user's "embedding"
- V is item matrix (M × K): Each row is an item's "embedding"
- K << N, M (typically K = 10-200)
Interpretation
Each dimension captures a latent "factor":
- Movie factors might capture: Action content, Romance level, Production year...
- User factors capture: Preference for action, romance, etc.
Prediction: $\hat{y}_{ui} = u_u^T v_i$ (dot product of embeddings)
Training
Minimize squared error on observed ratings: $$L = \sum_{(u,i) \in \text{observed}} (y_{ui} - u_u^T v_i)^2 + \lambda(|U|^2 + |V|^2)$$
Note: Can't use SVD directly (missing values). Use:
- Alternating Least Squares (ALS): Fix U, solve for V; fix V, solve for U
- SGD: Stochastic gradient descent on observed entries
Adding Biases
Users and items have inherent tendencies: $$\hat{y}_{ui} = \mu + b_u + c_i + u_u^T v_i$$
Where:
- μ: Global average rating
- $b_u$: User bias (some users rate higher on average)
- $c_i$: Item bias (some items are generally liked more)
Probabilistic Matrix Factorization
Bayesian Approach
Model ratings as: $$p(y_{ui} | u_u, v_i) = \mathcal{N}(\mu + b_u + c_i + u_u^T v_i, \sigma^2)$$
Add priors on embeddings: $$u_u \sim \mathcal{N}(0, \sigma_u^2 I)$$ $$v_i \sim \mathcal{N}(0, \sigma_v^2 I)$$
Benefits:
- Principled handling of uncertainty
- Regularization from priors
- Can incorporate side information
Bayesian Personalized Ranking (BPR)
For Implicit Feedback
Problem: With implicit data, we only have positive examples.
Approach: Learn to rank positives above negatives.
Assumption: User prefers items they interacted with over items they didn't.
The Loss
For triplet (user, positive item, negative item): $$L = -\log \sigma(f(u, i^+) - f(u, i^-))$$
Where f is the prediction score (e.g., $u_u^T v_i$).
Training: Sample triplets and optimize.
Factorization Machines
Beyond Matrix Factorization
Matrix factorization only captures user-item interactions.
Factorization Machines generalize to any features: $$f(x) = \mu + \sum_{j=1}^d w_j x_j + \sum_{j < k} (v_j \cdot v_k) x_j x_k$$
Where:
- x: Feature vector (one-hot user, one-hot item, plus any other features)
- v_j: Embedding for feature j
Advantages
- Can incorporate side information: User demographics, item attributes, context (time, location)
- Handles cold start better
- Same framework for different feature types
Connection to MF
When features are just user and item IDs: $$f = \mu + b_u + c_i + u_u^T v_i$$
Exactly matrix factorization with biases!
The Cold Start Problem
The Challenge
New users or items have no interaction history.
Solutions
Content-based: Use features instead of collaborative signal
- New user: Ask preferences, use demographics
- New item: Use item attributes, description
Hybrid methods: Combine collaborative and content-based
Active learning: Ask strategic questions to new users
Transfer learning: Leverage data from related domains
Exploration-Exploitation Trade-off
The Problem
If we only recommend what users already like, they never discover new interests.
Counterfactual: Users might love items they never see!
Approaches
Multi-Armed Bandits:
- Thompson Sampling: Sample from posterior, act greedily
- UCB: Optimism in the face of uncertainty
Contextual Bandits: Personalized exploration based on user features
Diversity: Ensure recommendations aren't all the same type
Deep Learning for RecSys
Neural Collaborative Filtering
Replace dot product with neural network: $$\hat{y}{ui} = f{neural}([u_u; v_i])$$
Benefit: Captures non-linear interactions.
Sequential Recommendations
Model user's sequence of interactions with RNN/Transformer: $$h_t = f(x_t, h_{t-1})$$
Benefit: Captures temporal dynamics.
Two-Tower Models
Separate encoders for users and items:
- Fast serving (pre-compute item embeddings)
- Easy to add features
Evaluation Metrics
For Rating Prediction
- RMSE: $\sqrt{\frac{1}{N}\sum (y - \hat{y})^2}$
- MAE: $\frac{1}{N}\sum |y - \hat{y}|$
For Ranking
- Precision@K: Fraction of top-K recommendations that are relevant
- Recall@K: Fraction of relevant items in top-K
- NDCG: Accounts for position (higher rank = more important)
- MAP: Mean average precision
Offline vs. Online
Offline: Historical data, fast to compute but may not reflect real preferences.
Online (A/B testing): Real users, gold standard but expensive and slow.
Summary
| Method | Key Idea | Best For |
|---|---|---|
| User-Based CF | Similar users → similar items | Small, stable user base |
| Item-Based CF | Similar items | Most practical CF |
| Matrix Factorization | Low-rank approximation | General purpose |
| BPR | Learn to rank | Implicit feedback |
| Factorization Machines | Feature interactions | Side information |
| Neural Methods | Non-linear patterns | Large data, complex patterns |
Practical Recommendations
- Start simple: Matrix factorization with biases often hard to beat
- Add biases: Critical for good performance
- Handle implicit correctly: Use ranking losses, not rating losses
- Address cold start: Hybrid methods or feature-based fallback
- Evaluate carefully: Ranking metrics often more meaningful than RMSE
- Consider fairness: Avoid filter bubbles, ensure diverse recommendations