Introduction to Machine Learning
Machine learning is the science of getting computers to learn from data without being explicitly programmed. This chapter introduces the fundamental concepts, problem types, and challenges that define the field.
What is Machine Learning?
Tom Mitchell's Definition:
A computer program is said to learn from experience E with respect to some class of tasks T, and performance measure P, if its performance at tasks in T, as measured by P, improves with experience E.
In plain English: A machine learning system gets better at a task as it sees more data.
Example: A spam filter
- Task (T): Classify emails as spam or not spam
- Experience (E): A dataset of labeled emails
- Performance (P): Accuracy on new emails
Supervised Learning
In supervised learning, we have input-output pairs and want to learn the mapping between them.
The Setup
- Inputs (X): Also called features, covariates, or predictors
- Example: Pixel values of an image, words in an email
- Outputs (Y): Also called labels, targets, or responses
- Example: "cat" or "dog", spam or not spam
Goal: Learn a function $f: X \rightarrow Y$ that generalizes to new, unseen examples.
Classification
When the output is a discrete category:
$$L(\theta) = \frac{1}{N}\sum_{i=1}^N I{y_i \neq f(x_i, \theta)}$$
This is the misclassification rate — the fraction of examples we get wrong.
Key concepts:
- Empirical Risk: Average loss on training data
- Empirical Risk Minimization (ERM): Find parameters that minimize training loss
- Generalization: The real goal is to perform well on new data, not just training data
Dealing with Uncertainty
Models can't predict with 100% certainty. There are two types of uncertainty:
Model Uncertainty (Epistemic)
- Arises from lack of knowledge about the true mapping
- Can be reduced with more data
- Example: We don't know if a blurry image is a cat or dog
Data Uncertainty (Aleatoric)
- Arises from inherent randomness in the data
- Cannot be reduced even with infinite data
- Example: A coin flip is inherently random
Probabilistic Predictions
Instead of just predicting a class, predict a probability distribution over classes:
$$p(y | x, \theta)$$
Why probabilities?
- Quantify confidence
- Enable better decision making
- Allow principled handling of uncertainty
Negative Log-Likelihood (NLL):
$$\text{NLL}(\theta) = -\frac{1}{N}\sum_{i=1}^N \log p(y_i | f(x_i, \theta))$$
Minimizing NLL is equivalent to Maximum Likelihood Estimation (MLE) — finding parameters that make the observed data most probable.
Regression
When the output is a continuous value:
$$L(\theta) = \frac{1}{N}\sum_{i=1}^N (y_i - f(x_i, \theta))^2$$
This is Mean Squared Error (MSE) — the average squared difference between predictions and true values.
Connection to Probability: If we assume Gaussian noise:
$$p(y | x, \theta) = \mathcal{N}(y | f(x, \theta), \sigma^2)$$
Then minimizing NLL is equivalent to minimizing MSE!
Types of Regression Models
| Model | Description | Flexibility |
|---|---|---|
| Linear | $f(x) = w^Tx + b$ | Low |
| Polynomial | Includes $x^2, x^3$, etc. | Medium |
| Neural Networks | Nested nonlinear functions | High |
Overfitting and Generalization
The Overfitting Problem
A model that perfectly fits training data but fails on new data is overfitting. It has memorized the training set rather than learning the underlying pattern.
Signs of overfitting:
- Training error much lower than test error
- Model is very complex relative to data size
- Model captures noise as if it were signal
Understanding the Errors
Population Risk: Theoretical expected loss on the true data generating process $$R(\theta) = \mathbb{E}_{(x,y) \sim p^*}[L(y, f(x, \theta))]$$
Empirical Risk: Average loss on training data $$\hat{R}(\theta) = \frac{1}{N}\sum_{i=1}^N L(y_i, f(x_i, \theta))$$
Generalization Gap: Difference between population and empirical risk $$\text{Gap} = R(\theta) - \hat{R}(\theta)$$
A large gap indicates overfitting.
The U-Shaped Test Error Curve
Error
│
│ ╲ ╱ Training Error
│ ╲ ___________╱ (keeps decreasing)
│ ╲____╱
│
│ ╱─────────╲
│ ╱ Test ╲
│───────╱ Error ╲───── (U-shaped)
│
└─────────────────────────→ Model Complexity
Simple Complex
- Underfitting (left): Model too simple, high bias
- Sweet spot (middle): Good balance
- Overfitting (right): Model too complex, high variance
No Free Lunch Theorem
There is no single best model that works for all problems.
Every model makes assumptions about the data. When those assumptions match reality, the model works well. When they don't, it fails.
Implication: Understanding your problem domain is crucial for choosing the right model.
Unsupervised Learning
In unsupervised learning, we only have inputs X — no labels.
Goal: Discover hidden structure in data.
Common Tasks
Clustering: Group similar data points together
- Example: Customer segmentation, document grouping
Dimensionality Reduction: Find lower-dimensional representations
- Example: Compress images while preserving important information
Density Estimation: Model the probability distribution $p(x)$
- Example: Anomaly detection (low probability = anomalous)
Self-Supervised Learning: Create proxy tasks from unlabeled data
- Example: Predict missing words in text (BERT), predict next frame in video
Evaluation Challenge
Without labels, how do we evaluate? Common approaches:
- Likelihood of held-out data
- Performance on downstream tasks
- Human evaluation of quality
Reinforcement Learning
An agent learns to interact with an environment to maximize cumulative reward.
Key differences from supervised learning:
- No explicit labels — only reward signals
- Rewards are often delayed (sparse feedback)
- Agent's actions affect future states (sequential decision making)
Analogy:
- Supervised learning = learning with a teacher who gives correct answers
- Reinforcement learning = learning with a critic who only says "good" or "bad"
Data Preprocessing
Text Data
Raw text needs transformation before ML models can process it.
Bag of Words (BoW)
- Represent document as vector of word counts
- Loses word order but captures content
Problem: Common words ("the", "a") dominate counts.
TF-IDF (Term Frequency - Inverse Document Frequency): $$\text{TF-IDF} = \log(1 + \text{TF}) \times \text{IDF}$$
Where:
- TF = term frequency (how often word appears in document)
- IDF = $\log\frac{N}{1 + \text{DF}}$ (inverse of how many documents contain the word)
Effect: Downweight common words, upweight distinctive words.
Word Embeddings: Map words to dense vectors that capture semantic meaning
- Similar words have similar vectors
- "king" - "man" + "woman" ≈ "queen"
Handling Unknown Words:
- UNK token: Replace rare/unseen words with a special token
- Subword units (BPE): Break words into common pieces
Missing Data
How data is missing matters!
| Type | Description | Example | Handling |
|---|---|---|---|
| MCAR | Missing Completely At Random | Random sensor failures | Easier to handle |
| MAR | Missing At Random (depends on observed data) | Older people less likely to report income | Model the missingness |
| NMAR | Not Missing At Random | Sick people skip health surveys | Most challenging |
Summary
| Concept | Key Insight |
|---|---|
| ML Definition | Learning improves with experience |
| Supervised Learning | Learn input-output mapping from labeled data |
| Classification | Predict discrete categories |
| Regression | Predict continuous values |
| Probabilistic View | Quantify uncertainty with probability distributions |
| Overfitting | Memorizing training data instead of learning patterns |
| Generalization | The ultimate goal: perform well on new data |
| Unsupervised Learning | Find structure without labels |
| RL | Learn from reward signals through interaction |
The probabilistic perspective unifies these concepts — learning is inference under uncertainty.