Self-Supervised and Semi-Supervised Learning
These techniques leverage unlabeled data to improve learning. In a world where labels are expensive but data is abundant, these methods are increasingly important.
The Big Picture
The label bottleneck: Labeling data is expensive and time-consuming.
The opportunity: Vast amounts of unlabeled data are available.
The goal: Learn useful representations from unlabeled data that transfer to downstream tasks.
Data Augmentation
The Idea
Create modified versions of training examples that preserve the label.
Effect: Expands training set, improves robustness.
Common Augmentations
Images:
- Rotation, flipping, cropping
- Color jitter, brightness changes
- Random erasing, cutout
Text:
- Synonym replacement
- Back-translation
- Random insertion/deletion
Audio:
- Pitch shifting
- Time stretching
- Adding noise
Theoretical View: Vicinal Risk Minimization
Instead of minimizing risk at exact training points, minimize in a vicinity around them:
$$R = \int L(y, f(x')) p(x' | x) dx'$$
Where p(x'|x) is the augmentation distribution.
Transfer Learning
The Problem
Task A has lots of data; Task B has little data.
The Solution
- Pretrain on large source dataset (Task A)
- Fine-tune on small target dataset (Task B)
Fine-tuning Strategies
Feature extraction: Freeze pretrained layers, train only new head.
- Best when: Very small target data, similar domains
Full fine-tuning: Update all parameters.
- Best when: More target data, different domains
Gradual unfreezing: Start from top layers, progressively unfreeze.
- Often best practice
Parameter-Efficient Fine-tuning
For large models, updating all parameters is expensive.
Adapters: Small bottleneck layers inserted between frozen transformer blocks.
LoRA: Low-rank updates to weight matrices.
Prompt tuning: Learn soft prompts while keeping model frozen.
Self-Supervised Learning
The Core Idea
Create supervisory signals from the data itself — no human labels needed.
Pretext Tasks
Reconstruction: Predict masked or corrupted parts
- Autoencoders
- Masked language modeling (BERT)
- Masked image modeling (MAE)
Contrastive: Learn to distinguish similar from dissimilar
- Positive pairs: Same image, different augmentations
- Negative pairs: Different images
Predictive: Predict properties of the data
- Rotation prediction (images)
- Next word prediction (language)
Contrastive Learning
The Framework
- Create two views of same example (via augmentation)
- Push their representations together
- Push representations of different examples apart
SimCLR (Simple Contrastive Learning)
Pretraining:
- Take image x
- Apply two augmentations: $x_1, x_2$
- Encode both: $z_1 = g(f(x_1)), z_2 = g(f(x_2))$
- Contrastive loss (NT-Xent): $$L = -\log \frac{\exp(\text{sim}(z_1, z_2)/\tau)}{\sum_{k \neq i} \exp(\text{sim}(z_1, z_k)/\tau)}$$
Fine-tuning: Discard projection head g, fine-tune encoder f.
Key Insights
- Projection head is crucial during pretraining but discarded after
- Large batch sizes help (more negatives)
- Strong augmentation is important
Challenges
- Need many negative examples
- Batch size dependence
- Risk of feature collapse
Non-Contrastive Methods
BYOL (Bootstrap Your Own Latent)
No negative examples needed!
Architecture:
- Online network (student): Updated by gradient descent
- Target network (teacher): Exponential moving average of online
Loss: Predict target representation from online prediction.
Why no collapse? Asymmetric architecture + momentum updates prevent trivial solutions.
Masked Autoencoders (MAE)
Inspired by BERT's success:
- Mask large portion of image (75%!)
- Encode visible patches
- Decode to reconstruct masked patches
- For downstream: Use only encoder
Why mask so much? Forces learning high-level features, not just copying.
Semi-Supervised Learning
Use both labeled and unlabeled data together.
Self-Training
- Train on labeled data
- Predict on unlabeled data (pseudo-labels)
- Add confident predictions to training set
- Repeat
Connection to EM: Pseudo-labels are the E-step!
Noise Student Training
Self-training + noise:
- Train teacher on labeled data
- Generate pseudo-labels for unlabeled data
- Train student on all data with noise (dropout, augmentation)
- Student becomes new teacher; repeat
Key insight: Noise makes student more robust than teacher.
Consistency Regularization
Model predictions should be consistent under small input changes: $$L = L_{supervised} + \lambda \cdot d(f(x), f(\text{aug}(x)))$$
FixMatch: Combine pseudo-labeling with consistency.
Label Propagation
Graph-Based Approach
- Build graph: Nodes = data points, edges = similarity
- Propagate labels from labeled to unlabeled nodes
- Use resulting labels for training
Algorithm
- T: Transition matrix (normalized edge weights)
- Y: Label matrix (N × C)
- Iterate: $Y = TY$ until convergence
- Use propagated labels for supervised learning
Assumptions
- Similar points should have similar labels
- Cluster structure in data reflects label structure
Generative Self-Supervised Learning
Variational Autoencoders (VAE)
Generative model:
- Sample latent: $z \sim p(z)$
- Generate: $x \sim p(x|z)$
Training: Maximize ELBO (Evidence Lower Bound) $$\log p(x) \geq \mathbb{E}{q(z|x)}[\log p(x|z)] - D{KL}(q(z|x) | p(z))$$
Use for SSL: Learn representations z for downstream tasks.
GANs
Generator: Maps noise to data. Discriminator: Distinguishes real from fake.
Semi-supervised extension: Discriminator predicts K classes + "fake".
Active Learning
The Idea
If we must label data, label the most informative examples.
Strategies
Uncertainty sampling: Label examples where model is least confident. $$x^* = \arg\max_x H[p(y|x)]$$
BALD (Bayesian Active Learning by Disagreement): Label where model's predictions are most diverse (across ensemble/dropout samples).
Query by committee: Multiple models vote; label where they disagree.
Few-Shot Learning
The Challenge
Learn to classify new classes from very few examples (1-5 per class).
Meta-Learning Approach
Train model to learn quickly:
- Training: Many "episodes" with different class subsets
- Testing: New classes, few examples each
Metric Learning Approach
Learn embedding space where similarity = class membership.
- Prototypical networks: Classify by nearest class prototype
- Matching networks: Weighted nearest neighbor
Weak Supervision
When Labels Are Imperfect
- Noisy labels (some wrong)
- Soft labels (probability distributions)
- Aggregate from multiple labelers
Label Smoothing
Instead of hard labels [0, 1, 0]: $$y_{smooth} = (1 - \epsilon) \cdot y + \epsilon / K$$
Prevents overconfidence, improves calibration.
Summary
| Method | Uses Unlabeled Data | Key Idea |
|---|---|---|
| Data Augmentation | No (extends labeled) | Transform while preserving label |
| Transfer Learning | Pre-training stage | Leverage large datasets |
| Contrastive Learning | Yes | Pull similar, push dissimilar |
| Non-Contrastive | Yes | Predict across views (no negatives) |
| Semi-Supervised | Yes (with some labels) | Pseudo-labels + consistency |
| Active Learning | Selects what to label | Query most informative |
Practical Recommendations
- Always use data augmentation — almost free improvement
- Start with pretrained models — rarely worth training from scratch
- Try self-training for semi-supervised (simple and effective)
- Large-scale pretraining for best representations (if compute allows)
- Match pretraining and downstream domains for best transfer