SSL
Data Augmentation
- Artificially modified versions of input vectors that may appear in real world data
- Improves accuracy, makes model robust
- Empirical risk minimization to vicinical risk minimization
- Minimizing risk in the vicinity of input data point
Transfer Learning
- Some data poor tasks may have structural similarity to other data rich tasks
- Transferring information from one dataset to another via shared parameters of a model
- Pretrain the model on a large source dataset
- Fine tune the model on a small target dataset
- Chop-off the head of the pretrained model and add a new one
- The parameters may be frozen during fine-tuning
- In case the parameters aren't frozen, use small learning rates.
Adapters
- Modify the model structure to customize feature extraction
- For example: Add MLPs after transformer blocks and initialize them for identity mappings
- Much less parameters to be learned during fine-tuning
Pre-training
- Can be supervised or unsupervised.
- Supervised
- Imagenet is supervised pretraining.
- For unrelated domains, less helpful.
- More like speedup trick with a good initialization.
- Unsupervised
- Use unlabeled dataset
- Minimize reconstruction error
- Self-supervised
- Labels are created from ulabeled dataset algorithmically
- Cloze Task
- Fill in the blanks
- Proxy Tasks
- Create representations
- Siamese Neural Networks
- Capture relationship between inputs
- Contrastive Tasks
- Use data augmentation
- Ensure that similar inputs have closer representations
- SimCLR
- Simple Contrastive Learning for Visual Representations
- Pretraining
- Take an unlabeled image X
- Apply data augmentation A and A' and get two views X and X'
- Apply encoder to both views
- Apply projection head to both encodings
- Contrastive Loss function with mini-batch to identify the positive pairs
- Pretraining
- Finetune the encoder and a task specific head
- The original projection head is discarded
- NLP Encoders
- Contrastively trained language models
- SimCSE: Simple Contrastive Learning for Sentence Embeddings
- Dropout acts as data augmentation
- Cosine Similarity between representations
- Non-Contrastive Learning
- BYOL
- Bootstrap your own latent (BYOL)
- Two copies of encoder
- Teacher: Online, Student: Target
- Teacher is EMA of updates made to student
- MSE Loss
- BYOL
Semi-Supervised Learning
- Learn from labeled + unlabeled data in tandem
- Self-Training
- Train the model on labeled data
- Run inference on unlabeled data to get predictions (pseudo-labels, machine generated)
- Combine machine generated and human generated labels
- Self-training is similar to EM algorithm where pseudo-labels are the E-step
- Noise Student Training
- Adds noise to student model to improve generalization
- Add noise via dropout, stochastic depth, data augmentation
- Consistency Regularization
- Model's prediction shouldn't change much for small changes to the input
- Can be implemented by passing augmented versions of same image as loss
- Label Propagation
- Graph: Nodes are i/p datapoints and edges denote similarity
- Use graph clustering to group related nodes
- Class labels are assigned to unlabeled data, based on the cluster distribution
- Model Labels: Labels of the data points
- Propagate the labels in such a way that there is minimal label disagreement between node and it's neighbours
- Label guesses for unlabeled data that can be used for superised learning
- Details:
- M labeled points, N unlabeled points
- T: (M+N) x (M+N) transition matrix of normalized edge weights
- Y: Label matrix for class distribution of (M+N) x C dimension
- Use transition matrix to propagate labels Y = TY until convergence
- Success depends on calculating similarity between data points
- Consistency Regularization
- Small perturbation to input data point should not change the model predicitons
Generative Models
- Natural way of using unlabeled data by learning a model of data generative process.
- Variational Autoencoders
- Models joint distribution of data (x) and latent variables (z)
- First sample: $z \sim p(z)$ and then sample $x\sim p(x|z)$
- Encoder: Approximate the posterior
- Decoder: Approximate the likelihood
- Maximize evidence lower bound of the data (ELBO) (derived from Jensen's ineuqlity)
- Use VAEs to learn representations for downstream tasks
- Generative Adversarial Netwworks
- Generator: Maps latent distribution to data space
- Discriminator: Distinguish between outputs of generator and true distribution
- Modify discriminator to predict class labels and fake rather than just fake
Active Learning
- Identify true predictive mapping by quering as few data points as possible
- Query Synthesis: Model asks output for any input
- Pool Based: Model selects the data point from a pool of ulabeled data points
- Maximum Entropy Sampling
- Uncertainty in predicted label
- Fails when examples are ambiguous of mislabeled
- Bayesian Active Learning by Disagreement (BALD)
- Select examples where model makes predictions tht are highly diverese
Few-Shot Learning
- Learn to predict from very few labeled example
- One-Shot Learning: Learn to predict from single example
- Zero-Shot Lerning: Learn to predict without labeled examples
- Model has to generalize for unseen labels during traning time
- Works by learning a distance metric
Weak Supervision
- Exact label not aviabale for data points
- Distribution of labels for each case
- Soft labels / label smoothing
Semi-Supervised and Self-Supervised Learning
Semi-Supervised Learning (SSL): Leveraging both labeled and unlabeled data
- Motivation: Labels are expensive, unlabeled data is abundant
- Assumption: Underlying data distribution contains useful structure
Data Augmentation
- Creates artificial training examples through transformations
- Preserves semantic content while changing surface features
- Common augmentations:
- Image domain: rotations, flips, color jitter, cropping
- Text domain: synonym replacement, back-translation
- Audio domain: pitch shifting, time stretching
- Theoretical framework: Vicinical risk minimization
- Minimize risk in local neighborhoods around training examples
- Improves robustness and generalization
Transfer Learning
- Leverages knowledge from data-rich domains to improve performance in data-poor domains
- Process:
- Pretrain model on large source dataset (e.g., ImageNet, Common Crawl)
- Adapt model to target task with smaller dataset
- Options for adaptation:
- Feature extraction: Freeze pretrained layers, train only new head
- Fine-tuning: Update all or subset of pretrained parameters
- Parameter-efficient fine-tuning:
- Adapters: Small bottleneck layers added between frozen transformer blocks
- LoRA: Low-rank adaptation of weight matrices
- Prompt tuning: Learn soft prompts while keeping model parameters frozen
Self-Supervised Learning
Creates supervisory signals from unlabeled data
Pretext tasks:
- Reconstruction tasks: Autoencoders, masked language modeling
- Context prediction: Predict arrangement of shuffled patches
- Contrastive tasks: Learn similar representations for related inputs
Contrastive Learning
Learn representations by comparing similar and dissimilar examples
SimCLR framework:
- Generate two views of each image via augmentation
- Encode both views with shared encoder
- Apply projection head to map encodings to space for contrastive loss
- Contrastive loss: Maximize similarity between positive pairs (same image) and minimize similarity between negative pairs (different images)
- For downstream tasks, discard projection head and fine-tune encoder
Key challenges:
- Hard negative mining: Finding informative negative examples
- Batch size dependence: Performance scales with number of negatives
- Feature collapse: Trivial solutions that ignore semantic content
Non-Contrastive Methods
BYOL (Bootstrap Your Own Latent):
- Teacher-student architecture with no negative examples
- Student network predicts teacher network outputs
- Teacher parameters updated via exponential moving average of student
- Avoids collapse through asymmetric architecture and predictor networks
Masked Autoencoders:
- Inspired by BERT's success in NLP
- Mask significant portions of input (e.g., 75% of image patches)
- Train encoder-decoder to reconstruct original input
- For downstream tasks, use only encoder
Practical Considerations
- Pretraining often provides:
- Better initialization for optimization
- More generalizable features
- Sample efficiency: Fewer labeled examples needed
- Domain gap between pretraining and target task affects transfer effectiveness
- Large pretrained models may contain useful knowledge but require careful adaptation
- Pretraining often provides: