Logistic Regression
Logistic regression is the foundational discriminative model for classification. Despite its name, it's a classification algorithm (not regression) that directly models posterior class probabilities.
The Big Picture
Unlike generative models (which model how data is generated per class), logistic regression directly models: $$p(y | x, \theta)$$
This "discriminative" approach focuses on the decision boundary rather than the full data distribution.
Binary Logistic Regression
The Model
For binary classification with $y \in {0, 1}$:
$$p(y = 1 | x, \theta) = \sigma(w^T x + b)$$
Where σ is the sigmoid function: $$\sigma(a) = \frac{1}{1 + e^{-a}}$$
The probability of class 0 is simply: $$p(y = 0 | x, \theta) = 1 - \sigma(w^T x + b) = \sigma(-(w^T x + b))$$
Alternative Notation
For $y \in {-1, +1}$: $$p(y | x, \theta) = \sigma(y \cdot (w^T x + b))$$
This compact form handles both classes with one equation.
The Decision Boundary
Predict $y = 1$ if $p(y = 1 | x) > 0.5$, which occurs when: $$w^T x + b > 0$$
This is a hyperplane in feature space — logistic regression produces linear decision boundaries.
Geometric interpretation:
- w is the normal vector to the decision boundary
- b determines the offset from the origin
- Distance from boundary relates to confidence
Maximum Likelihood Estimation
The Likelihood
For dataset ${(x_i, y_i)}{i=1}^N$: $$L(\theta) = \prod{i=1}^N p(y_i | x_i, \theta)$$
Negative Log-Likelihood (Binary Cross-Entropy)
$$\text{NLL}(\theta) = -\sum_{i=1}^N [y_i \log \hat{y}_i + (1 - y_i) \log(1 - \hat{y}_i)]$$
Where $\hat{y}_i = \sigma(w^T x_i + b)$.
For $y \in {-1, +1}$ notation: $$\text{NLL}(\theta) = \sum_{i=1}^N \log(1 + \exp(-y_i(w^T x_i + b)))$$
Computing the Gradient
The gradient has a beautiful form: $$\nabla_w \text{NLL} = \sum_{i=1}^N (\hat{y}_i - y_i) x_i = X^T(\hat{y} - y)$$
Intuition: The gradient is a weighted sum of input vectors, where the weights are the prediction errors.
Optimization
Good news: The NLL is convex (Hessian is positive semi-definite).
Methods:
- Gradient Descent / SGD: Simple, works for large datasets
- Newton's Method: Faster convergence for smaller problems
- IRLS: Iteratively Reweighted Least Squares — Newton's method specialized for logistic regression
Regularization
The Overfitting Problem
MLE can overfit, especially with:
- High-dimensional features
- Small datasets
- Linearly separable data (weights → ∞)
L2 Regularization (Ridge)
Add Gaussian prior on weights: $$p(w) = \mathcal{N}(0, \lambda^{-1} I)$$
Regularized objective: $$L(w) = \text{NLL}(w) + \lambda |w|^2$$
Effect: Penalizes large weights, improves generalization.
L1 Regularization (Lasso)
Use Laplace prior for sparse solutions: $$L(w) = \text{NLL}(w) + \lambda |w|_1$$
Effect: Some weights become exactly zero — automatic feature selection.
Practical Notes
- Standardize features before applying regularization (features should be on same scale)
- Don't regularize the bias term
- Choose λ via cross-validation
Multinomial Logistic Regression
Extending to Multiple Classes
For $y \in {1, 2, ..., C}$:
$$p(y = c | x, \theta) = \frac{\exp(w_c^T x + b_c)}{\sum_{j=1}^C \exp(w_j^T x + b_j)} = \text{softmax}(a)_c$$
Where $a_c = w_c^T x + b_c$ are the logits.
Overparameterization
Note: We have C sets of weights, but only C-1 are needed (one class can be the reference).
For binary case with softmax: $$p(y = 0 | x) = \frac{e^{a_0}}{e^{a_0} + e^{a_1}} = \sigma(a_0 - a_1)$$
This reduces to standard logistic regression with $w = w_0 - w_1$.
Maximum Entropy Classifier
When features depend on both x and the class c: $$p(y = c | x, w) \propto \exp(w^T \phi(x, c))$$
This is common in NLP where features might include "word X appears AND class is Y".
Handling Special Cases
Hierarchical Classification
When classes have taxonomy (e.g., animal → mammal → dog):
Label smearing: Propagate positive labels to parent categories.
Approach: Multi-label classification where an example can belong to multiple levels.
Many Classes
Hierarchical softmax: Organize classes in a tree; predict by traversing tree.
- Reduces computation from O(C) to O(log C)
- Put frequent classes near root
Class Imbalance
When some classes are much more common:
Resampling strategies: $$p_c = \frac{N_c^q}{\sum_j N_j^q}$$
- q = 1: Instance-balanced (original distribution)
- q = 0: Class-balanced (equal weight per class)
- q = 0.5: Square-root sampling (compromise)
Robust Logistic Regression
Handling Outliers and Label Noise
Standard logistic regression is sensitive to mislabeled examples.
Mixture model approach: $$p(y | x) = \pi \cdot \text{Ber}(0.5) + (1 - \pi) \cdot \text{Ber}(\sigma(w^T x + b))$$
Mix predictions with uniform noise — mislabeled points have less impact.
Bi-tempered Logistic Loss
Two modifications for robustness:
- Tempered cross-entropy: Handles outliers far from boundary
- Tempered softmax: Handles noise near boundary
Probit Regression
Replace sigmoid with Gaussian CDF (probit function): $$p(y = 1 | x) = \Phi(w^T x + b)$$
Similar shape to logistic but different tails — can be more robust in some cases.
Summary
| Aspect | Key Points |
|---|---|
| Model | $p(y=1|x) = \sigma(w^Tx + b)$ |
| Loss | Binary cross-entropy (NLL) |
| Optimization | Convex — guaranteed global optimum |
| Boundary | Linear (hyperplane) |
| Regularization | L2 (shrink) or L1 (sparse) |
| Multiclass | Softmax over C classes |
| Robustness | Mixture models, tempered losses |
When to Use Logistic Regression
Good for:
- Binary and multiclass classification
- When interpretability matters (coefficients are meaningful)
- As a baseline before trying complex models
- When computational resources are limited
Limitations:
- Linear decision boundaries
- May underfit complex data
- Sensitive to outliers (without modifications)
Pro tip: Start with logistic regression. If it works well, you may not need anything more complex!