Feed Forward Neural Networks
Neural networks are powerful function approximators that learn hierarchical representations of data
From Linear Models to Neural Networks
- Linear Models perform an affine transformation of inputs: $f(x, \theta) = Wx + b$
- To increase expressivity, we can transform inputs: $f(x, \theta) = W \phi(x) + b$
- Neural networks learn these transformations automatically by composing multiple functions:
- $f(x, \theta) = f_L(f_{L-1}(f_{L-2}.....(f_1(x)....))$
- Each layer extracts progressively more abstract features
Architecture Components
- Layers: Groups of neurons that transform inputs to outputs
- Connections: Weighted links between neurons in adjacent layers
- Activation Functions: Non-linear functions applied to neuron outputs
- Sigmoid: $\sigma(x) = \frac{1}{1+e^{-x}}$, outputs in range (0,1)
- Suffers from vanishing gradient for large magnitude inputs
- TanH: $\tanh(x) = \frac{e^{2x} - 1}{e^{2x} + 1}$, outputs in range (-1,+1)
- Centered at zero but still suffers from vanishing gradients
- ReLU: $\max(0, x)$, non-saturating activation function
- Solves vanishing gradient problem for positive inputs
- May cause "dying ReLU" problem (neurons that always output 0)
- Leaky ReLU: $\max(\alpha x, x)$ where $\alpha$ is small (e.g., 0.01)
- Addresses the dying ReLU problem
- GELU: Smooth approximation of ReLU that performs well in modern networks
- Sigmoid: $\sigma(x) = \frac{1}{1+e^{-x}}$, outputs in range (0,1)
The XOR Problem
- Classic example showing limitation of linear models
- XOR function cannot be represented by a single linear boundary
- Requires at least one hidden layer to solve
- Demonstrates how composition of functions increases expressivity
Universal Approximation Theorem
- An MLP with a single hidden layer of sufficient width can approximate any continuous function
- In practice, deeper networks (more layers) are more parameter-efficient than wider ones
- Deep networks learn hierarchical representations with increasing abstraction
Backpropagation: Learning Algorithm for Neural Networks
- Efficiently computes gradients of loss with respect to all parameters
- Based on chain rule of calculus applied to computational graphs
- Forward pass: Compute network output and loss
- Backward pass: Propagate gradients from output to input
- Implementation via automatic differentiation frameworks (PyTorch, TensorFlow)
Backpropagation Algorithm
- Compute gradient of a loss function wrt parameters in each layer
- Equivalent to repeated application of chain rule
- Autodiff: Automatic Differentiation on Computation Graph
- Suppose $f = f_1 \circ f_2 \circ f_3 \circ f_4$
- Jacobian $\mathbf J_f$ needs to be calculated for backprop
- Row Form: $\nabla f_i(\mathbf x)$ is the ith row of jacobian
- Calculated efficiently using forward mode
- Column Form: $\frac{\partial \mathbf f}{\partial x_i}$ is the ith column of jacobian
- Calculated efficiently using the backward mode
Derivatives
- Cross-Entropy Layer
- $z = \text{CrossEntropyWithLogitsLoss(y,x)}$
- $z = -\sum_c y_c \log(p_c)$
- $p_c = {\exp x_c \over \sum_c \exp x_c}$
- ${\delta z \over \delta x_c} = \sum_c {\delta z \over \delta p_i} \times {\delta p_i \over \delta x_c}$
- When i = c
- ${\delta z \over \delta x_c} = {-y_c \over p_c} \times p_c (1 - p_c) = - y_c ( 1 - p_c)$
- When i <> c
- ${\delta z \over \delta x_c} = {-y_c \over p_c} \times - p_i p_c = -y_c p_c$
- Adding up
- $-y_c(1-p_c) + \sum_{i \ne c} y_c p_i = p_c \sum_c y_c - y_c = p_c - y_c$
- ReLU
- $\phi(x) = \max(x,0)$
- $\phi'(x,a) = I{x > 0}$
- Adjoint
- ${\delta o \over \delta x_j} = \sum_{children} {\delta o \over \delta x_i} \times {\delta x_i \over \delta x_j}$
- Cross-Entropy Layer
Training Neural Networks
- Maximize the likelihood: $\min L(\theta) = -\log p(D|\theta)$
- Calculate gradients using backprop and use an optimizer to tune the parameters
- Objective function is not convex and there is no guarantee to find a global minimum
- Vanishing Gradients
- Gradients become very small
- Stacked layers diminish the error signals
- Difficult to solve
- Modify activation functions that don't saturate
- Switch to architectures with additive operations
- Layer Normalization
- Exploding Gradients
- Gradients become very large
- Stacked layers amplify the error signals
- Controlled via gradient clipping
- Exploding / Vanishing gradients are related to the eigenvalues of the Jacobian matrix
- Chain Rule
- ${\delta L \over \delta z_l} = {\delta L \over \delta z_{l+1}} \times {\delta z_{l+1} \over \delta z_{l}}$
- ${\delta L \over \delta z_l} = J_l \times g_{l+1}$
Non-Saturating Activations
- Sigmoid
- $f(x) = 1 / (1 + \exp^{-x}) = z$
- $f'(x) = z (1 - z)$
- If z is close to 0 or 1, the derivative vanishes
- ReLU
- $f(x) = \max(0, x)$
- $f'(x) = I {x > 0}$
- Derivative will exist as long as the input is positive
- Can still encounter dead ReLU problem when weights are large negative/positive
- Leaky ReLU
- $f(x,\alpha) = max(\alpha x, x); ,,, 0< \alpha < 1$
- Slope is 1 for for positive inputs
- Slope is alpha for negative inputs
- If alpha is learnable, then we get parametric ReLU
- ELU, SELU are smooth versions of ReLU
- Swish Activation
- $f(x) = x \sigma(x)$
- $f'(x) = f(x) + \sigma(x) (1 - f(x))$
- The slope has additive operations
- Sigmoid
Residual Connections
- It's easier to learn small perturbations to inputs than to learn new output
- $F_l(x) = x + \text{activation}_l(x)$
- Doesn't add more parameters
- $z_L = z_l + \sum_{i=l}^L F_i(z_i)$
- ${\delta L \over \delta \theta_l} = {\delta L \over \delta z_l} \times {\delta z_l \over \delta \theta_l}$
- ${\delta L \over \delta \theta_l} = {\delta z_l \over \delta \theta_l} \times {\delta L \over \delta z_L} (1 + \sum f'(z_i))$
- The derivative of the layer l has a term that is independent of the network
Initialization
- Sampling parameters from standard normal distribution with fixed variance can result in exploding gradients
- Suppose we have linear activations sampled from standard Normal Distribution
- $o = \sum w_j x_ij$
- $E(o) = \sum E(w_j)E(x_{ij}) = 0$
- $V(o) \propto n_{in} \sigma^2$
- Similarly for gradients:
- $V(o') \propto n_{out} \sigma^2$
- To prevent the expected variance from blowing up
- $\sigma^2 = {1 \over (n_{in} + n_{out})}$
- Xavier Initialization, Glorot Initialization
- He/LeCun Initialization equivalent if n_in = n_out
Regularization
- Early Stopping
- Stop training when error on validation set stops reducing
- Restricts optimization algorithm to transfer information from the training examples
- Weight Decay
- Impose prior on parameters and then use MAP estimation
- Encourages smaller weights
- Sparse DNNs
- Model compression via quantization
- Dropout
- Turnoff outgoing connections with probability p
- Prevents complex co-adaptation
- Each unit should learn to perform well on its own
- At test time, turning on dropout is equivalent to ensemble of networks (Monte Calo Dropout)
- Early Stopping