Feed Forward Neural Networks

Neural networks are powerful function approximators that learn hierarchical representations of data
From Linear Models to Neural Networks
- Linear Models perform an affine transformation of inputs: $f(x, \theta) = Wx + b$
- To increase expressivity, we can transform inputs: $f(x, \theta) = W \phi(x) + b$
- Neural networks learn these transformations automatically by composing multiple functions:
  - $f(x, \theta) = f_L(f_{L-1}(f_{L-2}.....(f_1(x)....))$
  - Each layer extracts progressively more abstract features
Architecture Components
- Layers: Groups of neurons that transform inputs to outputs
- Connections: Weighted links between neurons in adjacent layers
- Activation Functions: Non-linear functions applied to neuron outputs
  - Sigmoid: $\sigma(x) = \frac{1}{1+e^{-x}}$, outputs in range (0,1)
    - Suffers from vanishing gradient for large magnitude inputs
  - TanH: $\tanh(x) = \frac{e^{2x} - 1}{e^{2x} + 1}$, outputs in range (-1,+1)
    - Centered at zero but still suffers from vanishing gradients
  - ReLU: $\max(0, x)$, non-saturating activation function
    - Solves vanishing gradient problem for positive inputs
    - May cause "dying ReLU" problem (neurons that always output 0)
  - Leaky ReLU: $\max(\alpha x, x)$ where $\alpha$ is small (e.g., 0.01)
    - Addresses the dying ReLU problem
  - GELU: Smooth approximation of ReLU that performs well in modern networks
The XOR Problem
- Classic example showing limitation of linear models
- XOR function cannot be represented by a single linear boundary
- Requires at least one hidden layer to solve
- Demonstrates how composition of functions increases expressivity
Universal Approximation Theorem
- An MLP with a single hidden layer of sufficient width can approximate any continuous function
- In practice, deeper networks (more layers) are more parameter-efficient than wider ones
- Deep networks learn hierarchical representations with increasing abstraction
Backpropagation: Learning Algorithm for Neural Networks
- Efficiently computes gradients of loss with respect to all parameters
- Based on chain rule of calculus applied to computational graphs
- Forward pass: Compute network output and loss
- Backward pass: Propagate gradients from output to input
- Implementation via automatic differentiation frameworks (PyTorch, TensorFlow)
Backpropagation Algorithm
- Compute gradient of a loss function wrt parameters in each layer
- Equivalent to repeated application of chain rule
- Autodiff: Automatic Differentiation on Computation Graph
- Suppose $f = f_1 \circ f_2 \circ f_3 \circ f_4$
  - Jacobian $\mathbf J_f$ needs to be calculated for backprop
  - Row Form: $\nabla f_i(\mathbf x)$ is the ith row of jacobian
    - Calculated efficiently using forward mode
  - Column Form: $\frac{\partial \mathbf f}{\partial x_i}$ is the ith column of jacobian
    - Calculated efficiently using the backward mode
Derivatives
- Cross-Entropy Layer
  - $z = \text{CrossEntropyWithLogitsLoss(y,x)}$
  - $z = -\sum_c y_c \log(p_c)$
  - $p_c = {\exp x_c \over \sum_c \exp x_c}$
  - ${\delta z \over \delta x_c} = \sum_c {\delta z \over \delta p_i} \times {\delta p_i \over \delta x_c}$
  - When i = c
    - ${\delta z \over \delta x_c} = {-y_c \over p_c} \times p_c (1 - p_c) = - y_c ( 1 - p_c)$
  - When i <> c
    - ${\delta z \over \delta x_c} = {-y_c \over p_c} \times - p_i p_c = -y_c p_c$
  - Adding up
    - $-y_c(1-p_c) + \sum_{i \ne c} y_c p_i = p_c \sum_c y_c - y_c = p_c - y_c$
- ReLU
  - $\phi(x) = \max(x,0)$
  - $\phi'(x,a) = I{x > 0}$
- Adjoint
  - ${\delta o \over \delta x_j} = \sum_{children} {\delta o \over \delta x_i} \times {\delta x_i \over \delta x_j}$
Training Neural Networks
- Maximize the likelihood: $\min L(\theta) = -\log p(D|\theta)$
- Calculate gradients using backprop and use an optimizer to tune the parameters
- Objective function is not convex and there is no guarantee to find a global minimum
- Vanishing Gradients
  - Gradients become very small
  - Stacked layers diminish the error signals
  - Difficult to solve
  - Modify activation functions that don't saturate
  - Switch to architectures with additive operations
  - Layer Normalization
- Exploding Gradients
  - Gradients become very large
  - Stacked layers amplify the error signals
  - Controlled via gradient clipping
- Exploding / Vanishing gradients are related to the eigenvalues of the Jacobian matrix
  - Chain Rule
  - ${\delta L \over \delta z_l} = {\delta L \over \delta z_{l+1}} \times {\delta z_{l+1} \over \delta z_{l}}$
  - ${\delta L \over \delta z_l} = J_l \times g_{l+1}$
Non-Saturating Activations
- Sigmoid
  - $f(x) = 1 / (1 + \exp^{-x}) = z$
  - $f'(x) = z (1 - z)$
  - If z is close to 0 or 1, the derivative vanishes
- ReLU
  - $f(x) = \max(0, x)$
  - $f'(x) = I {x > 0}$
  - Derivative will exist as long as the input is positive
  - Can still encounter dead ReLU problem when weights are large negative/positive
- Leaky ReLU
  - $f(x,\alpha) = max(\alpha x, x); ,,, 0< \alpha < 1$
  - Slope is 1 for for positive inputs
  - Slope is alpha for negative inputs
  - If alpha is learnable, then we get parametric ReLU
- ELU, SELU are smooth versions of ReLU
- Swish Activation
  - $f(x) = x \sigma(x)$
  - $f'(x) = f(x) + \sigma(x) (1 - f(x))$
  - The slope has additive operations
Residual Connections
- It's easier to learn small perturbations to inputs than to learn new output
- $F_l(x) = x + \text{activation}_l(x)$
- Doesn't add more parameters
- $z_L = z_l + \sum_{i=l}^L F_i(z_i)$
- ${\delta L \over \delta \theta_l} = {\delta L \over \delta z_l} \times {\delta z_l \over \delta \theta_l}$
- ${\delta L \over \delta \theta_l} = {\delta z_l \over \delta \theta_l} \times {\delta L \over \delta z_L} (1 + \sum f'(z_i))$
- The derivative of the layer l has a term that is independent of the network
Initialization
- Sampling parameters from standard normal distribution with fixed variance can result in exploding gradients
- Suppose we have linear activations sampled from standard Normal Distribution
  - $o = \sum w_j x_ij$
  - $E(o) = \sum E(w_j)E(x_{ij}) = 0$
  - $V(o) \propto n_{in} \sigma^2$
- Similarly for gradients:
  - $V(o') \propto n_{out} \sigma^2$
- To prevent the expected variance from blowing up
  - $\sigma^2 = {1 \over (n_{in} + n_{out})}$
  - Xavier Initialization, Glorot Initialization
  - He/LeCun Initialization equivalent if n_in = n_out
Regularization
- Early Stopping
  - Stop training when error on validation set stops reducing
  - Restricts optimization algorithm to transfer information from the training examples
- Weight Decay
  - Impose prior on parameters and then use MAP estimation
  - Encourages smaller weights
- Sparse DNNs
  - Model compression via quantization
- Dropout
  - Turnoff outgoing connections with probability p
  - Prevents complex co-adaptation
  - Each unit should learn to perform well on its own
  - At test time, turning on dropout is equivalent to ensemble of networks (Monte Calo Dropout)