Convolution NN
MLPs not effective for images
- Different sized inputs
- Translational invariance difficult to achieve
- Weight matrix prohibitive in size
Convolutional Neural Networks
- Replace matrix multiplication with convolution operator
- Divide image into overlapping 2d patches
- Perform template matching based on filters with learned parameters
- Number of parameters significantly reduced
- Translation invariance easy to achieve
Convolution Operators
- Convolution between two functions
- $f \star g = \int f(u) g(z-u) du$
- Similar to cross-correlation operator
- $w \star x = \sum_u^{L-1} w_ux_{i+u}$
- Convolution in 2D
- $W \star X = \sum_{u=0}^{H-1}\sum_{v=0}^{W-1} w_{u,v}x_{i+u,j+v}$
- 2D convolution is template matching, feature detection
- The output is called feature map
- Convolution is matrix multiplication
- The corresponding weight matrix is Toeplitz like
- $y = Cx$
- $C = [[w_1, w_2,0|w_3, w_4, 0|0,0,0],[0, w_1, w_2 | 0, w_3, w_4 | 0,0,0],....]$
- Weight matrix is sparse in a typical MLP setting
- Valid Convolution
- Filter Size: $(f_h, f_w)$
- Image Size: $(x_h, x_w)$
- Output Size : $(x_h - f_w + 1, x_w - f_w + 1)$
- Padding
- Filter Size: $(f_h, f_w)$
- Image Size: $(x_h, x_w)$
- Padding Size: $(p_h, p_w)$
- Output Size : $(x_h + 2p_h - f_w + 1, x_w + 2p_w - f_w + 1)$
- If 2p = f - 1, then output size is equal to input size
- Strided Convolution
- Skip every sth input to reduce redundancy
- Filter Size: $(f_h, f_w)$
- Image Size: $(x_h, x_w)$
- Padding Size: $(p_h, p_w)$
- Stride Size: $(s_h, s_w)$
- Output Size: $\lbrack {x_h + 2p_h -f_h +s_h \over s_h}, {x_w + 2p_w -f_w + s_w \over s_w} \rbrack$
- Mutiple channels
- Input images have 3 channels
- Define a kernel for each input channel
- Weight is a 3D matrix
- $z_{i,j} = \sum_H \sum_W \sum_C x_{si + u, sj+v, c} w_{u,v,c}$
- In order to detect multiple features, extend the dimension of weight matrix
- Weight is a 4D matrix
- $z_{i,j,d} = \sum_H \sum_W \sum_C x_{si + u, sj+v, c} w_{u,v,c,d}$
- Output is a hyper column formed by concatenation of feature maps
- Special Case: (1x1) point wise convolution
- Filter is of size 1x1.
- Only the number of channels change from input to output
- $z_{i,j,d} = \sum x_{i,j,c}w_{0,0,c,d}$
- Pooling Layers
- Convolution preserves information about location of input features i.e. equivariance
- To achieve translational invariance, use pooling operation
- Max Pooling
- Maximum over incoming values
- Average Pooling
- Average over incoming values
- Global Average Pooling
- Convert the (H,W,D) feature maps into (1,1,D) output layer
- Usually to compute features before passing to fully connected layer
- Dilated Convolution
- Convolution with holes
- Takes every rth input (r is the dilation rate)
- The filters have 0s
- Increases the receptive field
- Transposed Convolution
- Produce larger output form smaller input
- Pad the input with zeros and then run the filter
- Depthwise
- Convolution between two functions
Normalization
- Vanishing / Exploding gradient issues in deeper models
- Add extra layers to standardize the statistics of hidden units
- Batch Normalization
- Zero mean and unit variance across the samples in a minibatch
- $\hat z_n = {z_n - \mu_b \over \sqrt{\sigma^2_b +\epsilon}}$
- $\tilde z_n = \gamma \hat z_n + \beta$
- $\gamma, \beta$ are learnable parameters
- When applied to input layer, BN is close to unsual standardization process
- For other layers, as model trains, the mean and variance change
- Internal Covariate Shift
- At test time, the inference may run on streaming i.e. one example at a time
- Solution: After training, re-compute the mean and variance across entire training batch and then freeze the parameters
- Sometimes, after recomputing, the BN parameters are fused to the hidden layer. This results in fused BN layer
- BN struggles when batch size is small
- Layer Normalization
- Pool over channel, height and width
- Match on batch index
- Instance Normalization
- Pool over height and width
- Match over batch index
- Normalization Free Networks
- Adaptive gradient clipping
Common Architectures
- ResNet
- Uses residula blocks to learn small perturbation in inputs
- Residual Block: conv:BN:ReLU:conv:BN
- Use padding, 1x1 convolution to ensure that additive operation is valid
- DenseNet
- Concatenate (rather than add) the output with the input
- $x \rightarrow [x, f_1(x), f_2(x, f_1(x)), f_3(x, f_1(x), f_2(x))]$
- Computationally expensive
- Neural Architecture Search
- EfficeintNetV2
- ResNet
Adversarial Exmaples
- White-Box Attacks
- Gradient Free
- Add small perturbation to input that changes the prediction from classifier
- Targeted attack
- Black-Box Attack
- Gradient Free
- Design fooling images as apposed to adversarial images
- White-Box Attacks