Reading time: 20 minutes
Optimizations are required to run training and inference of Neural Networks faster on a particular hardware infrastructure. It is important to maintain the accuracy of Neural Networks while applying various optimizations.
The various types of Neural Network optimizations available are as follows:
- Element-wise pruning using:
- magnitude thresholding
- sensitivity thresholding
- target sparsity level
- activation statistics
Pruning is a technique to reduce the number of activation values involved in a Neural Network which in turn which reduce the number of computations required.
- 2D (kernel-wise)
- 3D (filter-wise)
- 4D (layer-wise)
- channel-wise structured pruning
- column-wise structured pruning
- row-wise structured pruning
- activations criteria like:
- Soft and hard pruning
- Dual weight copies
- Model thinning to permanently remove pruned neurons and connections.
- Compression scheduling: Flexible scheduling of pruning, regularization, and learning rate decay
- One-shot and iterative pruning
- Automatic gradual schedule for pruning individual connections and complete structures
- Element-wise and filter-wise pruning sensitivity analysis
- L1-norm element-wise regularization
- Group Lasso or group variance regularization
- Post-training quantization
- quantization-aware training
- Training with knowledge distillation along with:
- quantization methods.
- Early Exit