In this article at OpenGenus.org, we have explored the concept of Capacity Mismatch in Deep Learning and Knowledge Distillation and discussed some solutions towards it.
Capacity Mismatch is one of main issues in Knowledge Distillation and it highlights the fact that the current DL models are not universal models.
Table of contents:
- Capacity Mismatch in Deep Learning
- Capacity Mismatch in Knowledge Distillation
Capacity Mismatch in Deep Learning
Capacity Mismatch in Deep Learning is the problem when the complexity or number of parameters of a Deep Learning model is not suited to address the complexity of the task at hand. This may mean that the model is either too simple or too complex for the problem. This will result in low accuracy.
Following are the two cases:
- Model is of undercapacity (too simple)
- As the model is too simple, it cannot learn all features or patterns of the data. This results in high bias as the simple model tends to learn and focus only of the features it finds easier to learn.
- Model is of overcapacity (too complex)
- The direct result of using a complex model is overfitting. This will lead to high variance and problems with generalizing the model. In most cases with early stopping during training, this does not impact the accuracy but model throughput performance may tend to be low.
The solution to address this is to:
- Change the model architecture or increase/ decrease the number of layers
- Regularization techniques like dropout, weight decay, early stopping
- Cross validation and analyzing the learning curve closely
Capacity Mismatch in Knowledge Distillation
case of Knowledge Distillation, capacity mismatch refers to the problem when the teacher model is significantly more complex than the student model. Due to a large difference in the model capacity, some classes tend to become undistillable classes which cannot be learnt by the student model and hence, the process of knowledge distillation remains limited.
This can arise even when the student model is more complex than the teacher model but in the process of knowledge distillation, it should be ensured that the capacity of student model is less as the goal is to do model compression.
This can arise because:
- Student model is too simple
- Architecture of student and teacher model are a mismatch
- Loss function differ between the student and teacher model
The solution is to ensure:
- Model architectures match
- Adaptive loss functions are used.
- Use regularization techniques to prevent overfitting of student model
With this article at OpenGenus.org, you must have a strong idea of Capacity Mismatch in Deep Learning.