> [Reducing the Teacher-Student Gap via Spherical Knowledge Disitllation](https://arxiv.org/abs/2010.07485)
>
> Jia Guo, Minghao Chen, Yao Hu, Chen Zhu, Xiaofei He, Deng Cai
>
> 2022, under review
Due to the limited capacity of the student, student performance would unexpectedly drop when distilling from an oversized teacher. Spherical Knowledge Distillation (SKD) explicitly eliminates the gap of confidence between teacher and student, so as to ease the capacity gap problem. SKD achieves a significant improvement over previous SOTA in distilling ResNet18 on ImageNet1k.
The SKD configuration is shown below. In the `Arch` field, you need to define both the student model and the teacher model. The teacher model has fixed parameters, and the pretrained parameters are loaded. In the `Loss` field, you need to define `DistillationSKDLoss` (SKD loss between student and teacher). It should be noted that SKD loss includes KL div loss with teacher and CE loss with ground truth labels. Therefore, `DistillationGTCELoss` does not need to be defined.
```yaml
# model architecture
Arch:
name:"DistillationModel"
# if not null, its lengths should be same as models
pretrained_list:
# if not null, its lengths should be same as models