提交 8ae48bb4 编写于 作者: littletomatodonkey's avatar littletomatodonkey

add 84.0r50_vd doc

上级 0feef337
...@@ -22,7 +22,7 @@ ...@@ -22,7 +22,7 @@
<img src="./docs/images/models/V100_benchmark/v100.fp32.bs1.main_fps_top1_s.jpg" width="700"> <img src="./docs/images/models/V100_benchmark/v100.fp32.bs1.main_fps_top1_s.jpg" width="700">
</div> </div>
上图对比了一些最新的面向服务器端应用场景的模型,在使用V100,FP32和TensorRT,batch size为1时的预测时间及其准确率,图中准确率82.4%的ResNet50_vd_ssld和83.7%的ResNet101_vd_ssld,是采用PaddleClas提供的SSLD知识蒸馏方案训练的模型。图中相同颜色和符号的点代表同一系列不同规模的模型。不同模型的简介、FLOPS、Parameters以及详细的GPU预测时间(包括不同batchsize的T4卡预测速度)请参考文档教程中的[**模型库章节**](https://paddleclas.readthedocs.io/zh_CN/latest/models/models_intro.html) 上图对比了一些最新的面向服务器端应用场景的模型,在使用V100,FP32和TensorRT,batch size为1时的预测时间及其准确率,图中准确率83.0%的ResNet50_vd_ssld_v2和83.7%的ResNet101_vd_ssld,是采用PaddleClas提供的SSLD知识蒸馏方案训练的模型,其中v2表示在训练时添加了AutoAugment数据增广策略。图中相同颜色和符号的点代表同一系列不同规模的模型。不同模型的简介、FLOPS、Parameters以及详细的GPU预测时间(包括不同batchsize的T4卡预测速度)请参考文档教程中的[**模型库章节**](https://paddleclas.readthedocs.io/zh_CN/latest/models/models_intro.html)
<div align="center"> <div align="center">
<img <img
......
...@@ -10,7 +10,7 @@ With enough training data, increasing parameters of the neural network by buildi ...@@ -10,7 +10,7 @@ With enough training data, increasing parameters of the neural network by buildi
Parameter redundancy exists in deep neural networks. There are several methods to compress the model suck as pruning ,quantization, knowledge distillation, etc. Knowledge distillation refers to using the teacher model to guide the student model to learn specific tasks, ensuring that the small model has a relatively large effect improvement with the computation cost unchanged, and even obtains similar accuracy with the large model [1]. Combining some of the existing distillation methods [2,3], PaddleClas provides a simple semi-supervised label knowledge distillation solution (SSLD). Top-1 Accuarcy on ImageNet1k dataset has an improvement of more than 3% based on ResNet_vd and MobileNet series, which can be shown as below. Parameter redundancy exists in deep neural networks. There are several methods to compress the model suck as pruning ,quantization, knowledge distillation, etc. Knowledge distillation refers to using the teacher model to guide the student model to learn specific tasks, ensuring that the small model has a relatively large effect improvement with the computation cost unchanged, and even obtains similar accuracy with the large model [1]. Combining some of the existing distillation methods [2,3], PaddleClas provides a simple semi-supervised label knowledge distillation solution (SSLD). Top-1 Accuarcy on ImageNet1k dataset has an improvement of more than 3% based on ResNet_vd and MobileNet series, which can be shown as below.
![](../../../images/distillation/distillation_perform.png) ![](../../../images/distillation/distillation_perform_s.jpg)
# 2. SSLD # 2. SSLD
...@@ -97,6 +97,12 @@ Finetuning is carried out on ImageNet1k dataset to restore distribution between ...@@ -97,6 +97,12 @@ Finetuning is carried out on ImageNet1k dataset to restore distribution between
| ResNet50_vd | 60 | 7e-5 | 1024/32 | 0.004 | cosine_decay_warmup | 82.39% | | ResNet50_vd | 60 | 7e-5 | 1024/32 | 0.004 | cosine_decay_warmup | 82.39% |
| ResNet101_vd | 30 | 7e-5 | 1024/32 | 0.004 | cosine_decay_warmup | 83.73% | | ResNet101_vd | 30 | 7e-5 | 1024/32 | 0.004 | cosine_decay_warmup | 83.73% |
## 3.4 Data agmentation and Fix strategy
* Based on experiments mentioned above, we add AutoAugment [4] during training process, and reduced l2_decay from 4e-5 t 2e-5. Finally, the Top-1 accuracy on ImageNet1k dataset can reach 82.99%, with 0.6% improvement compared to the standard SSLD distillation strategy.
* For image classsification tasks, The model accuracy can be further improved when the test scale is 1.15 times that of training[5]. For the 82.99% ResNet50_vd pretrained model, it comes to 83.7% using 320x320 for the evaluation. We use Fix strategy to finetune the model with the training scale set as 320x320. During the process, the pre-preocessing pipeline is same for both training and test. All the weights except the fully connected layer are freezed. Finally the top-1 accuracy comes to **84.0%**.
# 4. Application of the distillation model # 4. Application of the distillation model
...@@ -225,3 +231,7 @@ python -m paddle.distributed.launch \ ...@@ -225,3 +231,7 @@ python -m paddle.distributed.launch \
[2] Bagherinezhad H, Horton M, Rastegari M, et al. Label refinery: Improving imagenet classification through label progression[J]. arXiv preprint arXiv:1805.02641, 2018. [2] Bagherinezhad H, Horton M, Rastegari M, et al. Label refinery: Improving imagenet classification through label progression[J]. arXiv preprint arXiv:1805.02641, 2018.
[3] Yalniz I Z, Jégou H, Chen K, et al. Billion-scale semi-supervised learning for image classification[J]. arXiv preprint arXiv:1905.00546, 2019. [3] Yalniz I Z, Jégou H, Chen K, et al. Billion-scale semi-supervised learning for image classification[J]. arXiv preprint arXiv:1905.00546, 2019.
[4] Cubuk E D, Zoph B, Mane D, et al. Autoaugment: Learning augmentation strategies from data[C]//Proceedings of the IEEE conference on computer vision and pattern recognition. 2019: 113-123.
[5] Touvron H, Vedaldi A, Douze M, et al. Fixing the train-test resolution discrepancy[C]//Advances in Neural Information Processing Systems. 2019: 8250-8260.
...@@ -8,7 +8,7 @@ ...@@ -8,7 +8,7 @@
深度神经网络一般有较多的参数冗余,目前有几种主要的方法对模型进行压缩,减小其参数量。如裁剪、量化、知识蒸馏等,其中知识蒸馏是指使用教师模型(teacher model)去指导学生模型(student model)学习特定任务,保证小模型在参数量不变的情况下,得到比较大的性能提升,甚至获得与大模型相似的精度指标[1]。PaddleClas融合已有的蒸馏方法[2,3],提供了一种简单的半监督标签知识蒸馏方案(SSLD,Simple Semi-supervised Label Distillation),基于ImageNet1k分类数据集,在ResNet_vd以及MobileNet系列上的精度均有超过3%的绝对精度提升,具体指标如下图所示。 深度神经网络一般有较多的参数冗余,目前有几种主要的方法对模型进行压缩,减小其参数量。如裁剪、量化、知识蒸馏等,其中知识蒸馏是指使用教师模型(teacher model)去指导学生模型(student model)学习特定任务,保证小模型在参数量不变的情况下,得到比较大的性能提升,甚至获得与大模型相似的精度指标[1]。PaddleClas融合已有的蒸馏方法[2,3],提供了一种简单的半监督标签知识蒸馏方案(SSLD,Simple Semi-supervised Label Distillation),基于ImageNet1k分类数据集,在ResNet_vd以及MobileNet系列上的精度均有超过3%的绝对精度提升,具体指标如下图所示。
![](../../../images/distillation/distillation_perform.png) ![](../../../images/distillation/distillation_perform_s.jpg)
# 二、SSLD 蒸馏策略 # 二、SSLD 蒸馏策略
...@@ -17,10 +17,8 @@ ...@@ -17,10 +17,8 @@
SSLD的流程图如下图所示。 SSLD的流程图如下图所示。
![](../../../images/distillation/ppcls_distillation.png) ![](../../../images/distillation/ppcls_distillation.png)
首先,我们从ImageNet22k中挖掘出了近400万张图片,同时与ImageNet-1k训练集整合在一起,得到了一个新的包含500万张图片的数据集。然后,我们将学生模型与教师模型组合成一个新的网络,该网络分别输出学生模型和教师模型的预测分布,与此同时,固定教师模型整个网络的梯度,而学生模型可以做正常的反向传播。最后,我们将两个模型的logits经过softmax激活函数转换为soft label,并将二者的soft label做JS散度作为损失函数,用于蒸馏模型训练。下面以MobileNetV3(该模型直接训练,精度为75.3%)的知识蒸馏为例,介绍该方案的核心关键点(baseline为79.12%的ResNet50_vd模型蒸馏MobileNetV3,训练集为ImageNet1k训练集,loss为cross entropy loss,迭代轮数为120epoch,精度指标为75.6%)。 首先,我们从ImageNet22k中挖掘出了近400万张图片,同时与ImageNet-1k训练集整合在一起,得到了一个新的包含500万张图片的数据集。然后,我们将学生模型与教师模型组合成一个新的网络,该网络分别输出学生模型和教师模型的预测分布,与此同时,固定教师模型整个网络的梯度,而学生模型可以做正常的反向传播。最后,我们将两个模型的logits经过softmax激活函数转换为soft label,并将二者的soft label做JS散度作为损失函数,用于蒸馏模型训练。下面以MobileNetV3(该模型直接训练,精度为75.3%)的知识蒸馏为例,介绍该方案的核心关键点(baseline为79.12%的ResNet50_vd模型蒸馏MobileNetV3,训练集为ImageNet1k训练集,loss为cross entropy loss,迭代轮数为120epoch,精度指标为75.6%)。
* 教师模型的选择。在进行知识蒸馏时,如果教师模型与学生模型的结构差异太大,蒸馏得到的结果反而不会有太大收益。相同结构下,精度更高的教师模型对结果也有很大影响。相比于79.12%的ResNet50_vd教师模型,使用82.4%的ResNet50_vd教师模型可以带来0.4%的绝对精度收益(`75.6%->76.0%`)。 * 教师模型的选择。在进行知识蒸馏时,如果教师模型与学生模型的结构差异太大,蒸馏得到的结果反而不会有太大收益。相同结构下,精度更高的教师模型对结果也有很大影响。相比于79.12%的ResNet50_vd教师模型,使用82.4%的ResNet50_vd教师模型可以带来0.4%的绝对精度收益(`75.6%->76.0%`)。
...@@ -103,6 +101,14 @@ SSLD的流程图如下图所示。 ...@@ -103,6 +101,14 @@ SSLD的流程图如下图所示。
| ResNet101_vd | 30 | 7e-5 | 1024/32 | 0.004 | cosine_decay_warmup | 83.73% | | ResNet101_vd | 30 | 7e-5 | 1024/32 | 0.004 | cosine_decay_warmup | 83.73% |
## 3.4 数据增广以及基于Fix策略的微调
* 基于前文所述的实验结论,我们在训练的过程中加入自动增广(AutoAugment)[4],同时进一步减小了l2_decay(4e-5->2e-5),最终ResNet50_vd经过SSLD蒸馏策略,在ImageNet1k上的精度可以达到82.99%,相比之前不加数据增广的蒸馏策略再次增加了0.6%。
* 对于图像分类任务,在测试的时候,测试尺度为训练尺度的1.15倍左右时,往往在不需要重新训练模型的情况下,模型的精度指标就可以进一步提升[5],对于82.99%的ResNet50_vd在320x320的尺度下测试,精度可达83.7%,我们进一步使用Fix策略,即在320x320的尺度下进行训练,使用与预测时相同的数据预处理方法,同时固定除FC层以外的所有参数,最终在320x320的预测尺度下,精度可以达到**84.0%**
## 3.4 实验过程中的一些问题 ## 3.4 实验过程中的一些问题
### 3.4.1 bn的计算方法 ### 3.4.1 bn的计算方法
...@@ -266,3 +272,7 @@ sh tools/run.sh ...@@ -266,3 +272,7 @@ sh tools/run.sh
[2] Bagherinezhad H, Horton M, Rastegari M, et al. Label refinery: Improving imagenet classification through label progression[J]. arXiv preprint arXiv:1805.02641, 2018. [2] Bagherinezhad H, Horton M, Rastegari M, et al. Label refinery: Improving imagenet classification through label progression[J]. arXiv preprint arXiv:1805.02641, 2018.
[3] Yalniz I Z, Jégou H, Chen K, et al. Billion-scale semi-supervised learning for image classification[J]. arXiv preprint arXiv:1905.00546, 2019. [3] Yalniz I Z, Jégou H, Chen K, et al. Billion-scale semi-supervised learning for image classification[J]. arXiv preprint arXiv:1905.00546, 2019.
[4] Cubuk E D, Zoph B, Mane D, et al. Autoaugment: Learning augmentation strategies from data[C]//Proceedings of the IEEE conference on computer vision and pattern recognition. 2019: 113-123.
[5] Touvron H, Vedaldi A, Douze M, et al. Fixing the train-test resolution discrepancy[C]//Advances in Neural Information Processing Systems. 2019: 8250-8260.
Markdown is supported
0% .
You are about to add 0 people to the discussion. Proceed with caution.
先完成此消息的编辑!
想要评论请 注册