add kd doc (#1997)

* add kd doc * fix * add ssld doc * fix ssld * fix ssld * Update knowledge_distillation.md * fix doc * fix dist export * fix * add dist doc * fix speed info * Update ssld.md * Update ssld.md

add kd doc (#1997)
* add kd doc * fix * add ssld doc * fix ssld * fix ssld * Update knowledge_distillation.md * fix doc * fix dist export * fix * add dist doc * fix speed info * Update ssld.md * Update ssld.md
794af8c0 · littletomatodonkey · GitHub · fed4ea69 · 794af8c0 · 794af8c0
11 changed file
--- a/docs/zh_CN/PULC/PULC_traffic_sign.md
+++ b/docs/zh_CN/PULC/PULC_traffic_sign.md
@@ -44,14 +44,14 @@

 | 模型 | Top-1 Acc（%） | 延时（ms） | 存储（M） | 策略 |
 |-------|-----------|----------|---------------|---------------|
-| SwinTranformer_tiny  | 98.11 | 87.19  | 111 | 使用ImageNet预训练模型 |
-| MobileNetV3_large_x1_0  | 97.79 | 5.59  | 23 | 使用ImageNet预训练模型 |
-| PPLCNet_x1_0  | 97.78 | 2.67  | 8.2 | 使用ImageNet预训练模型 |
-| PPLCNet_x1_0  | 97.84 | 2.67  | 8.2 | 使用SSLD预训练模型 |
-| PPLCNet_x1_0  | 98.14 | 2.67  | 8.2 | 使用SSLD预训练模型+EDA策略|
-| <b>PPLCNet_x1_0<b>  | <b>98.35<b> | <b>2.67<b>  | <b>8.2<b> | 使用SSLD预训练模型+EDA策略+SKL-UGI知识蒸馏策略|
-
-从表中可以看出，backbone 为 SwinTranformer_tiny 时精度较高，但是推理速度较慢。将 backbone 替换为轻量级模型 MobileNetV3_large_x1_0 后，速度可以大幅提升，但是精度下降明显。将 backbone 替换为 PPLCNet_x1_0 时，精度低0.01%，但是速度提升 2 倍左右。在此基础上，使用 SSLD 预训练模型后，在不改变推理速度的前提下，精度可以提升约 0.06%，进一步地，当融合EDA策略后，精度可以再提升 0.3%，最后，在使用 SKL-UGI 知识蒸馏后，精度可以继续提升 0.21%。此时，PPLCNet_x1_0 的精度超越了SwinTranformer_tiny，速度快32倍。关于 PULC 的训练方法和推理部署方法将在下面详细介绍。
+| SwinTranformer_tiny  | 98.11 | 89.45  | 111 | 使用ImageNet预训练模型 |
+| MobileNetV3_large_x1_0  | 97.79 | 4.81  | 23 | 使用ImageNet预训练模型 |
+| PPLCNet_x1_0  | 97.78 | 2.10  | 8.2 | 使用ImageNet预训练模型 |
+| PPLCNet_x1_0  | 97.84 | 2.10  | 8.2 | 使用SSLD预训练模型 |
+| PPLCNet_x1_0  | 98.14 | 2.10  | 8.2 | 使用SSLD预训练模型+EDA策略|
+| <b>PPLCNet_x1_0<b>  | <b>98.35<b> | <b>2.10<b>  | <b>8.2<b> | 使用SSLD预训练模型+EDA策略+SKL-UGI知识蒸馏策略|
+
+从表中可以看出，backbone 为 SwinTranformer_tiny 时精度较高，但是推理速度较慢。将 backbone 替换为轻量级模型 MobileNetV3_large_x1_0 后，速度可以大幅提升，但是精度下降明显。将 backbone 替换为 PPLCNet_x1_0 时，精度低0.01%，但是速度提升 1 倍左右。在此基础上，使用 SSLD 预训练模型后，在不改变推理速度的前提下，精度可以提升约 0.06%，进一步地，当融合EDA策略后，精度可以再提升 0.3%，最后，在使用 SKL-UGI 知识蒸馏后，精度可以继续提升 0.21%。此时，PPLCNet_x1_0 的精度超越了SwinTranformer_tiny，速度快 41 倍。关于 PULC 的训练方法和推理部署方法将在下面详细介绍。

 **备注：**


--- a/docs/zh_CN/PULC/PULC_vehicle_attr.md
+++ b/docs/zh_CN/PULC/PULC_vehicle_attr.md
@@ -44,15 +44,15 @@

 | 模型 | ma（%） | 延时（ms） | 存储（M） | 策略 |
 |-------|-----------|----------|---------------|---------------|
-| Res2Net200_vd_26w_4s  | 91.36 | 66.58  | 293 | 使用ImageNet预训练模型 |
-| ResNet50  | 89.98 | 12.74  | 92 | 使用ImageNet预训练模型 |
-| MobileNetV3_large_x1_0  | 89.77 | 5.59  | 23 | 使用ImageNet预训练模型 |
-| PPLCNet_x1_0  | 89.57 | 2.56  | 8.2 | 使用ImageNet预训练模型 |
-| PPLCNet_x1_0  | 90.07 | 2.56  | 8.2 | 使用SSLD预训练模型 |
-| PPLCNet_x1_0  | 90.59 | 2.56  | 8.2 | 使用SSLD预训练模型+EDA策略|
-| <b>PPLCNet_x1_0<b>  | <b>90.81<b> | <b>2.56<b>  | <b>8.2<b> | 使用SSLD预训练模型+EDA策略+SKL-UGI知识蒸馏策略|
-
-从表中可以看出，backbone 为 Res2Net200_vd_26w_4s 时精度较高，但是推理速度较慢。将 backbone 替换为轻量级模型 MobileNetV3_large_x1_0 后，速度可以大幅提升，但是精度下降明显。将 backbone 替换为 PPLCNet_x1_0 时，精度低0.2%，但是速度提升 2 倍左右。在此基础上，使用 SSLD 预训练模型后，在不改变推理速度的前提下，精度可以提升约 0.5%，进一步地，当融合EDA策略后，精度可以再提升 0.52%，最后，在使用 SKL-UGI 知识蒸馏后，精度可以继续提升 0.23%。此时，PPLCNet_x1_0 的精度与 Res2Net200_vd_26w_4s 仅相差0.55%，但是速度快26倍。关于 PULC 的训练方法和推理部署方法将在下面详细介绍。
+| Res2Net200_vd_26w_4s  | 91.36 | 79.46  | 293 | 使用ImageNet预训练模型 |
+| ResNet50  | 89.98 | 12.83  | 92 | 使用ImageNet预训练模型 |
+| MobileNetV3_large_x1_0  | 89.77 | 5.09  | 23 | 使用ImageNet预训练模型 |
+| PPLCNet_x1_0  | 89.57 | 2.36  | 8.2 | 使用ImageNet预训练模型 |
+| PPLCNet_x1_0  | 90.07 | 2.36  | 8.2 | 使用SSLD预训练模型 |
+| PPLCNet_x1_0  | 90.59 | 2.36  | 8.2 | 使用SSLD预训练模型+EDA策略|
+| <b>PPLCNet_x1_0<b>  | <b>90.81<b> | <b>2.36<b>  | <b>8.2<b> | 使用SSLD预训练模型+EDA策略+SKL-UGI知识蒸馏策略|
+
+从表中可以看出，backbone 为 Res2Net200_vd_26w_4s 时精度较高，但是推理速度较慢。将 backbone 替换为轻量级模型 MobileNetV3_large_x1_0 后，速度可以大幅提升，但是精度下降明显。将 backbone 替换为 PPLCNet_x1_0 时，精度低0.2%，但是速度提升 1 倍左右。在此基础上，使用 SSLD 预训练模型后，在不改变推理速度的前提下，精度可以提升约 0.5%，进一步地，当融合EDA策略后，精度可以再提升 0.52%，最后，在使用 SKL-UGI 知识蒸馏后，精度可以继续提升 0.23%。此时，PPLCNet_x1_0 的精度与 Res2Net200_vd_26w_4s 仅相差0.55%，但是速度快32倍。关于 PULC 的训练方法和推理部署方法将在下面详细介绍。

 **备注：**


--- a/docs/zh_CN/advanced_tutorials/knowledge_distillation.md
+++ b/docs/zh_CN/advanced_tutorials/knowledge_distillation.md
--- a/docs/zh_CN/advanced_tutorials/ssld.md
+++ b/docs/zh_CN/advanced_tutorials/ssld.md
+
+# SSLD 知识蒸馏实战
+
+## 目录
+
+- [1. 算法介绍](#1)
+    - [1.1 知识蒸馏简介](#1.1)
+    - [1.2 SSLD蒸馏策略](#1.2)
+- [2. SSLD预训练模型库](#2)
+- [3. SSLD使用](#3)
+    - [3.1 加载SSLD模型进行微调](#3.1)
+    - [3.2 使用SSLD方案进行知识蒸馏](#3.2)
+- [4. 参考文献](#4)
+
+
+
+<a name="1"></a>
+
+## 1. 算法介绍
+
+### 1.1 简介
+
+PaddleClas 融合已有的知识蒸馏方法 [2,3]，提供了一种简单的半监督标签知识蒸馏方案(SSLD，Simple Semi-supervised Label Distillation)，基于 ImageNet1k 分类数据集，在 ResNet_vd 以及 MobileNet 系列上的精度均有超过 3% 的绝对精度提升，具体指标如下图所示。
+
+<div align="center">
+<img src="../../images/distillation/distillation_perform_s.jpg"  width = "800" />
+</div>
+
+### 1.2 SSLD蒸馏策略
+
+SSLD 的流程图如下图所示。
+
+<div align="center">
+<img src="../../images/distillation/ppcls_distillation.png"  width = "800" />
+</div>
+
+首先，我们从 ImageNet22k 中挖掘出了近 400 万张图片，同时与 ImageNet-1k 训练集整合在一起，得到了一个新的包含 500 万张图片的数据集。然后，我们将学生模型与教师模型组合成一个新的网络，该网络分别输出学生模型和教师模型的预测分布，与此同时，固定教师模型整个网络的梯度，而学生模型可以做正常的反向传播。最后，我们将两个模型的 logits 经过 softmax 激活函数转换为 soft label，并将二者的 soft label 做 JS 散度作为损失函数，用于蒸馏模型训练。
+
+以 MobileNetV3（该模型直接训练，精度为 75.3%）的知识蒸馏为例，该方案的核心策略优化点如下所示。
+
+
+| 实验ID | 策略               | Top-1 acc |
+|:------:|:---------:|:--------:|
+| 1    | baseline         | 75.60%    |
+| 2    | 更换教师模型精度为82.4%的权重 | 76.00%    |
+| 3    | 使用改进的JS散度损失函数    | 76.20%    |
+| 4    | 迭代轮数增加至360epoch  | 77.10%    |
+| 5    | 添加400W挖掘得到的无标注数据      | 78.50%    |
+| 6    | 基于ImageNet1k数据微调 | 78.90%    |
+
+* 注：其中baseline的训练条件为
+    * 训练数据：ImageNet1k数据集
+    * 损失函数：Cross Entropy Loss
+    * 迭代轮数：120epoch
+
+
+SSLD 蒸馏方案的一大特色就是无需使用图像的真值标签，因此可以任意扩展数据集的大小，考虑到计算资源的限制，我们在这里仅基于 ImageNet22k 数据集对蒸馏任务的训练集进行扩充。在 SSLD 蒸馏任务中，我们使用了 `Top-k per class` 的数据采样方案 [3] 。具体步骤如下。
+
+（1）训练集去重。我们首先基于 SIFT 特征相似度匹配的方式对 ImageNet22k 数据集与 ImageNet1k 验证集进行去重，防止添加的 ImageNet22k 训练集中包含 ImageNet1k 验证集图像，最终去除了 4511 张相似图片。部分过滤的相似图片如下所示。
+
+<div align="center">
+<img src="../../images/distillation/22k_1k_val_compare_w_sift.png"  width = "600" />
+</div>
+
+（2）大数据集 soft label 获取，对于去重后的 ImageNet22k 数据集，我们使用 `ResNeXt101_32x16d_wsl` 模型进行预测，得到每张图片的 soft label 。
+
+（3）Top-k 数据选择，ImageNet1k 数据共有 1000 类，对于每一类，找出属于该类并且得分最高的 `k` 张图片，最终得到一个数据量不超过 `1000*k` 的数据集（某些类上得到的图片数量可能少于 `k` 张）。
+
+（4）将该数据集与 ImageNet1k 的训练集融合组成最终蒸馏模型所使用的数据集，数据量为 500 万。
+
+
+<a name="2"></a>
+
+## 2. 预训练模型库
+
+
+移动端预训练模型库列表如下所示。
+
+| 模型                      | FLOPs(M) | Params(M) | top-1 acc | SSLD top-1 acc | 精度收益   | 下载链接 |
+|-------------------|----------|-----------|----------|---------------|--------|------|
+| PPLCNetV2_base          | 604.16     | 6.54       | 77.04%   | 80.10%        | +3.06% |   [链接](https://paddle-imagenet-models-name.bj.bcebos.com/dygraph/legendary_models/PPLCNetV2_base_ssld_pretrained.pdparams)   |
+| PPLCNet_x2_5            | 906.49    | 9.04        | 76.60%   | 80.82%        | +4.22% |  [链接](https://paddle-imagenet-models-name.bj.bcebos.com/dygraph/legendary_models/PPLCNet_x2_5_ssld_pretrained.pdparams)    |
+| PPLCNet_x1_0            | 160.81    | 2.96         | 71.32%   | 74.39%        | +3.07% |   [链接](https://paddle-imagenet-models-name.bj.bcebos.com/dygraph/legendary_models/PPLCNet_x1_0_ssld_pretrained.pdparams)   |
+| PPLCNet_x0_5            | 47.28    | 1.89       | 63.14%   | 66.10%        | +2.96% |   [链接](https://paddle-imagenet-models-name.bj.bcebos.com/dygraph/legendary_models/PPLCNet_x0_5_ssld_pretrained.pdparams)   |
+| PPLCNet_x0_25           | 18.43    | 1.52       | 51.86%   | 53.43%        | +1.57% |   [链接](https://paddle-imagenet-models-name.bj.bcebos.com/dygraph/legendary_models/PPLCNet_x0_25_ssld_pretrained.pdparams)   |
+| MobileNetV1             | 578.88     | 4.19      | 71.00%   | 77.90%        | +6.90% |   [链接](https://paddle-imagenet-models-name.bj.bcebos.com/dygraph/legendary_models/MobileNetV1_ssld_pretrained.pdparams)   |
+| MobileNetV2             | 327.84      | 3.44      | 72.20%   | 76.74%        | +4.54% |   [链接](https://paddle-imagenet-models-name.bj.bcebos.com/dygraph/MobileNetV2_ssld_pretrained.pdparams)   |
+| MobileNetV3_large_x1_0  | 229.66     | 5.47      | 75.30%   | 79.00%        | +3.70% |   [链接](https://paddle-imagenet-models-name.bj.bcebos.com/dygraph/legendary_models/MobileNetV3_large_x1_0_ssld_pretrained.pdparams)   |
+| MobileNetV3_small_x1_0  | 63.67    | 2.94      | 68.20%   | 71.30%        | +3.10% |   [链接](https://paddle-imagenet-models-name.bj.bcebos.com/dygraph/legendary_models/MobileNetV3_small_x1_0_ssld_pretrained.pdparams)   |
+| MobileNetV3_small_x0_35 | 14.56    | 1.66      | 53.00%   | 55.60%        | +2.60% |   [链接](https://paddle-imagenet-models-name.bj.bcebos.com/dygraph/legendary_models/MobileNetV3_small_x0_35_ssld_pretrained.pdparams)   |
+| GhostNet_x1_3_ssld      | 236.89     | 7.30       | 75.70%   | 79.40%        | +3.70% |   [链接](https://paddle-imagenet-models-name.bj.bcebos.com/dygraph/GhostNet_x1_3_ssld_pretrained.pdparams)   |
+
+* 注：其中的`top-1 acc`表示使用普通训练方式得到的模型精度，`SSLD top-1 acc`表示使用SSLD知识蒸馏训练策略得到的模型精度。
+
+
+服务端预训练模型库列表如下所示。
+
+| 模型                   | FLOPs(G) | Params(M) | top-1 acc | SSLD top-1 acc | 精度收益   | 下载链接                                                                                      |
+|----------------------|----------|-----------|----------|---------------|--------|-------------------------------------------------------------------------------------------|
+| PPHGNet_base         | 25.14    | 71.62     | -   | 85.00% | - | [链接](https://paddle-imagenet-models-name.bj.bcebos.com/dygraph/legendary_models/PPHGNet_base_ssld_pretrained.pdparams) |
+| PPHGNet_small        | 8.53     | 24.38     | 81.50%   | 83.80% | +2.30% | [链接](https://paddle-imagenet-models-name.bj.bcebos.com/dygraph/legendary_models/PPHGNet_small_ssld_pretrained.pdparams) |
+| PPHGNet_tiny         | 4.54     | 14.75     | 79.83%   | 81.95% | +2.12% | [链接](https://paddle-imagenet-models-name.bj.bcebos.com/dygraph/legendary_models/PPHGNet_tiny_ssld_pretrained.pdparams) |
+| ResNet50_vd          | 8.67     | 25.58     | 79.10%   | 83.00% | +3.90% | [链接](https://paddle-imagenet-models-name.bj.bcebos.com/dygraph/ResNet50_vd_ssld_pretrained.pdparams) |
+| ResNet101_vd         | 16.1     | 44.57     | 80.20%   | 83.70% | +3.50% | [链接](https://paddle-imagenet-models-name.bj.bcebos.com/dygraph/legendary_models/ResNet101_vd_ssld_pretrained.pdparams) |
+| ResNet34_vd          | 7.39     | 21.82     | 76.00%   | 79.70% | +3.70% | [链接](https://paddle-imagenet-models-name.bj.bcebos.com/dygraph/ResNet34_vd_ssld_pretrained.pdparams) |
+| Res2Net50_vd_26w_4s  | 8.37     | 25.06     | 79.80%   | 83.10% | +3.30% | [链接](https://paddle-imagenet-models-name.bj.bcebos.com/dygraph/Res2Net50_vd_26w_4s_ssld_pretrained.pdparams) |
+| Res2Net101_vd_26w_4s | 16.67    | 45.22     | 80.60%   | 83.90% | +3.30% | [链接](https://paddle-imagenet-models-name.bj.bcebos.com/dygraph/Res2Net101_vd_26w_4s_ssld_pretrained.pdparams) |
+| Res2Net200_vd_26w_4s | 31.49    | 76.21     | 81.20%   | 85.10% | +3.90% | [链接](https://paddle-imagenet-models-name.bj.bcebos.com/dygraph/Res2Net200_vd_26w_4s_ssld_pretrained.pdparams) |
+| HRNet_W18_C          | 4.14     | 21.29     | 76.90%   | 81.60% | +4.70% | [链接](https://paddle-imagenet-models-name.bj.bcebos.com/dygraph/legendary_models/_ssld_pretrained.pdparams) |
+| HRNet_W48_C          | 34.58    | 77.47     | 79.00%   | 83.60% | +4.60% | [链接](https://paddle-imagenet-models-name.bj.bcebos.com/dygraph/legendary_models/HRNet_W48_C_ssld_pretrained.pdparams) |
+| SE_HRNet_W64_C       | 57.83    | 128.97    | -        | 84.70% |   -     | [链接](https://paddle-imagenet-models-name.bj.bcebos.com/dygraph/legendary_models/SE_HRNet_W64_C_ssld_pretrained.pdparams) |
+
+
+<a name="3"></a>
+
+## 3. SSLD使用方法
+
+<a name="3.1"></a>  
+
+### 3.1 加载SSLD模型进行微调
+
+如果希望直接使用预训练模型，可以在训练的时候，加入参数`-o Arch.pretrained=True -o Arch.use_ssld=True`，表示使用基于SSLD的预训练模型，示例如下所示。
+
+```shell
+# 单机单卡训练
+python3 tools/train.py -c ppcls/configs/ImageNet/ResNet/ResNet50_vd.yaml -o Arch.pretrained=True -o Arch.use_ssld=True
+# 单机多卡训练
+python3 -m paddle.distributed.launch --gpus="0,1,2,3" tools/train.py -c ppcls/configs/ImageNet/ResNet/ResNet50_vd.yaml -o Arch.pretrained=True -o Arch.use_ssld=True
+```
+
+<a name="3.2"></a>
+
+### 3.2 使用SSLD方案进行知识蒸馏
+
+相比于其他大多数知识蒸馏算法，SSLD摆脱对数据标注的依赖，通过引入无标注数据，可以进一步提升模型精度。
+
+对于无标注数据，需要按照与有标注数据完全相同的整理方式，将文件与当前有标注的数据集放在相同目录下，将其标签值记为`0`，假设整理的标签文件名为`train_list_unlabel.txt`，则可以通过下面的命令生成用于SSLD训练的标签文件。
+
+```shell
+cat train_list.txt train_list_unlabel.txt > train_list_all.txt
+```
+
+更多关于图像分类任务的数据标签说明，请参考：[PaddleClas图像分类数据集格式说明](../data_preparation/classification_dataset.md#1-数据集格式说明)
+
+PaddleClas中集成了PULC超轻量图像分类实用方案，里面包含SSLD ImageNet预训练模型的使用以及更加通用的无标签数据的知识蒸馏方案，更多详细信息，请参考[PULC超轻量图像分类实用方案使用教程](../PULC/PULC_train.md)。
+
+<a name="4"></a>
+
+## 4. 参考文献
+
+[1] Hinton G, Vinyals O, Dean J. Distilling the knowledge in a neural network[J]. arXiv preprint arXiv:1503.02531, 2015.
+
+[2] Bagherinezhad H, Horton M, Rastegari M, et al. Label refinery: Improving imagenet classification through label progression[J]. arXiv preprint arXiv:1805.02641, 2018.
+
+[3] Yalniz I Z, Jégou H, Chen K, et al. Billion-scale semi-supervised learning for image classification[J]. arXiv preprint arXiv:1905.00546, 2019.
+
+[4] Touvron H, Vedaldi A, Douze M, et al. Fixing the train-test resolution discrepancy[C]//Advances in Neural Information Processing Systems. 2019: 8250-8260.
--- a/docs/zh_CN/algorithm_introduction/knowledge_distillation.md
+++ b/docs/zh_CN/algorithm_introduction/knowledge_distillation.md
@@ -42,7 +42,7 @@

 PaddleClas 中提出了一种简单使用的 SSLD 知识蒸馏算法 [6]，在训练的时候去除了对 gt label 的依赖，结合大量无标注数据，最终蒸馏训练得到的预训练模型在 15 个模型上的精度提升平均高达 3%。

-上述标准的蒸馏方法是通过一个大模型作为教师模型来指导学生模型提升效果，而后来又发展出 DML(Deep Mutual Learning)互学习蒸馏方法 [7]，即通过两个结构相同的模型互相学习。具体的。相比于 KD 等依赖于大的教师模型的知识蒸馏算法，DML 脱离了对大的教师模型的依赖，蒸馏训练的流程更加简单，模型产出效率也要更高一些。
+上述标准的蒸馏方法是通过一个大模型作为教师模型来指导学生模型提升效果，而后来又发展出 DML(Deep Mutual Learning)互学习蒸馏方法 [7]，即通过两个结构相同的模型互相学习。具体的。相比于 KD 等依赖于大的教师模型的知识蒸馏算法，DML 脱离了对大的教师模型的依赖，蒸馏训练的流程更加简单，模型产出效率也要更高一些。

 <a name='3.2'></a>
 ### 3.2 Feature based distillation

--- a/docs/zh_CN/models_training/distributed_training.md
+++ b/docs/zh_CN/models_training/distributed_training.md
@@ -42,11 +42,21 @@ python3 -m paddle.distributed.launch \

 ## 3. 性能效果测试

-* 在4机8卡V100的机器上，基于[SSLD知识蒸馏训练策略](../advanced_tutorials/knowledge_distillation.md)（数据量500W）进行模型训练，不同模型的训练耗时以及多机加速比情况如下所示。
+* 在单机8卡V100的机器上，基于[SSLD知识蒸馏训练策略](../advanced_tutorials/ssld.md)（数据量500W）进行模型训练，不同模型的训练耗时以及单机8卡加速比情况如下所示。
+
+
+| 模型    | 精度     | 单机单卡耗时 | 单机8卡耗时 | 加速比  |
+|:---------:|:--------:|:--------:|:--------:|:------:|
+| PPHGNet-base_ssld   | 85.00% | 133.2d | 18.96d  | **7.04** |
+| PPLCNetv2-base_ssld | 80.10% | 31.6d   | 6.4d  | **4.93** |
+| PPLCNet_x0_25_ssld  | 53.43% | 21.8d   | 6.2d  | **3.99** |
+
+
+* 在4机8卡V100的机器上，基于[SSLD知识蒸馏训练策略](../advanced_tutorials/ssld.md)（数据量500W）进行模型训练，不同模型的训练耗时以及多机加速比情况如下所示。


 | 模型    | 精度     | 单机8卡耗时 | 4机8卡耗时 | 加速比  |
 |:---------:|:--------:|:--------:|:--------:|:------:|
-| PPHGNet-base_ssld   | 85.00% | 15.74d | 4.86d  | **3.23** |
+| PPHGNet-base_ssld   | 85.00% | 18.96d | 4.86d  | **3.90** |
 | PPLCNetv2-base_ssld | 80.10% | 6.4d   | 1.67d  | **3.83** |
 | PPLCNet_x0_25_ssld  | 53.43% | 6.2d   | 1.78d  | **3.48** |
--- a/ppcls/configs/ImageNet/Distillation/PPLCNet_x2_5_dml.yaml
+++ b/ppcls/configs/ImageNet/Distillation/PPLCNet_x2_5_dml.yaml
+# global configs
+Global:
+  checkpoints: null
+  pretrained_model: null
+  output_dir: ./output_lcnet_x2_5_dml
+  device: gpu
+  save_interval: 1
+  eval_during_train: True
+  eval_interval: 1
+  epochs: 100
+  print_batch_step: 10
+  use_visualdl: False
+  # used for static mode and model export
+  image_shape: [3, 224, 224]
+  save_inference_dir: ./inference
+
+AMP:
+  scale_loss: 128.0
+  use_dynamic_loss_scaling: True
+  # O1: mixed fp16
+  level: O1
+
+# model architecture
+Arch:
+  name: "DistillationModel"
+  class_num: &class_num 1000
+  # if not null, its lengths should be same as models
+  pretrained_list:
+  # if not null, its lengths should be same as models
+  freeze_params_list:
+  - False
+  - False
+  infer_model_name: "Student"
+  models:
+    - Teacher:
+        name: PPLCNet_x2_5
+        class_num: *class_num
+        pretrained: False
+    - Student:
+        name: PPLCNet_x2_5
+        class_num: *class_num
+        pretrained: False
+ 
+# loss function config for traing/eval process
+Loss:
+  Train:
+    - DistillationGTCELoss:
+        weight: 1.0
+        model_names: ["Student", "Teacher"]
+    - DistillationDMLLoss:
+        weight: 1.0
+        model_name_pairs:
+        - ["Student", "Teacher"]
+  Eval:
+    - CELoss:
+        weight: 1.0
+
+
+Optimizer:
+  name: Momentum
+  momentum: 0.9
+  lr:
+    name: Cosine
+    learning_rate: 0.4
+    warmup_epoch: 5
+  regularizer:
+    name: 'L2'
+    coeff: 0.00004
+
+
+# data loader for train and eval
+DataLoader:
+  Train:
+    dataset:
+      name: ImageNetDataset
+      image_root: ./dataset/ILSVRC2012/
+      cls_label_path: ./dataset/ILSVRC2012/train_list.txt
+      transform_ops:
+        - DecodeImage:
+            to_rgb: True
+            channel_first: False
+        - RandCropImage:
+            size: 224
+        - RandFlipImage:
+            flip_code: 1
+        - NormalizeImage:
+            scale: 1.0/255.0
+            mean: [0.485, 0.456, 0.406]
+            std: [0.229, 0.224, 0.225]
+            order: ''
+
+    sampler:
+      name: DistributedBatchSampler
+      batch_size: 128
+      drop_last: False
+      shuffle: True
+    loader:
+      num_workers: 8
+      use_shared_memory: True
+
+  Eval:
+    dataset: 
+      name: ImageNetDataset
+      image_root: ./dataset/ILSVRC2012/
+      cls_label_path: ./dataset/ILSVRC2012/val_list.txt
+      transform_ops:
+        - DecodeImage:
+            to_rgb: True
+            channel_first: False
+        - ResizeImage:
+            resize_short: 256
+        - CropImage:
+            size: 224
+        - NormalizeImage:
+            scale: 1.0/255.0
+            mean: [0.485, 0.456, 0.406]
+            std: [0.229, 0.224, 0.225]
+            order: ''
+    sampler:
+      name: DistributedBatchSampler
+      batch_size: 256
+      drop_last: False
+      shuffle: False
+    loader:
+      num_workers: 8
+      use_shared_memory: True
+
+Infer:
+  infer_imgs: docs/images/inference_deployment/whl_demo.jpg
+  batch_size: 10
+  transforms:
+    - DecodeImage:
+        to_rgb: True
+        channel_first: False
+    - ResizeImage:
+        resize_short: 256
+    - CropImage:
+        size: 224
+    - NormalizeImage:
+        scale: 1.0/255.0
+        mean: [0.485, 0.456, 0.406]
+        std: [0.229, 0.224, 0.225]
+        order: ''
+    - ToCHWImage:
+  PostProcess:
+    name: Topk
+    topk: 5
+    class_id_map_file: ppcls/utils/imagenet1k_label_list.txt
+
+Metric:
+    Train:
+    - DistillationTopkAcc:
+        model_key: "Student"
+        topk: [1, 5]
+    Eval:
+    - DistillationTopkAcc:
+        model_key: "Student"
+        topk: [1, 5]
--- a/ppcls/configs/ImageNet/Distillation/PPLCNet_x2_5_ssld.yaml
+++ b/ppcls/configs/ImageNet/Distillation/PPLCNet_x2_5_ssld.yaml
+# global configs
+Global:
+  checkpoints: null
+  pretrained_model: null
+  output_dir: ./output_r50_vd_distill
+  device: gpu
+  save_interval: 1
+  eval_during_train: True
+  eval_interval: 1
+  epochs: 100
+  print_batch_step: 10
+  use_visualdl: False
+  # used for static mode and model export
+  image_shape: [3, 224, 224]
+  save_inference_dir: ./inference
+  to_static: True
+
+AMP:
+  scale_loss: 128.0
+  use_dynamic_loss_scaling: True
+  # O1: mixed fp16
+  level: O1
+
+# model architecture
+Arch:
+  name: "DistillationModel"
+  class_num: &class_num 1000
+  # if not null, its lengths should be same as models
+  pretrained_list:
+  # if not null, its lengths should be same as models
+  freeze_params_list:
+  - True
+  - False
+  infer_model_name: "Student"
+  models:
+    - Teacher:
+        name: ResNet50_vd
+        class_num: *class_num
+        pretrained: True
+        use_ssld: True
+    - Student:
+        name: PPLCNet_x2_5
+        class_num: *class_num
+        pretrained: False
+ 
+# loss function config for traing/eval process
+Loss:
+  Train:
+    - DistillationDMLLoss:
+        weight: 1.0
+        model_name_pairs:
+        - ["Student", "Teacher"]
+  Eval:
+    - CELoss:
+        weight: 1.0
+
+
+Optimizer:
+  name: Momentum
+  momentum: 0.9
+  lr:
+    name: Cosine
+    learning_rate: 0.2
+    warmup_epoch: 5
+  regularizer:
+    name: 'L2'
+    coeff: 0.00004
+
+
+# data loader for train and eval
+DataLoader:
+  Train:
+    dataset:
+      name: ImageNetDataset
+      image_root: ./dataset/ILSVRC2012/
+      cls_label_path: ./dataset/ILSVRC2012/train_list.txt
+      transform_ops:
+        - DecodeImage:
+            to_rgb: True
+            channel_first: False
+        - RandCropImage:
+            size: 224
+        - RandFlipImage:
+            flip_code: 1
+        - NormalizeImage:
+            scale: 1.0/255.0
+            mean: [0.485, 0.456, 0.406]
+            std: [0.229, 0.224, 0.225]
+            order: ''
+
+    sampler:
+      name: DistributedBatchSampler
+      batch_size: 128
+      drop_last: False
+      shuffle: True
+    loader:
+      num_workers: 8
+      use_shared_memory: True
+
+  Eval:
+    dataset: 
+      name: ImageNetDataset
+      image_root: ./dataset/ILSVRC2012/
+      cls_label_path: ./dataset/ILSVRC2012/train_list.txt
+      transform_ops:
+        - DecodeImage:
+            to_rgb: True
+            channel_first: False
+        - ResizeImage:
+            resize_short: 256
+        - CropImage:
+            size: 224
+        - NormalizeImage:
+            scale: 1.0/255.0
+            mean: [0.485, 0.456, 0.406]
+            std: [0.229, 0.224, 0.225]
+            order: ''
+    sampler:
+      name: DistributedBatchSampler
+      batch_size: 256
+      drop_last: False
+      shuffle: False
+    loader:
+      num_workers: 8
+      use_shared_memory: True
+
+Infer:
+  infer_imgs: docs/images/inference_deployment/whl_demo.jpg
+  batch_size: 10
+  transforms:
+    - DecodeImage:
+        to_rgb: True
+        channel_first: False
+    - ResizeImage:
+        resize_short: 256
+    - CropImage:
+        size: 224
+    - NormalizeImage:
+        scale: 1.0/255.0
+        mean: [0.485, 0.456, 0.406]
+        std: [0.229, 0.224, 0.225]
+        order: ''
+    - ToCHWImage:
+  PostProcess:
+    name: Topk
+    topk: 5
+    class_id_map_file: ppcls/utils/imagenet1k_label_list.txt
+
+Metric:
+    Train:
+    - DistillationTopkAcc:
+        model_key: "Student"
+        topk: [1, 5]
+    Eval:
+    - DistillationTopkAcc:
+        model_key: "Student"
+        topk: [1, 5]
--- a/ppcls/engine/engine.py
+++ b/ppcls/engine/engine.py
@@ -441,6 +441,8 @@ class Engine(object):

                if isinstance(out, list):
                    out = out[0]
+                if isinstance(out, dict) and "Student" in out:
+                    out = out["Student"]
                if isinstance(out, dict) and "logits" in out:
                    out = out["logits"]
                if isinstance(out, dict) and "output" in out:

--- a/ppcls/loss/afdloss.py
+++ b/ppcls/loss/afdloss.py
@@ -97,8 +97,6 @@ class Attention(nn.Layer):
        super().__init__()
        self.qk_dim = qk_dim
        self.n_t = n_t
-        # self.linear_trans_s = LinearTransformStudent(qk_dim, t_shapes, s_shapes, unique_t_shapes)
-        # self.linear_trans_t = LinearTransformTeacher(qk_dim, t_shapes)

        self.p_t = self.create_parameter(
            shape=[len(t_shapes), qk_dim],

--- a/tools/search_strategy.py
+++ b/tools/search_strategy.py
@@ -59,7 +59,7 @@ def search_strategy():
    configs = config.get_config(
        args.config, overrides=args.override, show=False)
    base_config_file = configs["base_config_file"]
-    distill_config_file = configs["distill_config_file"]
+    distill_config_file = configs.get("distill_config_file", None)
    model_name = config.get_config(base_config_file)["Arch"]["name"]
    gpus = configs["gpus"]
    gpus = ",".join([str(i) for i in gpus])