branch

d9a35886 · wqz960 · c5a18cda · 9d3f36b7 · d9a35886 · d9a35886
80 changed file
--- a/.pre-commit-config.yaml
+++ b/.pre-commit-config.yaml
@@ -33,6 +33,3 @@
    - id: trailing-whitespace
      files: \.(md|yml)$
    - id: check-case-conflict
-    - id: flake8
-      args: ['--ignore=E265']
--- a/README.md
+++ b/README.md
+简体中文 | [English](README_en.md)
 # PaddleClas
 **文档教程**: https://paddleclas.readthedocs.io
@@ -20,7 +22,7 @@
    <img src="./docs/images/models/V100_benchmark/v100.fp32.bs1.main_fps_top1_s.jpg" width="700">
 </div>
-上图对比了一些最新的面向服务器端应用场景的模型，在使用V100，FP32和TensorRT，batch size为1时的预测时间及其准确率，图中准确率82.4%的ResNet50_vd_ssld和83.7%的ResNet101_vd_ssld，是采用PaddleClas提供的SSLD知识蒸馏方案训练的模型。图中相同颜色和符号的点代表同一系列不同规模的模型。不同模型的简介、FLOPS、Parameters以及详细的GPU预测时间(包括不同batchsize的T4卡预测速度)请参考文档教程中的[**模型库章节**](https://paddleclas.readthedocs.io/zh_CN/latest/models/models_intro.html)。
+上图对比了一些最新的面向服务器端应用场景的模型，在使用V100，FP32和TensorRT，batch size为1时的预测时间及其准确率，图中准确率83.0%的ResNet50_vd_ssld_v2和83.7%的ResNet101_vd_ssld，是采用PaddleClas提供的SSLD知识蒸馏方案训练的模型，其中v2表示在训练时添加了AutoAugment数据增广策略。图中相同颜色和符号的点代表同一系列不同规模的模型。不同模型的简介、FLOPS、Parameters以及详细的GPU预测时间(包括不同batchsize的T4卡预测速度)请参考文档教程中的[**模型库章节**](https://paddleclas.readthedocs.io/zh_CN/latest/models/models_intro.html)。
 <div align="center">
 <img
@@ -30,7 +32,7 @@ src="./docs/images/models/mobile_arm_top1.png" width="700">
 上图对比了一些最新的面向移动端应用场景的模型，在骁龙855（SD855）上预测一张图像的时间和其准确率，包括MobileNetV1系列、MobileNetV2系列、MobileNetV3系列和ShuffleNetV2系列。图中准确率79%的MV3_large_x1_0_ssld（M是MobileNet的简称），71.3%的MV3_small_x1_0_ssld、76.74%的MV2_ssld和77.89%的MV1_ssld，是采用PaddleClas提供的SSLD蒸馏方法训练的模型。MV3_large_x1_0_ssld_int8是进一步进行INT8量化的模型。不同模型的简介、FLOPS、Parameters和模型存储大小请参考文档教程中的[**模型库章节**](https://paddleclas.readthedocs.io/zh_CN/latest/models/models_intro.html)。
 - TODO
- [ ] EfficientLite、GhostNet、RegNet论文指标复现和性能评估
+- [ ] EfficientLite、GhostNet、RegNet、ResNeSt的论文指标复现和性能评估
 ## 高阶优化支持
 除了提供丰富的分类网络结构和预训练模型，PaddleClas也支持了一系列有助于图像分类任务效果和效率提升的算法或工具。
@@ -97,10 +99,6 @@ PaddleClas的安装说明、模型训练、预测、评估以及模型微调（f
 近年来，学术界和工业界广泛关注图像中目标检测任务，而图像分类的网络结构以及预训练模型效果直接影响目标检测的效果。PaddleDetection使用PaddleClas的82.39%的ResNet50_vd的预训练模型，结合自身丰富的检测算子，提供了一种面向服务器端应用的目标检测方案，PSS-DET (Practical Server Side Detection)。该方案融合了多种只增加少许计算量，但是可以有效提升两阶段Faster RCNN目标检测效果的策略，包括检测模型剪裁、使用分类效果更优的预训练模型、DCNv2、Cascade RCNN、AutoAugment、Libra sampling以及多尺度训练。其中基于82.39%的R50_vd_ssld预训练模型，与79.12%的R50_vd的预训练模型相比，检测效果可以提升1.5%。在COCO目标检测数据集上测试PSS-DET，当V100单卡预测速度为61FPS时，mAP是41.6%，预测速度为20FPS时，mAP是47.8%。详情请参考[**通用目标检测章节**](https://paddleclas.readthedocs.io/zh_CN/latest/application/object_detection.html)。
-<div align="center">
-<img
-src="./docs/images/det/pssdet.png" width="500">
-</div>
 - TODO
 - [ ] PaddleClas在OCR任务中的应用

--- a/README_en.md
+++ b/README_en.md
+[简体中文](README.md) | English
+# PaddleClas
+**Book**: https://paddleclas-en.readthedocs.io/en/latest/
+**Quick start PaddleClas in 30 minutes**: https://paddleclas-en.readthedocs.io/en/latest/tutorials/quick_start_en.html
+## Introduction
+PaddleClas is a toolset for image classification tasks prepared for the industry and academia. It helps users train better computer vision models and apply them in real scenarios.
+<div align="center">
+    <img src="./docs/images/main_features_s_en.png" width="700">
+</div>
+## Rich model zoo
+Based on ImageNet1k dataset, PaddleClas provides 23 series of image classification networks such as ResNet, ResNet_vd, Res2Net, HRNet, and MobileNetV3 with brief introductions, reproduction configurations and training tricks. At the same time, the corresponding 117 image classification pretrained models are also available. The GPU inference time of the server-side models are evaluated based on TensorRT. The CPU inference time and storage size of the mobile-side models are evaluated on the Snapdragon 855 (SD855). For more detailed information on the supported pretrained models and their download links, please refer to [**models introduction tutorial**](https://paddleclas-en.readthedocs.io/en/latest/models/models_intro_en.html).
+<div align="center">
+    <img src="./docs/images/models/V100_benchmark/v100.fp32.bs1.main_fps_top1_s.jpg" width="700">
+</div>
+The above figure shows some of the latest server-side pretrained models. It can be seen from the figure that when using V100 GPU with FP32 and TensorRT, the `Top1` accuracy of the ResNet50_vd_ssld pretrained model on ImageNet1k-val dataset is **83.0%** and that of ResNet101_vd_ssld pretrained model is 83.7%. These pretained models are obtained from  SSLD knowledge distillation solution provided by PaddleClas. The marks of the same color and symbol in the figure represent models of different model sizes in the same series. For the introduction of different models, FLOPS, Params and detailed GPU inference time (including the inference speed of T4 GPU with different batch size), please refer to the documentation tutorial for more details: [https://paddleclas-en.readthedocs.io/en/latest/models/models_intro_en.html](https://paddleclas-en.readthedocs.io/en/latest/models/models_intro_en.html)
+<div align="center">
+<img
+src="./docs/images/models/mobile_arm_top1.png" width="700">
+</div>
+The above figure shows the performance of some commonly used mobile-side models, including MobileNetV1, MobileNetV2, MobileNetV3 and ShuffleNetV2 series. The inference time is tested on Snapdragon 855 (SD855) with the batch size set as 1. The `Top1` accuracy of the MV3_large_x1_0_ssld, MV3_small_x1_0_ssld, MV1_ssld and MV2_ssld pretrained model on ImageNet1k-val dataset are 79%, 71.3%, 76.74%, 77.89%, respectively (M is short for MobileNet). MV3_large_x1_0_ssld_int8 is a quantizatied pretrained model for MV3_large_x1_0. More details about the mobile-side models can be seen in [**models introduction tutorial**](https://paddleclas-en.readthedocs.io/en/latest/models/models_intro_en.html)
+- TODO
+- [ ] Reproduction and performance evaluation of EfficientLite, GhostNet, RegNet and ResNeSt.
+## High-level support
+In addition to providing rich classification network structures and pretrained models, PaddleClas  supports a series of algorithms or tools that help to improve the effectiveness and efficiency of image classification tasks.
+### knowledge distillation (SSLD)
+Knowledge distillation refers to using the teacher model to guide the student model to learn specific tasks, ensuring that the small model has a relatively large effect improvement with the computation cost unchanged, and even obtains similar accuracy with the large model. PaddleClas provides a Simple Semi-supervised Label Distillation method (SSLD). With this method, different models on ImageNet1k-val dataset have 3% absolute improvement(Top1 accuracy) on ImageNet1k-val dataset. The following figure shows the models' performance after SSLD.
+<div align="center">
+<img
+src="./docs/images/distillation/distillation_perform_s.jpg" width="700">
+</div>
+Taking the ImageNet1k dataset as an example, the following figure shows the SSLD knowledge distillation method framework. The key points of the method include the choice of teacher model, loss calculation method, iteration number, use of unlabeled data, and ImageNet1k dataset finetune. For detailed introduction and experiments, please refer to [**knowledge distillation tutorial**](https://paddleclas-en.readthedocs.io/en/latest/advanced_tutorials/distillation/distillation_en.html)
+<div align="center">
+<img
+src="./docs/images/distillation/ppcls_distillation_s.jpg" width="700">
+</div>
+### Data augmentation
+For a certain image classification task, data augmentation is a commonly used regularization method, which can effectively improve the effect of image classification, especially for scenarios where the data is insufficient or the network is large. Commonly used data augmentation can be divided into 3 categories, `transformation`, `cropping` and `aliasing`, as shown below. The image transformation refers to performing some transformations on the entire image, such as AutoAugment and RandAugment. The image cropping refers to the transformation of the image to block some areas in a certain way, such as CutOut, RandErasing, HideAndSeek and GridMask. Image aliasing refers to the transformation of multiple images to alias a new image, such as Mixup and Cutmix.
+<div align="center">
+<img
+src="./docs/images/image_aug/image_aug_samples_s_en.jpg" width="800">
+</div>
+PaddleClas provides the reproduction of the above 8 data augmentation algorithms and the evaluation of the effect in a unified environment. The following figure shows the performance of different data augmentation methods based on ResNet50. Compared with the standard transformation, using data augmentation, the recognition accuracy  can be increased by up to 1%. For more detailed introduction of data augmentation methods, please refer to the [**data augmentation tutorial**](https://paddleclas-en.readthedocs.io/en/latest/advanced_tutorials/image_augmentation/ImageAugment_en.html).
+<div align="center">
+<img
+src="./docs/images/image_aug/main_image_aug_s.jpg" width="600">
+</div>
+## Quick start
+Based on flowers102 dataset, one can easily experience different networks, pretrained models and SSLD knowledge distillation method in PaddleClas. More details can be seen in [**Quick start PaddleClas in 30 minutes**](https://paddleclas-en.readthedocs.io/en/latest/tutorials/quick_start_en.html).
+## Getting started
+For installation, model training, inference, evaluation and finetune in PaddleClas, you can refer to [**gettting started tutorial**](https://paddleclas-en.readthedocs.io/en/latest/tutorials/index.html).
+## Featured extension and application
+### A classification pretrained model for 100,000 categories
+The models trained on ImageNet1K dataset are often used as pretrained models for other classification tasks in practical applications due to lack of training data. However, there are only 1,000 categories in the ImageNet1K dataset, and the feature  capability of the pretrained model is limited. Therefore, Baidu developed a tag system including 100,000 categories, with semantic information and different granularity. Through manual or semi-supervised methods, more than 55,000,000 training images have been collected. It is the largest image classification system in China and even in the worldwide. PaddleClas provides the ResNet50_vd model trained on this dataset. The following table shows the comparison of using the ImageNet pretrained model and the above 100,000 image classification pretrained model in some practical application scenarios. Using the 100,000 image classification pretrained model, the recognition accuracy can be increased by up to 30%.
+| Dataset   | Dataset statistics | ImageNet pretrained model | 100,000-categories' pretrained model |
+|:--:|:--:|:--:|:--:|
+| Flowers    | class_num:102<br/>train/val:5789/2396      | 0.7779        | 0.9892        |
+| Hand-painted stick figures | class_num:18<br/>train/val:1007/432        | 0.8785        | 0.9107        |
+| Leaves  | class_num:6<br/>train/val:5256/2278        | 0.8212        | 0.8385        |
+| Container vehicle | class_num:115<br/>train/val:4879/2094       | 0.623         | 0.9524        |
+| Chair    | class_num:5<br/>train/val:169/78         | 0.8557        | 0.9077        |
+| Geology    | class_num:4<br/>train/val:671/296         | 0.5719        | 0.6781        |
+The 100,000 categories' pretrained model can be downloaded here: [download link](https://paddle-imagenet-models-name.bj.bcebos.com/ResNet50_vd_10w_pretrained.tar). More details can be seen in [**Transfer learning tutorial**](https://paddleclas-en.readthedocs.io/en/latest/application/transfer_learning_en.html).
+### Object detection
+In recent years, object detection tasks attract a lot of attention in academia and industry. The ImageNet classification model is often used for pretrained model in object detection, which can directly affect the effect of object detection. Based on 82.39% ResNet50_vd pretrained model, PaddleDetection provides a Practical Server-side Detection solution, PSS-DET. The solution contains many strategies that can effectively improve the performance while taking limited extra computation cost, such as model pruning, better pretrained model, deformable convolution, cascade rcnn, autoaugment, libra sampling and multi-scale training. Compared with the 79.12% ImageNet1k pretrained model, the 82.39% model can help improve the COCO mAP by 1.5% without any computation cost. Using PSS-DET, the inference speed on single V100 GPU can reach 20FPS when COCO mAP is 47.8%, and reach 61FPS when COCO mAP is 41.6%. For more details, please refer to [**Object Detection tutorial**](https://paddleclas-en.readthedocs.io/en/latest/application/object_detection_en.html).
+- TODO
+- [ ] Application of PaddleClas in OCR tasks.
+- [ ] Application of PaddleClas in face detection and recognition tasks.
+## Industrial-grade application deployment tools
+PaddlePaddle provides a series of practical tools to conveniently deploy the PaddleClas models for industrial applications. For more details, please refer to [**extension tutorial**](https://paddleclas.readthedocs.io/zh_CN/latest/extension/index.html)
+- TensorRT inference
+- Paddle-Lite
+- Paddle-Serving
+- Model quantization
+- Multi-machine training
+- Paddle Hub
+## Competition support
+PaddleClas stems from the Baidu's visual business applications and the exploration of frontier visual capabilities. It has helped us achieve leading results in many key events, and continues to promote more frontier visual solutions and landing applications. For more details, please refer to [**competition support tutorial**](https://paddleclas.readthedocs.io/zh_CN/latest/competition_support.html)
+- 1st place in 2018 Kaggle Open Images V4 object detection challenge
+- A-level certificate of three tasks: printed text OCR, face recognition and landmark recognition in the first multimedia information recognition technology competition
+- 2nd place in 2019 Kaggle Open Images V5 object detection challenge
+- 2nd place in Kaggle Landmark Retrieval Challenge 2019
+- 2nd place in Kaggle Landmark Recognition Challenge 2019
+## License
+PaddleClas is released under the <a href="https://github.com/PaddlePaddle/PaddleClas/blob/master/LICENSE">Apache 2.0 license</a>
+## Updates
+## Contributing
+Contributions are highly welcomed and we would really appreciate your feedback!!
--- a/configs/EfficientNet/EfficientNetB0.yaml
+++ b/configs/EfficientNet/EfficientNetB0.yaml
@@ -83,4 +83,4 @@ VALID:
            mean: [0.485, 0.456, 0.406]
            std: [0.229, 0.224, 0.225]
            order: ''
        - ToCHWImage:
\ No newline at end of file
--- a/configs/GhostNet/GhostNet_x0_5.yaml
+++ b/configs/GhostNet/GhostNet_x0_5.yaml
+mode: 'train'
+ARCHITECTURE:
+    name: 'GhostNet_x0_5'
+pretrained_model: ""
+model_save_dir: "./output/"
+classes_num: 1000
+total_images: 1281167
+save_interval: 1
+validate: True
+valid_interval: 1
+epochs: 360
+topk: 5
+image_shape: [3, 224, 224]
+use_mix: False
+ls_epsilon: 0.1
+LEARNING_RATE:
+    function: 'CosineWarmup'          
+    params:                   
+        lr: 0.8               
+OPTIMIZER:
+    function: 'Momentum'
+    params:
+        momentum: 0.9
+    regularizer:
+        function: 'L2'
+        factor: 0.0000400
+TRAIN:
+    batch_size: 2048
+    num_workers: 4
+    file_list: "./dataset/ILSVRC2012/train_list.txt"
+    data_dir: "./dataset/ILSVRC2012/"
+    shuffle_seed: 0
+    transforms:
+        - DecodeImage:
+            to_rgb: True
+            to_np: False
+            channel_first: False
+        - RandCropImage:
+            size: 224
+        - RandFlipImage:
+            flip_code: 1
+        - NormalizeImage:
+            scale: 1./255.
+            mean: [0.485, 0.456, 0.406]
+            std: [0.229, 0.224, 0.225]
+            order: ''
+        - ToCHWImage:
+VALID:
+    batch_size: 64
+    num_workers: 4
+    file_list: "./dataset/ILSVRC2012/val_list.txt"
+    data_dir: "./dataset/ILSVRC2012/"
+    shuffle_seed: 0
+    transforms:
+        - DecodeImage:
+            to_rgb: True
+            to_np: False
+            channel_first: False
+        - ResizeImage:
+            resize_short: 256
+        - CropImage:
+            size: 224
+        - NormalizeImage:
+            scale: 1.0/255.0
+            mean: [0.485, 0.456, 0.406]
+            std: [0.229, 0.224, 0.225]
+            order: ''
+        - ToCHWImage:
--- a/configs/GhostNet/GhostNet_x1_0.yaml
+++ b/configs/GhostNet/GhostNet_x1_0.yaml
+mode: 'train'
+ARCHITECTURE:
+    name: 'GhostNet_x1_0'
+pretrained_model: ""
+model_save_dir: "./output/"
+classes_num: 1000
+total_images: 1281167
+save_interval: 1
+validate: True
+valid_interval: 1
+epochs: 360
+topk: 5
+image_shape: [3, 224, 224]
+use_mix: False
+ls_epsilon: 0.1
+LEARNING_RATE:
+    function: 'CosineWarmup'          
+    params:                   
+        lr: 0.4               
+OPTIMIZER:
+    function: 'Momentum'
+    params:
+        momentum: 0.9
+    regularizer:
+        function: 'L2'
+        factor: 0.0000400
+TRAIN:
+    batch_size: 1024
+    num_workers: 4
+    file_list: "./dataset/ILSVRC2012/train_list.txt"
+    data_dir: "./dataset/ILSVRC2012/"
+    shuffle_seed: 0
+    transforms:
+        - DecodeImage:
+            to_rgb: True
+            to_np: False
+            channel_first: False
+        - RandCropImage:
+            size: 224
+        - RandFlipImage:
+            flip_code: 1
+        - NormalizeImage:
+            scale: 1./255.
+            mean: [0.485, 0.456, 0.406]
+            std: [0.229, 0.224, 0.225]
+            order: ''
+        - ToCHWImage:
+VALID:
+    batch_size: 64
+    num_workers: 4
+    file_list: "./dataset/ILSVRC2012/val_list.txt"
+    data_dir: "./dataset/ILSVRC2012/"
+    shuffle_seed: 0
+    transforms:
+        - DecodeImage:
+            to_rgb: True
+            to_np: False
+            channel_first: False
+        - ResizeImage:
+            resize_short: 256
+        - CropImage:
+            size: 224
+        - NormalizeImage:
+            scale: 1.0/255.0
+            mean: [0.485, 0.456, 0.406]
+            std: [0.229, 0.224, 0.225]
+            order: ''
+        - ToCHWImage:
--- a/configs/GhostNet/GhostNet_x1_3.yaml
+++ b/configs/GhostNet/GhostNet_x1_3.yaml
+mode: 'train'
+ARCHITECTURE:
+    name: 'GhostNet_x1_3'
+pretrained_model: ""
+model_save_dir: "./output/"
+classes_num: 1000
+total_images: 1281167
+save_interval: 1
+validate: True
+valid_interval: 1
+epochs: 360
+topk: 5
+image_shape: [3, 224, 224]
+use_mix: False
+ls_epsilon: 0.1
+LEARNING_RATE:
+    function: 'CosineWarmup'          
+    params:                   
+        lr: 0.4               
+OPTIMIZER:
+    function: 'Momentum'
+    params:
+        momentum: 0.9
+    regularizer:
+        function: 'L2'
+        factor: 0.0000400
+TRAIN:
+    batch_size: 1024
+    num_workers: 4
+    file_list: "./dataset/ILSVRC2012/train_list.txt"
+    data_dir: "./dataset/ILSVRC2012/"
+    shuffle_seed: 0
+    transforms:
+        - DecodeImage:
+            to_rgb: True
+            to_np: False
+            channel_first: False
+        - RandCropImage:
+            size: 224
+        - RandFlipImage:
+            flip_code: 1
+        - AutoAugment:
+        - NormalizeImage:
+            scale: 1./255.
+            mean: [0.485, 0.456, 0.406]
+            std: [0.229, 0.224, 0.225]
+            order: ''
+        - ToCHWImage:
+VALID:
+    batch_size: 64
+    num_workers: 4
+    file_list: "./dataset/ILSVRC2012/val_list.txt"
+    data_dir: "./dataset/ILSVRC2012/"
+    shuffle_seed: 0
+    transforms:
+        - DecodeImage:
+            to_rgb: True
+            to_np: False
+            channel_first: False
+        - ResizeImage:
+            resize_short: 256
+        - CropImage:
+            size: 224
+        - NormalizeImage:
+            scale: 1.0/255.0
+            mean: [0.485, 0.456, 0.406]
+            std: [0.229, 0.224, 0.225]
+            order: ''
+        - ToCHWImage:
--- a/docs/en/advanced_tutorials/distillation/distillation_en.md
+++ b/docs/en/advanced_tutorials/distillation/distillation_en.md
+# Introduction of model compression methods
+In recent years, deep neural networks have been proven to be an extremely effective method to solve problems in the fields of computer vision and natural language processing. The deep learning methods performs better than traditional methods with suitable network structure and training process.
+With enough training data, increasing parameters of the neural network by building a reasonabe network can significantly the model performance. But this increases the model complexity, which takes too much computation cost in real scenarios.
+Parameter redundancy exists in deep neural networks. There are several methods to compress the model suck as pruning ,quantization, knowledge distillation, etc. Knowledge distillation refers to using the teacher model to guide the student model to learn specific tasks, ensuring that the small model has a relatively large effect improvement with the computation cost unchanged, and even obtains similar accuracy with the large model [1]. Combining some of the existing distillation methods [2,3], PaddleClas provides a simple semi-supervised label knowledge distillation solution (SSLD). Top-1 Accuarcy on ImageNet1k dataset has an improvement of more than 3% based on ResNet_vd and MobileNet series, which can be shown as below.
+![](../../../images/distillation/distillation_perform_s.jpg)
+# SSLD
+## Introduction
+The following figure shows the framework of SSLD.
+![](../../../images/distillation/ppcls_distillation.png)
+First, we select nearly 4 million images from ImageNet22k dataset, and integrate it with the ImageNet-1k training set to get a new dataset containing 5 million images. Then, we combine the student model and the teacher model into a new network, which outputs the predictions of the student model and the teacher model, respectively. The gradient of the entire network of the teacher model is fixed. Finally, we use JS divergence loss as the loss function for the training process. Here we take MobileNetV3 distillation task as an example, and introduce key points of SSLD.
+* Choice of the teacher model, During knowledge distillation, it may not be an optimal solution if the structure of the teacher model and the student model are too different. Under the same structure, the teacher model with higher accuracy leads to better performance for the student model during distillation. Compared with the 79.12% ResNet50_vd teacher model, using the 82.4% teacher model can bring a 0.4% accuracy improvement on Top-1 accuracy (`75.6%-> 76.0%`).
+* Improvement of loss function. The most commonly used loss function for classification is cross entropy loss. We fint that when using soft label for training, KL divergence loss is almost useless to improve model performance compared to cross entropy loss, but The accuracy has a 0.2% improvement using JS divergence loss (`76.0%-> 76.2%`). Loss function in SSLD is JS divergence loss.
+* More iteration number. It is only 120 for the baseline experiment. We can achieve a 0.9% improvement by setting it as 360 (`76.2%-> 77.1%`).
+* There is not need for laleled data in SSLD, which leads to convenient training data expansion. label is not utilized when computing the loss function, therefore the unlabeled data can also be used to train the network. The label-free distillation strategy of this distillation solution has also greatly improved the upper performance limit of student models (`77.1%-> 78.5%`).
+* ImageNet1k finetune. ImageNet1k training set is used for finetuning, which brings a 0.4% accuracy improvement (`75.8%-> 78.9%`).
+## Data selection
+* An important feature of the SSLD distillation scheme is no need for labeled images, so the dataset size can be arbitrarily expanded. Considering the limitation of computing resources, we here only expand the training set of the distillation task based on the ImageNet22k dataset. For SSLD, we used the `Top-k per class` data sampling scheme [3]. Specific steps are as follows.
+     * Deduplication of training set. We first deduplicate the ImageNet22k dataset and the ImageNet1k validation set based on the SIFT feature similarity matching method to prevent the added ImageNet22k training set from containing the ImageNet1k validation set images. Finally we removed 4511 similar images. Similar pictures with partial filtering are shown below.
+    ![](../../../images/distillation/22k_1k_val_compare_w_sift.png)
+    * Obtain the soft label of the ImageNet22k dataset. For the ImageNet22k dataset after deduplication, we use the `ResNeXt101_32x16d_wsl` model to make predictions to obtain the soft label of each image.
+     * Top-k data selection. There contains 1000 categories in ImageNet1k dataset. For each category, we find out images in the category with Top-k highest score, and finally generate a dataset whose image number does not exceed `1000 * k` (For some categories, there may contain less than k images).
+     * The selected images are merged with the ImageNet1k training set to form the new dataset used for the final distillation model training, which contains 5 million images in all.
+# Experiments
+The distillation solution that PaddleClas provides is combining common training with finetuning. Given a suitable teacher model, the large dataset(5 million) is used for common training and the ImageNet1k dataset is used for finetuning.
+## Choice of teacher model
+In order to verify the influence of the model size difference between the teacher model and the student model on the distillation results as well as the teacher model accuracy, we conducted several experiments. The training strategy is unified as follows: `cosine_decay_warmup, lr = 1.3, epoch = 120, bs = 2048`, and the student models are all trained from scratch.
+|Teacher Model | Teacher Top1 | Student Model | Student Top1|
+|- |:-: |:-: | :-: |
+| ResNeXt101_32x16d_wsl | 84.2% | MobileNetV3_large_x1_0 | 75.78% |
+| ResNet50_vd | 79.12% | MobileNetV3_large_x1_0 | 75.60% |
+| ResNet50_vd | 82.35% | MobileNetV3_large_x1_0 | 76.00% |
+It can be shown from the table that:
+> When the teacher model structure is the same, the higher the teacher model accuracy, the better the final student model will be.
+>
+> The size difference between the teacher model and the student model should not be too large, otherwise it will decrease the accuracy of the distillation results.
+Therefore, during distillation, for the ResNet series student model, we use `ResNeXt101_32x16d_wsl` as the teacher model; for the MobileNet series student model, we use` ResNet50_vd_SSLD` as the teacher model.
+## Distillation using large-scale dataset
+Training process is carried out on the large-scale dataset with 5 million images. Specifically, the following table shows more details of different models.
+|Student Model | num_epoch  | l2_ecay | batch size/gpu cards |  base lr | learning rate decay | top1 acc |
+| - |:-: |:-: | :-: |:-: |:-: |:-: |
+| MobileNetV1 | 360 | 3e-5 | 4096/8  | 1.6 | cosine_decay_warmup | 77.65% |
+| MobileNetV2 | 360 | 1e-5 | 3072/8  | 0.54 | cosine_decay_warmup | 76.34% |
+| MobileNetV3_large_x1_0 | 360 | 1e-5 |  5760/24 | 3.65625 | cosine_decay_warmup | 78.54% |
+| MobileNetV3_small_x1_0 | 360 | 1e-5 |  5760/24 | 3.65625 | cosine_decay_warmup | 70.11% |
+| ResNet50_vd | 360 | 7e-5 | 1024/32 | 0.4 | cosine_decay_warmup | 82.07% |
+| ResNet101_vd | 360 | 7e-5 | 1024/32 | 0.4 | cosine_decay_warmup | 83.41% |
+## finetuning using ImageNet1k
+Finetuning is carried out on ImageNet1k dataset to restore distribution between training set and test set. the following table shows more details of finetuning.
+|Student Model | num_epoch  | l2_ecay | batch size/gpu cards |  base lr | learning rate decay |  top1 acc |
+| - |:-: |:-: | :-: |:-: |:-: |:-: |
+| MobileNetV1 | 30 | 3e-5 | 4096/8 | 0.016 | cosine_decay_warmup | 77.89%  |
+| MobileNetV2 | 30 | 1e-5 | 3072/8  | 0.0054 | cosine_decay_warmup | 76.73% |
+| MobileNetV3_large_x1_0 | 30 | 1e-5 |  2048/8 | 0.008 | cosine_decay_warmup | 78.96% |
+| MobileNetV3_small_x1_0 | 30 | 1e-5 |  6400/32 | 0.025 | cosine_decay_warmup | 71.28% |
+| ResNet50_vd | 60 | 7e-5 | 1024/32 | 0.004 | cosine_decay_warmup | 82.39% |
+| ResNet101_vd | 30 | 7e-5 | 1024/32 | 0.004 | cosine_decay_warmup | 83.73% |
+## Data agmentation and Fix strategy
+* Based on experiments mentioned above, we add AutoAugment [4] during training process, and reduced l2_decay from 4e-5 t 2e-5. Finally, the Top-1 accuracy on ImageNet1k dataset can reach 82.99%, with 0.6% improvement compared to the standard SSLD distillation strategy.
+* For image classsification tasks, The model accuracy can be further improved when the test scale is 1.15 times that of training[5]. For the 82.99% ResNet50_vd pretrained model, it comes to 83.7% using 320x320 for the evaluation. We use Fix strategy to finetune the model with the training scale set as 320x320. During the process, the pre-preocessing pipeline is same for both training and test. All the weights except the fully connected layer are freezed. Finally the top-1 accuracy comes to **84.0%**.
+# Application of the distillation model
+## Instructions
+* Adjust the learning rate of the middle layer. The middle layer feature map of the model obtained by distillation is more refined. Therefore, when the distillation model is used as the pretrained model in other tasks, if the same learning rate as before is adopted, it is easy to destroy the features. If the learning rate of the overall model training is reduced, it will bring about the problem of slow convergence. Therefore, we use the strategy of adjusting the learning rate of the middle layer. specifically:
+    * For ResNet50_vd, we set up a learning rate list. The three conv2d convolution parameters before the resiual block have a uniform learning rate multiple, and the four resiual block conv2d have theirs own learning rate parameters, respectively. 5 values need to be set in the list. By the experiment, we find that when used for transfer learning finetune classification model, the learning rate list with `[0.1,0.1,0.2,0.2,0.3]` performs better in most tasks; while in the object detection tasks, `[0.05, 0.05, 0.05, 0.1, 0.15]` can bring greater accuracy gains.
+    * For MoblileNetV3_large_1x0, because it contains 15 blocks, we set each 3 blocks to share a learning rate, so 5 learning rate values are required. We find that in classification and detection tasks, the learning rate list with `[0.25, 0.25, 0.5, 0.5, 0.75]` performs better in most tasks.
+* Appropriate l2 decay. Different l2 decay values are set for different models during training. In order to prevent overfitting, l2 decay is ofen set as large for large models. L2 decay is set as `1e-4` for ResNet50, and `1e-5 ~ 4e-5` for MobileNet series models. L2 decay needs also to be adjusted when applied in other tasks. Taking Faster_RCNN_MobiletNetV3_FPN as an example, we found that only modifying l2 decay can bring up to 0.5% accuracy (mAP) improvement on the COCO2017 dataset.
+## Transfer learning
+* To verify the effect of the SSLD pretrained model in transfer learning, we carried out experiments on 10 small datasets. Here, in order to ensure the comparability of the experiment, we use the standard preprocessing process trained by the ImageNet1k dataset. For the distillation model, we also add a simple search method for the learning rate of the middle layers of the distillation pretrained model.
+* For ResNet50_vd, the baseline pretrained model Top-1 Acc is 79.12%, the other parameters are got by grid search. For distillation pretrained model, we add learning rate of the middle layers into the search space. The following table shows the results.
+| Dataset | Model | Baseline Top1 Acc | Distillation Model Finetune |
+|- |:-: |:-: | :-: |
+| Oxford102 flowers | ResNete50_vd | 97.18% | 97.41% |
+| caltech-101 | ResNete50_vd | 92.57% | 93.21% |
+| Oxford-IIIT-Pets | ResNete50_vd | 94.30% | 94.76% |
+| DTD | ResNete50_vd | 76.48% | 77.71% |
+| fgvc-aircraft-2013b | ResNete50_vd | 88.98% | 90.00% |
+| Stanford-Cars | ResNete50_vd | 92.65% | 92.76% |
+| SUN397 | ResNete50_vd | 64.02% | 68.36% |
+| cifar100 | ResNete50_vd | 86.50% | 87.58% |
+| cifar10 | ResNete50_vd | 97.72% | 97.94% |
+| Food-101 | ResNete50_vd | 89.58% | 89.99% |
+* It can be seen that on the above 10 datasets, combined with the appropriate middle layer learning rate, the distillation pretrained model can bring an average accuracy improvement of more than 1%.
+## Object detection
+Based on the two-stage Faster/Cascade RCNN model, we verify the effect of the pretrained model obtained by distillation.
+* ResNet50_vd
+Training scale and test scale are set as 640x640, and some of the ablationstudies are as follows.
+| Model | train/test scale | pretrain top1 acc | feature map lr | coco mAP |
+|- |:-: |:-: | :-: | :-: |
+| Faster RCNN R50_vd FPN | 640/640 | 79.12% | [1.0,1.0,1.0,1.0,1.0] | 34.8% |
+| Faster RCNN R50_vd FPN | 640/640 | 79.12% | [0.05,0.05,0.1,0.1,0.15] | 34.3% |
+| Faster RCNN R50_vd FPN | 640/640 | 82.18% | [0.05,0.05,0.1,0.1,0.15] | 36.3% |
+It can be seen here that for the baseline pretrained model, excessive adjustment of the middle-layer learning rate actually reduces the performance of the detection model. Based on this distillation model, we also provide a practical server-side detection solution. The detailed configuration and training code are open source, more details can be refer to [PaddleDetection] (https://github.com/PaddlePaddle/PaddleDetection/tree/master/configs/rcnn_enhance).
+# Practice
+This section will introduce the SSLD distillation experiments in detail based on the ImageNet-1K dataset. If you want to experience this method quickly, you can refer to [** Quick start PaddleClas in 30 minutes**] (../../tutorials/quick_start.md), whose dataset is set as Flowers102.
+## Configuration
+### Distill ResNet50_vd using ResNeXt101_32x16d_wsl
+Configuration of distilling `ResNet50_vd` using `ResNeXt101_32x16d_wsl` is as follows.
+```yaml
+ARCHITECTURE:
+    name: 'ResNeXt101_32x16d_wsl_distill_ResNet50_vd'
+pretrained_model: "./pretrained/ResNeXt101_32x16d_wsl_pretrained/"
+# pretrained_model:
+#     - "./pretrained/ResNeXt101_32x16d_wsl_pretrained/"
+#     - "./pretrained/ResNet50_vd_pretrained/"
+use_distillation: True
+```
+### Distill MobileNetV3_large_x1_0 using ResNet50_vd_ssld
+The detailed configuration is as follows.
+```yaml
+ARCHITECTURE:
+    name: 'ResNet50_vd_distill_MobileNetV3_large_x1_0'
+pretrained_model: "./pretrained/ResNet50_vd_ssld_pretrained/"
+# pretrained_model:
+#     - "./pretrained/ResNet50_vd_ssld_pretrained/"
+#     - "./pretrained/ResNet50_vd_pretrained/"
+use_distillation: True
+```
+## Begin to train the network
+If everything is ready, users can begin to train the network using the following command.
+```bash
+export PYTHONPATH=path_to_PaddleClas:$PYTHONPATH
+python -m paddle.distributed.launch \
+    --selected_gpus="0,1,2,3" \
+    --log_dir=R50_vd_distill_MV3_large_x1_0 \
+    tools/train.py \
+        -c ./configs/Distillation/R50_vd_distill_MV3_large_x1_0.yaml
+```
+## Note
+* Before using SSLD, users need to train a teacher model on the target dataset firstly. The teacher model is used to guide the training of the student model.
+* When using SSLD, users need to set `use_distillation` in the configuration file to` True`. In addition, because the student model learns soft-label with knowledge information, you need to turn off the `label_smoothing` option.
+* If the student model is not loaded with a pretrained model, the other hyperparameters of the training can refer to the hyperparameters trained by the student model on ImageNet-1k. If the student model is loaded with the pre-trained model, the learning rate can be adjusted to `1/100~1/10` of the standard learning rate.
+* In the process of SSLD distillation, the student model only learns the soft label, which makes the training process more difficult. It is recommended that the value of `l2_decay` can be decreased appropriately to obtain higher accuracy of the validation set.
+* If users are going to add unlabeled training data, just the training list textfile needs to be adjusted for more data.
+> If this document is helpful to you, welcome to star our project: [https://github.com/PaddlePaddle/PaddleClas](https://github.com/PaddlePaddle/PaddleClas)
+# Reference
+[1] Hinton G, Vinyals O, Dean J. Distilling the knowledge in a neural network[J]. arXiv preprint arXiv:1503.02531, 2015.
+[2] Bagherinezhad H, Horton M, Rastegari M, et al. Label refinery: Improving imagenet classification through label progression[J]. arXiv preprint arXiv:1805.02641, 2018.
+[3] Yalniz I Z, Jégou H, Chen K, et al. Billion-scale semi-supervised learning for image classification[J]. arXiv preprint arXiv:1905.00546, 2019.
+[4] Cubuk E D, Zoph B, Mane D, et al. Autoaugment: Learning augmentation strategies from data[C]//Proceedings of the IEEE conference on computer vision and pattern recognition. 2019: 113-123.
+[5] Touvron H, Vedaldi A, Douze M, et al. Fixing the train-test resolution discrepancy[C]//Advances in Neural Information Processing Systems. 2019: 8250-8260.
--- a/docs/en/advanced_tutorials/distillation/index.rst
+++ b/docs/en/advanced_tutorials/distillation/index.rst
@@ -4,4 +4,4 @@ distillation
 .. toctree::
   :maxdepth: 3
-   distillation.md
+   distillation_en.md
--- a/docs/en/advanced_tutorials/image_augmentation/ImageAugment_en.md
+++ b/docs/en/advanced_tutorials/image_augmentation/ImageAugment_en.md
--- a/docs/en/advanced_tutorials/image_augmentation/index.rst
+++ b/docs/en/advanced_tutorials/image_augmentation/index.rst
@@ -4,4 +4,4 @@ image_augmentation
 .. toctree::
   :maxdepth: 3
-   ImageAugment.md
+   ImageAugment_en.md
--- a/docs/en/application/index.rst
+++ b/docs/en/application/index.rst
@@ -4,5 +4,5 @@ application
 .. toctree::
   :maxdepth: 2
-   transfer_learning.md
+   transfer_learning_en.md
-   object_detection.md
+   object_detection_en.md
--- a/docs/en/application/object_detection_en.md
+++ b/docs/en/application/object_detection_en.md
+# General object detection
+## Practical Server-side detection method base on RCNN
+### Introduction
+* In recent years, object detection tasks have attracted widespread attention. [PaddleClas](https://github.com/PaddlePaddle/PaddleClas) open-sourced the ResNet50_vd_SSLD pretrained model based on ImageNet(Top1 Acc 82.4%). And based on the pretrained model, PaddleDetection provided the PSS-DET (Practical Server-side detection) with the help of the rich operators in PaddleDetection. The inference speed can reach 61FPS on single V100 GPU when COCO mAP is 41.6%, and 20FPS when COCO mAP is 47.8%.
+* We take the standard `Faster RCNN ResNet50_vd FPN` as an example. The following table shows ablation study of PSS-DET.
+| Trick | Train scale | Test scale |  COCO mAP | Infer speed/FPS |
+|- |:-: |:-: | :-: | :-: |
+| `baseline` | 640x640 | 640x640 | 36.4% | 43.589 |
+| +`test proposal=pre/post topk 500/300` | 640x640 | 640x640 | 36.2% | 52.512 |
+| +`fpn channel=64` | 640x640 | 640x640 | 35.1% | 67.450 |
+| +`ssld pretrain` | 640x640 | 640x640 | 36.3% | 67.450 |
+| +`ciou loss` | 640x640 | 640x640 | 37.1% | 67.450 |
+| +`DCNv2` | 640x640 | 640x640 | 39.4% | 60.345 |
+| +`3x, multi-scale training` | 640x640 | 640x640 | 41.0% | 60.345 |
+| +`auto augment` | 640x640 | 640x640 | 41.4% | 60.345 |
+| +`libra sampling` | 640x640 | 640x640 | 41.6% | 60.345 |
+Based on the ablation experiments, Cascade RCNN and larger inference scale(1000x1500) are used for better performance. The final COCO mAP is 47.8%
+and the following figure shows `mAP-Speed` curves for some common detectors.
+![pssdet](../../images/det/pssdet.png)
+**Note**
+> For fair comparison, inference time for PSS-DET models on V100 GPU is transformed to Titan V GPU by multiplying by 1.2 times.
+For more detailed information, you can refer to [PaddleDetection](https://github.com/PaddlePaddle/PaddleDetection/tree/master/configs/rcnn_server_side_det).
+## Practical Mobile-side detection method base on RCNN
+* This part is comming soon!
--- a/docs/en/application/transfer_learning_en.md
+++ b/docs/en/application/transfer_learning_en.md
+# Transfer learning in image classification
+Transfer learning is an important part of machine learning, which is widely used in various fields such as text and images. Here we mainly introduce transfer learning in the field of image classification, which is often called domain transfer, such as migration of the ImageNet classification model to the specified image classification task, such as flower classification.
+## Hyperparameter search
+ImageNet is the widely used dataset for image classification. A series of empirical hyperparameters have been summarized. High accuracy can be got using the hyperparameters. However, when applied in the specified dataset, the hyperparameters may not be optimal. There are two commonly used hyperparameter search methods that can be used to help us obtain better model hyperparameters.
+### Grid search
+For grid search, which is also called exhaustive search, the optimal value is determined by finding the best solution from all solutions in the search space. The method is simple and effective, but when the search space is large, it takes huge computing resource.
+### Bayesian search
+Bayesian search, which is also called Bayesian optimization, is realized by randomly selecting a group of hyperparameters in the search space. Gaussian process is used to update the hyperparameters, compute their expected mean and variance according to the performance of the previous hyperparameters. The larger the expected mean, the greater the probability of being close to the optimal solution. The larger the expected variance, the greater the uncertainty. Usually, the hyperparameter point with large expected mean is called `exporitation`, and the hyperparameter point with large variance is called `exploration`. Acquisition function is defined to balance the expected mean and variance. The currently selected hyperparameter point is viewed as the optimal position with maximum probability.
+According to the above two search schemes, we carry out some experiments based on fixed scheme and two search schemes on 8 open source datasets. As the experimental scheme in [1], we search for 4 hyperparameters, the search space and The experimental results are as follows:
+a fixed set of parameter experiments and two search schemes on 8 open source data sets. With reference to the experimental scheme of [1], we search for 4 hyperparameters, the search space and the experimental results are as follows:
+- Fixed scheme.
+```
+lr=0.003，l2 decay=1e-4，label smoothing=False，mixup=False
+```
+- Search space of the hyperparameters.
+```
+lr: [0.1, 0.03, 0.01, 0.003, 0.001, 0.0003, 0.0001]
+l2 decay: [1e-3, 3e-4, 1e-4, 3e-5, 1e-5, 3e-6, 1e-6]
+label smoothing: [False, True]
+mixup: [False, True]
+```
+It takes 196 times for grid search, and takes 10 times less for Bayesian search. The baseline is trained by using ImageNet1k pretrained model based on ResNet50_vd and fixed scheme. The follow shows the experiments.
+| Dataset             | Fix scheme | Grid search | Grid search time | Bayesian search | Bayesian search time|
+| ------------------ | -------- | -------- | -------- | -------- | ---------- |
+| Oxford-IIIT-Pets   | 93.64%   | 94.55%   | 196 | 94.04%     | 20         |
+| Oxford-102-Flowers | 96.08%   | 97.69%   | 196 |  97.49%     | 20         |
+| Food101            | 87.07%   | 87.52%   | 196 |  87.33%     | 23         |
+| SUN397             | 63.27%   | 64.84%   | 196 |  64.55%     | 20         |
+| Caltech101         | 91.71%   | 92.54%   | 196 |  92.16%     | 14         |
+| DTD                | 76.87%   | 77.53%   | 196 |  77.47%     | 13         |
+| Stanford Cars      | 85.14%   | 92.72%   | 196 |  92.72%     | 25         |
+| FGVC Aircraft      | 80.32%   | 88.45%   | 196 |  88.36%     | 20         |
+- The above experiments verify that Bayesian search only reduces the accuracy by 0% to 0.4% under the condition of reducing the number of searches by about 10 times compared to grid search.
+- The search space can be expaned easily using Bayesian search.
+## Large-scale image classification
+In practical applications, due to the lack of training data, the classification model trained on the ImageNet1k data set is often used as the pretrained model for other image classification tasks. In order to further help solve practical problems, based on ResNet50_vd, Baidu open sourced a self-developed large-scale classification pretrained model, in which the training data contains 100,000 categories and 43 million pictures.
+We conducted transfer learning experiments on 6 self-collected datasets,
+using a set of fixed parameters and a grid search method, in which the number of training rounds was set to 20epochs, the ResNet50_vd model was selected, and the ImageNet pre-training accuracy was 79.12%. The comparison results of the experimental data set parameters and model accuracy are as follows:
+Fixed scheme：
+```
+lr=0.001，l2 decay=1e-4，label smoothing=False，mixup=False
+```
+| Dataset          | Statstics                                  | **Pretrained moel on ImageNet <br />Top-1(fixed)/Top-1(search)** | **Pretrained moel on large-scale dataset<br />Top-1(fixed)/Top-1(search)** |
+| --------------- | ----------------------------------------- | -------------------------------------------------------- | --------------------------------------------------------- |
+| Flowers         | class:102<br />train:5789<br />valid:2396 | 0.7779/0.9883                                            | 0.9892/0.9954                                             |
+| Hand-painted stick figures       | Class:18<br />train:1007<br />valid:432   | 0.8795/0.9196                                            | 0.9107/0.9219                                             |
+| Leaves     | class:6<br />train:5256<br />valid:2278   | 0.8212/0.8482                                            | 0.8385/0.8659                                             |
+| Container vehicle       | Class:115<br />train:4879<br />valid:2094 | 0.6230/0.9556                                            | 0.9524/0.9702                                             |
+| Chair         | class:5<br />train:169<br />valid:78      | 0.8557/0.9688                                            | 0.9077/0.9792                                             |
+| Geology         | class:4<br />train:671<br />valid:296     | 0.5719/0.8094                                            | 0.6781/0.8219                                             |
+- The above experiments verified that for fixed parameters, compared with the pretrained model on ImageNet, using the large-scale classification model as a pretrained model can help us improve the model performance on a new dataset in most cases. Parameter search can be further helpful to the model performance.
+## Reference
+[1] Kornblith, Simon, Jonathon Shlens, and Quoc V. Le. "Do better imagenet models transfer better?." *Proceedings of the IEEE conference on computer vision and pattern recognition*. 2019.
+[2] Kolesnikov, Alexander, et al. "Large Scale Learning of General Visual Representations for Transfer." *arXiv preprint arXiv:1912.11370* (2019).
--- a/docs/en/change_log.md
+++ b/docs/en/change_log.md
-# Release Notes
-* 2020.04.14: first commit
--- a/docs/en/competition_support_en.md
+++ b/docs/en/competition_support_en.md
+### Competition Support
+PaddleClas stems from the Baidu's visual business applications and the exploration of frontier visual capabilities. It has helped us achieve leading results in many key events, and continues to promote more frontier visual solutions and landing applications.
+* 1st place in 2018 Kaggle Open Images V4 object detection challenge
+* 2nd place in 2019 Kaggle Open Images V5 object detection challenge
+    * The report is avaiable here: [https://arxiv.org/pdf/1911.07171.pdf](https://arxiv.org/pdf/1911.07171.pdf)
+    * The pretrained model and code is avaiable here: [source code](https://github.com/PaddlePaddle/PaddleDetection/blob/master/docs/featured_model/OIDV5_BASELINE_MODEL.md)
+* 2nd place in Kacggle Landmark Retrieval Challenge 2019
+    * The report is avaiable here: [https://arxiv.org/abs/1906.03990](https://arxiv.org/abs/1906.03990)
+    * The pretrained model and code is avaiable here: [source code](https://github.com/PaddlePaddle/Research/tree/master/CV/landmark)
+* 2nd place in Kaggle Landmark Recognition Challenge 2019
+    * The report is avaiable here: [https://arxiv.org/abs/1906.03990](https://arxiv.org/abs/1906.03990)
+    * The pretrained model and code is avaiable here: [source code](https://github.com/PaddlePaddle/Research/tree/master/CV/landmark)
+* A-level certificate of three tasks: printed text OCR, face recognition and landmark recognition in the first multimedia information recognition technology competition
--- a/docs/en/extension/index.rst
+++ b/docs/en/extension/index.rst
+extension
+================================
+.. toctree::
+   :maxdepth: 1
+   paddle_inference_en.md
+   paddle_mobile_inference_en.md
+   paddle_quantization_en.md
+   multi_machine_training_en.md
+   paddle_hub_en.md
+   paddle_serving_en.md
--- a/docs/en/extension/multi_machine_training_en.md
+++ b/docs/en/extension/multi_machine_training_en.md
+# Distributed Training
+Distributed deep neural networks training is highly efficient in PaddlePaddle. 
+And it is one of the PaddlePaddle's core advantage technologies.
+On image classification tasks, distributed training can achieve almost linear acceleration ratio.
+[Fleet](https://github.com/PaddlePaddle/Fleet) is High-Level API for distributed training in PaddlePaddle. 
+By using Fleet, a user can shift from local machine paddlepaddle code to distributed code easily.
+In order to support both single-machine training and multi-machine training,
+[PaddleClas](https://github.com/PaddlePaddle/PaddleClas) uses the Fleet API interface.
+For more information about distributed training,
+please refer to [Fleet API documentation](https://github.com/PaddlePaddle/Fleet/blob/develop/README.md).
--- a/docs/en/extension/paddle_hub_en.md
+++ b/docs/en/extension/paddle_hub_en.md
+# Paddle Hub
+[PaddleHub](https://github.com/PaddlePaddle/PaddleHub) is a pre-trained model application tool for PaddlePaddle.
+Developers can conveniently use the high-quality pre-trained model combined with Fine-tune API to quickly complete the whole process from model migration to deployment.
+All the pre-trained models of [PaddleClas](https://github.com/PaddlePaddle/PaddleClas) have been collected by PaddleHub.
+For further details, please refer to [PaddleHub website](https://www.paddlepaddle.org.cn/hub).
--- a/docs/en/extension/paddle_inference_en.md
+++ b/docs/en/extension/paddle_inference_en.md
+# Prediction Framework
+## Introduction
+Models for Paddle are stored in many different forms, which can be roughly divided into two categories：
+1. persistable model（the models saved by fluid.save_persistables）
+    The weights are saved in checkpoint, which can be loaded to retrain, one scattered weight file saved by persistable stands for one persistable variable in the model, there is no structure information in these variable, so the weights should be used with the model structure.
+    ```
+    resnet50-vd-persistable/
+    ├── bn2a_branch1_mean
+    ├── bn2a_branch1_offset
+    ├── bn2a_branch1_scale
+    ├── bn2a_branch1_variance
+    ├── bn2a_branch2a_mean
+    ├── bn2a_branch2a_offset
+    ├── bn2a_branch2a_scale
+    ├── ...
+    └── res5c_branch2c_weights
+    ```
+2. inference model（the models saved by fluid.io.save_inference_model）
+    The model saved by this function cam be used for inference directly, compared with the ones saved by persistable, the model structure will be additionally saved in the model, with the weights, the model with trained weights can be reconstruction. as shown in the following figure, the structure information is saved in `model`
+    ```
+    resnet50-vd-persistable/
+    ├── bn2a_branch1_mean
+    ├── bn2a_branch1_offset
+    ├── bn2a_branch1_scale
+    ├── bn2a_branch1_variance
+    ├── bn2a_branch2a_mean
+    ├── bn2a_branch2a_offset
+    ├── bn2a_branch2a_scale
+    ├── ...
+    ├── res5c_branch2c_weights
+    └── model
+    ```
+    For convenience, all weight files will be saved into a `params` file when saving the inference model on Paddle, as shown below：
+    ```
+    resnet50-vd
+    ├── model
+    └── params
+    ```
+Both the training engine and the prediction engine in Paddle support the model's e inference, but the back propagation is not performed during the inference, so it can be customized optimization (such as layer fusion, kernel selection, etc.) to achieve low latency and high throughput during inference. The training engine can support either the persistable model or the inference model, and the prediction engine only supports the inference model, so three different inferences are derived：
+1. prediction engine + inference model
+2. training engine + inference model
+3. training engine + inference model
+Regardless of the inference method, it basically includes the following main steps：
+ Engine Build
+ Make Data to Be Predicted
+ Perform Predictions
+ Result Analysis
+There are two main differences in different inference methods: building the engine and executing the forecast. The following sections will be introduced in detail
+## Model Transformation
+During training, we usually save some checkpoints (persistable models). These are just model weight files and cannot be directly loaded by the prediction engine to predict, so we usually find suitable checkpoints after the training and convert them to inference model. There are two main steps: 1. Build a training engine, 2. Save the inference model, as shown below.
+```python
+import fluid
+from ppcls.modeling.architectures.resnet_vd import ResNet50_vd
+place = fluid.CPUPlace()
+exe = fluid.Executor(place)
+startup_prog = fluid.Program()
+infer_prog = fluid.Program()
+with fluid.program_guard(infer_prog, startup_prog):
+    with fluid.unique_name.guard():
+        image = create_input()
+        image = fluid.data(name='image', shape=[None, 3, 224, 224], dtype='float32')
+        out = ResNet50_vd.net(input=input, class_dim=1000)
+infer_prog = infer_prog.clone(for_test=True)
+fluid.load(program=infer_prog, model_path=the path of persistable model, executor=exe)
+fluid.io.save_inference_model(
+        dirname='./output/',
+        feeded_var_names=[image.name],
+        main_program=infer_prog,
+        target_vars=out,
+        executor=exe,
+        model_filename='model',
+        params_filename='params')
+```
+A complete example is provided in the `tools/export_model.py`, just execute the following command to complete the conversion：
+```python
+python tools/export_model.py \
+    --m=the name of model \
+    --p=the path of persistable model\
+    --o=the saved path of model and params
+```
+## Prediction engine + inference model
+The complete example is provided in the `tools/infer/predict.py`，just execute the following command to complete the prediction:
+```
+python ./predict.py \
+    -i=./test.jpeg \
+    -m=./resnet50-vd/model \
+    -p=./resnet50-vd/params \
+    --use_gpu=1 \
+    --use_tensorrt=True
+```
+Parameter Description：
+ `image_file`(shortening i)：the path of images which are needed to predict，such as `./test.jpeg`.
+ `model_file`(shortening m)：the path of weights folder，such as `./resnet50-vd/model`.
+ `params_file`(shortening p)：the path of weights file，such as `./resnet50-vd/params`.
+ `batch_size`(shortening b)：batch size，such as  `1`.
+ `ir_optim` whether to use `IR` optimization, default: True.
+ `use_tensorrt`: whether to use TensorRT prediction engine, default:True.
+ `gpu_mem`： Initial allocation of GPU memory, the unit is M.
+ `use_gpu`: whether to use GPU, default: True.
+ `enable_benchmark`：whether to use benchmark, default: False.
+ `model_name`：the name of model.
+NOTE：
+when using benchmark, we use tersorrt by default to make predictions on Paddle.
+Building prediction engine：
+```python
+from paddle.fluid.core import AnalysisConfig
+from paddle.fluid.core import create_paddle_predictor
+config = AnalysisConfig(the path of model file, the path of params file)
+config.enable_use_gpu(8000, 0)
+config.disable_glog_info()
+config.switch_ir_optim(True)
+config.enable_tensorrt_engine(
+        precision_mode=AnalysisConfig.Precision.Float32,
+        max_batch_size=1)
+# no zero copy方式需要去除fetch feed op
+config.switch_use_feed_fetch_ops(False)
+predictor = create_paddle_predictor(config)
+```
+Prediction Execution：
+```python
+import numpy as np
+input_names = predictor.get_input_names()
+input_tensor = predictor.get_input_tensor(input_names[0])
+input = np.random.randn(1, 3, 224, 224).astype("float32")
+input_tensor.reshape([1, 3, 224, 224])
+input_tensor.copy_from_cpu(input)
+predictor.zero_copy_run()
+```
+More parameters information can be refered in [Paddle Python prediction API](https://www.paddlepaddle.org.cn/documentation/docs/zh/advanced_guide/inference_deployment/inference/python_infer_cn.html). If you need to predict in the environment of business, we recommand you to use [Paddel C++ prediction API](https://www.paddlepaddle.org.cn/documentation/docs/zh/advanced_guide/inference_deployment/inference/native_infer.html)，a rich pre-compiled prediction library is provided in the offical website[Paddle C++ prediction library](https://www.paddlepaddle.org.cn/documentation/docs/zh/advanced_guide/inference_deployment/inference/build_and_install_lib_cn.html)。
+By default, Paddle's wheel package does not include the TensorRT prediction engine. If you need to use TensorRT for prediction optimization, you need to compile the corresponding wheel package yourself. For the compilation method, please refer to Paddle's compilation guide. [Paddle compilation](https://www.paddlepaddle.org.cn/documentation/docs/zh/install/compile/fromsource.html)。
+## Training engine + persistable model prediction
+A complete example is provided in the `tools/infer/infer.py`, just execute the following command to complete the prediction：
+```python
+python tools/infer/infer.py \
+    --i=the path of images which are needed to predict \
+    --m=the name of model \
+    --p=the path of persistable model \
+    --use_gpu=True
+```
+Parameter Description：
+ `image_file`(shortening i)：the path of images which are needed to predict，such as `./test.jpeg`
+ `model_file`(shortening m)：the path of weights folder，such as `./resnet50-vd/model`
+ `params_file`(shortening p)：the path of weights file，such as `./resnet50-vd/params`
+ `use_gpu` : whether to use GPU, default: True.
+Training Engine Construction：
+Since the persistable model does not contain the structural information of the model, it is necessary to construct the network structure first, and then load the weights to build the training engine。
+```python
+import fluid
+from ppcls.modeling.architectures.resnet_vd import ResNet50_vd
+place = fluid.CPUPlace()
+exe = fluid.Executor(place)
+startup_prog = fluid.Program()
+infer_prog = fluid.Program()
+with fluid.program_guard(infer_prog, startup_prog):
+    with fluid.unique_name.guard():
+        image = create_input()
+        image = fluid.data(name='image', shape=[None, 3, 224, 224], dtype='float32')
+        out = ResNet50_vd.net(input=input, class_dim=1000)
+infer_prog = infer_prog.clone(for_test=True)
+fluid.load(program=infer_prog, model_path=the path of persistable model, executor=exe)
+```
+Perform inference：
+```python
+outputs = exe.run(infer_prog,
+        feed={image.name: data},
+        fetch_list=[out.name],
+        return_numpy=False)
+```
+For the above parameter descriptions, please refer to the official website [fluid.Executor](https://www.paddlepaddle.org.cn/documentation/docs/zh/api_cn/executor_cn/Executor_cn.html)
+## Training engine + inference model prediction
+A complete example is provided in `tools/infer/py_infer.py`, just execute the following command to complete the prediction：
+```python
+python tools/infer/py_infer.py \
+    --i=the path of images \
+    --d=the path of saved model \
+    --m=the path of saved model file \
+    --p=the path of saved weight file \
+    --use_gpu=True
+```
+ `image_file`(shortening i)：the path of images which are needed to predict，如 `./test.jpeg`
+ `model_file`(shortening m)：the path of model file，如 `./resnet50_vd/model`
+ `params_file`(shortening p)：the path of weights file，如 `./resnet50_vd/params`
+ `model_dir`(shortening d)：the folder of model，如`./resent50_vd`
+ `use_gpu`：whether to use GPU, default: True
+Training engine build
+Since inference model contains the structure of model, we do not need to construct the model before, load the model file and weights file directly to bulid training engine.
+```python
+import fluid
+place = fluid.CPUPlace()
+exe = fluid.Executor(place)
+[program, feed_names, fetch_lists] = fluid.io.load_inference_model(
+        the path of saved model,
+        exe,
+        model_filename=the path of model file,
+        params_filename=the path of weights file)
+compiled_program = fluid.compiler.CompiledProgram(program)
+```
+> `load_inference_model` Not only supports scattered weight file collection, but also supports a single weight file。
+Perform inference：
+```python
+outputs = exe.run(compiled_program,
+        feed={feed_names[0]: data},
+        fetch_list=fetch_lists,
+        return_numpy=False)
+```
+For the above parameter descriptions, please refer to the official website [fluid.Executor](https://www.paddlepaddle.org.cn/documentation/docs/zh/api_cn/executor_cn/Executor_cn.html)
--- a/docs/en/extension/paddle_mobile_inference_en.md
+++ b/docs/en/extension/paddle_mobile_inference_en.md
+# Paddle-Lite
+## Introduction
+[Paddle-Lite](https://github.com/PaddlePaddle/Paddle-Lite) is a set of lightweight inference engine which is fully functional, easy to use and then performs well. Lightweighting is reflected in the use of fewer bits to represent the weight and activation of the neural network, which can greatly reduce the size of the model, solve the problem of limited storage space of the mobile device, and the inference speed is better than other frameworks on the whole.
+In [PaddleClas](https://github.com/PaddlePaddle/PaddleClas), we uses Paddle-Lite to [evaluate the performance on the mobile device](../models/Mobile.md), in this section we uses the `MobileNetV1` model trained on the `ImageNet1k` dataset as an example to introduce how to use `Paddle-Lite` to evaluate the model speed on the mobile terminal (evaluated on SD855)
+## Evaluation Steps
+### Export the Inference Model
+* First you should transform the saved model during training to the special model which can be used to inference, the special model can be exported by `tools/export_model.py`, the specific way of transform is as follows.
+```shell
+python tools/export_model.py -m MobileNetV1 -p pretrained/MobileNetV1_pretrained/ -o inference/MobileNetV1
+```
+Finally the `model` and `parmas` can be saved in `inference/MobileNetV1`.
+### Download Benchmark Binary File
+* Use the adb (Android Debug Bridge) tool to connect the Android phone and the PC, then develop and debug. After installing adb and ensuring that the PC and the phone are successfully connected, use the following command to view the ARM version of the phone and select the pre-compiled library based on ARM version.
+```shell
+adb shell getprop ro.product.cpu.abi
+```
+* Download Benchmark_bin File
+```shell
+wget -c https://paddle-inference-dist.bj.bcebos.com/PaddleLite/benchmark_0/benchmark_bin_v8
+```
+If the ARM version is v7, the v7 benchmark_bin file should be downloaded, the command is as follow.
+```shell
+wget -c https://paddle-inference-dist.bj.bcebos.com/PaddleLite/benchmark_0/benchmark_bin_v7
+```
+### Inference benchmark
+After the PC and mobile phone are successfully connected, use the following command to start the model evaluation.
+```
+sh tools/lite/benchmark.sh ./benchmark_bin_v8 ./inference result_armv8.txt true
+```
+Where `./benchmark_bin_v8` is the path of the benchmark binary file, `./inference` is the path of all the models that need to be evaluated, `result_armv8.txt` is the result file, and the final parameter `true` means that the model will be optimized before evaluation. Eventually, the evaluation result file of `result_armv8.txt` will be saved in the current folder. The specific performances are as follows.
+```
+PaddleLite Benchmark
+Threads=1 Warmup=10 Repeats=30
+MobileNetV1                           min = 30.89100    max = 30.73600    average = 30.79750
+Threads=2 Warmup=10 Repeats=30
+MobileNetV1                           min = 18.26600    max = 18.14000    average = 18.21637
+Threads=4 Warmup=10 Repeats=30
+MobileNetV1                           min = 10.03200    max = 9.94300     average = 9.97627
+```
+Here is the model inference speed under different number of threads, the unit is FPS, taking model on one threads as an example, the average speed of MobileNetV1 on SD855 is `30.79750FPS`.
+### Model Optimization and Speed Evaluation
+* In II.III section, we mention that the model will be optimized before evaluation, here you can  first optimize the model, and then directly load the optimized model for speed evaluation
+* Paddle-Lite
+In Paddle-Lite, we provides multiple strategies to automatically optimize the original training model, which contain Quantify, Subgraph fusion, Hybrid scheduling, Kernel optimization and so on. In order to make the optimization more convenient and easy to use, we provide opt tools to automatically complete the optimization steps and output a lightweight, optimal  and executable model in Paddle-Lite, which can be downloaded on [Paddle-Lite Model Optimization Page](https://paddle-lite.readthedocs.io/zh/latest/user_guides/model_optimize_tool.html). Here we take `MacOS` as our development environment, download[opt_mac](https://paddlelite-data.bj.bcebos.com/model_optimize_tool/opt_mac) model optimization tools and use the following commands to optimize the model.
+```shell
+model_file="../MobileNetV1/model"
+param_file="../MobileNetV1/params"
+opt_models_dir="./opt_models"
+mkdir ${opt_models_dir}
+./opt_mac --model_file=${model_file} \
+    --param_file=${param_file} \
+    --valid_targets=arm \
+    --optimize_out_type=naive_buffer \
+    --prefer_int8_kernel=false \
+    --optimize_out=${opt_models_dir}/MobileNetV1
+```
+Where the `model_file` and `param_file` are exported model file and the file address respectively, after transforming successfully, the `MobileNetV1.nb` will be saved in `opt_models`
+Use the benchmark_bin file to load the optimized model for evaluation. The commands are as follows.
+```shell
+bash benchmark.sh ./benchmark_bin_v8 ./opt_models result_armv8.txt
+```
+Finally the result is saved in `result_armv8.txt` and shown as follow.
+```
+PaddleLite Benchmark
+Threads=1 Warmup=10 Repeats=30
+MobileNetV1_lite              min = 30.89500    max = 30.78500    average = 30.84173
+Threads=2 Warmup=10 Repeats=30
+MobileNetV1_lite              min = 18.25300    max = 18.11000    average = 18.18017
+Threads=4 Warmup=10 Repeats=30
+MobileNetV1_lite              min = 10.00600    max = 9.90000     average = 9.96177
+```
+Taking the model on one threads as an example, the average speed of MobileNetV1 on SD855 is `30.84173FPS`.
+More specific parameter explanation and Paddle-Lite usage can refer to [Paddle-Lite docs](https://paddle-lite.readthedocs.io/zh/latest/)。
--- a/docs/en/extension/paddle_quantization_en.md
+++ b/docs/en/extension/paddle_quantization_en.md
+# Model Quantifization
+Int8 quantization is one of the key features in [PaddleSlim](https://github.com/PaddlePaddle/PaddleSlim).
+It supports two kinds of training aware, **Dynamic strategy** and **Static strategy**, 
+layer-wise and channel-wise quantization,
+and using PaddleLite to deploy models generated by PaddleSlim.
+By using this toolkit, [PaddleClas](https://github.com/PaddlePaddle/PaddleClas) quantized the mobilenet_v3_large_x1_0 model whose accuracy is 78.9% after distilled.
+After quantized, the prediction speed is accelerated from 19.308ms to 14.395ms on SD855. 
+The storage size is reduced from 21M to 10M.
+The top1 recognition accuracy rate is 75.9%.
+For specific training methods, please refer to [PaddleSlim quant aware](https://paddlepaddle.github.io/PaddleSlim/quick_start/quant_aware_tutorial.html)。
--- a/docs/en/extension/paddle_serving_en.md
+++ b/docs/en/extension/paddle_serving_en.md
+# Model Service Deployment
+## Overview
+[Paddle Serving](https://github.com/PaddlePaddle/Serving) aims to help deep-learning researchers to easily deploy online inference services, supporting one-click deployment of industry, high concurrency and efficient communication between client and server and supporting multiple programming languages to develop clients.
+Taking HTTP inference service deployment as an example to introduce how to use PaddleServing to deploy model services in PaddleClas.
+## Serving Install
+It is recommends to use docker to install and deploy the Serving environment in the Serving official website, first, you need to pull the docker environment and create Serving-based docker.
+```shell
+nvidia-docker pull hub.baidubce.com/paddlepaddle/serving:0.2.0-gpu
+nvidia-docker run -p 9292:9292 --name test -dit hub.baidubce.com/paddlepaddle/serving:0.2.0-gpu
+nvidia-docker exec -it test bash
+```
+In docker, you need to install some packages about Serving
+```shell
+pip install paddlepaddle-gpu
+pip install paddle-serving-client
+pip install paddle-serving-server-gpu
+```
+* If the installation speed is too slow, you can add `-i https://pypi.tuna.tsinghua.edu.cn/simple` following pip to speed up the process.
+* If you want to deploy CPU service, you can install the cpu version of Serving, the command is as follow.
+```shell
+pip install paddle-serving-server
+```
+### Export Model
+Exporting the Serving model using `tools/export_serving_model.py`, taking ResNet50_vd as an example, the command is as follow.
+```shell
+python tools/export_serving_model.py -m ResNet50_vd -p ./pretrained/ResNet50_vd_pretrained/ -o serving
+```
+finally, the client configures, model parameters and structure file will be saved in `ppcls_client_conf` and `ppcls_model`.
+### Service Deployment and Request
+* Using the following commands to start the Serving.
+```shell
+python tools/serving/image_service_gpu.py serving/ppcls_model workdir 9292
+```
+`serving/ppcls_model` is the address of the Serving model just saved, `workdir` is the work directory, and `9292` is the port of the service.
+* Using the following script to send an identification request to the Serving and return the result.
+```
+python tools/serving/image_http_client.py  9292 ./docs/images/logo.png
+```
+`9292` is the port for sending the request, which is consistent with the Serving starting port, and `./docs/images/logo.png` is the test image, the final top1 label and probability are returned.
+* For more Serving deployment, such RPC inference service, you can refer to the Serving official website: [https://github.com/PaddlePaddle/Serving/tree/develop/python/examples/imagenet](https://github.com/PaddlePaddle/Serving/tree/develop/python/examples/imagenet)
--- a/docs/en/faq_en.md
+++ b/docs/en/faq_en.md
+# FAQ
+>>
+* Why are the metrics different for different cards?
+* A: Fleet is the default option for the use of PaddleClas. Each GPU card is taken as a single trainer and deals with different images, which cause the final small difference. Single card evalution is suggested to get the accurate results if you use `tools/eval.py`. You can also use  `tools/eval_multi_platform.py` to evalute the models on multiple GPU cards, which is also supported on Windows and CPU.
+>>
+* Q: Why `Mixup` or `Cutmix` is not used even if I have already add the data operation in the configuration file?
+* A: When using `Mixup` or `Cutmix`, you also need to add `use_mix: True` in the configuration file to make it work properly.
+>>
+* Q: During evaluation and inference, pretrained model address is assgined, but the weights can not be imported. Why?
+* A: Prefix of the pretrained model is needed. For example, if the pretained weights are located in `output/ResNet50_vd/19`, with the filename `output/ResNet50_vd/19/ppcls.pdparams`, then `pretrained_model` in the configuration file needs to be `output/ResNet50_vd/19/ppcls`.
+>>
+* Q: Why are the metrics 0.3% lower than that shown in the model zoo for `EfficientNet` series of models?
+* A: Resize method is set as `Cubic` for `EfficientNet`(interpolation is set as 2 in OpenCV), while other models are set as `Bilinear`(interpolation is set as None in OpenCV). Therefore, you need to modify the interpolation explicitly in `ResizeImage`. Specifically, the following configuration is a demo for EfficientNet.
+```
+VALID:
+    batch_size: 16
+    num_workers: 4
+    file_list: "./dataset/ILSVRC2012/val_list.txt"
+    data_dir: "./dataset/ILSVRC2012/"
+    shuffle_seed: 0
+    transforms:
+        - DecodeImage:
+            to_rgb: True
+            to_np: False
+            channel_first: False
+        - ResizeImage:
+            resize_short: 256
+            interpolation: 2
+        - CropImage:
+            size: 224
+        - NormalizeImage:
+            scale: 1.0/255.0
+            mean: [0.485, 0.456, 0.406]
+            std: [0.229, 0.224, 0.225]
+            order: ''
+        - ToCHWImage:
+```
+>>
+* Q: What should I do if I want to transform the weights' format from `pdparams` to an earlier version(before Paddle1.7.0), which consists of the scattered files?
+* A: You can use `fluid.load` to load the `pdparams` weights and use `fluid.io.save_vars` to save the weights as scattered files.
--- a/docs/en/index.rst
+++ b/docs/en/index.rst
@@ -11,9 +11,7 @@ Welcome to PaddleClas！
   advanced_tutorials/index
   application/index
   extension/index
-   competition_support.md
+   competition_support_en.md
-   model_zoo.md
+   update_history_en.md
-   change_log.md
+   faq_en.md
-   faq.md
-:math:`PaddlePaddle2020`
--- a/docs/en/models/DPN_DenseNet_en.md
+++ b/docs/en/models/DPN_DenseNet_en.md
+# DPN and DenseNet series
+## Overview
+DenseNet is a new network structure proposed in 2017 and was the best paper of CVPR. The network has designed a new cross-layer connected block called dense-block. Compared to the bottleneck in ResNet, dense-block has designed a more aggressive dense connection module, that is, connecting all the layers to each other, and each layer will accept all the layers in front of it as its additional input. DenseNet stacks all dense-blocks into a densely connected network. The dense connection makes DenseNet easier to backpropagate, making the network easier to train and converge.  The full name of DPN is Dual Path Networks, which is a network composed of DenseNet and ResNeXt, which proves that DenseNet can extract new features from the previous level, and ResNeXt essentially reuses the extracted features . The author further analyzes and finds that ResNeXt has high reuse rate for features, but low redundancy, while DenseNet can create new features, but with high redundancy. Combining the advantages of the two structures, the author designed the DPN network. In the end, the DPN network achieved better results than ResNeXt and DenseNet under the same FLOPS and parameters.
+The FLOPS, parameters, and inference time on the T4 GPU of this series of models are shown in the figure below.
+![](../../images/models/T4_benchmark/t4.fp32.bs4.DPN.flops.png)
+![](../../images/models/T4_benchmark/t4.fp32.bs4.DPN.params.png)
+![](../../images/models/T4_benchmark/t4.fp32.bs4.DPN.png)
+![](../../images/models/T4_benchmark/t4.fp16.bs4.DPN.png)
+The pretrained models of these two types of models (a total of 10) are open sourced in PaddleClas at present. The indicators are shown in the figure above. It is easy to observe that under the same FLOPS and parameters, DPN has higher accuracy than DenseNet. However,because DPN has more branches, its inference speed is slower than DenseNet. Since DenseNet264 has the deepest layers in all DenseNet networks, it has the largest parameters,DenseNet161 has the largest width, resulting the largest FLOPs and the highest accuracy in this series. From the perspective of inference speed, DenseNet161, which has a large FLOPs and high accuracy, has a faster speed than DenseNet264, so it has a greater advantage than DenseNet264.
+For DPN series networks, the larger the model's FLOPs and parameters, the higher the model's accuracy. Among them, since the width of DPN107 is the largest, it has the largest number of parameters and FLOPs in this series of networks.
+## Accuracy, FLOPS and Parameters
+| Models      | Top1   | Top5   | Reference<br>top1 | Reference<br>top5 | FLOPS<br>(G) | Parameters<br>(M) |
+|:--:|:--:|:--:|:--:|:--:|:--:|:--:|
+| DenseNet121 | 0.757  | 0.926  | 0.750             |                   | 5.690        | 7.980             |
+| DenseNet161 | 0.786  | 0.941  | 0.778             |                   | 15.490       | 28.680            |
+| DenseNet169 | 0.768  | 0.933  | 0.764             |                   | 6.740        | 14.150            |
+| DenseNet201 | 0.776  | 0.937  | 0.775             |                   | 8.610        | 20.010            |
+| DenseNet264 | 0.780  | 0.939  | 0.779             |                   | 11.540       | 33.370            |
+| DPN68       | 0.768  | 0.934  | 0.764             | 0.931             | 4.030        | 10.780            |
+| DPN92       | 0.799  | 0.948  | 0.793             | 0.946             | 12.540       | 36.290            |
+| DPN98       | 0.806  | 0.951  | 0.799             | 0.949             | 22.220       | 58.460            |
+| DPN107      | 0.809  | 0.953  | 0.802             | 0.951             | 35.060       | 82.970            |
+| DPN131      | 0.807  | 0.951  | 0.801             | 0.949             | 30.510       | 75.360            |
+## Inference speed based on V100 GPU
+| Models                               | Crop Size | Resize Short Size | FP32<br>Batch Size=1<br>(ms) |
+|-------------|-----------|-------------------|--------------------------|
+| DenseNet121 | 224       | 256               | 4.371                    |
+| DenseNet161 | 224       | 256               | 8.863                    |
+| DenseNet169 | 224       | 256               | 6.391                    |
+| DenseNet201 | 224       | 256               | 8.173                    |
+| DenseNet264 | 224       | 256               | 11.942                   |
+| DPN68       | 224       | 256               | 11.805                   |
+| DPN92       | 224       | 256               | 17.840                   |
+| DPN98       | 224       | 256               | 21.057                   |
+| DPN107      | 224       | 256               | 28.685                   |
+| DPN131      | 224       | 256               | 28.083                   |
+## Inference speed based on T4 GPU
+| Models      | Crop Size | Resize Short Size | FP16<br>Batch Size=1<br>(ms) | FP16<br>Batch Size=4<br>(ms) | FP16<br>Batch Size=8<br>(ms) | FP32<br>Batch Size=1<br>(ms) | FP32<br>Batch Size=4<br>(ms) | FP32<br>Batch Size=8<br>(ms) |
+|-------------|-----------|-------------------|------------------------------|------------------------------|------------------------------|------------------------------|------------------------------|------------------------------|
+| DenseNet121 | 224       | 256               | 4.16436                      | 7.2126                       | 10.50221                     | 4.40447                      | 9.32623                      | 15.25175                     |
+| DenseNet161 | 224       | 256               | 9.27249                      | 14.25326                     | 20.19849                     | 10.39152                     | 22.15555                     | 35.78443                     |
+| DenseNet169 | 224       | 256               | 6.11395                      | 10.28747                     | 13.68717                     | 6.43598                      | 12.98832                     | 20.41964                     |
+| DenseNet201 | 224       | 256               | 7.9617                       | 13.4171                      | 17.41949                     | 8.20652                      | 17.45838                     | 27.06309                     |
+| DenseNet264 | 224       | 256               | 11.70074                     | 19.69375                     | 24.79545                     | 12.14722                     | 26.27707                     | 40.01905                     |
+| DPN68       | 224       | 256               | 11.7827                      | 13.12652                     | 16.19213                     | 11.64915                     | 12.82807                     | 18.57113                     |
+| DPN92       | 224       | 256               | 18.56026                     | 20.35983                     | 29.89544                     | 18.15746                     | 23.87545                     | 38.68821                     |
+| DPN98       | 224       | 256               | 21.70508                     | 24.7755                      | 40.93595                     | 21.18196                     | 33.23925                     | 62.77751                     |
+| DPN107      | 224       | 256               | 27.84462                     | 34.83217                     | 60.67903                     | 27.62046                     | 52.65353                     | 100.11721                    |
+| DPN131      | 224       | 256               | 28.58941                     | 33.01078                     | 55.65146                     | 28.33119                     | 46.19439                     | 89.24904                     |
--- a/docs/en/models/EfficientNet_and_ResNeXt101_wsl_en.md
+++ b/docs/en/models/EfficientNet_and_ResNeXt101_wsl_en.md
+# EfficientNet and ResNeXt101_wsl series
+## Overview
+EfficientNet is a lightweight NAS-based network released by Google in 2019. EfficientNetB7 refreshed the classification accuracy of ImageNet-1k at that time. In this paper, the author points out that the traditional methods to improve the performance of neural networks mainly start with the width of the network, the depth of the network, and the resolution of the input picture.
+However, the author found that balancing these three dimensions is essential for improving accuracy and efficiency through experiments.
+Therefore, the author summarized how to balance the three dimensions at the same time through a series of experiments.
+At the same time, based on this scaling method, the author built a total of 7 networks B1-B7 in the EfficientNet series on the basis of EfficientNetB0, and with the same FLOPS and parameters, the accuracy reached state-of-the-art effect.
+ResNeXt is an improved version of ResNet that proposed by Facebook in 2016. In 2019, Facebook researchers studied the accuracy limit of the series network on ImageNet through weakly-supervised-learning. In order to distinguish the previous ResNeXt network, the suffix of this series network is WSL, where WSL is the abbreviation of weakly-supervised-learning. In order to have stronger feature extraction capability, the researchers further enlarged the network width, among which the largest ResNeXt101_32x48d_wsl has 800 million parameters. It was trained under 940 million weak-labeled images, and the results were finetune trained on imagenet-1k. Finally, the acc-1 of imagenet-1k reaches 85.4%, which is also the network with the highest precision under the resolution of 224x224 on imagenet-1k so far. In Fix-ResNeXt, the author used a larger image resolution, made a special Fix strategy for the inconsistency of image data preprocessing in training and testing, and made ResNeXt101_32x48d_wsl have a higher accuracy. Since it used the Fix strategy, it was named Fix-ResNeXt101_32x48d_wsl.
+The FLOPS, parameters, and inference time on the T4 GPU of this series of models are shown in the figure below.
+![](../../images/models/T4_benchmark/t4.fp32.bs4.EfficientNet.flops.png)
+![](../../images/models/T4_benchmark/t4.fp32.bs4.EfficientNet.params.png)
+![](../../images/models/T4_benchmark/t4.fp32.bs1.EfficientNet.png)
+![](../../images/models/T4_benchmark/t4.fp16.bs1.EfficientNet.png)
+At present, there are a total of 14 pretrained models of the two types of models that PaddleClas open source. It can be seen from the above figure that the advantages of the EfficientNet series network are very obvious. The ResNeXt101_wsl series model uses more data, and the final accuracy is also higher. EfficientNet_B0_small removes SE_block based on EfficientNet_B0, which has faster inference speed.
+## Accuracy, FLOPS and Parameters
+| Models                        | Top1   | Top5   | Reference<br>top1 | Reference<br>top5 | FLOPS<br>(G) | Parameters<br>(M) |
+|:--:|:--:|:--:|:--:|:--:|:--:|:--:|
+| ResNeXt101_<br>32x8d_wsl      | 0.826  | 0.967  | 0.822             | 0.964             | 29.140       | 78.440            |
+| ResNeXt101_<br>32x16d_wsl     | 0.842  | 0.973  | 0.842             | 0.972             | 57.550       | 152.660           |
+| ResNeXt101_<br>32x32d_wsl     | 0.850  | 0.976  | 0.851             | 0.975             | 115.170      | 303.110           |
+| ResNeXt101_<br>32x48d_wsl     | 0.854  | 0.977  | 0.854             | 0.976             | 173.580      | 456.200           |
+| Fix_ResNeXt101_<br>32x48d_wsl | 0.863  | 0.980  | 0.864             | 0.980             | 354.230      | 456.200           |
+| EfficientNetB0                | 0.774  | 0.933  | 0.773             | 0.935             | 0.720        | 5.100             |
+| EfficientNetB1                | 0.792  | 0.944  | 0.792             | 0.945             | 1.270        | 7.520             |
+| EfficientNetB2                | 0.799  | 0.947  | 0.803             | 0.950             | 1.850        | 8.810             |
+| EfficientNetB3                | 0.812  | 0.954  | 0.817             | 0.956             | 3.430        | 11.840            |
+| EfficientNetB4                | 0.829  | 0.962  | 0.830             | 0.963             | 8.290        | 18.760            |
+| EfficientNetB5                | 0.836  | 0.967  | 0.837             | 0.967             | 19.510       | 29.610            |
+| EfficientNetB6                | 0.840  | 0.969  | 0.842             | 0.968             | 36.270       | 42.000            |
+| EfficientNetB7                | 0.843  | 0.969  | 0.844             | 0.971             | 72.350       | 64.920            |
+| EfficientNetB0_<br>small      | 0.758  | 0.926  |                   |                   | 0.720        | 4.650             |
+## Inference speed based on V100 GPU
+| Models                               | Crop Size | Resize Short Size | FP32<br>Batch Size=1<br>(ms) |
+|-------------------------------|-----------|-------------------|--------------------------|
+| ResNeXt101_<br>32x8d_wsl      | 224       | 256               | 19.127                   |
+| ResNeXt101_<br>32x16d_wsl     | 224       | 256               | 23.629                   |
+| ResNeXt101_<br>32x32d_wsl     | 224       | 256               | 40.214                   |
+| ResNeXt101_<br>32x48d_wsl     | 224       | 256               | 59.714                   |
+| Fix_ResNeXt101_<br>32x48d_wsl | 320       | 320               | 82.431                   |
+| EfficientNetB0                | 224       | 256               | 2.449                    |
+| EfficientNetB1                | 240       | 272               | 3.547                    |
+| EfficientNetB2                | 260       | 292               | 3.908                    |
+| EfficientNetB3                | 300       | 332               | 5.145                    |
+| EfficientNetB4                | 380       | 412               | 7.609                    |
+| EfficientNetB5                | 456       | 488               | 12.078                   |
+| EfficientNetB6                | 528       | 560               | 18.381                   |
+| EfficientNetB7                | 600       | 632               | 27.817                   |
+| EfficientNetB0_<br>small      | 224       | 256               | 1.692                    |
+## Inference speed based on T4 GPU
+| Models                    | Crop Size | Resize Short Size | FP16<br>Batch Size=1<br>(ms) | FP16<br>Batch Size=4<br>(ms) | FP16<br>Batch Size=8<br>(ms) | FP32<br>Batch Size=1<br>(ms) | FP32<br>Batch Size=4<br>(ms) | FP32<br>Batch Size=8<br>(ms) |
+|---------------------------|-----------|-------------------|------------------------------|------------------------------|------------------------------|------------------------------|------------------------------|------------------------------|
+| ResNeXt101_<br>32x8d_wsl      | 224       | 256               | 18.19374                     | 21.93529                     | 34.67802                     | 18.52528                     | 34.25319                     | 67.2283                      |
+| ResNeXt101_<br>32x16d_wsl     | 224       | 256               | 18.52609                     | 36.8288                      | 62.79947                     | 25.60395                     | 71.88384                     | 137.62327                    |
+| ResNeXt101_<br>32x32d_wsl     | 224       | 256               | 33.51391                     | 70.09682                     | 125.81884                    | 54.87396                     | 160.04337                    | 316.17718                    |
+| ResNeXt101_<br>32x48d_wsl     | 224       | 256               | 50.97681                     | 137.60926                    | 190.82628                    | 99.01698256                  | 315.91261                    | 551.83695                    |
+| Fix_ResNeXt101_<br>32x48d_wsl | 320       | 320               | 78.62869                     | 191.76039                    | 317.15436                    | 160.0838242                  | 595.99296                    | 1151.47384                   |
+| EfficientNetB0            | 224       | 256               | 3.40122                      | 5.95851                      | 9.10801                      | 3.442                        | 6.11476                      | 9.3304                       |
+| EfficientNetB1            | 240       | 272               | 5.25172                      | 9.10233                      | 14.11319                     | 5.3322                       | 9.41795                      | 14.60388                     |
+| EfficientNetB2            | 260       | 292               | 5.91052                      | 10.5898                      | 17.38106                     | 6.29351                      | 10.95702                     | 17.75308                     |
+| EfficientNetB3            | 300       | 332               | 7.69582                      | 16.02548                     | 27.4447                      | 7.67749                      | 16.53288                     | 28.5939                      |
+| EfficientNetB4            | 380       | 412               | 11.55585                     | 29.44261                     | 53.97363                     | 12.15894                     | 30.94567                     | 57.38511                     |
+| EfficientNetB5            | 456       | 488               | 19.63083                     | 56.52299                     | -                            | 20.48571                     | 61.60252                     | -                            |
+| EfficientNetB6            | 528       | 560               | 30.05911                     | -                            | -                            | 32.62402                     | -                            | -                            |
+| EfficientNetB7            | 600       | 632               | 47.86087                     | -                            | -                            | 53.93823                     | -                            | -                            |
+| EfficientNetB0_small      | 224       | 256               | 2.39166                      | 4.36748                      | 6.96002                      | 2.3076                       | 4.71886                      | 7.21888                      |
--- a/docs/en/models/HRNet_en.md
+++ b/docs/en/models/HRNet_en.md
+# HRNet series
+## Overview
+HRNet is a brand new neural network proposed by Microsoft research Asia in 2019. Different from the previous convolutional neural network, this network can still maintain high resolution in the deep layer of the network, so the heat map of the key points predicted is more accurate, and it is also more accurate in space. In addition, the network performs particularly well in other visual tasks sensitive to resolution, such as detection and segmentation.
+The FLOPS, parameters, and inference time on the T4 GPU of this series of models are shown in the figure below.
+![](../../images/models/T4_benchmark/t4.fp32.bs4.HRNet.flops.png)
+![](../../images/models/T4_benchmark/t4.fp32.bs4.HRNet.params.png)
+![](../../images/models/T4_benchmark/t4.fp32.bs4.HRNet.png)
+![](../../images/models/T4_benchmark/t4.fp16.bs4.HRNet.png)
+At present, there are 7 pretrained models of such models open-sourced by PaddleClas, and their indicators are shown in the figure. Among them, the reason why the accuracy of the HRNet_W48_C indicator is abnormal may be due to fluctuations in training.
+## Accuracy, FLOPS and Parameters
+| Models      | Top1   | Top5   | Reference<br>top1 | Reference<br>top5 | FLOPS<br>(G) | Parameters<br>(M) |
+|:--:|:--:|:--:|:--:|:--:|:--:|:--:|
+| HRNet_W18_C | 0.769  | 0.934  | 0.768             | 0.934             | 4.140        | 21.290            |
+| HRNet_W30_C | 0.780  | 0.940  | 0.782             | 0.942             | 16.230       | 37.710            |
+| HRNet_W32_C | 0.783  | 0.942  | 0.785             | 0.942             | 17.860       | 41.230            |
+| HRNet_W40_C | 0.788  | 0.945  | 0.789             | 0.945             | 25.410       | 57.550            |
+| HRNet_W44_C | 0.790  | 0.945  | 0.789             | 0.944             | 29.790       | 67.060            |
+| HRNet_W48_C | 0.790  | 0.944  | 0.793             | 0.945             | 34.580       | 77.470            |
+| HRNet_W64_C | 0.793  | 0.946  | 0.795             | 0.946             | 57.830       | 128.060           |
+## Inference speed based on V100 GPU
+| Models      | Crop Size | Resize Short Size | FP32<br>Batch Size=1<br>(ms) |
+|-------------|-----------|-------------------|--------------------------|
+| HRNet_W18_C | 224       | 256               | 7.368                    |
+| HRNet_W30_C | 224       | 256               | 9.402                    |
+| HRNet_W32_C | 224       | 256               | 9.467                    |
+| HRNet_W40_C | 224       | 256               | 10.739                   |
+| HRNet_W44_C | 224       | 256               | 11.497                   |
+| HRNet_W48_C | 224       | 256               | 12.165                   |
+| HRNet_W64_C | 224       | 256               | 15.003                   |
+## Inference speed based on T4 GPU
+| Models      | Crop Size | Resize Short Size | FP16<br>Batch Size=1<br>(ms) | FP16<br>Batch Size=4<br>(ms) | FP16<br>Batch Size=8<br>(ms) | FP32<br>Batch Size=1<br>(ms) | FP32<br>Batch Size=4<br>(ms) | FP32<br>Batch Size=8<br>(ms) |
+|-------------|-----------|-------------------|------------------------------|------------------------------|------------------------------|------------------------------|------------------------------|------------------------------|
+| HRNet_W18_C | 224       | 256               | 6.79093                      | 11.50986                     | 17.67244                     | 7.40636                      | 13.29752                     | 23.33445                     |
+| HRNet_W30_C | 224       | 256               | 8.98077                      | 14.08082                     | 21.23527                     | 9.57594                      | 17.35485                     | 32.6933                      |
+| HRNet_W32_C | 224       | 256               | 8.82415                      | 14.21462                     | 21.19804                     | 9.49807                      | 17.72921                     | 32.96305                     |
+| HRNet_W40_C | 224       | 256               | 11.4229                      | 19.1595                      | 30.47984                     | 12.12202                     | 25.68184                     | 48.90623                     |
+| HRNet_W44_C | 224       | 256               | 12.25778                     | 22.75456                     | 32.61275                     | 13.19858                     | 32.25202                     | 59.09871                     |
+| HRNet_W48_C | 224       | 256               | 12.65015                     | 23.12886                     | 33.37859                     | 13.70761                     | 34.43572                     | 63.01219                     |
+| HRNet_W64_C | 224       | 256               | 15.10428                     | 27.68901                     | 40.4198                      | 17.57527                     | 47.9533                      | 97.11228                     |
--- a/docs/en/models/Inception_en.md
+++ b/docs/en/models/Inception_en.md
+# Inception series
+## Overview
+GoogLeNet is a new neural network structure designed by Google in 2014, which, together with VGG network, became the twin champions of the ImageNet challenge that year. GoogLeNet introduces the Inception structure for the first time, and stacks the Inception structure in the network so that the number of network layers reaches 22, which is also the mark of the convolutional network exceeding 20 layers for the first time. Since 1x1 convolution is used in the Inception structure to reduce the dimension of channel number, and Global pooling is used to replace the traditional method of processing features in multiple fc layers, the final GoogLeNet network has much less FLOPS and parameters than VGG network, which has become a beautiful scenery of neural network design at that time.
+Xception is another improvement to InceptionV3 that Google proposed after Inception. In Xception, the author used the depthwise separable convolution to replace the traditional convolution operation, which greatly saved the network FLOPS and the number of parameters, but improved the accuracy. In DeeplabV3+, the author further improved the Xception and increased the number of Xception layers, and designed the network of Xception65 and Xception71.
+InceptionV4 is a new neural network designed by Google in 2016, when residual structure were all the rage, but the authors believe that high performance can be achieved using only Inception structure. InceptionV4 uses more Inception structure to achieve even greater precision on Imagenet-1k.
+The FLOPS, parameters, and inference time on the T4 GPU of this series of models are shown in the figure below.
+![](../../images/models/T4_benchmark/t4.fp32.bs4.Inception.flops.png)
+![](../../images/models/T4_benchmark/t4.fp32.bs4.Inception.params.png)
+![](../../images/models/T4_benchmark/t4.fp32.bs4.Inception.png)
+![](../../images/models/T4_benchmark/t4.fp16.bs4.Inception.png)
+The figure above reflects the relationship between the accuracy of Xception series and InceptionV4 and other indicators. Among them, Xception_deeplab is consistent with the structure of the paper, and Xception is an improved model developed by PaddleClas, which improves the accuracy by about 0.6% when the inference speed is basically unchanged. Details of the improved model are being updated, so stay tuned.
+## Accuracy, FLOPS and Parameters
+| Models             | Top1   | Top5   | Reference<br>top1 | Reference<br>top5 | FLOPS<br>(G) | Parameters<br>(M) |
+|:--:|:--:|:--:|:--:|:--:|:--:|:--:|
+| GoogLeNet          | 0.707  | 0.897  | 0.698             |                   | 2.880        | 8.460             |
+| Xception41         | 0.793  | 0.945  | 0.790             | 0.945             | 16.740       | 22.690            |
+| Xception41<br>_deeplab | 0.796  | 0.944  |                   |                   | 18.160       | 26.730            |
+| Xception65         | 0.810  | 0.955  |                   |                   | 25.950       | 35.480            |
+| Xception65<br>_deeplab | 0.803  | 0.945  |                   |                   | 27.370       | 39.520            |
+| Xception71         | 0.811  | 0.955  |                   |                   | 31.770       | 37.280            |
+| InceptionV4        | 0.808  | 0.953  | 0.800             | 0.950             | 24.570       | 42.680            |
+## Inference speed based on V100 GPU
+| Models                 | Crop Size | Resize Short Size | FP32<br>Batch Size=1<br>(ms) |
+|------------------------|-----------|-------------------|--------------------------|
+| GoogLeNet              | 224       | 256               | 1.807                    |
+| Xception41             | 299       | 320               | 3.972                    |
+| Xception41_<br>deeplab | 299       | 320               | 4.408                    |
+| Xception65             | 299       | 320               | 6.174                    |
+| Xception65_<br>deeplab | 299       | 320               | 6.464                    |
+| Xception71             | 299       | 320               | 6.782                    |
+| InceptionV4            | 299       | 320               | 11.141                   |
+## Inference speed based on T4 GPU
+| Models             | Crop Size | Resize Short Size | FP16<br>Batch Size=1<br>(ms) | FP16<br>Batch Size=4<br>(ms) | FP16<br>Batch Size=8<br>(ms) | FP32<br>Batch Size=1<br>(ms) | FP32<br>Batch Size=4<br>(ms) | FP32<br>Batch Size=8<br>(ms) |
+|--------------------|-----------|-------------------|------------------------------|------------------------------|------------------------------|------------------------------|------------------------------|------------------------------|
+| GoogLeNet          | 299       | 320               | 1.75451                      | 3.39931                      | 4.71909                      | 1.88038                      | 4.48882                      | 6.94035                      |
+| Xception41         | 299       | 320               | 2.91192                      | 7.86878                      | 15.53685                     | 4.96939                      | 17.01361                     | 32.67831                     |
+| Xception41_<br>deeplab | 299       | 320               | 2.85934                      | 7.2075                       | 14.01406                     | 5.33541                      | 17.55938                     | 33.76232                     |
+| Xception65         | 299       | 320               | 4.30126                      | 11.58371                     | 23.22213                     | 7.26158                      | 25.88778                     | 53.45426                     |
+| Xception65_<br>deeplab | 299       | 320               | 4.06803                      | 9.72694                      | 19.477                       | 7.60208                      | 26.03699                     | 54.74724                     |
+| Xception71         | 299       | 320               | 4.80889                      | 13.5624                      | 27.18822                     | 8.72457                      | 31.55549                     | 69.31018                     |
+| InceptionV4        | 299       | 320               | 9.50821                      | 13.72104                     | 20.27447                     | 12.99342                     | 25.23416                     | 43.56121                     |
--- a/docs/en/models/Mobile_en.md
+++ b/docs/en/models/Mobile_en.md
+# Mobile and Embedded Vision Applications Network series
+## Overview
+MobileNetV1 is a network launched by Google in 2017 for use on mobile devices or embedded devices. The network replaces the depthwise separable convolution with the traditional convolution operation, that is, the combination of depthwise convolution and pointwise convolution. Compared with the traditional convolution operation, this combination can greatly save the number of parameters and computation. At the same time, MobileNetV1 can also be used for object detection, image segmentation and other visual tasks.
+MobileNetV2 is a lightweight network proposed by Google following MobileNetV1. Compared with MobileNetV1, MobileNetV2 proposed Linear bottlenecks and Inverted residual block as a basic network structures, to constitute MobileNetV2 network architecture through stacking these basic module a lot. In the end, higher classification accuracy was achieved when FLOPS was only half of MobileNetV1.
+The ShuffleNet series network is the lightweight network structure proposed by MEGVII. So far, there are two typical structures in this series network, namely, ShuffleNetV1 and ShuffleNetV2. A Channel Shuffle operation in ShuffleNet can exchange information between groups and perform end-to-end training. In the paper of ShuffleNetV2, the author proposes four criteria for designing lightweight networks, and designs the ShuffleNetV2 network according to the four criteria and the shortcomings of ShuffleNetV1.
+MobileNetV3 is a new and lightweight network based on NAS proposed by Google in 2019. In order to further improve the effect, the activation functions of relu and sigmoid were replaced with hard_swish and hard_sigmoid activation functions, and some improved strategies were introduced to reduce the amount of network computing.
+![](../../images/models/mobile_arm_top1.png)
+![](../../images/models/mobile_arm_storage.png)
+![](../../images/models/T4_benchmark/t4.fp32.bs4.mobile_trt.flops.png)
+![](../../images/models/T4_benchmark/t4.fp32.bs4.mobile_trt.params.png)
+Currently there are 32 pretrained models of the mobile series open source by PaddleClas, and their indicators are shown in the figure below. As you can see from the picture, newer lightweight models tend to perform better, and MobileNetV3 represents the latest lightweight neural network architecture. In MobileNetV3, the author used 1x1 convolution after global-avg-pooling in order to obtain higher accuracy,this operation significantly increases the number of parameters but has little impact on the amount of computation, so if the model is evaluated from a storage perspective of excellence, MobileNetV3 does not have much advantage, but because of its smaller computation, it has a faster inference speed. In addition, the SSLD distillation model in our model library performs excellently, refreshing the accuracy of the current lightweight model from various perspectives. Due to the complex structure and many branches of the MobileNetV3 model, which is not GPU friendly, the GPU inference speed is not as good as that of MobileNetV1.
+## Accuracy, FLOPS and Parameters
+| Models                               | Top1    | Top5    | Reference<br>top1 | Reference<br>top5 | FLOPS<br>(G) | Parameters<br>(M) |
+|:--:|:--:|:--:|:--:|:--:|:--:|:--:|
+| MobileNetV1_x0_25                    | 0.514   | 0.755   | 0.506             |                   | 0.070        | 0.460             |
+| MobileNetV1_x0_5                     | 0.635   | 0.847   | 0.637             |                   | 0.280        | 1.310             |
+| MobileNetV1_x0_75                    | 0.688   | 0.882   | 0.684             |                   | 0.630        | 2.550             |
+| MobileNetV1                          | 0.710   | 0.897   | 0.706             |                   | 1.110        | 4.190             |
+| MobileNetV1_ssld                     | 0.779   | 0.939   |                   |                   | 1.110        | 4.190             |
+| MobileNetV2_x0_25                    | 0.532   | 0.765   |                   |                   | 0.050        | 1.500             |
+| MobileNetV2_x0_5                     | 0.650   | 0.857   | 0.654             | 0.864             | 0.170        | 1.930             |
+| MobileNetV2_x0_75                    | 0.698   | 0.890   | 0.698             | 0.896             | 0.350        | 2.580             |
+| MobileNetV2                          | 0.722   | 0.907   | 0.718             | 0.910             | 0.600        | 3.440             |
+| MobileNetV2_x1_5                     | 0.741   | 0.917   |                   |                   | 1.320        | 6.760             |
+| MobileNetV2_x2_0                     | 0.752   | 0.926   |                   |                   | 2.320        | 11.130            |
+| MobileNetV2_ssld                     | 0.7674  | 0.9339  |                   |                   | 0.600        | 3.440             |
+| MobileNetV3_large_<br>x1_25          | 0.764   | 0.930   | 0.766             |                   | 0.714        | 7.440             |
+| MobileNetV3_large_<br>x1_0           | 0.753   | 0.923   | 0.752             |                   | 0.450        | 5.470             |
+| MobileNetV3_large_<br>x0_75          | 0.731   | 0.911   | 0.733             |                   | 0.296        | 3.910             |
+| MobileNetV3_large_<br>x0_5           | 0.692   | 0.885   | 0.688             |                   | 0.138        | 2.670             |
+| MobileNetV3_large_<br>x0_35          | 0.643   | 0.855   | 0.642             |                   | 0.077        | 2.100             |
+| MobileNetV3_small_<br>x1_25          | 0.707   | 0.895   | 0.704             |                   | 0.195        | 3.620             |
+| MobileNetV3_small_<br>x1_0           | 0.682   | 0.881   | 0.675             |                   | 0.123        | 2.940             |
+| MobileNetV3_small_<br>x0_75          | 0.660   | 0.863   | 0.654             |                   | 0.088        | 2.370             |
+| MobileNetV3_small_<br>x0_5           | 0.592   | 0.815   | 0.580             |                   | 0.043        | 1.900             |
+| MobileNetV3_small_<br>x0_35          | 0.530   | 0.764   | 0.498             |                   | 0.026        | 1.660             |
+| MobileNetV3_large_<br>x1_0_ssld      | 0.790   | 0.945   |                   |                   | 0.450        | 5.470             |
+| MobileNetV3_large_<br>x1_0_ssld_int8 | 0.761   |         |                   |                   |              |                   |
+| MobileNetV3_small_<br>x1_0_ssld      | 0.713   | 0.901   |                   |                   | 0.123        | 2.940             |
+| ShuffleNetV2                         | 0.688   | 0.885   | 0.694             |                   | 0.280        | 2.260             |
+| ShuffleNetV2_x0_25                   | 0.499   | 0.738   |                   |                   | 0.030        | 0.600             |
+| ShuffleNetV2_x0_33                   | 0.537   | 0.771   |                   |                   | 0.040        | 0.640             |
+| ShuffleNetV2_x0_5                    | 0.603   | 0.823   | 0.603             |                   | 0.080        | 1.360             |
+| ShuffleNetV2_x1_5                    | 0.716   | 0.902   | 0.726             |                   | 0.580        | 3.470             |
+| ShuffleNetV2_x2_0                    | 0.732   | 0.912   | 0.749             |                   | 1.120        | 7.320             |
+| ShuffleNetV2_swish                   | 0.700   | 0.892   |                   |                   | 0.290        | 2.260             |
+## Inference speed and storage size based on SD855
+| Models                               | Batch Size=1(ms) | Storage Size(M) |
+|:--:|:--:|:--:|
+| MobileNetV1_x0_25                    | 3.220            | 1.900           |
+| MobileNetV1_x0_5                     | 9.580            | 5.200           |
+| MobileNetV1_x0_75                    | 19.436           | 10.000          |
+| MobileNetV1                          | 32.523           | 16.000          |
+| MobileNetV1_ssld                     | 32.523           | 16.000          |
+| MobileNetV2_x0_25                    | 3.799            | 6.100           |
+| MobileNetV2_x0_5                     | 8.702            | 7.800           |
+| MobileNetV2_x0_75                    | 15.531           | 10.000          |
+| MobileNetV2                          | 23.318           | 14.000          |
+| MobileNetV2_x1_5                     | 45.624           | 26.000          |
+| MobileNetV2_x2_0                     | 74.292           | 43.000          |
+| MobileNetV2_ssld                     | 23.318           | 14.000          |
+| MobileNetV3_large_x1_25          | 28.218           | 29.000          |
+| MobileNetV3_large_x1_0           | 19.308           | 21.000          |
+| MobileNetV3_large_x0_75          | 13.565           | 16.000          |
+| MobileNetV3_large_x0_5           | 7.493            | 11.000          |
+| MobileNetV3_large_x0_35          | 5.137            | 8.600           |
+| MobileNetV3_small_x1_25          | 9.275            | 14.000          |
+| MobileNetV3_small_x1_0           | 6.546            | 12.000          |
+| MobileNetV3_small_x0_75          | 5.284            | 9.600           |
+| MobileNetV3_small_x0_5           | 3.352            | 7.800           |
+| MobileNetV3_small_x0_35          | 2.635            | 6.900           |
+| MobileNetV3_large_x1_0_ssld      | 19.308           | 21.000          |
+| MobileNetV3_large_x1_0_ssld_int8 | 14.395           | 10.000          |
+| MobileNetV3_small_x1_0_ssld      | 6.546            | 12.000          |
+| ShuffleNetV2                         | 10.941           | 9.000           |
+| ShuffleNetV2_x0_25                   | 2.329            | 2.700           |
+| ShuffleNetV2_x0_33                   | 2.643            | 2.800           |
+| ShuffleNetV2_x0_5                    | 4.261            | 5.600           |
+| ShuffleNetV2_x1_5                    | 19.352           | 14.000          |
+| ShuffleNetV2_x2_0                    | 34.770           | 28.000          |
+| ShuffleNetV2_swish                   | 16.023           | 9.100           |
+## Inference speed based on T4 GPU
+| Models            | FP16<br>Batch Size=1<br>(ms) | FP16<br>Batch Size=4<br>(ms) | FP16<br>Batch Size=8<br>(ms) | FP32<br>Batch Size=1<br>(ms) | FP32<br>Batch Size=4<br>(ms) | FP32<br>Batch Size=8<br>(ms) |
+|-----------------------------|-----------------------|-----------------------|-----------------------|-----------------------|-----------------------|-----------------------|
+| MobileNetV1_x0_25           | 0.68422               | 1.13021               | 1.72095               | 0.67274               | 1.226                 | 1.84096               |
+| MobileNetV1_x0_5            | 0.69326               | 1.09027               | 1.84746               | 0.69947               | 1.43045               | 2.39353               |
+| MobileNetV1_x0_75           | 0.6793                | 1.29524               | 2.15495               | 0.79844               | 1.86205               | 3.064                 |
+| MobileNetV1                 | 0.71942               | 1.45018               | 2.47953               | 0.91164               | 2.26871               | 3.90797               |
+| MobileNetV1_ssld            | 0.71942               | 1.45018               | 2.47953               | 0.91164               | 2.26871               | 3.90797               |
+| MobileNetV2_x0_25           | 2.85399               | 3.62405               | 4.29952               | 2.81989               | 3.52695               | 4.2432                |
+| MobileNetV2_x0_5            | 2.84258               | 3.1511                | 4.10267               | 2.80264               | 3.65284               | 4.31737               |
+| MobileNetV2_x0_75           | 2.82183               | 3.27622               | 4.98161               | 2.86538               | 3.55198               | 5.10678               |
+| MobileNetV2                 | 2.78603               | 3.71982               | 6.27879               | 2.62398               | 3.54429               | 6.41178               |
+| MobileNetV2_x1_5            | 2.81852               | 4.87434               | 8.97934               | 2.79398               | 5.30149               | 9.30899               |
+| MobileNetV2_x2_0            | 3.65197               | 6.32329               | 11.644                | 3.29788               | 7.08644               | 12.45375              |
+| MobileNetV2_ssld            | 2.78603               | 3.71982               | 6.27879               | 2.62398               | 3.54429               | 6.41178               |
+| MobileNetV3_large_x1_25     | 2.34387               | 3.16103               | 4.79742               | 2.35117               | 3.44903               | 5.45658               |
+| MobileNetV3_large_x1_0      | 2.20149               | 3.08423               | 4.07779               | 2.04296               | 2.9322                | 4.53184               |
+| MobileNetV3_large_x0_75     | 2.1058                | 2.61426               | 3.61021               | 2.0006                | 2.56987               | 3.78005               |
+| MobileNetV3_large_x0_5      | 2.06934               | 2.77341               | 3.35313               | 2.11199               | 2.88172               | 3.19029               |
+| MobileNetV3_large_x0_35     | 2.14965               | 2.7868                | 3.36145               | 1.9041                | 2.62951               | 3.26036               |
+| MobileNetV3_small_x1_25     | 2.06817               | 2.90193               | 3.5245                | 2.02916               | 2.91866               | 3.34528               |
+| MobileNetV3_small_x1_0      | 1.73933               | 2.59478               | 3.40276               | 1.74527               | 2.63565               | 3.28124               |
+| MobileNetV3_small_x0_75     | 1.80617               | 2.64646               | 3.24513               | 1.93697               | 2.64285               | 3.32797               |
+| MobileNetV3_small_x0_5      | 1.95001               | 2.74014               | 3.39485               | 1.88406               | 2.99601               | 3.3908                |
+| MobileNetV3_small_x0_35     | 2.10683               | 2.94267               | 3.44254               | 1.94427               | 2.94116               | 3.41082               |
+| MobileNetV3_large_x1_0_ssld | 2.20149               | 3.08423               | 4.07779               | 2.04296               | 2.9322                | 4.53184               |
+| MobileNetV3_small_x1_0_ssld | 1.73933               | 2.59478               | 3.40276               | 1.74527               | 2.63565               | 3.28124               |
+| ShuffleNetV2                | 1.95064               | 2.15928               | 2.97169               | 1.89436               | 2.26339               | 3.17615               |
+| ShuffleNetV2_x0_25          | 1.43242               | 2.38172               | 2.96768               | 1.48698               | 2.29085               | 2.90284               |
+| ShuffleNetV2_x0_33          | 1.69008               | 2.65706               | 2.97373               | 1.75526               | 2.85557               | 3.09688               |
+| ShuffleNetV2_x0_5           | 1.48073               | 2.28174               | 2.85436               | 1.59055               | 2.18708               | 3.09141               |
+| ShuffleNetV2_x1_5           | 1.51054               | 2.4565                | 3.41738               | 1.45389               | 2.5203                | 3.99872               |
+| ShuffleNetV2_x2_0           | 1.95616               | 2.44751               | 4.19173               | 2.15654               | 3.18247               | 5.46893               |
+| ShuffleNetV2_swish          | 2.50213               | 2.92881               | 3.474                 | 2.5129                | 2.97422               | 3.69357               |
--- a/docs/en/models/Others_en.md
+++ b/docs/en/models/Others_en.md
+# Other networks
+## Overview
+In 2012, AlexNet network proposed by Alex et al. won the ImageNet competition by far surpassing the second place, and the convolutional neural network and even deep learning attracted wide attention. AlexNet used relu as the activation function of CNN to solve the gradient dispersion problem of sigmoid when the network is deep. During the training, Dropout was used to randomly lose a part of the neurons, avoiding the overfitting of the model. In the network, overlapping maximum pooling is used to replace the average pooling commonly used in CNN, which avoids the fuzzy effect of average pooling and improves the feature richness. In a sense, AlexNet has exploded the research and application of neural networks.
+SqueezeNet achieved the same precision as AlexNet on Imagenet-1k, but only with 1/50 parameters. The core of the network is the Fire module, which used the convolution of 1x1 to achieve channel dimensionality reduction, thus greatly saving the number of parameters. The author created SqueezeNet by stacking a large number of Fire modules.
+VGG is a convolutional neural network developed by researchers at Oxford University's Visual Geometry Group and DeepMind. The network explores the relationship between the depth of the convolutional neural network and its performance. By repeatedly stacking the small convolutional kernel of 3x3 and the maximum pooling layer of 2x2, the multi-layer convolutional neural network is successfully constructed and has achieved good convergence accuracy. In the end, VGG won the runner-up of ILSVRC 2014 classification and the champion of positioning.
+DarkNet53 is designed for object detection by YOLO author in the paper. The network is basically composed of 1x1 and 3x3 kernel, with a total of 53 layers, named DarkNet53.
+## Accuracy, FLOPS and Parameters
+| Models                    | Top1   | Top5   | Reference<br>top1 | Reference<br>top5 | FLOPS<br>(G) | Parameters<br>(M) |
+|:--:|:--:|:--:|:--:|:--:|:--:|:--:|
+| AlexNet                   | 0.567  | 0.792  | 0.5720            |                   | 1.370        | 61.090            |
+| SqueezeNet1_0             | 0.596  | 0.817  | 0.575             |                   | 1.550        | 1.240             |
+| SqueezeNet1_1             | 0.601  | 0.819  |                   |                   | 0.690        | 1.230             |
+| VGG11                     | 0.693  | 0.891  |                   |                   | 15.090       | 132.850           |
+| VGG13                     | 0.700  | 0.894  |                   |                   | 22.480       | 133.030           |
+| VGG16                     | 0.720  | 0.907  | 0.715             | 0.901             | 30.810       | 138.340           |
+| VGG19                     | 0.726  | 0.909  |                   |                   | 39.130       | 143.650           |
+| DarkNet53                 | 0.780  | 0.941  | 0.772             | 0.938             | 18.580       | 41.600            |
+| ResNet50_ACNet            | 0.767  | 0.932  |                   |                   | 10.730       | 33.110            |
+| ResNet50_ACNet<br>_deploy | 0.767  | 0.932  |                   |                   | 8.190        | 25.550            |
+## Inference speed based on V100 GPU
+| Models                 | Crop Size | Resize Short Size | FP32<br>Batch Size=1<br>(ms) |
+|---------------------------|-----------|-------------------|----------------------|
+| AlexNet                   | 224       | 256               | 1.176                |
+| SqueezeNet1_0             | 224       | 256               | 0.860                |
+| SqueezeNet1_1             | 224       | 256               | 0.763                |
+| VGG11                     | 224       | 256               | 1.867                |
+| VGG13                     | 224       | 256               | 2.148                |
+| VGG16                     | 224       | 256               | 2.616                |
+| VGG19                     | 224       | 256               | 3.076                |
+| DarkNet53                 | 256       | 256               | 3.139                |
+| ResNet50_ACNet<br>_deploy | 224       | 256               | 5.626                |
+## Inference speed based on T4 GPU
+| Models                | Crop Size | Resize Short Size | FP16<br>Batch Size=1<br>(ms) | FP16<br>Batch Size=4<br>(ms) | FP16<br>Batch Size=8<br>(ms) | FP32<br>Batch Size=1<br>(ms) | FP32<br>Batch Size=4<br>(ms) | FP32<br>Batch Size=8<br>(ms) |
+|-----------------------|-----------|-------------------|------------------------------|------------------------------|------------------------------|------------------------------|------------------------------|------------------------------|
+| AlexNet               | 224       | 256               | 1.06447                      | 1.70435                      | 2.38402                      | 1.44993                      | 2.46696                      | 3.72085                      |
+| SqueezeNet1_0         | 224       | 256               | 0.97162                      | 2.06719                      | 3.67499                      | 0.96736                      | 2.53221                      | 4.54047                      |
+| SqueezeNet1_1         | 224       | 256               | 0.81378                      | 1.62919                      | 2.68044                      | 0.76032                      | 1.877                        | 3.15298                      |
+| VGG11                 | 224       | 256               | 2.24408                      | 4.67794                      | 7.6568                       | 3.90412                      | 9.51147                      | 17.14168                     |
+| VGG13                 | 224       | 256               | 2.58589                      | 5.82708                      | 10.03591                     | 4.64684                      | 12.61558                     | 23.70015                     |
+| VGG16                 | 224       | 256               | 3.13237                      | 7.19257                      | 12.50913                     | 5.61769                      | 16.40064                     | 32.03939                     |
+| VGG19                 | 224       | 256               | 3.69987                      | 8.59168                      | 15.07866                     | 6.65221                      | 20.4334                      | 41.55902                     |
+| DarkNet53             | 256       | 256               | 3.18101                      | 5.88419                      | 10.14964                     | 4.10829                      | 12.1714                      | 22.15266                     |
+| ResNet50_ACNet        | 256       | 256               | 3.89002                      | 4.58195                      | 9.01095                      | 5.33395                      | 10.96843                     | 18.70368                     |
+| ResNet50_ACNet_deploy | 224       | 256               | 2.6823                       | 5.944                        | 7.16655                      | 3.49161                      | 7.78374                      | 13.94361                     |
--- a/docs/en/models/ResNet_and_vd_en.md
+++ b/docs/en/models/ResNet_and_vd_en.md
+# ResNet and ResNet_vd series
+## Overview
+The ResNet series model was proposed in 2015 and won the championship in the ILSVRC2015 competition with a top5 error rate of 3.57%. The network innovatively proposed the residual structure, and built the ResNet network by stacking multiple residual structures. Experiments show that using residual blocks can improve the convergence speed and accuracy effectively.
+Joyce Xu of Stanford university calls ResNet one of three architectures that "really redefine the way we think about neural networks." Due to the outstanding performance of ResNet, more and more scholars and engineers from academia and industry have improved its structure. The well-known ones include wide-resnet, resnet-vc, resnet-vd, Res2Net, etc. The number of parameters and FLOPs of resnet-vc and resnet-vd are almost the same as those of ResNet, so we hereby unified them into the ResNet series.
+The models of the ResNet series released this time include 14 pre-trained models including ResNet50, ResNet50_vd, ResNet50_vd_ssld, and ResNet200_vd. At the training level, ResNet adopted the standard training process for training ImageNet, while the rest of the improved model adopted more training strategies, such as cosine decay for the decline of learning rate and the regular label smoothing method,mixup was added to the data preprocessing, and the total number of iterations increased from 120 epoches to 200 epoches.
+Among them, ResNet50_vd_v2 and ResNet50_vd_ssld adopted knowledge distillation, which further improved the accuracy of the model while keeping the structure unchanged. Specifically, the teacher model of ResNet50_vd_v2 is ResNet152_vd (top1 accuracy 80.59%), the training set is imagenet-1k, the teacher model of ResNet50_vd_ssld is ResNeXt101_32x16d_wsl (top1 accuracy 84.2%), and the training set is the combination of 4 million data mined by imagenet-22k and ImageNet-1k . The specific methods of knowledge distillation are being continuously updated.
+The FLOPS, parameters, and inference time on the T4 GPU of this series of models are shown in the figure below.
+![](../../images/models/T4_benchmark/t4.fp32.bs4.ResNet.flops.png)
+![](../../images/models/T4_benchmark/t4.fp32.bs4.ResNet.params.png)
+![](../../images/models/T4_benchmark/t4.fp32.bs4.ResNet.png)
+![](../../images/models/T4_benchmark/t4.fp16.bs4.ResNet.png)
+As can be seen from the above curves, the higher the number of layers, the higher the accuracy, but the corresponding number of parameters, calculation and latency will increase. ResNet50_vd_ssld further improves the accuracy of top-1 of the ImageNet-1k validation set by using stronger teachers and more data, reaching 82.39%, refreshing the accuracy of ResNet50 series models.
+## Accuracy, FLOPS and Parameters
+| Models           | Top1 | Top5 | Reference<br>top1 | Reference<br>top5 | FLOPS<br>(G) | Parameters<br>(M) |
+|:--:|:--:|:--:|:--:|:--:|:--:|:--:|
+| ResNet18         | 0.710           | 0.899           | 0.696                    | 0.891                    | 3.660     | 11.690    |
+| ResNet18_vd      | 0.723           | 0.908           |                          |                          | 4.140     | 11.710    |
+| ResNet34         | 0.746           | 0.921           | 0.732                    | 0.913                    | 7.360     | 21.800    |
+| ResNet34_vd      | 0.760           | 0.930           |                          |                          | 7.390     | 21.820    |
+| ResNet50         | 0.765           | 0.930           | 0.760                    | 0.930                    | 8.190     | 25.560    |
+| ResNet50_vc      | 0.784           | 0.940           |                          |                          | 8.670     | 25.580    |
+| ResNet50_vd      | 0.791           | 0.944           | 0.792                    | 0.946                    | 8.670     | 25.580    |
+| ResNet50_vd_v2   | 0.798           | 0.949           |                          |                          | 8.670     | 25.580    |
+| ResNet101        | 0.776           | 0.936           | 0.776                    | 0.938                    | 15.520    | 44.550    |
+| ResNet101_vd     | 0.802           | 0.950           |                          |                          | 16.100    | 44.570    |
+| ResNet152        | 0.783           | 0.940           | 0.778                    | 0.938                    | 23.050    | 60.190    |
+| ResNet152_vd     | 0.806           | 0.953           |                          |                          | 23.530    | 60.210    |
+| ResNet200_vd     | 0.809           | 0.953           |                          |                          | 30.530    | 74.740    |
+| ResNet50_vd_ssld | 0.824           | 0.961           |                          |                          | 8.670     | 25.580    |
+| ResNet50_vd_ssld_v2 | 0.830           | 0.964           |                          |                          | 8.670     | 25.580    |
+| Fix_ResNet50_vd_ssld_v2 | 0.840           | 0.970           |                          |                          | 17.696     | 25.580    |
+| ResNet101_vd_ssld | 0.837           | 0.967           |                          |                          | 16.100    | 44.570     |
+* Note: `ResNet50_vd_ssld_v2` is obtained by adding AutoAugment in training process on the basis of `ResNet50_vd_ssld` training strategy.`Fix_ResNet50_vd_ssld_v2` stopped all parameter updates of `ResNet50_vd_ssld_v2` except the FC layer,and fine-tuned on ImageNet1k dataset, the resolution is 320x320.
+## Inference speed based on V100 GPU
+| Models                 | Crop Size | Resize Short Size | FP32<br>Batch Size=1<br>(ms) |
+|------------------|-----------|-------------------|--------------------------|
+| ResNet18         | 224       | 256               | 1.499                    |
+| ResNet18_vd      | 224       | 256               | 1.603                    |
+| ResNet34         | 224       | 256               | 2.272                    |
+| ResNet34_vd      | 224       | 256               | 2.343                    |
+| ResNet50         | 224       | 256               | 2.939                    |
+| ResNet50_vc      | 224       | 256               | 3.041                    |
+| ResNet50_vd      | 224       | 256               | 3.165                    |
+| ResNet50_vd_v2   | 224       | 256               | 3.165                    |
+| ResNet101        | 224       | 256               | 5.314                    |
+| ResNet101_vd     | 224       | 256               | 5.252                    |
+| ResNet152        | 224       | 256               | 7.205                    |
+| ResNet152_vd     | 224       | 256               | 7.200                    |
+| ResNet200_vd     | 224       | 256               | 8.885                    |
+| ResNet50_vd_ssld | 224       | 256               | 3.165                    |
+| ResNet101_vd_ssld  | 224       | 256             | 5.252                  |
+## Inference speed based on T4 GPU
+| Models            | Crop Size | Resize Short Size | FP16<br>Batch Size=1<br>(ms) | FP16<br>Batch Size=4<br>(ms) | FP16<br>Batch Size=8<br>(ms) | FP32<br>Batch Size=1<br>(ms) | FP32<br>Batch Size=4<br>(ms) | FP32<br>Batch Size=8<br>(ms) |
+|-------------------|-----------|-------------------|------------------------------|------------------------------|------------------------------|------------------------------|------------------------------|------------------------------|
+| ResNet18          | 224       | 256               | 1.3568                       | 2.5225                       | 3.61904                      | 1.45606                      | 3.56305                      | 6.28798                      |
+| ResNet18_vd       | 224       | 256               | 1.39593                      | 2.69063                      | 3.88267                      | 1.54557                      | 3.85363                      | 6.88121                      |
+| ResNet34          | 224       | 256               | 2.23092                      | 4.10205                      | 5.54904                      | 2.34957                      | 5.89821                      | 10.73451                     |
+| ResNet34_vd       | 224       | 256               | 2.23992                      | 4.22246                      | 5.79534                      | 2.43427                      | 6.22257                      | 11.44906                     |
+| ResNet50          | 224       | 256               | 2.63824                      | 4.63802                      | 7.02444                      | 3.47712                      | 7.84421                      | 13.90633                     |
+| ResNet50_vc       | 224       | 256               | 2.67064                      | 4.72372                      | 7.17204                      | 3.52346                      | 8.10725                      | 14.45577                     |
+| ResNet50_vd       | 224       | 256               | 2.65164                      | 4.84109                      | 7.46225                      | 3.53131                      | 8.09057                      | 14.45965                     |
+| ResNet50_vd_v2    | 224       | 256               | 2.65164                      | 4.84109                      | 7.46225                      | 3.53131                      | 8.09057                      | 14.45965                     |
+| ResNet101         | 224       | 256               | 5.04037                      | 7.73673                      | 10.8936                      | 6.07125                      | 13.40573                     | 24.3597                      |
+| ResNet101_vd      | 224       | 256               | 5.05972                      | 7.83685                      | 11.34235                     | 6.11704                      | 13.76222                     | 25.11071                     |
+| ResNet152         | 224       | 256               | 7.28665                      | 10.62001                     | 14.90317                     | 8.50198                      | 19.17073                     | 35.78384                     |
+| ResNet152_vd      | 224       | 256               | 7.29127                      | 10.86137                     | 15.32444                     | 8.54376                      | 19.52157                     | 36.64445                     |
+| ResNet200_vd      | 224       | 256               | 9.36026                      | 13.5474                      | 19.0725                      | 10.80619                     | 25.01731                     | 48.81399                     |
+| ResNet50_vd_ssld  | 224       | 256               | 2.65164                      | 4.84109                      | 7.46225                      | 3.53131                      | 8.09057                      | 14.45965                     |
+| ResNet50_vd_ssld_v2  | 224       | 256               | 2.65164                      | 4.84109                      | 7.46225                      | 3.53131                      | 8.09057                      | 14.45965                     |
+| Fix_ResNet50_vd_ssld_v2  | 320       | 320               | 3.42818                      | 7.51534                      | 13.19370                      | 5.07696                      | 14.64218                      | 27.01453                     |
+| ResNet101_vd_ssld | 224       | 256               | 5.05972                      | 7.83685                      | 11.34235                     | 6.11704                      | 13.76222                     | 25.11071                     |
--- a/docs/en/models/SEResNext_and_Res2Net_en.md
+++ b/docs/en/models/SEResNext_and_Res2Net_en.md
+# SEResNeXt and Res2Net series
+## Overview
+ResNeXt, one of the typical variants of ResNet, was presented at the CVPR conference in 2017. Prior to this, the methods to improve the model accuracy mainly focused on deepening or widening the network, which increased the number of parameters and calculation, and slowed down the inference speed accordingly. The concept of cardinality was proposed in ResNeXt structure. The author found that increasing the number of channel groups was more effective than increasing the depth and width through experiments. It can improve the accuracy without increasing the parameter complexity and reduce the number of parameters at the same time, so it is a more successful variant of ResNet.
+SENet is the winner of the 2017 ImageNet classification competition. It proposes a new SE structure that can be migrated to any other network. It controls the scale to enhance the important features between each channel, and weaken the unimportant features. So that the extracted features are more directional.
+Res2Net is a brand-new improvement of ResNet proposed in 2019. The solution can be easily integrated with other excellent modules. Without increasing the amount of calculation, the performance on ImageNet, CIFAR-100 and other data sets exceeds ResNet. Res2Net, with its simple structure and superior performance, further explores the multi-scale representation capability of CNN at a more fine-grained level. Res2Net reveals a new dimension to improve model accuracy, called scale, which is an essential and more effective factor in addition to the existing dimensions of depth, width, and cardinality. The network also performs well in other visual tasks such as object detection and image segmentation.
+The FLOPS, parameters, and inference time on the T4 GPU of this series of models are shown in the figure below.
+![](../../images/models/T4_benchmark/t4.fp32.bs4.SeResNeXt.flops.png)
+![](../../images/models/T4_benchmark/t4.fp32.bs4.SeResNeXt.params.png)
+![](../../images/models/T4_benchmark/t4.fp32.bs4.SeResNeXt.png)
+![](../../images/models/T4_benchmark/t4.fp16.bs4.SeResNeXt.png)
+At present, there are a total of 24 pretrained models of the three categories open sourced by PaddleClas, and the indicators are shown in the figure. It can be seen from the diagram that under the same Flops and Params, the improved model tends to have higher accuracy, but the inference speed is often inferior to the ResNet series. On the other hand, Res2Net performed better. Compared with group operation in ResNeXt and SE structure operation in SEResNet, Res2Net tended to have better accuracy in the same Flops, Params and inference speed.
+## Accuracy, FLOPS and Parameters
+| Models                | Top1   | Top5   | Reference<br>top1 | Reference<br>top5 | FLOPS<br>(G) | Parameters<br>(M) |
+|:--:|:--:|:--:|:--:|:--:|:--:|:--:|
+| Res2Net50_26w_4s      | 0.793  | 0.946  | 0.780             | 0.936             | 8.520        | 25.700            |
+| Res2Net50_vd_26w_4s   | 0.798  | 0.949  |                   |                   | 8.370        | 25.060            |
+| Res2Net50_14w_8s      | 0.795  | 0.947  | 0.781             | 0.939             | 9.010        | 25.720            |
+| Res2Net101_vd_26w_4s  | 0.806  | 0.952  |                   |                   | 16.670       | 45.220            |
+| Res2Net200_vd_26w_4s  | 0.812  | 0.957  |                   |                   | 31.490       | 76.210            |
+| ResNeXt50_32x4d       | 0.778  | 0.938  | 0.778             |                   | 8.020        | 23.640            |
+| ResNeXt50_vd_32x4d    | 0.796  | 0.946  |                   |                   | 8.500        | 23.660            |
+| ResNeXt50_64x4d       | 0.784  | 0.941  |                   |                   | 15.060       | 42.360            |
+| ResNeXt50_vd_64x4d    | 0.801  | 0.949  |                   |                   | 15.540       | 42.380            |
+| ResNeXt101_32x4d      | 0.787  | 0.942  | 0.788             |                   | 15.010       | 41.540            |
+| ResNeXt101_vd_32x4d   | 0.803  | 0.951  |                   |                   | 15.490       | 41.560            |
+| ResNeXt101_64x4d      | 0.784  | 0.945  | 0.796             |                   | 29.050       | 78.120            |
+| ResNeXt101_vd_64x4d   | 0.808  | 0.952  |                   |                   | 29.530       | 78.140            |
+| ResNeXt152_32x4d      | 0.790  | 0.943  |                   |                   | 22.010       | 56.280            |
+| ResNeXt152_vd_32x4d   | 0.807  | 0.952  |                   |                   | 22.490       | 56.300            |
+| ResNeXt152_64x4d      | 0.795  | 0.947  |                   |                   | 43.030       | 107.570           |
+| ResNeXt152_vd_64x4d   | 0.811  | 0.953  |                   |                   | 43.520       | 107.590           |
+| SE_ResNet18_vd        | 0.733  | 0.914  |                   |                   | 4.140        | 11.800            |
+| SE_ResNet34_vd        | 0.765  | 0.932  |                   |                   | 7.840        | 21.980            |
+| SE_ResNet50_vd        | 0.795  | 0.948  |                   |                   | 8.670        | 28.090            |
+| SE_ResNeXt50_32x4d    | 0.784  | 0.940  | 0.789             | 0.945             | 8.020        | 26.160            |
+| SE_ResNeXt50_vd_32x4d | 0.802  | 0.949  |                   |                   | 10.760       | 26.280            |
+| SE_ResNeXt101_32x4d   | 0.791  | 0.942  | 0.793             | 0.950             | 15.020       | 46.280            |
+| SENet154_vd           | 0.814  | 0.955  |                   |                   | 45.830       | 114.290           |
+## Inference speed based on V100 GPU
+| Models                 | Crop Size | Resize Short Size | FP32<br>Batch Size=1<br>(ms) |
+|-----------------------|-----------|-------------------|--------------------------|
+| Res2Net50_26w_4s      | 224       | 256               | 4.148                    |
+| Res2Net50_vd_26w_4s   | 224       | 256               | 4.172                    |
+| Res2Net50_14w_8s      | 224       | 256               | 5.113                    |
+| Res2Net101_vd_26w_4s  | 224       | 256               | 7.327                    |
+| Res2Net200_vd_26w_4s  | 224       | 256               | 12.806                   |
+| ResNeXt50_32x4d       | 224       | 256               | 10.964                   |
+| ResNeXt50_vd_32x4d    | 224       | 256               | 7.566                    |
+| ResNeXt50_64x4d       | 224       | 256               | 13.905                   |
+| ResNeXt50_vd_64x4d    | 224       | 256               | 14.321                   |
+| ResNeXt101_32x4d      | 224       | 256               | 14.915                   |
+| ResNeXt101_vd_32x4d   | 224       | 256               | 14.885                   |
+| ResNeXt101_64x4d      | 224       | 256               | 28.716                   |
+| ResNeXt101_vd_64x4d   | 224       | 256               | 28.398                   |
+| ResNeXt152_32x4d      | 224       | 256               | 22.996                   |
+| ResNeXt152_vd_32x4d   | 224       | 256               | 22.729                   |
+| ResNeXt152_64x4d      | 224       | 256               | 46.705                   |
+| ResNeXt152_vd_64x4d   | 224       | 256               | 46.395                   |
+| SE_ResNet18_vd        | 224       | 256               | 1.694                    |
+| SE_ResNet34_vd        | 224       | 256               | 2.786                    |
+| SE_ResNet50_vd        | 224       | 256               | 3.749                    |
+| SE_ResNeXt50_32x4d    | 224       | 256               | 8.924                    |
+| SE_ResNeXt50_vd_32x4d | 224       | 256               | 9.011                    |
+| SE_ResNeXt101_32x4d   | 224       | 256               | 19.204                   |
+| SENet154_vd           | 224       | 256               | 50.406                   |
+## Inference speed based on T4 GPU
+| Models                | Crop Size | Resize Short Size | FP16<br>Batch Size=1<br>(ms) | FP16<br>Batch Size=4<br>(ms) | FP16<br>Batch Size=8<br>(ms) | FP32<br>Batch Size=1<br>(ms) | FP32<br>Batch Size=4<br>(ms) | FP32<br>Batch Size=8<br>(ms) |
+|-----------------------|-----------|-------------------|------------------------------|------------------------------|------------------------------|------------------------------|------------------------------|------------------------------|
+| Res2Net50_26w_4s      | 224       | 256               | 3.56067                      | 6.61827                      | 11.41566                     | 4.47188                      | 9.65722                      | 17.54535                     |
+| Res2Net50_vd_26w_4s   | 224       | 256               | 3.69221                      | 6.94419                      | 11.92441                     | 4.52712                      | 9.93247                      | 18.16928                     |
+| Res2Net50_14w_8s      | 224       | 256               | 4.45745                      | 7.69847                      | 12.30935                     | 5.4026                       | 10.60273                     | 18.01234                     |
+| Res2Net101_vd_26w_4s  | 224       | 256               | 6.53122                      | 10.81895                     | 18.94395                     | 8.08729                      | 17.31208                     | 31.95762                     |
+| Res2Net200_vd_26w_4s  | 224       | 256               | 11.66671                     | 18.93953                     | 33.19188                     | 14.67806                     | 32.35032                     | 63.65899                     |
+| ResNeXt50_32x4d       | 224       | 256               | 7.61087                      | 8.88918                      | 12.99674                     | 7.56327                      | 10.6134                      | 18.46915                     |
+| ResNeXt50_vd_32x4d    | 224       | 256               | 7.69065                      | 8.94014                      | 13.4088                      | 7.62044                      | 11.03385                     | 19.15339                     |
+| ResNeXt50_64x4d       | 224       | 256               | 13.78688                     | 15.84655                     | 21.79537                     | 13.80962                     | 18.4712                      | 33.49843                     |
+| ResNeXt50_vd_64x4d    | 224       | 256               | 13.79538                     | 15.22201                     | 22.27045                     | 13.94449                     | 18.88759                     | 34.28889                     |
+| ResNeXt101_32x4d      | 224       | 256               | 16.59777                     | 17.93153                     | 21.36541                     | 16.21503                     | 19.96568                     | 33.76831                     |
+| ResNeXt101_vd_32x4d   | 224       | 256               | 16.36909                     | 17.45681                     | 22.10216                     | 16.28103                     | 20.25611                     | 34.37152                     |
+| ResNeXt101_64x4d      | 224       | 256               | 30.12355                     | 32.46823                     | 38.41901                     | 30.4788                      | 36.29801                     | 68.85559                     |
+| ResNeXt101_vd_64x4d   | 224       | 256               | 30.34022                     | 32.27869                     | 38.72523                     | 30.40456                     | 36.77324                     | 69.66021                     |
+| ResNeXt152_32x4d      | 224       | 256               | 25.26417                     | 26.57001                     | 30.67834                     | 24.86299                     | 29.36764                     | 52.09426                     |
+| ResNeXt152_vd_32x4d   | 224       | 256               | 25.11196                     | 26.70515                     | 31.72636                     | 25.03258                     | 30.08987                     | 52.64429                     |
+| ResNeXt152_64x4d      | 224       | 256               | 46.58293                     | 48.34563                     | 56.97961                     | 46.7564                      | 56.34108                     | 106.11736                    |
+| ResNeXt152_vd_64x4d   | 224       | 256               | 47.68447                     | 48.91406                     | 57.29329                     | 47.18638                     | 57.16257                     | 107.26288                    |
+| SE_ResNet18_vd        | 224       | 256               | 1.61823                      | 3.1391                       | 4.60282                      | 1.7691                       | 4.19877                      | 7.5331                       |
+| SE_ResNet34_vd        | 224       | 256               | 2.67518                      | 5.04694                      | 7.18946                      | 2.88559                      | 7.03291                      | 12.73502                     |
+| SE_ResNet50_vd        | 224       | 256               | 3.65394                      | 7.568                        | 12.52793                     | 4.28393                      | 10.38846                     | 18.33154                     |
+| SE_ResNeXt50_32x4d    | 224       | 256               | 9.06957                      | 11.37898                     | 18.86282                     | 8.74121                      | 13.563                       | 23.01954                     |
+| SE_ResNeXt50_vd_32x4d | 224       | 256               | 9.25016                      | 11.85045                     | 25.57004                     | 9.17134                      | 14.76192                     | 19.914                       |
+| SE_ResNeXt101_32x4d   | 224       | 256               | 19.34455                     | 20.6104                      | 32.20432                     | 18.82604                     | 25.31814                     | 41.97758                     |
+| SENet154_vd           | 224       | 256               | 49.85733                     | 54.37267                     | 74.70447                     | 53.79794                     | 66.31684                     | 121.59885                    |
--- a/docs/en/models/Tricks_en.md
+++ b/docs/en/models/Tricks_en.md
+# Tricks for Training
+## Choice of Optimizers:
+Since the development of deep learning, there have been many researchers working on the optimizer. The purpose of the optimizer is to make the loss function as small as possible, so as to find suitable parameters to complete a certain task. At present, the main optimizers used in model training are SGD, RMSProp, Adam, AdaDelt and so on. The SGD optimizers with momentum is widely used in academia and industry, so most of models we release are trained by SGD optimizer with momentum. But the SGD optimizer with momentum has two disadvantages, one is that the convergence speed is slow, the other is that the initial learning rate is difficult to set, however, if the initial learning rate is set properly and the models are trained in sufficient iterations, the models trained by SGD with momentum can reach higher accuracy compared with the models trained by other optimizers. Some other optimizers with adaptive learning rate such as Adam, RMSProp and so on tent to converge faster, but the final convergence accuracy will be slightly worse. If you want to train a model in faster convergence speed, we recommend you use the optimizers with adaptive learning rate, but if you want to train a model with higher accuracy, we recommend you to use SGD optimizer with momentum.
+## Choice of Learning Rate and Learning Rate Declining Strategy:
+The choice of learning rate is related to the optimizer, data set and tasks. Here we mainly introduce the learning rate of training ImageNet-1K with momentum + SGD as the optimizer and the choice of learning rate decline.
+### Concept of Learning Rate：
+the learning rate is the hyperparameter to control the learning speed, the lower the learning rate, the slower the change of the loss value, though using a low learning rate can ensure that you will not miss any local minimum, but it also means that the convergence speed is slow, especially when the gradient is trapped in a gradient plateau area.
+### Learning Rate Decline Strategy：
+During training, if we always use the same learning rate, we cannot get the model with highest accuracy, so the learning rate should be adjust during training. In the early stage of training, the weights are in a random initialization state and the gradients are tended to descent, so we can set a relatively large learning rate for faster convergence. In the late stage of training, the weights are close to the optimal values, the optimal value cannot be reached by a relatively large learning rate, so a relatively smaller learning rate should be used. During training, many researchers use the piecewise_decay learning rate reduction strategy, which is a stepwise decline learning rate. For example, in the training of ResNet50, the initial learning rate we set is 0.1, and the learning rate drops to 1/10 every 30 epoches, the total epoches for training is 120. Besides the piecewise_decay, many researchers also proposed other ways to decrease the learning rate, such as polynomial_decay, exponential_decay and cosine_decay and so on, among them, cosine_decay has become the preferred learning rate reduction method for improving model accuracy beacause there is no need to adjust hyperparameters and the robustness is relatively high. The learning rate curves of cosine_decay and piecewise_decay are shown in the following figures, it is easy to observe that during the entire training process, cosine_decay keeps a relatively large learning rate, so its convergence is slower, but the final convergence accuracy is better than the one using piecewise_decay.
+![](../../images/models/lr_decay.jpeg)
+In addition, we can also see from the figures that the number of epoches with a small learning rate in cosine_decay is fewer, which will affect the final accuracy, so in order to make cosine_decay play a better effect, it is recommended to use cosine_decay in large epoched, such as 200 epoches.
+### Warmup Strategy
+If a large batch_size is adopted to train nerual network, we recommend you to adopt warmup strategy. as the name suggests, the warmup strategy is to let model learning first warm up, we do not directly use the initial learning rate at the begining of training, instead, we use a gradually increasing learning rate to train the model, when the increasing learning rate reaches the initial learning rate, the learning rate reduction method mentioned in the learning rate reduction strategy is then used to decay the learning rate. Experiments show that when the batch size is large, warmup strategy can improve the accuracy. Some model training with large batch_size such as MobileNetV3 training, we set the epoch in warmup to 5 by default, that is, first in 5 epoches, the learning rate increases from 0 to initial learning rate, then learning rate decay begins.
+## Choice of Batch_size
+Batch_size is an important hyperparameter in training neural networks, batch_size determines how much data is sent to the neural network to for training at a time. In the paper [1], the author found in experiments that when batch_size is linearly related to the learning rate, the convergence accuracy is hardly affected. When training ImageNet data, an initial learning rate of 0.1 are commonly chosen for training, and batch_size is 256, so according to the actual model size and memory, you can set the learning rate to 0.1\*k, batch_size to 256\*k.
+## Choice of Weight_decay
+Overfitting is a common term in machine learning. A simple understanding is that the model performs well on the training data, but it performs poorly on the test data. In the convolutional neural network, there also exists the problem of overfitting. To avoid overfitting, many regular ways have been proposed. Among them, weight_decay is one of the widely used ways to avoid overfitting. After the final loss function, L2 regularization(weight_decay) is added to the loss function, with the help of L2 regularization, the weight of the network tend to choose a smaller value, and finally the parameters in the entire network tends to 0, and the generalization performance of the model is improved accordingly. In different kinds of Deep learning frame, the meaning of L2_decay is the coefficient of L2 regularization, on paddle, the name of this value is L2_decay, so in the following the value is called L2_decay. the larger the coefficient, the more the model tends to be underfitting. In the task of training ImageNet, this parameter is set to 1e-4 in most network. In some small networks such as MobileNet networks, in order to avoid network underfitting, the value is set to 1e-5 ~ 4e-5. Of course, the setting of this value is also related to the specific data set, When the data set is large, the network itself tends to be under-fitted, and the value can be appropriately reduced. When the data set is small, the network tends to overfit itself, so the value can be increased appropriately. The following table shows the accuracy of MobileNetV1_x0_25 using different l2_decay on ImageNet-1k. Since MobileNetV1_x0_25 is a relatively small network, the large l2_decay will make the network tend to be underfitting, so in this network, 3e-5 are better choices compared with 1e-4.
+| Model                | L2_decay | Train acc1/acc5 | Test acc1/acc5 |
+|:--:|:--:|:--:|:--:|
+| MobileNetV1_x0_25 | 1e-4     | 43.79%/67.61%   | 50.41%/74.70%  |
+| MobileNetV1_x0_25 | 3e-5     | 47.38%/70.83%   | 51.45%/75.45%  |
+In addition, the setting of L2_decay is also related to whether other regularization is used during training. If the data argument during the training is more complicated, which means that the training becomes more difficult, L2_decay can be appropriately reduced. The following table shows that the precision of ResNet50 using a different l2_decay on ImageNet-1K. It is easy to observe that after the training becomes difficult, using a smaller l2_decay helps to improve the accuracy of the model.
+| Model       | L2_decay | Train acc1/acc5 | Test acc1/acc5 |
+|:--:|:--:|:--:|:--:|
+| ResNet50 | 1e-4     | 75.13%/90.42%   | 77.65%/93.79%  |
+| ResNet50 | 7e-5     | 75.56%/90.55%   | 78.04%/93.74%  |
+In summary, l2_decay can be adjusted according to specific tasks and models. Usually simple tasks or larger models are recommended to use Larger l2_decay, complex tasks or smaller models are recommended to use smaller l2_decay.
+## Choice of Label_smoothing
+Label_smoothing is a regularization method in deep learning. Its full name is Label Smoothing Regularization (LSR), that is, label smoothing regularization. In the traditional classification task, when calculating the loss function, the real one hot label and the output of the neural network are calculated in cross-entropy formula, the label smoothing aims to make the real one hot label become smooth label, which makes the neural network no longer learn from the hard labels, but the soft labels with a probability value, where the probability of the position corresponding to the category is the largest and the probability of other positions are very small value, specific calculation method can be seen in the paper[2]. In label-smoothing, there is an epsilon parameter describing the degree of softening the label. The larger epsilon, the smaller the probability and smoother the label, on the contrary, the label tends to be hard label. during training on ImageNet-1K, the parameter is usually set to 0.1. In the experiments of training ResNet50, when using label_smoothing, the accuracy is higher than the one without label_smoothing, the following table shows the performance of ResNet50_vd with label smoothing and without label smoothing.
+| Model          | Use_label_smoothing | Test acc1 |
+|:--:|:--:|:--:|
+| ResNet50_vd | 0                   | 77.9%     |
+| ResNet50_vd | 1                   | 78.4%     |
+But, because label smoothing can be regarded as a regular way, on relatively small models, the accuracy improvement is not obvious or even decreases, the following table shows the accuracy performance of ResNet18 with label smoothing and without label smoothing on ImageNet-1K, it can be clearly seen that after using label smoothing, the accuracy of ResNet has decreased.
+| Model       | Use_label_smoohing | Train acc1/acc5 | Test acc1/acc5 |
+|:--:|:--:|:--:|:--:|
+| ResNet18 | 0                  | 69.81%/87.70%   | 70.98%/89.92%  |
+| ResNet18 | 1                  | 68.00%/86.56%   | 70.81%/89.89%  |
+In summary, the use of label_smoohing for larger models can effectively improve the accuracy of the model, and the use of label_smoohing for smaller models may reduce the accuracy of the model, so before deciding whether to use label_smoohing, you need to evaluate the size of the model and the difficulty of the task.
+## Change the Crop Area and Stretch Transformation Degree of the Images for Small Models
+In the standard preprocessing of ImageNet-1k data, two values of scale and ratio are defined in the random_crop function. These two values respectively determine the size of the image crop and the degree of stretching of the image. The default value of scale is 0.08-1(lower_scale-upper_scale), the default value range of ratio is 3/4-4/3(lower_ratio-upper_ratio). In small network training, such data argument will make the network underfitting, resulting in a decrease in accuracy. In order to improve the accuracy of the network, you can make the data argument weaker, that is, increase the crop area of the images or weaken the degree of stretching and transformation of the images, we can achieve weaker image transformation by increasing the value of lower_scale or narrowing the gap between lower_ratio and upper_scale. The following table lists the accuracy of training MobileNetV2_x0_25 with different lower_scale. It can be seen that the training accuracy and validation accuracy are improved after increasing the crop area of the images
+| Model                | Scale Range | Train_acc1/acc5 | Test_acc1/acc5 |
+|:--:|:--:|:--:|:--:|
+| MobileNetV2_x0_25 | [0.08,1]  | 50.36%/72.98%   | 52.35%/75.65%  |
+| MobileNetV2_x0_25 | [0.2,1]   | 54.39%/77.08%   | 53.18%/76.14%  |
+## Use Data Augmentation to Improve Accuracy
+In general, the size of the data set is critical to the performances, but the annotation of images are often more expensive, so the number of annotated images are often scarce. In this case, the data argument is particularly important. In the standard data augmentation for training on ImageNet-1k, two data augmentation methods which are random_crop and random_flip are mainly used. However, in recent years, more and more data augmentation methods have been proposed, such as cutout, mixup, cutmix, AutoAugment, etc. Experiments show that these data augmentation methods can effectively improve the accuracy of the model. The following table lists the performance of ResNet50 in 8 different data augmentation methods. It can be seen that compared to the baseline, all data augmentation methods can be useful for the accuracy of ResNet50, among them cutmix is currently the most effective data argument. More data argument can be seen here[**Data Argument**](https://paddleclas.readthedocs.io/zh_CN/latest/advanced_tutorials/image_augmentation/ImageAugment.html).
+| Model       | Data Argument         | Test top-1 |
+|:--:|:--:|:--:|
+| ResNet50 | Baseline           | 77.31%     |
+| ResNet50 | Auto-Augment   | 77.95%     |
+| ResNet50 | Mixup          | 78.28%     |
+| ResNet50 | Cutmix         | 78.39%     |
+| ResNet50 | Cutout         | 78.01%     |
+| ResNet50 | Gridmask       | 77.85%     |
+| ResNet50 | Random-Augment | 77.70%     |
+| ResNet50 | Random-Erasing | 77.91%     |
+| ResNet50 | Hide-and-Seek  | 77.43%     |
+## Determine the Tuning Strategy by Train_acc and Test_acc
+In the process of training the network, the training set accuracy rate and validation set accuracy rate of each epoch are usually printed. Generally speaking, the accuracy of the training set is slightly higher than the accuracy of the validation set or the same are good state in training, but if you find that the accuracy of training set is much higher than the one of validation set, it means that overfitting happens in your task, which need more regularization, such as increase the value of L2_decay, using more data argument or label smoothing and so on. If you find that the accuracy of training set is lower than the one of validation set, it means that underfitting happens in your task, which recommend you to decrease the value of L2_decay, using fewer data argument, increase the area of the crop area of the images, weaken the stretching transformation of the images, remove label_smoothing, etc.
+## Improve the Accuracy of Your Own Data Set with Existing Pre-trained Models
+In the field of computer vision, it has become common to load pre-trained models to train one's own tasks. Compared with starting training from random initialization, loading pre-trained models can often improve the accuracy of specific tasks. In general, the pre-trained model widely used in the industry is obtained from the ImageNet-1k dataset. The fc layer weight of the pre-trained model is a matrix of k\*1000, where k is The number of neurons before,  and the weights of the fc layer is not need to load because of the different tasks. In terms of learning rate, if your training data set is particularly small (such as less than 1,000), we recommend that you use a smaller initial learning rate, such as 0.001 (batch_size: 256, the same below), to avoid a large learning rate undermine pre-training weights, if your training data set is relatively large (greater than 100,000), we recommend that you try a larger initial learning rate, such as 0.01 or greater.
+> If you think this guide is helpful to you, welcome to star our repo:[https://github.com/PaddlePaddle/PaddleClas](https://github.com/PaddlePaddle/PaddleClas)
+## Reference
+[1]P. Goyal, P. Dolla ́r, R. B. Girshick, P. Noordhuis, L. Wesolowski, A. Kyrola, A. Tulloch, Y. Jia, and K. He. Accurate, large minibatch SGD: training imagenet in 1 hour. CoRR, abs/1706.02677, 2017.
+[2]C.Szegedy,V.Vanhoucke,S.Ioffe,J.Shlens,andZ.Wojna. Rethinking the inception architecture for computer vision. CoRR, abs/1512.00567, 2015.
--- a/docs/en/models/index.rst
+++ b/docs/en/models/index.rst
@@ -4,13 +4,13 @@ models
 .. toctree::
   :maxdepth: 1
-   models_intro.md
+   models_intro_en.md
-   Tricks.md
+   Tricks_en.md
-   ResNet_and_vd.md
+   ResNet_and_vd_en.md
-   Mobile.md
+   Mobile_en.md
-   SEResNext_and_Res2Net.md
+   SEResNext_and_Res2Net_en.md
-   Inception.md
+   Inception_en.md
-   HRNet.md
+   HRNet_en.md
-   DPN_DenseNet.md
+   DPN_DenseNet_en.md
-   EfficientNet_and_ResNeXt101_wsl.md
+   EfficientNet_and_ResNeXt101_wsl_en.md
-   Others.md
+   Others_en.md
--- a/docs/en/models/models_intro_en.md
+++ b/docs/en/models/models_intro_en.md
--- a/docs/en/tutorials/config_en.md
+++ b/docs/en/tutorials/config_en.md
+#Configuration
+---
+## Introduction
+This document introduces the configuration(filed in `config/*.yaml`) of PaddleClas.
+### Basic
+| name | detail | default value | optional value |
+|:---:|:---:|:---:|:---:|
+| mode | mode | "train" | ["train"," valid"] |
+| architecture | model name | "ResNet50_vd" | one of 23 architectures |
+| pretrained_model | pretrained model path | "" | Str |
+| model_save_dir | model stored path | "" | Str |
+| classes_num | class number | 1000 | int |
+| total_images | total images | 1281167 | int |
+| save_interval | save interval | 1 | int |
+| validate | whether to validate when training | TRUE | bool |
+| valid_interval | valid interval | 1 | int |
+| epochs | epoch |  | int |
+| topk | K value | 5 | int |
+| image_shape | image size | [3，224，224] | list, shape: (3,) |
+| use_mix | whether to use mixup | False | ['True', 'False'] |
+| ls_epsilon | label_smoothing epsilon value| 0 | float |
+### Optimizer & Learning rate
+learning rate
+| name | detail | default value |Optional value |
+|:---:|:---:|:---:|:---:|
+| function | decay type | "Linear" | ["Linear", "Cosine", <br> "Piecewise", "CosineWarmup"] |
+| params.lr | initial learning rate | 0.1 | float |
+| params.decay_epochs | milestone in piecewisedecay |  | list |
+| params.gamma | gamma in piecewisedecay | 0.1 | float |
+| params.warmup_epoch | warmup epoch | 5 | int |
+| parmas.steps | decay steps in lineardecay | 100 | int |
+| params.end_lr | end lr in lineardecay | 0 | float |
+optimizer
+| name | detail | default value | optional value |
+|:---:|:---:|:---:|:---:|
+| function | optimizer name | "Momentum" | ["Momentum", "RmsProp"] |
+| params.momentum | momentum value | 0.9 | float |
+| regularizer.function | regularizer method name | "L2" | ["L1", "L2"] |
+| regularizer.factor | regularizer factor | 0.0001 | float |
+### reader
+| name | detail |
+|:---:|:---:|
+| batch_size | batch size |
+| num_workers | worker number |
+| file_list | train list path |
+| data_dir | train  dataset path |
+| shuffle_seed | seed |
+processing
+| function name | attribute name | detail |
+|:---:|:---:|:---:|
+| DecodeImage | to_rgb | decode to RGB |
+|  | to_np | to numpy |
+|  | channel_first | Channel first |
+| RandCropImage | size | random crop |
+| RandFlipImage | | random flip |
+| NormalizeImage | scale | normalize image |
+|  | mean | mean |
+|  | std | std |
+|  | order | order |
+| ToCHWImage |  | to CHW |
+| CropImage | size | crop size |
+| ResizeImage | resize_short | resize according to short size |
+mix preprocessing
+| name| detail|
+|:---:|:---:|
+| MixupOperator.alpha | alpha value in mixup|
--- a/docs/en/tutorials/data_en.md
+++ b/docs/en/tutorials/data_en.md
+# Data
+---
+## Introducation
+This document introduces the preparation of ImageNet1k and flowers102
+## Dataset
+Dataset | train dataset size | valid dataset size | category |
+:------:|:---------------:|:---------------------:|:--------:|
+[flowers102](https://www.robots.ox.ac.uk/~vgg/data/flowers/102/)|1k | 6k | 102 |
+[ImageNet1k](http://www.image-net.org/challenges/LSVRC/2012/)|1.2M| 50k | 1000 |
+* Data format
+Please follow the steps mentioned below to organize data, include train_list.txt and val_list.txt
+```shell
+# delimiter: "space"
+ILSVRC2012_val_00000001.JPEG 65
+...
+```
+### ImageNet1k
+After downloading data, please organize the data dir as below
+```bash
+PaddleClas/dataset/imagenet/
+|_ train/
+|  |_ n01440764
+|  |  |_ n01440764_10026.JPEG
+|  |  |_ ...
+|  |_ ...
+|  |
+|  |_ n15075141
+|     |_ ...
+|     |_ n15075141_9993.JPEG
+|_ val/
+|  |_ ILSVRC2012_val_00000001.JPEG
+|  |_ ...
+|  |_ ILSVRC2012_val_00050000.JPEG
+|_ train_list.txt
+|_ val_list.txt
+```
+### Flowers102 Dataset
+Download [Data](https://www.robots.ox.ac.uk/~vgg/data/flowers/102/) then decompress:
+```shell
+jpg/
+setid.mat
+imagelabels.mat
+```
+Please put all the files under ```PaddleClas/dataset/flowers102```
+generate generate_flowers102_list.py and train_list.txt和val_list.txt
+```bash
+python generate_flowers102_list.py jpg train > train_list.txt
+python generate_flowers102_list.py jpg valid > val_list.txt
+```
+Please organize data dir as below
+```bash
+PaddleClas/dataset/flowers102/
+|_ jpg/
+|  |_ image_03601.jpg
+|  |_ ...
+|  |_ image_02355.jpg
+|_ train_list.txt
+|_ val_list.txt
+```
--- a/docs/en/tutorials/getting_started_en.md
+++ b/docs/en/tutorials/getting_started_en.md
+# Getting Started
+---
+Please refer to [Installation](install.md) to setup environment at first, and prepare ImageNet1K data by following the instruction mentioned in the [data](data.md)
+## Setup
+**Setup PYTHONPATH：**
+```bash
+export PYTHONPATH=path_to_PaddleClas:$PYTHONPATH
+```
+## Training and validating
+PaddleClas support `tools/train.py` and `tools/eval.py` to start training and validating.
+### Training
+```bash
+# PaddleClas use paddle.distributed.launch to start multi-cards and multiprocess training.
+# Set FLAGS_selected_gpus to indicate GPU cards
+python -m paddle.distributed.launch \
+    --selected_gpus="0,1,2,3" \
+    tools/train.py \
+        -c ./configs/ResNet/ResNet50_vd.yaml
+```
+- log:
+```
+epoch:0    train    step:13    loss:7.9561    top1:0.0156    top5:0.1094    lr:0.100000    elapse:0.193
+```
+add -o params to update configuration
+```bash
+python -m paddle.distributed.launch \
+    --selected_gpus="0,1,2,3" \
+    tools/train.py \
+        -c ./configs/ResNet/ResNet50_vd.yaml \
+        -o use_mix=1 \
+    --vdl_dir=./scalar/
+```
+- log:
+```
+epoch:0    train    step:522    loss:1.6330    lr:0.100000    elapse:0.210
+```
+or modify configuration directly to config fileds, please refer to [config](config.md) for more details.
+use visuldl to visulize training loss in the real time
+```bash
+visualdl --logdir ./scalar --host <host_IP> --port <port_num>
+```
+### finetune
+* please refer to [Trial](./quick_start.md) for more details.
+### validation
+```bash
+python tools/eval.py \
+    -c ./configs/eval.yaml \
+    -o ARCHITECTURE.name="ResNet50_vd" \
+    -o pretrained_model=path_to_pretrained_models
+modify `configs/eval.yaml filed: `ARCHITECTURE.name` and filed: `pretrained_model` to config valid model or add -o params to update config directly.
+**NOTE: ** when loading the pretrained model, should ignore the suffix ```.pdparams```
+## Predict
+PaddlePaddle supprot three predict interfaces
+Use predicator interface to predict
+First, export inference model
+```bash
+python tools/export_model.py \
+    --model=model_name \
+    --pretrained_model=pretrained_model_dir \
+    --output_path=save_inference_dir
+```
+Second, start predicator enginee：
+```bash
+python tools/infer/predict.py \
+    -m model_path \
+    -p params_path \
+    -i image path \
+    --use_gpu=1 \
+    --use_tensorrt=True
+```
+please refer to [inference](../extension/paddle_inference.md) for more details.
--- a/docs/en/tutorials/index.rst
+++ b/docs/en/tutorials/index.rst
@@ -4,6 +4,8 @@ tutorials
 .. toctree::
   :maxdepth: 1
-   install.md
+   install_en.md
-   getting_started.md
+   quick_start_en.md
-   config.md
+   data_en.md
+   getting_started_en.md
+   config_en.md
--- a/docs/en/tutorials/install_en.md
+++ b/docs/en/tutorials/install_en.md
+# Installation
+---
+## Introducation
+This document introduces how to install PaddleClas and its requirements.
+## Install PaddlePaddle
+Python 3.5, CUDA 9.0, CUDNN7.0 nccl2.1.2 and later version are required at first, For now, PaddleClas only support training on the GPU device. Please follow the instructions in the [Installation](http://www.paddlepaddle.org.cn/install/quick) if the PaddlePaddle on the device is lower than v1.7
+Install PaddlePaddle
+```bash
+pip install paddlepaddle-gpu --upgrade
+```
+or compile from source code, please refer to [Installation](http://www.paddlepaddle.org.cn/install/quick).
+Verify Installation
+```python
+import paddle.fluid as fluid
+fluid.install_check.run_check()
+```
+Check PaddlePaddle version：
+```bash
+python -c "import paddle; print(paddle.__version__)"
+```
+Note:
+- Make sure the compiled version is later than v1.7
+- Indicate **WITH_DISTRIBUTE=ON** when compiling, Please refer to [Instruction](https://www.paddlepaddle.org.cn/documentation/docs/zh/develop/install/Tables.html#id3) for more details.
+## Install PaddleClas
+**Clone PaddleClas: **
+```
+cd path_to_clone_PaddleClas
+git clone https://github.com/PaddlePaddle/PaddleClas.git
+```
+**Install requirements**
+```
+pip install --upgrade -r requirements.txt
+```
--- a/docs/en/tutorials/quick_start_en.md
+++ b/docs/en/tutorials/quick_start_en.md
+# Trial in 30mins
+Based on the flowers102 dataset, it takes only 30 mins to experience PaddleClas, include training varieties of backbone and pretrained model, SSLD distillation, and multiple data augmentation, Please refer to [Installation](install.md) to install at first.
+## Preparation
+* enter insatallation dir
+```
+cd path_to_PaddleClas
+```
+* enter `dataset/flowers102`, download and decompress flowers102 dataset.
+```shell
+cd dataset/flowers102
+wget https://www.robots.ox.ac.uk/~vgg/data/flowers/102/102flowers.tgz
+wget https://www.robots.ox.ac.uk/~vgg/data/flowers/102/imagelabels.mat
+wget https://www.robots.ox.ac.uk/~vgg/data/flowers/102/setid.mat
+tar -xf 102flowers.tgz
+```
+* create train/val/test label files
+```shell
+python generate_flowers102_list.py jpg train > train_list.txt
+python generate_flowers102_list.py jpg valid > val_list.txt
+python generate_flowers102_list.py jpg test > extra_list.txt
+cat train_list.txt extra_list.txt > train_extra_list.txt
+```
+**Note:** In order to offer more data to SSLD training task, train_list.txt and extra_list.txt will merge into train_extra_list.txft
+* return `PaddleClas` dir
+```
+cd ../../
+```
+## Environment
+### Set PYTHONPATH
+```bash
+export PYTHONPATH=./:$PYTHONPATH
+```
+### Download pretrained model
+```bash
+python tools/download.py -a ResNet50_vd -p ./pretrained -d True
+python tools/download.py -a ResNet50_vd_ssld -p ./pretrained -d True
+python tools/download.py -a MobileNetV3_large_x1_0 -p ./pretrained -d True
+```
+Paramters：
+ `architecture`(shortname: a): model name.
+ `path`(shortname: p) download path.
+ `decompress`(shortname: d) whether to decompress.
+* All experiments are running on the NVIDIA® Tesla® V100 sigle card.
+## Training
+### Train from scratch
+* Train ResNet50_vd
+```shell
+export CUDA_VISIBLE_DEVICES=0
+python -m paddle.distributed.launch \
+    --selected_gpus="0" \
+    tools/train.py \
+        -c ./configs/quick_start/ResNet50_vd.yaml
+```
+The validation `Top1 Acc` curve is showmn below.
+![](../../images/quick_start/r50_vd_acc.png)
+### Finetune - ResNet50_vd pretrained model (Acc 79.12\%)
+* finetune ResNet50_vd_ model pretrained on the 1000-class Imagenet dataset
+```shell
+export CUDA_VISIBLE_DEVICES=0
+python -m paddle.distributed.launch \
+    --selected_gpus="0" \
+    tools/train.py \
+        -c ./configs/quick_start/ResNet50_vd_finetune.yaml
+```
+The validation `Top1 Acc` curve is shown below
+![](../../images/quick_start/r50_vd_pretrained_acc.png)
+Compare with training from scratch, it improve by 65\% to 94.02\%
+### SSLD finetune - ResNet50_vd_ssld pretrained model (Acc 82.39\%)
+Note: when finetuning model, which has been trained by SSLD, please use smaller learning rate in the middle of net.
+```yaml
+ARCHITECTURE:
+    name: 'ResNet50_vd'
+    params:
+        lr_mult_list: [0.1, 0.1, 0.2, 0.2, 0.3]
+pretrained_model: "./pretrained/ResNet50_vd_ssld_pretrained"
+```
+Tringing script
+```shell
+export CUDA_VISIBLE_DEVICES=0
+python -m paddle.distributed.launch \
+    --selected_gpus="0" \
+    tools/train.py \
+        -c ./configs/quick_start/ResNet50_vd_ssld_finetune.yaml
+```
+Compare with finetune on the 79.12% pretrained model, it improve by 0.9% to 95%.
+### More architecture - MobileNetV3
+Training script
+```shell
+export CUDA_VISIBLE_DEVICES=0
+python -m paddle.distributed.launch \
+    --selected_gpus="0" \
+    tools/train.py \
+        -c ./configs/quick_start/MobileNetV3_large_x1_0_finetune.yaml
+```
+Compare with ResNet50_vd pretrained model, it decrease by 5% to 90%. Different architecture generates different performance, actually it is a task-oriented decision to apply the best performance model, should consider the inference time, storage, heterogeneous device, etc.
+### RandomErasing
+Data augmentation works when training data is small.
+Training script
+```shell
+export CUDA_VISIBLE_DEVICES=0
+python -m paddle.distributed.launch \
+    --selected_gpus="0" \
+    tools/train.py \
+        -c ./configs/quick_start/ResNet50_vd_ssld_random_erasing_finetune.yaml
+```
+It improves by 1.27\% to 96.27\%
+* Save ResNet50_vd pretrained model to experience next chapter.
+```shell
+cp -r output/ResNet50_vd/19/  ./pretrained/flowers102_R50_vd_final/
+```
+### Distillation
+* Use extra_list.txt as unlabeled data, Note:
+    * Samples in the `extra_list.txt` and `val_list.txt` don't have intersection
+    * Because of in the source code, label information is unused, This is still unlabeled distillation
+    * Teacher model use the pretrained_model trained on the flowers102 dataset, and student model use the MobileNetV3_large_x1_0 pretrained model(Acc 75.32\%) trained on the ImageNet1K dataset
+```yaml
+total_images: 7169
+ARCHITECTURE:
+    name: 'ResNet50_vd_distill_MobileNetV3_large_x1_0'
+pretrained_model:
+    - "./pretrained/flowers102_R50_vd_final/ppcls"
+    - "./pretrained/MobileNetV3_large_x1_0_pretrained/”
+TRAIN:
+    file_list: "./dataset/flowers102/train_extra_list.txt"
+```
+Final training script
+```shell
+export CUDA_VISIBLE_DEVICES=0
+python -m paddle.distributed.launch \
+    --selected_gpus="0" \
+    tools/train.py \
+        -c ./configs/quick_start/R50_vd_distill_MV3_large_x1_0.yaml
+```
+It significantly imporve by 6.47% to 96.47% with more unlabeled data and teacher model.
+### All accuracy
+|Configuration | Top1 Acc |
+|- |:-: |
+| ResNet50_vd.yaml | 0.2735 |
+| MobileNetV3_large_x1_0_finetune.yaml | 0.9000 |
+| ResNet50_vd_finetune.yaml | 0.9402 |
+| ResNet50_vd_ssld_finetune.yaml | 0.9500 |
+| ResNet50_vd_ssld_random_erasing_finetune.yaml | 0.9627 |
+| R50_vd_distill_MV3_large_x1_0.yaml | 0.9647 |
+The whole accuracy curves are shown below
+![](../../images/quick_start/all_acc.png)
+* **NOTE**: As flowers102 is a small dataset, validatation accuracy maybe float 1%.
+* Please refer to [Getting_started](./getting_started) for more details
--- a/docs/en/update_history_en.md
+++ b/docs/en/update_history_en.md
+# Release Notes
+* 2020.06.17
+    * Add English documents。
+* 2020.06.12
+    * Add support for training and evaluation on Windows or CPU.
+* 2020.05.17
+    * Add support for mixed precision training.
+* 2020.05.09
+    * Add user guide about Paddle Serving and Paddle-Lite.
+    * Add benchmark about FP16/FP32 on T4 GPU.
+* 2020.04.14
+    * First commit.
--- a/docs/images/distillation/distillation_perform.png
+++ b/docs/images/distillation/distillation_perform.png
--- a/docs/images/distillation/distillation_perform_s.jpg
+++ b/docs/images/distillation/distillation_perform_s.jpg
--- a/docs/images/image_aug/image_aug_samples_s_en.jpg
+++ b/docs/images/image_aug/image_aug_samples_s_en.jpg
--- a/docs/images/main_features_s.png
+++ b/docs/images/main_features_s.png
--- a/docs/images/main_features_s_en.png
+++ b/docs/images/main_features_s_en.png
--- a/docs/images/models/T4_benchmark/t4.fp16.bs4.ResNet.png
+++ b/docs/images/models/T4_benchmark/t4.fp16.bs4.ResNet.png
--- a/docs/images/models/T4_benchmark/t4.fp32.bs4.ResNet.flops.png
+++ b/docs/images/models/T4_benchmark/t4.fp32.bs4.ResNet.flops.png
--- a/docs/images/models/T4_benchmark/t4.fp32.bs4.ResNet.params.png
+++ b/docs/images/models/T4_benchmark/t4.fp32.bs4.ResNet.params.png
--- a/docs/images/models/T4_benchmark/t4.fp32.bs4.ResNet.png
+++ b/docs/images/models/T4_benchmark/t4.fp32.bs4.ResNet.png
--- a/docs/images/models/T4_benchmark/t4.fp32.bs4.main_fps_top1.png
+++ b/docs/images/models/T4_benchmark/t4.fp32.bs4.main_fps_top1.png
--- a/docs/images/models/V100_benchmark/v100.fp32.bs1.main_fps_top1_s.jpg
+++ b/docs/images/models/V100_benchmark/v100.fp32.bs1.main_fps_top1_s.jpg
--- a/docs/zh_CN/advanced_tutorials/distillation/distillation.md
+++ b/docs/zh_CN/advanced_tutorials/distillation/distillation.md
@@ -8,7 +8,7 @@
 深度神经网络一般有较多的参数冗余，目前有几种主要的方法对模型进行压缩，减小其参数量。如裁剪、量化、知识蒸馏等，其中知识蒸馏是指使用教师模型(teacher model)去指导学生模型(student model)学习特定任务，保证小模型在参数量不变的情况下，得到比较大的性能提升，甚至获得与大模型相似的精度指标[1]。PaddleClas融合已有的蒸馏方法[2,3]，提供了一种简单的半监督标签知识蒸馏方案（SSLD，Simple Semi-supervised Label Distillation），基于ImageNet1k分类数据集，在ResNet_vd以及MobileNet系列上的精度均有超过3%的绝对精度提升，具体指标如下图所示。
-![](../../../images/distillation/distillation_perform.png)
+![](../../../images/distillation/distillation_perform_s.jpg)
 # 二、SSLD 蒸馏策略
@@ -17,10 +17,8 @@
 SSLD的流程图如下图所示。
 ![](../../../images/distillation/ppcls_distillation.png)
 首先，我们从ImageNet22k中挖掘出了近400万张图片，同时与ImageNet-1k训练集整合在一起，得到了一个新的包含500万张图片的数据集。然后，我们将学生模型与教师模型组合成一个新的网络，该网络分别输出学生模型和教师模型的预测分布，与此同时，固定教师模型整个网络的梯度，而学生模型可以做正常的反向传播。最后，我们将两个模型的logits经过softmax激活函数转换为soft label，并将二者的soft label做JS散度作为损失函数，用于蒸馏模型训练。下面以MobileNetV3（该模型直接训练，精度为75.3%）的知识蒸馏为例，介绍该方案的核心关键点（baseline为79.12%的ResNet50_vd模型蒸馏MobileNetV3，训练集为ImageNet1k训练集，loss为cross entropy loss，迭代轮数为120epoch，精度指标为75.6%）。
 * 教师模型的选择。在进行知识蒸馏时，如果教师模型与学生模型的结构差异太大，蒸馏得到的结果反而不会有太大收益。相同结构下，精度更高的教师模型对结果也有很大影响。相比于79.12%的ResNet50_vd教师模型，使用82.4%的ResNet50_vd教师模型可以带来0.4%的绝对精度收益(`75.6%->76.0%`)。
@@ -103,6 +101,14 @@ SSLD的流程图如下图所示。
 | ResNet101_vd | 30 | 7e-5 | 1024/32 | 0.004 | cosine_decay_warmup | 83.73% |
+## 3.4 数据增广以及基于Fix策略的微调
+* 基于前文所述的实验结论，我们在训练的过程中加入自动增广(AutoAugment)[4]，同时进一步减小了l2_decay(4e-5->2e-5)，最终ResNet50_vd经过SSLD蒸馏策略，在ImageNet1k上的精度可以达到82.99%，相比之前不加数据增广的蒸馏策略再次增加了0.6%。
+* 对于图像分类任务，在测试的时候，测试尺度为训练尺度的1.15倍左右时，往往在不需要重新训练模型的情况下，模型的精度指标就可以进一步提升[5]，对于82.99%的ResNet50_vd在320x320的尺度下测试，精度可达83.7%，我们进一步使用Fix策略，即在320x320的尺度下进行训练，使用与预测时相同的数据预处理方法，同时固定除FC层以外的所有参数，最终在320x320的预测尺度下，精度可以达到**84.0%**。
 ## 3.4 实验过程中的一些问题
 ### 3.4.1 bn的计算方法
@@ -182,7 +188,7 @@ for var in ./*_student; do cp "$var" "../student_model/${var%_student}"; done #
 | Faster RCNN R50_vd FPN | 640/640 | 79.12% | [0.05,0.05,0.1,0.1,0.15] | 34.3% |
 | Faster RCNN R50_vd FPN | 640/640 | 82.18% | [0.05,0.05,0.1,0.1,0.15] | 36.3% |
-在这里可以看出，对于未蒸馏模型，过度调整中间层学习率反而降低最终检测模型的性能指标。基于该蒸馏模型，我们也提供了领先的服务端实用目标检测方案，详细的配置与训练代码均已开源，可以参考[PaddleDetection](https://github.com/PaddlePaddle/PaddleDetection/tree/master/configs/rcnn_server_side_det)。
+在这里可以看出，对于未蒸馏模型，过度调整中间层学习率反而降低最终检测模型的性能指标。基于该蒸馏模型，我们也提供了领先的服务端实用目标检测方案，详细的配置与训练代码均已开源，可以参考[PaddleDetection](https://github.com/PaddlePaddle/PaddleDetection/tree/master/configs/rcnn_enhance)。
 # 五、SSLD实战
@@ -266,3 +272,7 @@ sh tools/run.sh
 [2] Bagherinezhad H, Horton M, Rastegari M, et al. Label refinery: Improving imagenet classification through label progression[J]. arXiv preprint arXiv:1805.02641, 2018.
 [3] Yalniz I Z, Jégou H, Chen K, et al. Billion-scale semi-supervised learning for image classification[J]. arXiv preprint arXiv:1905.00546, 2019.
+[4] Cubuk E D, Zoph B, Mane D, et al. Autoaugment: Learning augmentation strategies from data[C]//Proceedings of the IEEE conference on computer vision and pattern recognition. 2019: 113-123.
+[5] Touvron H, Vedaldi A, Douze M, et al. Fixing the train-test resolution discrepancy[C]//Advances in Neural Information Processing Systems. 2019: 8250-8260.
--- a/docs/zh_CN/faq.md
+++ b/docs/zh_CN/faq.md
@@ -17,7 +17,7 @@
 >>
 * Q: 在评测`EfficientNetB0_small`模型时，为什么最终的精度始终比官网的低0.3%左右？
-* A: `EfficientNet`系列的网络在进行resize的时候，是使用`cubic插值方式`(resize参数的interpolation值设置为2)，而其他模型默认情况下为None，因此在训练和评估的时候需要显式地指定resiz的interpolation值。具体地，可以参考以下配置中预处理过程中ResizeImage的参数。
+* A: `EfficientNet`系列的网络在进行resize的时候，是使用`cubic插值方式`(resize参数的interpolation值设置为2)，而其他模型默认情况下为None，因此在训练和评估的时候需要显式地指定resize的interpolation值。具体地，可以参考以下配置中预处理过程中ResizeImage的参数。
 ```
 VALID:
    batch_size: 16

--- a/docs/zh_CN/models/ResNet_and_vd.md
+++ b/docs/zh_CN/models/ResNet_and_vd.md
@@ -42,8 +42,11 @@ ResNet系列模型是在2015年提出的，一举在ILSVRC2015比赛中取得冠
 | ResNet152_vd     | 0.806           | 0.953           |                          |                          | 23.530    | 60.210    |
 | ResNet200_vd     | 0.809           | 0.953           |                          |                          | 30.530    | 74.740    |
 | ResNet50_vd_ssld | 0.824           | 0.961           |                          |                          | 8.670     | 25.580    |
+| ResNet50_vd_ssld_v2 | 0.830           | 0.964           |                          |                          | 8.670     | 25.580    |
+| Fix_ResNet50_vd_ssld_v2 | 0.840           | 0.970           |                          |                          | 17.696     | 25.580    |
 | ResNet101_vd_ssld | 0.837           | 0.967           |                          |                          | 16.100    | 44.570     |
+* 注：`ResNet50_vd_ssld_v2`是在`ResNet50_vd_ssld`训练策略的基础上加上AutoAugment训练得到，`Fix_ResNet50_vd_ssld_v2`是固定`ResNet50_vd_ssld_v2`除FC层外所有的网络参数，在320x320的图像输入分辨率下，基于ImageNet1k数据集微调得到。
@@ -86,4 +89,6 @@ ResNet系列模型是在2015年提出的，一举在ILSVRC2015比赛中取得冠
 | ResNet152_vd      | 224       | 256               | 7.29127                      | 10.86137                     | 15.32444                     | 8.54376                      | 19.52157                     | 36.64445                     |
 | ResNet200_vd      | 224       | 256               | 9.36026                      | 13.5474                      | 19.0725                      | 10.80619                     | 25.01731                     | 48.81399                     |
 | ResNet50_vd_ssld  | 224       | 256               | 2.65164                      | 4.84109                      | 7.46225                      | 3.53131                      | 8.09057                      | 14.45965                     |
+| ResNet50_vd_ssld_v2  | 224       | 256               | 2.65164                      | 4.84109                      | 7.46225                      | 3.53131                      | 8.09057                      | 14.45965                     |
+| Fix_ResNet50_vd_ssld_v2  | 320       | 320               | 3.42818                      | 7.51534                      | 13.19370                      | 5.07696                      | 14.64218                      | 27.01453                     |
 | ResNet101_vd_ssld | 224       | 256               | 5.05972                      | 7.83685                      | 11.34235                     | 6.11704                      | 13.76222                     | 25.11071                     |
--- a/docs/zh_CN/models/models_intro.md
+++ b/docs/zh_CN/models/models_intro.md
@@ -51,6 +51,8 @@ python tools/infer/predict.py \
    - [ResNet152_vd](https://paddle-imagenet-models-name.bj.bcebos.com/ResNet152_vd_pretrained.tar)
    - [ResNet200_vd](https://paddle-imagenet-models-name.bj.bcebos.com/ResNet200_vd_pretrained.tar)
    - [ResNet50_vd_ssld](https://paddle-imagenet-models-name.bj.bcebos.com/ResNet50_vd_ssld_pretrained.tar)
+    - [ResNet50_vd_ssld_v2](https://paddle-imagenet-models-name.bj.bcebos.com/ResNet50_vd_ssld_v2_pretrained.tar)
+    - [Fix_ResNet50_vd_ssld_v2](https://paddle-imagenet-models-name.bj.bcebos.com/Fix_ResNet50_vd_ssld_v2_pretrained.tar)
    - [ResNet101_vd_ssld](https://paddle-imagenet-models-name.bj.bcebos.com/ResNet101_vd_ssld_pretrained.tar)

--- a/docs/zh_CN/tutorials/getting_started.md
+++ b/docs/zh_CN/tutorials/getting_started.md
@@ -41,7 +41,8 @@ python -m paddle.distributed.launch \
    --selected_gpus="0,1,2,3" \
    tools/train.py \
        -c ./configs/ResNet/ResNet50_vd.yaml \
-        -o use_mix=1
+        -o use_mix=1 \
+	--vdl_dir=./scalar/
 ```
@@ -53,6 +54,13 @@ epoch:0    train    step:522    loss:1.6330    lr:0.100000    elapse:0.210
 也可以直接修改模型对应的配置文件更新配置。具体配置参数参考[配置文档](config.md)。
+训练期间可以通过VisualDL实时观察loss变化，启动命令如下：
+```bash
+visualdl --logdir ./scalar --host <host_IP> --port <port_num>
+```
 ### 2.2 模型微调

--- a/docs/zh_CN/tutorials/install.md
+++ b/docs/zh_CN/tutorials/install.md
@@ -38,7 +38,7 @@ python -c "import paddle; print(paddle.__version__)"
 **运行环境需求:**
- Python2（官方已不提供更新维护）或Python3 (当前只支持Linux系统)
+- Python3 (当前只支持Linux系统)
 - CUDA >= 9.0
 - cuDNN >= 5.0
 - nccl >= 2.1.2
@@ -60,3 +60,10 @@ Python依赖库在[requirements.txt](https://github.com/PaddlePaddle/PaddleClas/
 ```
 pip install --upgrade -r requirements.txt
 ```
+visualdl可能出现安装失败，请尝试
+```
+pip3 install --upgrade visualdl==2.0.0b3 -i https://mirror.baidu.com/pypi/simple
+```
--- a/docs/zh_CN/update_history.md
+++ b/docs/zh_CN/update_history.md
 # 更新日志
+* 2020.06.17
+    * 添加英文文档。
+* 2020.06.12
+    * 添加对windows和CPU环境的训练与评估支持。
 * 2020.05.17
    * 添加混合精度训练。

--- a/ppcls/data/imaug/operators.py
+++ b/ppcls/data/imaug/operators.py
@@ -25,6 +25,8 @@ import random
 import cv2
 import numpy as np
+from .autoaugment import ImageNetPolicy
 class OperatorParamError(ValueError):
    """ OperatorParamError
@@ -115,7 +117,9 @@ class CropImage(object):
 class RandCropImage(object):
    """ random crop image """
-    def __init__(self, size, scale=None, ratio=None):
+    def __init__(self, size, scale=None, ratio=None, interpolation=-1):
+        self.interpolation = interpolation if interpolation >= 0 else None
        if type(size) is int:
            self.size = (size, size)  # (h, w)
        else:
@@ -149,7 +153,10 @@ class RandCropImage(object):
        j = random.randint(0, img_h - h)
        img = img[j:j + h, i:i + w, :]
-        return cv2.resize(img, size)
+        if self.interpolation is None:
+            return cv2.resize(img, size)
+        else:
+            return cv2.resize(img, size, interpolation=self.interpolation)
 class RandFlipImage(object):
@@ -172,6 +179,18 @@ class RandFlipImage(object):
            return img
+class AutoAugment(object):
+    def __init__(self):
+        self.policy = ImageNetPolicy()
+    def __call__(self, img):
+        from PIL import Image
+        img = np.ascontiguousarray(img)
+        img = Image.fromarray(img)
+        img = self.policy(img)
+        img = np.asarray(img)
 class NormalizeImage(object):
    """ normalize image such as substract mean, divide std
    """

--- a/ppcls/data/reader.py
+++ b/ppcls/data/reader.py
@@ -15,15 +15,17 @@
 import numpy as np
 import imghdr
 import os
+import sys
 import signal
+from paddle import fluid
 from paddle.fluid.io import multiprocess_reader
 from . import imaug
 from .imaug import transform
 from ppcls.utils import logger
-trainers_num = int(os.environ.get('PADDLE_TRAINERS_NUM', 1))
+trainers_num = int(os.environ.get('PADDLE_TRAINERS_NUM', 0))
 trainer_id = int(os.environ.get("PADDLE_TRAINER_ID", 0))
@@ -139,8 +141,9 @@ def get_file_list(params):
    # use only partial data for each trainer in distributed training
    if params['mode'] == 'train':
-        img_per_trainer = len(full_lines) // trainers_num
+        real_trainer_num = max(trainers_num, 1)
-        full_lines = full_lines[trainer_id::trainers_num][:img_per_trainer]
+        img_per_trainer = len(full_lines) // real_trainer_num
+        full_lines = full_lines[trainer_id::real_trainer_num][:img_per_trainer]
    return full_lines
@@ -165,7 +168,7 @@ def create_operators(params):
    return ops
-def partial_reader(params, full_lines, part_id=0, part_num=1):
+def partial_reader(params, full_lines, part_id=0, part_num=1, batch_size=1):
    """
    create a reader with partial data
@@ -174,13 +177,13 @@ def partial_reader(params, full_lines, part_id=0, part_num=1):
        full_lines: label list
        part_id(int): part index of the current partial data
        part_num(int): part num of the dataset
+        batch_size(int): batch size for one trainer
    """
    assert part_id < part_num, ("part_num: {} should be larger "
                                "than part_id: {}".format(part_num, part_id))
    full_lines = full_lines[part_id::part_num]
-    batch_size = int(params['batch_size']) // trainers_num
    if params['mode'] != "test" and len(full_lines) < batch_size:
        raise SampleNumException('', len(full_lines), batch_size)
@@ -197,7 +200,7 @@ def partial_reader(params, full_lines, part_id=0, part_num=1):
    return reader
-def mp_reader(params):
+def mp_reader(params, batch_size):
    """
    multiprocess reader
@@ -210,11 +213,16 @@ def mp_reader(params):
    if params["mode"] == "train":
        full_lines = shuffle_lines(full_lines, seed=None)
+    # NOTE: multiprocess reader is not supported on windows
+    if sys.platform == "win32":
+        return partial_reader(params, full_lines, 0, 1, batch_size)
    part_num = 1 if 'num_workers' not in params else params['num_workers']
    readers = []
    for part_id in range(part_num):
-        readers.append(partial_reader(params, full_lines, part_id, part_num))
+        readers.append(
+            partial_reader(params, full_lines, part_id, part_num, batch_size))
    return multiprocess_reader(readers, use_pipe=False)
@@ -248,6 +256,7 @@ class Reader:
        except KeyError:
            raise ModeException(mode=mode)
+        self.use_gpu = config.get("use_gpu", True)
        use_mix = config.get('use_mix')
        self.params['mode'] = mode
        if seed is not None:
@@ -257,10 +266,17 @@ class Reader:
            self.batch_ops = create_operators(self.params['mix'])
    def __call__(self):
-        batch_size = int(self.params['batch_size']) // trainers_num
+        device_num = trainers_num
+        # non-distributed launch
+        if trainers_num <= 0:
+            if self.use_gpu:
+                device_num = fluid.core.get_cuda_device_count()
+            else:
+                device_num = int(os.environ.get('CPU_NUM', 1))
+        batch_size = int(self.params['batch_size']) // device_num
        def wrapper():
-            reader = mp_reader(self.params)
+            reader = mp_reader(self.params, batch_size)
            batch = []
            for idx, sample in enumerate(reader()):
                img, label = sample

--- a/ppcls/modeling/architectures/__init__.py
+++ b/ppcls/modeling/architectures/__init__.py
@@ -42,8 +42,9 @@ from .res2net_vd import Res2Net50_vd_48w_2s, Res2Net50_vd_26w_4s, Res2Net50_vd_1
 from .hrnet import HRNet_W18_C, HRNet_W30_C, HRNet_W32_C, HRNet_W40_C, HRNet_W44_C, HRNet_W48_C, HRNet_W60_C, HRNet_W64_C, SE_HRNet_W18_C, SE_HRNet_W30_C, SE_HRNet_W32_C, SE_HRNet_W40_C, SE_HRNet_W44_C, SE_HRNet_W48_C, SE_HRNet_W60_C, SE_HRNet_W64_C
 from .darts_gs import DARTS_GS_6M, DARTS_GS_4M
 from .resnet_acnet import ResNet18_ACNet, ResNet34_ACNet, ResNet50_ACNet, ResNet101_ACNet, ResNet152_ACNet
+from .ghostnet import GhostNet_x0_5, GhostNet_x1_0, GhostNet_x1_3
 # distillation model
 from .distillation_models import ResNet50_vd_distill_MobileNetV3_large_x1_0, ResNeXt101_32x16d_wsl_distill_ResNet50_vd
 from .csp_resnet import CSPResNet50_leaky
\ No newline at end of file
--- a/ppcls/modeling/architectures/efficientnet.py
+++ b/ppcls/modeling/architectures/efficientnet.py
@@ -383,7 +383,9 @@ class EfficientNet():
            use_bias=True,
            padding_type=self.padding_type,
            name=name + '_se_expand')
-        se_out = inputs * fluid.layers.sigmoid(x_squeezed)
+        #se_out = inputs * fluid.layers.sigmoid(x_squeezed)
+        se_out = fluid.layers.elementwise_mul(
+            inputs, fluid.layers.sigmoid(x_squeezed), axis=-1)
        return se_out
    def extract_features(self, inputs, is_test):
@@ -467,8 +469,8 @@ class BlockDecoder(object):
        # Check stride
        cond_1 = ('s' in options and len(options['s']) == 1)
-        cond_2 = ((len(options['s']) == 2)
+        cond_2 = ((len(options['s']) == 2) and
-                  and (options['s'][0] == options['s'][1]))
+                  (options['s'][0] == options['s'][1]))
        assert (cond_1 or cond_2)
        return BlockArgs(

--- a/ppcls/modeling/architectures/ghostnet.py
+++ b/ppcls/modeling/architectures/ghostnet.py
@@ -276,4 +276,4 @@ def GhostNet_x1_0():
 def GhostNet_x1_3():
    model = GhostNet(scale=1.3)
    return model
\ No newline at end of file
--- a/ppcls/modeling/architectures/hrnet.py
+++ b/ppcls/modeling/architectures/hrnet.py
-#copyright (c) 2020 PaddlePaddle Authors. All Rights Reserve.
+# copyright (c) 2020 PaddlePaddle Authors. All Rights Reserve.
 #
-#Licensed under the Apache License, Version 2.0 (the "License");
+# Licensed under the Apache License, Version 2.0 (the "License");
-#you may not use this file except in compliance with the License.
+# you may not use this file except in compliance with the License.
-#You may obtain a copy of the License at
+# You may obtain a copy of the License at
 #
 #    http://www.apache.org/licenses/LICENSE-2.0
 #
-#Unless required by applicable law or agreed to in writing, software
+# Unless required by applicable law or agreed to in writing, software
-#distributed under the License is distributed on an "AS IS" BASIS,
+# distributed under the License is distributed on an "AS IS" BASIS,
-#WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-#See the License for the specific language governing permissions and
+# See the License for the specific language governing permissions and
-#limitations under the License.
+# limitations under the License.
 from __future__ import absolute_import
 from __future__ import division
@@ -74,7 +74,7 @@ class HRNet():
        tr3 = self.transition_layer(st3, channels_3, channels_4, name='tr3')
        st4 = self.stage(tr3, num_modules_4, channels_4, name='st4')
-        #classification
+        # classification
        last_cls = self.last_cls_out(x=st4, name='cls_head')
        y = last_cls[0]
        last_num_filters = [256, 512, 1024]
@@ -273,7 +273,7 @@ class HRNet():
                input=conv,
                num_channels=num_filters,
                reduction_ratio=16,
-                name=name + '_fc')
+                name="fc" + name)
        return fluid.layers.elementwise_add(x=residual, y=conv, act='relu')
    def bottleneck_block(self,
@@ -312,7 +312,7 @@ class HRNet():
                input=conv,
                num_channels=num_filters * 4,
                reduction_ratio=16,
-                name=name + '_fc')
+                name="fc" + name)
        return fluid.layers.elementwise_add(x=residual, y=conv, act='relu')
    def squeeze_excitation(self,
@@ -325,7 +325,7 @@ class HRNet():
        stdv = 1.0 / math.sqrt(pool.shape[1] * 1.0)
        squeeze = fluid.layers.fc(
            input=pool,
-            size=num_channels / reduction_ratio,
+            size=int(num_channels / reduction_ratio),
            act='relu',
            param_attr=fluid.param_attr.ParamAttr(
                initializer=fluid.initializer.Uniform(-stdv, stdv),

--- a/ppcls/modeling/architectures/resnet_vd.py
+++ b/ppcls/modeling/architectures/resnet_vd.py
-#copyright (c) 2020 PaddlePaddle Authors. All Rights Reserve.
+# copyright (c) 2020 PaddlePaddle Authors. All Rights Reserve.
 #
-#Licensed under the Apache License, Version 2.0 (the "License");
+# Licensed under the Apache License, Version 2.0 (the "License");
-#you may not use this file except in compliance with the License.
+# you may not use this file except in compliance with the License.
-#You may obtain a copy of the License at
+# You may obtain a copy of the License at
 #
 #    http://www.apache.org/licenses/LICENSE-2.0
 #
-#Unless required by applicable law or agreed to in writing, software
+# Unless required by applicable law or agreed to in writing, software
-#distributed under the License is distributed on an "AS IS" BASIS,
+# distributed under the License is distributed on an "AS IS" BASIS,
-#WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-#See the License for the specific language governing permissions and
+# See the License for the specific language governing permissions and
-#limitations under the License.
+# limitations under the License.
 from __future__ import absolute_import
 from __future__ import division
@@ -49,7 +49,8 @@ class ResNet():
        layers = self.layers
        supported_layers = [18, 34, 50, 101, 152, 200]
        assert layers in supported_layers, \
-            "supported layers are {} but input layer is {}".format(supported_layers, layers)
+            "supported layers are {} but input layer is {}".format(
+                supported_layers, layers)
        if layers == 18:
            depth = [2, 2, 2, 2]
@@ -159,7 +160,9 @@ class ResNet():
            padding=(filter_size - 1) // 2,
            groups=groups,
            act=None,
-            param_attr=ParamAttr(name=name + "_weights" + self.postfix_name),
+            param_attr=ParamAttr(
+                name=name + "_weights" + self.postfix_name,
+                learning_rate=lr_mult),
            bias_attr=False)
        if name == "conv1":
            bn_name = "bn_" + name
@@ -168,8 +171,12 @@ class ResNet():
        return fluid.layers.batch_norm(
            input=conv,
            act=act,
-            param_attr=ParamAttr(name=bn_name + '_scale' + self.postfix_name),
+            param_attr=ParamAttr(
-            bias_attr=ParamAttr(bn_name + '_offset' + self.postfix_name),
+                name=bn_name + '_scale' + self.postfix_name,
+                learning_rate=lr_mult),
+            bias_attr=ParamAttr(
+                bn_name + '_offset' + self.postfix_name,
+                learning_rate=lr_mult),
            moving_mean_name=bn_name + '_mean' + self.postfix_name,
            moving_variance_name=bn_name + '_variance' + self.postfix_name)

--- a/ppcls/modeling/loss.py
+++ b/ppcls/modeling/loss.py
@@ -41,6 +41,7 @@ class Loss(object):
            label=one_hot_target, epsilon=self._epsilon, dtype="float32")
        soft_target = fluid.layers.reshape(
            soft_target, shape=[-1, self._class_dim])
+        soft_target.stop_gradient = True
        return soft_target
    def _crossentropy(self, input, target):

--- a/ppcls/optimizer/learning_rate.py
+++ b/ppcls/optimizer/learning_rate.py
-#copyright (c) 2020 PaddlePaddle Authors. All Rights Reserve.
+# copyright (c) 2020 PaddlePaddle Authors. All Rights Reserve.
 #
-#Licensed under the Apache License, Version 2.0 (the "License");
+# Licensed under the Apache License, Version 2.0 (the "License");
-#you may not use this file except in compliance with the License.
+# you may not use this file except in compliance with the License.
-#You may obtain a copy of the License at
+# You may obtain a copy of the License at
 #
 #    http://www.apache.org/licenses/LICENSE-2.0
 #
-#Unless required by applicable law or agreed to in writing, software
+# Unless required by applicable law or agreed to in writing, software
-#distributed under the License is distributed on an "AS IS" BASIS,
+# distributed under the License is distributed on an "AS IS" BASIS,
-#WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-#See the License for the specific language governing permissions and
+# See the License for the specific language governing permissions and
-#limitations under the License.
+# limitations under the License.
 from __future__ import absolute_import
 from __future__ import division
@@ -130,7 +130,7 @@ class CosineWarmup(object):
        with fluid.layers.control_flow.Switch() as switch:
            with switch.case(epoch < self.warmup_epoch):
                decayed_lr = self.lr * \
-                        (global_step / (self.step_each_epoch * self.warmup_epoch))
+                    (global_step / (self.step_each_epoch * self.warmup_epoch))
                fluid.layers.tensor.assign(
                    input=decayed_lr, output=learning_rate)
            with switch.default():
@@ -145,6 +145,65 @@ class CosineWarmup(object):
        return learning_rate
+class ExponentialWarmup(object):
+    """
+    Exponential learning rate decay with warmup
+    [0, warmup_epoch): linear warmup
+    [warmup_epoch, epochs): Exponential decay
+    Args:
+        lr(float): initial learning rate
+        step_each_epoch(int): steps each epoch
+        decay_epochs(float): decay epochs
+        decay_rate(float): decay rate
+        warmup_epoch(int): epoch num of warmup
+    """
+    def __init__(self,
+                 lr,
+                 step_each_epoch,
+                 decay_epochs=2.4,
+                 decay_rate=0.97,
+                 warmup_epoch=5,
+                 **kwargs):
+        super(ExponentialWarmup, self).__init__()
+        self.lr = lr
+        self.step_each_epoch = step_each_epoch
+        self.decay_epochs = decay_epochs * self.step_each_epoch
+        self.decay_rate = decay_rate
+        self.warmup_epoch = fluid.layers.fill_constant(
+            shape=[1],
+            value=float(warmup_epoch),
+            dtype='float32',
+            force_cpu=True)
+    def __call__(self):
+        global_step = _decay_step_counter()
+        learning_rate = fluid.layers.tensor.create_global_var(
+            shape=[1],
+            value=0.0,
+            dtype='float32',
+            persistable=True,
+            name="learning_rate")
+        epoch = ops.floor(global_step / self.step_each_epoch)
+        with fluid.layers.control_flow.Switch() as switch:
+            with switch.case(epoch < self.warmup_epoch):
+                decayed_lr = self.lr * \
+                    (global_step / (self.step_each_epoch * self.warmup_epoch))
+                fluid.layers.tensor.assign(
+                    input=decayed_lr, output=learning_rate)
+            with switch.default():
+                rest_step = global_step - self.warmup_epoch * self.step_each_epoch
+                div_res = ops.floor(rest_step / self.decay_epochs)
+                decayed_lr = self.lr * (self.decay_rate**div_res)
+                fluid.layers.tensor.assign(
+                    input=decayed_lr, output=learning_rate)
+        return learning_rate
 class LearningRateBuilder():
    """
    Build learning rate variable

--- a/ppcls/utils/config.py
+++ b/ppcls/utils/config.py
@@ -64,14 +64,18 @@ def print_dict(d, delimiter=0):
    placeholder = "-" * 60
    for k, v in sorted(d.items()):
        if isinstance(v, dict):
-            logger.info("{}{} : ".format(delimiter * " ", logger.coloring(k, "HEADER")))
+            logger.info("{}{} : ".format(delimiter * " ",
+                                         logger.coloring(k, "HEADER")))
            print_dict(v, delimiter + 4)
        elif isinstance(v, list) and len(v) >= 1 and isinstance(v[0], dict):
-            logger.info("{}{} : ".format(delimiter * " ", logger.coloring(str(k),"HEADER")))
+            logger.info("{}{} : ".format(delimiter * " ",
+                                         logger.coloring(str(k), "HEADER")))
            for value in v:
                print_dict(value, delimiter + 4)
        else:
-            logger.info("{}{} : {}".format(delimiter * " ", logger.coloring(k,"HEADER"), logger.coloring(v,"OKGREEN")))
+            logger.info("{}{} : {}".format(delimiter * " ",
+                                           logger.coloring(k, "HEADER"),
+                                           logger.coloring(v, "OKGREEN")))
        if k.isupper():
            logger.info(placeholder)
@@ -95,7 +99,9 @@ def check_config(config):
    check.check_version()
    mode = config.get('mode', 'train')
-    check.check_gpu()
+    use_gpu = config.get("use_gpu", True)
+    if use_gpu:
+        check.check_gpu()
    architecture = config.get('ARCHITECTURE')
    check.check_architecture(architecture)

--- a/ppcls/utils/logger.py
+++ b/ppcls/utils/logger.py
@@ -19,9 +19,10 @@ import datetime
 from imp import reload
 reload(logging)
-logging.basicConfig(level=logging.INFO, 
+logging.basicConfig(
-                    format="%(asctime)s %(levelname)s: %(message)s",
+    level=logging.INFO,
-                    datefmt = "%Y-%m-%d %H:%M:%S")
+    format="%(asctime)s %(levelname)s: %(message)s",
+    datefmt="%Y-%m-%d %H:%M:%S")
 def time_zone(sec, fmt):
@@ -32,22 +33,22 @@ def time_zone(sec, fmt):
 logging.Formatter.converter = time_zone
 _logger = logging.getLogger(__name__)
+Color = {
-Color= {
+    'RED': '\033[31m',
-        'RED' : '\033[31m' ,
+    'HEADER': '\033[35m',  # deep purple
-        'HEADER' : '\033[35m' , # deep purple
+    'PURPLE': '\033[95m',  # purple
-        'PURPLE' : '\033[95m' ,# purple
+    'OKBLUE': '\033[94m',
-        'OKBLUE' : '\033[94m' ,
+    'OKGREEN': '\033[92m',
-        'OKGREEN' : '\033[92m' ,
+    'WARNING': '\033[93m',
-        'WARNING' : '\033[93m' ,
+    'FAIL': '\033[91m',
-        'FAIL' : '\033[91m' ,
+    'ENDC': '\033[0m'
-        'ENDC' : '\033[0m' }
+}
 def coloring(message, color="OKGREEN"):
    assert color in Color.keys()
    if os.environ.get('PADDLECLAS_COLORING', False):
-        return Color[color]+str(message)+Color["ENDC"]
+        return Color[color] + str(message) + Color["ENDC"]
    else:
        return message
@@ -80,6 +81,17 @@ def error(fmt, *args):
    _logger.error(coloring(fmt, "FAIL"), *args)
+def scaler(name, value, step, writer):
+    """
+    This function will draw a scalar curve generated by the visualdl.
+    Usage: Install visualdl: pip3 install visualdl==2.0.0b4
+           and then:
+           visualdl --logdir ./scalar --host 0.0.0.0 --port 8830 
+           to preview loss corve in real time.
+    """
+    writer.add_scalar(name, value, step)
 def advertise():
    """
    Show the advertising message like the following:
@@ -99,12 +111,13 @@ def advertise():
    website = "https://github.com/PaddlePaddle/PaddleClas"
    AD_LEN = 6 + len(max([copyright, ad, website], key=len))
-    info(coloring("\n{0}\n{1}\n{2}\n{3}\n{4}\n{5}\n{6}\n{7}\n".format(
+    info(
-        "=" * (AD_LEN + 4),
+        coloring("\n{0}\n{1}\n{2}\n{3}\n{4}\n{5}\n{6}\n{7}\n".format(
-        "=={}==".format(copyright.center(AD_LEN)),
+            "=" * (AD_LEN + 4),
-        "=" * (AD_LEN + 4),
+            "=={}==".format(copyright.center(AD_LEN)),
-        "=={}==".format(' ' * AD_LEN),
+            "=" * (AD_LEN + 4),
-        "=={}==".format(ad.center(AD_LEN)),
+            "=={}==".format(' ' * AD_LEN),
-        "=={}==".format(' ' * AD_LEN),
+            "=={}==".format(ad.center(AD_LEN)),
-        "=={}==".format(website.center(AD_LEN)),
+            "=={}==".format(' ' * AD_LEN),
-        "=" * (AD_LEN + 4), ),"RED"))
+            "=={}==".format(website.center(AD_LEN)),
+            "=" * (AD_LEN + 4), ), "RED"))
--- a/ppcls/utils/pretrained.list
+++ b/ppcls/utils/pretrained.list
@@ -12,6 +12,8 @@ ResNet101_vd
 ResNet152_vd
 ResNet200_vd
 ResNet50_vd_ssld
+ResNet50_vd_ssld_v2
+Fix_ResNet50_vd_ssld_v2
 ResNet101_vd_ssld
 MobileNetV3_large_x0_35
 MobileNetV3_large_x0_5

--- a/requirements.txt
+++ b/requirements.txt
@@ -3,3 +3,4 @@ opencv-python
 pillow
 tqdm
 PyYAML
+visualdl >= 2.0.0b
--- a/tools/ema.py
+++ b/tools/ema.py
+# copyright (c) 2020 PaddlePaddle Authors. All Rights Reserve.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#    http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+import numpy as np
+import paddle
+import paddle.fluid as fluid
+from paddle.fluid.wrapped_decorator import signature_safe_contextmanager
+from paddle.fluid.framework import Program, program_guard, name_scope, default_main_program
+from paddle.fluid import unique_name, layers
+class ExponentialMovingAverage(object):
+    def __init__(self,
+                 decay=0.999,
+                 thres_steps=None,
+                 zero_debias=False,
+                 name=None):
+        self._decay = decay
+        self._thres_steps = thres_steps
+        self._name = name if name is not None else ''
+        self._decay_var = self._get_ema_decay()
+        self._params_tmps = []
+        for param in default_main_program().global_block().all_parameters():
+            if param.do_model_average != False:
+                tmp = param.block.create_var(
+                    name=unique_name.generate(".".join(
+                        [self._name + param.name, 'ema_tmp'])),
+                    dtype=param.dtype,
+                    persistable=False,
+                    stop_gradient=True)
+                self._params_tmps.append((param, tmp))
+        self._ema_vars = {}
+        for param, tmp in self._params_tmps:
+            with param.block.program._optimized_guard(
+                [param, tmp]), name_scope('moving_average'):
+                self._ema_vars[param.name] = self._create_ema_vars(param)
+        self.apply_program = Program()
+        block = self.apply_program.global_block()
+        with program_guard(main_program=self.apply_program):
+            decay_pow = self._get_decay_pow(block)
+            for param, tmp in self._params_tmps:
+                param = block._clone_variable(param)
+                tmp = block._clone_variable(tmp)
+                ema = block._clone_variable(self._ema_vars[param.name])
+                layers.assign(input=param, output=tmp)
+                # bias correction
+                if zero_debias:
+                    ema = ema / (1.0 - decay_pow)
+                layers.assign(input=ema, output=param)
+        self.restore_program = Program()
+        block = self.restore_program.global_block()
+        with program_guard(main_program=self.restore_program):
+            for param, tmp in self._params_tmps:
+                tmp = block._clone_variable(tmp)
+                param = block._clone_variable(param)
+                layers.assign(input=tmp, output=param)
+    def _get_ema_decay(self):
+        with default_main_program()._lr_schedule_guard():
+            decay_var = layers.tensor.create_global_var(
+                shape=[1],
+                value=self._decay,
+                dtype='float32',
+                persistable=True,
+                name="scheduled_ema_decay_rate")
+            if self._thres_steps is not None:
+                decay_t = (self._thres_steps + 1.0) / (self._thres_steps + 10.0)
+                with layers.control_flow.Switch() as switch:
+                    with switch.case(decay_t < self._decay):
+                        layers.tensor.assign(decay_t, decay_var)
+                    with switch.default():
+                        layers.tensor.assign(
+                            np.array(
+                                [self._decay], dtype=np.float32),
+                            decay_var)
+        return decay_var
+    def _get_decay_pow(self, block):
+        global_steps = layers.learning_rate_scheduler._decay_step_counter()
+        decay_var = block._clone_variable(self._decay_var)
+        decay_pow_acc = layers.elementwise_pow(decay_var, global_steps + 1)
+        return decay_pow_acc
+    def _create_ema_vars(self, param):
+        param_ema = layers.create_global_var(
+            name=unique_name.generate(self._name + param.name + '_ema'),
+            shape=param.shape,
+            value=0.0,
+            dtype=param.dtype,
+            persistable=True)
+        return param_ema
+    def update(self):
+        """
+        Update Exponential Moving Average. Should only call this method in
+        train program.
+        """
+        param_master_emas = []
+        for param, tmp in self._params_tmps:
+            with param.block.program._optimized_guard(
+                [param, tmp]), name_scope('moving_average'):
+                param_ema = self._ema_vars[param.name]
+                if param.name + '.master' in self._ema_vars:
+                    master_ema = self._ema_vars[param.name + '.master']
+                    param_master_emas.append([param_ema, master_ema])
+                else:
+                    ema_t = param_ema * self._decay_var + param * (
+                        1 - self._decay_var)
+                    layers.assign(input=ema_t, output=param_ema)
+        # for fp16 params
+        for param_ema, master_ema in param_master_emas:
+            default_main_program().global_block().append_op(
+                type="cast",
+                inputs={"X": master_ema},
+                outputs={"Out": param_ema},
+                attrs={
+                    "in_dtype": master_ema.dtype,
+                    "out_dtype": param_ema.dtype
+                })
+    @signature_safe_contextmanager
+    def apply(self, executor, need_restore=True):
+        """
+        Apply moving average to parameters for evaluation.
+        Args:
+            executor (Executor): The Executor to execute applying.
+            need_restore (bool): Whether to restore parameters after applying.
+        """
+        executor.run(self.apply_program)
+        try:
+            yield
+        finally:
+            if need_restore:
+                self.restore(executor)
+    def restore(self, executor):
+        """Restore parameters.
+        Args:
+            executor (Executor): The Executor to execute restoring.
+        """
+        executor.run(self.restore_program)
--- a/tools/ema_clean.py
+++ b/tools/ema_clean.py
+#copyright (c) 2020 PaddlePaddle Authors. All Rights Reserve.
+#
+#Licensed under the Apache License, Version 2.0 (the "License");
+#you may not use this file except in compliance with the License.
+#You may obtain a copy of the License at
+#
+#    http://www.apache.org/licenses/LICENSE-2.0
+#
+#Unless required by applicable law or agreed to in writing, software
+#distributed under the License is distributed on an "AS IS" BASIS,
+#WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+#See the License for the specific language governing permissions and
+#limitations under the License.
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+import os
+import argparse
+import functools
+import shutil
+import sys
+def main():
+"""
+Usage: when training with flag use_ema, and evaluating EMA model, should clean the saved model at first.
+       To generate clean model:
+       python ema_clean.py ema_model_dir cleaned_model_dir
+"""
+    cleaned_model_dir = sys.argv[1]
+    ema_model_dir = sys.argv[2]
+    if not os.path.exists(cleaned_model_dir):
+        os.makedirs(cleaned_model_dir)
+    items = os.listdir(ema_model_dir)
+    for item in items:
+        if item.find('ema') > -1:
+            item_clean = item.replace('_ema_0', '')
+            shutil.copyfile(os.path.join(ema_model_dir, item),
+                            os.path.join(cleaned_model_dir, item_clean))
+        elif item.find('mean') > -1 or item.find('variance') > -1:
+            shutil.copyfile(os.path.join(ema_model_dir, item),
+                            os.path.join(cleaned_model_dir, item))
+if __name__ == '__main__':
+    main()
--- a/tools/eval_multi_platform.py
+++ b/tools/eval_multi_platform.py
+# copyright (c) 2020 PaddlePaddle Authors. All Rights Reserve.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#    http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+import os
+import argparse
+import paddle.fluid as fluid
+import program
+from ppcls.data import Reader
+from ppcls.utils.config import get_config
+from ppcls.utils.save_load import init_model
+def parse_args():
+    parser = argparse.ArgumentParser("PaddleClas eval script")
+    parser.add_argument(
+        '-c',
+        '--config',
+        type=str,
+        default='./configs/eval.yaml',
+        help='config file path')
+    parser.add_argument(
+        '-o',
+        '--override',
+        action='append',
+        default=[],
+        help='config options to be overridden')
+    args = parser.parse_args()
+    return args
+def main(args):
+    config = get_config(args.config, overrides=args.override, show=True)
+    use_gpu = config.get("use_gpu", True)
+    places = fluid.cuda_places() if use_gpu else fluid.cpu_places()
+    startup_prog = fluid.Program()
+    valid_prog = fluid.Program()
+    valid_dataloader, valid_fetchs = program.build(
+        config, valid_prog, startup_prog, is_train=False, is_distributed=False)
+    valid_prog = valid_prog.clone(for_test=True)
+    exe = fluid.Executor(places[0])
+    exe.run(startup_prog)
+    init_model(config, valid_prog, exe)
+    valid_reader = Reader(config, 'valid')()
+    valid_dataloader.set_sample_list_generator(valid_reader, places)
+    compiled_valid_prog = program.compile(config, valid_prog)
+    program.run(valid_dataloader, exe, compiled_valid_prog, valid_fetchs, -1,
+                'eval')
+if __name__ == '__main__':
+    args = parse_args()
+    main(args)
--- a/tools/program.py
+++ b/tools/program.py
@@ -18,6 +18,7 @@ from __future__ import print_function
 import os
 import time
+import numpy as np
 from collections import OrderedDict
@@ -36,6 +37,8 @@ from ppcls.utils import logger
 from paddle.fluid.incubate.fleet.collective import fleet
 from paddle.fluid.incubate.fleet.collective import DistributedStrategy
+from ema import ExponentialMovingAverage
 def create_feeds(image_shape, use_mix=None):
    """
@@ -86,7 +89,7 @@ def create_dataloader(feeds):
    return dataloader
-def create_model(architecture, image, classes_num):
+def create_model(architecture, image, classes_num, is_train):
    """
    Create a model
@@ -101,6 +104,8 @@ def create_model(architecture, image, classes_num):
    """
    name = architecture["name"]
    params = architecture.get("params", {})
+    if "is_test" in params:
+        params['is_test'] = not is_train
    model = architectures.__dict__[name](**params)
    out = model.net(input=image, class_dim=classes_num)
    return out
@@ -310,7 +315,7 @@ def mixed_precision_optimizer(config, optimizer):
    return optimizer
-def build(config, main_prog, startup_prog, is_train=True):
+def build(config, main_prog, startup_prog, is_train=True, is_distributed=True):
    """
    Build a program using a model and an optimizer
        1. create feeds
@@ -324,6 +329,7 @@ def build(config, main_prog, startup_prog, is_train=True):
        main_prog(): main program
        startup_prog(): startup program
        is_train(bool): train or valid
+        is_distributed(bool): whether to use distributed training method
    Returns:
        dataloader(): a bridge between the model and the data
@@ -336,7 +342,7 @@ def build(config, main_prog, startup_prog, is_train=True):
            feeds = create_feeds(config.image_shape, use_mix=use_mix)
            dataloader = create_dataloader(feeds.values())
            out = create_model(config.ARCHITECTURE, feeds['image'],
-                               config.classes_num)
+                               config.classes_num, is_train)
            fetchs = create_fetchs(
                out,
                feeds,
@@ -352,13 +358,22 @@ def build(config, main_prog, startup_prog, is_train=True):
                fetchs['lr'] = (lr, AverageMeter('lr', 'f', need_avg=False))
                optimizer = mixed_precision_optimizer(config, optimizer)
-                optimizer = dist_optimizer(config, optimizer)
+                if is_distributed:
+                    optimizer = dist_optimizer(config, optimizer)
                optimizer.minimize(fetchs['loss'][0])
+                if config.get('use_ema'):
+                    global_steps = fluid.layers.learning_rate_scheduler._decay_step_counter(
+                    )
+                    ema = ExponentialMovingAverage(
+                        config.get('ema_decay'), thres_steps=global_steps)
+                    ema.update()
+                    return dataloader, fetchs, ema
    return dataloader, fetchs
-def compile(config, program, loss_name=None):
+def compile(config, program, loss_name=None, share_prog=None):
    """
    Compile the program
@@ -366,6 +381,7 @@ def compile(config, program, loss_name=None):
        config(dict): config
        program(): the program which is wrapped by
        loss_name(str): loss name
+        share_prog(): the shared program, used for evaluation during training
    Returns:
        compiled_program(): a compiled program
@@ -377,6 +393,7 @@ def compile(config, program, loss_name=None):
    exec_strategy.num_iteration_per_drop_scope = 10
    compiled_program = fluid.CompiledProgram(program).with_data_parallel(
+        share_vars_from=share_prog,
        loss_name=loss_name,
        build_strategy=build_strategy,
        exec_strategy=exec_strategy)
@@ -384,7 +401,16 @@ def compile(config, program, loss_name=None):
    return compiled_program
-def run(dataloader, exe, program, fetchs, epoch=0, mode='train'):
+total_step = 0
+def run(dataloader,
+        exe,
+        program,
+        fetchs,
+        epoch=0,
+        mode='train',
+        vdl_writer=None):
    """
    Feed data to the model and fetch the measures and loss
@@ -409,9 +435,13 @@ def run(dataloader, exe, program, fetchs, epoch=0, mode='train'):
        batch_time.update(time.time() - tic)
        tic = time.time()
        for i, m in enumerate(metrics):
-            metric_list[i].update(m[0], len(batch[0]))
+            metric_list[i].update(np.mean(m), len(batch[0]))
        fetchs_str = ''.join([str(m.value) + ' '
                              for m in metric_list] + [batch_time.value]) + 's'
+        if vdl_writer:
+            global total_step
+            logger.scaler('loss', metrics[0][0], total_step, vdl_writer)
+            total_step += 1
        if mode == 'eval':
            logger.info("{:s} step:{:<4d} {:s}s".format(mode, idx, fetchs_str))
        else:

--- a/tools/train.py
+++ b/tools/train.py
@@ -38,6 +38,11 @@ def parse_args():
        type=str,
        default='configs/ResNet/ResNet50.yaml',
        help='config file path')
+    parser.add_argument(
+        '--vdl_dir',
+        type=str,
+        default=None,
+        help='VisualDL logging directory for image.')
    parser.add_argument(
        '-o',
        '--override',
@@ -64,8 +69,12 @@ def main(args):
    best_top1_acc = 0.0  # best top1 acc record
-    train_dataloader, train_fetchs = program.build(
+    if not config.get('use_ema'):
-        config, train_prog, startup_prog, is_train=True)
+        train_dataloader, train_fetchs = program.build(
+            config, train_prog, startup_prog, is_train=True)
+    else:
+        train_dataloader, train_fetchs, ema = program.build(
+            config, train_prog, startup_prog, is_train=True)
    if config.validate:
        valid_prog = fluid.Program()
@@ -75,11 +84,11 @@ def main(args):
        valid_prog = valid_prog.clone(for_test=True)
    # create the "Executor" with the statement of which place
-    exe = fluid.Executor(place=place)
+    exe = fluid.Executor(place)
-    # only run startup_prog once to init
+    # Parameter initialization
    exe.run(startup_prog)
-    # load model from checkpoint or pretrained model
+    # load model from 1. checkpoint to resume training, 2. pretrained model to finetune
    init_model(config, train_prog, exe)
    train_reader = Reader(config, 'train')()
@@ -91,25 +100,41 @@ def main(args):
        compiled_valid_prog = program.compile(config, valid_prog)
    compiled_train_prog = fleet.main_program
+    if args.vdl_dir:
+        from visualdl import LogWriter
+        vdl_writer = LogWriter(args.vdl_dir)
+    else:
+        vdl_writer = None
    for epoch_id in range(config.epochs):
        # 1. train with train dataset
        program.run(train_dataloader, exe, compiled_train_prog, train_fetchs,
-                    epoch_id, 'train')
+                    epoch_id, 'train', vdl_writer)
        if int(os.getenv("PADDLE_TRAINER_ID", 0)) == 0:
            # 2. validate with validate dataset
            if config.validate and epoch_id % config.valid_interval == 0:
+                if config.get('use_ema'):
+                    logger.info(logger.coloring("EMA validate start..."))
+                    with ema.apply(exe):
+                        top1_acc = program.run(valid_dataloader, exe,
+                                               compiled_valid_prog,
+                                               valid_fetchs, epoch_id, 'valid')
+                    logger.info(logger.coloring("EMA validate over!"))
                top1_acc = program.run(valid_dataloader, exe,
                                       compiled_valid_prog, valid_fetchs,
                                       epoch_id, 'valid')
                if top1_acc > best_top1_acc:
                    best_top1_acc = top1_acc
-                    message = "The best top1 acc {:.5f}, in epoch: {:d}".format(best_top1_acc, epoch_id)
+                    message = "The best top1 acc {:.5f}, in epoch: {:d}".format(
+                        best_top1_acc, epoch_id)
                    logger.info("{:s}".format(logger.coloring(message, "RED")))
-                    if epoch_id % config.save_interval==0:
+                    if epoch_id % config.save_interval == 0:
                        model_path = os.path.join(config.model_save_dir,
-                                              config.ARCHITECTURE["name"])
+                                                  config.ARCHITECTURE["name"])
-                        save_model(train_prog, model_path, "best_model_in_epoch_"+str(epoch_id))
+                        save_model(train_prog, model_path, "best_model")
            # 3. save the persistable model
            if epoch_id % config.save_interval == 0:

--- a/tools/train_multi_platform.py
+++ b/tools/train_multi_platform.py
+# copyright (c) 2020 PaddlePaddle Authors. All Rights Reserve.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#    http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+import argparse
+import os
+import paddle.fluid as fluid
+from ppcls.data import Reader
+from ppcls.utils.config import get_config
+from ppcls.utils.save_load import init_model, save_model
+from ppcls.utils import logger
+import program
+def parse_args():
+    parser = argparse.ArgumentParser("PaddleClas train script")
+    parser.add_argument(
+        '-c',
+        '--config',
+        type=str,
+        default='configs/ResNet/ResNet50.yaml',
+        help='config file path')
+    parser.add_argument(
+        '--vdl_dir',
+        type=str,
+        default=None,
+        help='VisualDL logging directory for image.')
+    parser.add_argument(
+        '-o',
+        '--override',
+        action='append',
+        default=[],
+        help='config options to be overridden')
+    args = parser.parse_args()
+    return args
+def main(args):
+    config = get_config(args.config, overrides=args.override, show=True)
+    # assign the place
+    use_gpu = config.get("use_gpu", True)
+    places = fluid.cuda_places() if use_gpu else fluid.cpu_places()
+    # startup_prog is used to do some parameter init work,
+    # and train prog is used to hold the network
+    startup_prog = fluid.Program()
+    train_prog = fluid.Program()
+    best_top1_acc = 0.0  # best top1 acc record
+    if not config.get('use_ema'):
+        train_dataloader, train_fetchs = program.build(
+            config,
+            train_prog,
+            startup_prog,
+            is_train=True,
+            is_distributed=False)
+    else:
+        train_dataloader, train_fetchs, ema = program.build(
+            config,
+            train_prog,
+            startup_prog,
+            is_train=True,
+            is_distributed=False)
+    if config.validate:
+        valid_prog = fluid.Program()
+        valid_dataloader, valid_fetchs = program.build(
+            config,
+            valid_prog,
+            startup_prog,
+            is_train=False,
+            is_distributed=False)
+        # clone to prune some content which is irrelevant in valid_prog
+        valid_prog = valid_prog.clone(for_test=True)
+    # create the "Executor" with the statement of which place
+    exe = fluid.Executor(places[0])
+    # Parameter initialization
+    exe.run(startup_prog)
+    # load model from 1. checkpoint to resume training, 2. pretrained model to finetune
+    init_model(config, train_prog, exe)
+    train_reader = Reader(config, 'train')()
+    train_dataloader.set_sample_list_generator(train_reader, places)
+    compiled_train_prog = program.compile(config, train_prog,
+                                          train_fetchs['loss'][0].name)
+    if config.validate:
+        valid_reader = Reader(config, 'valid')()
+        valid_dataloader.set_sample_list_generator(valid_reader, places)
+        compiled_valid_prog = program.compile(
+            config, valid_prog, share_prog=compiled_train_prog)
+    if args.vdl_dir:
+        from visualdl import LogWriter
+        vdl_writer = LogWriter(args.vdl_dir)
+    else:
+        vdl_writer = None
+    for epoch_id in range(config.epochs):
+        # 1. train with train dataset
+        program.run(train_dataloader, exe, compiled_train_prog, train_fetchs,
+                    epoch_id, 'train', vdl_writer)
+        # 2. validate with validate dataset
+        if config.validate and epoch_id % config.valid_interval == 0:
+            if config.get('use_ema'):
+                logger.info(logger.coloring("EMA validate start..."))
+                with ema.apply(exe):
+                    top1_acc = program.run(valid_dataloader, exe,
+                                           compiled_valid_prog, valid_fetchs,
+                                           epoch_id, 'valid')
+                logger.info(logger.coloring("EMA validate over!"))
+            top1_acc = program.run(valid_dataloader, exe, compiled_valid_prog,
+                                   valid_fetchs, epoch_id, 'valid')
+            if top1_acc > best_top1_acc:
+                best_top1_acc = top1_acc
+                message = "The best top1 acc {:.5f}, in epoch: {:d}".format(
+                    best_top1_acc, epoch_id)
+                logger.info("{:s}".format(logger.coloring(message, "RED")))
+                if epoch_id % config.save_interval == 0:
+                    model_path = os.path.join(config.model_save_dir,
+                                              config.ARCHITECTURE["name"])
+                    save_model(train_prog, model_path,
+                               "best_model_in_epoch_" + str(epoch_id))
+        # 3. save the persistable model
+        if epoch_id % config.save_interval == 0:
+            model_path = os.path.join(config.model_save_dir,
+                                      config.ARCHITECTURE["name"])
+            save_model(train_prog, model_path, epoch_id)
+if __name__ == '__main__':
+    args = parse_args()
+    main(args)