Develop (#1379)

* cp dev -> 2.3

Develop (#1379)
* cp dev -> 2.3
dfb26b6f · littletomatodonkey · GitHub · 796aa90a · dfb26b6f · dfb26b6f
40 changed file
--- a/README_ch.md
+++ b/README_ch.md
@@ -7,32 +7,26 @@
 飞桨图像识别套件PaddleClas是飞桨为工业界和学术界所准备的一个图像识别任务的工具集，助力使用者训练出更好的视觉模型和应用落地。
 **近期更新**
- 2021.10.31 发布[PP-ShiTu技术报告](./docs/PP_ShiTu.pdf)，优化文档，新增饮料识别demo
- 2021.10.23 发布PP-ShiTu图像识别系统，cpu上200ms即可完成在10w+库的图像识别。
+- 2021.11.1 发布[PP-ShiTu技术报告](https://arxiv.org/pdf/2111.00775.pdf)，新增饮料识别demo
+- 2021.10.23 发布轻量级图像识别系统PP-ShiTu，CPU上0.2s即可完成在10w+库的图像识别。
 [点击这里](./docs/zh_CN/quick_start/quick_start_recognition.md)立即体验
- 2021.09.17 增加PaddleClas自研PP-LCNet系列模型, 这些模型在Intel CPU上有较强的竞争力。PP-LCNet的介绍可以参考[论文](https://arxiv.org/pdf/2109.15099.pdf), 或者[PP-LCNet模型介绍](docs/zh_CN/models/PP-LCNet.md)，相关指标和预训练权重可以从 [这里](docs/zh_CN/ImageNet_models_cn.md)下载。
+- 2021.09.17 发布PP-LCNet系列超轻量骨干网络模型, 在Intel CPU上，单张图像预测速度约5ms，ImageNet-1K数据集上Top1识别准确率达到80.82%，超越ResNet152的模型效果。PP-LCNet的介绍可以参考[论文](https://arxiv.org/pdf/2109.15099.pdf), 或者[PP-LCNet模型介绍](docs/zh_CN/models/PP-LCNet.md)，相关指标和预训练权重可以从 [这里](docs/zh_CN/algorithm_introduction/ImageNet_models.md)下载。
 - [more](./docs/zh_CN/others/update_history.md)
 ## 特性
- PP-ShiTu轻量图像识别系统：集成了目标检测、特征学习、图像检索等模块，广泛适用于各类图像识别任务。
+- PP-ShiTu轻量图像识别系统：集成了目标检测、特征学习、图像检索等模块，广泛适用于各类图像识别任务。cpu上0.2s即可完成在10w+库的图像识别。
-cpu上200ms即可完成在10w+库的图像识别。
-详细介绍见[PP-ShiTu: A Practical Lightweight Image Recognition System](./docs/PP_ShiTu.pdf)
- PP-LCNet轻量级CPU骨干网络：专门为CPU设备打造轻量级骨干网络，速度、精度均超越竞品。
+- PP-LCNet轻量级CPU骨干网络：专门为CPU设备打造轻量级骨干网络，速度、精度均远超竞品。
-详细介绍见[PP-LCNet: A Lightweight CPU Convolutional Neural Network](https://arxiv.org/pdf/2109.15099.pdf),
-或者[PP-LCNet模型介绍](docs/zh_CN/models/PP-LCNet.md)。
- 丰富的预训练模型库：提供了35个系列共164个ImageNet预训练模型，其中6个精选系列模型支持结构快速修改。
+- 丰富的预训练模型库：提供了36个系列共175个ImageNet预训练模型，其中7个精选系列模型支持结构快速修改。
 - 全面易用的特征学习组件：集成arcmargin, triplet loss等12度量学习方法，通过配置文件即可随意组合切换。
 - SSLD知识蒸馏：14个分类预训练模型，精度普遍提升3%以上；其中ResNet50_vd模型在ImageNet-1k数据集上的Top-1精度达到了84.0%，
 Res2Net200_vd预训练模型Top-1精度高达85.1%。
- 数据增广：支持AutoAugment、Cutout、Cutmix等8种数据增广算法详细介绍、代码复现和在统一实验环境下的效果评估。
 <div align="center">
 <img src="./docs/images/recognition.gif"  width = "400" />
 </div>
@@ -47,6 +41,7 @@ Res2Net200_vd预训练模型Top-1精度高达85.1%。
 </div>
 ## 快速体验
 PP-ShiTu图像识别快速体验：[点击这里](./docs/zh_CN/quick_start/quick_start_recognition.md)
 ## 文档教程
@@ -59,9 +54,7 @@ PP-ShiTu图像识别快速体验：[点击这里](./docs/zh_CN/quick_start/quick
    - [尝鲜版](./docs/zh_CN/quick_start/quick_start_classification_new_user.md)
    - [进阶版](./docs/zh_CN/quick_start/quick_start_classification_professional.md) 
 - [PP-ShiTu图像识别系统介绍](#图像识别系统介绍)
-  - [主体检测](./docs/zh_CN/algorithm_introduction/mainbody_detection.md)
+- [骨干网络和预训练模型库](./docs/zh_CN/algorithm_introduction/ImageNet_models.md)
-  - [特征学习](./docs/zh_CN/algorithm_introduction/metric_learning.md)
-  - [向量检索](./deploy/vector_search/README.md)
 - 数据准备
  - [图像分类数据集介绍](./docs/zh_CN/data_preparation/classification_dataset.md)
  - [图像识别数据集介绍](./docs/zh_CN/data_preparation/recognition_dataset.md)
@@ -83,7 +76,6 @@ PP-ShiTu图像识别快速体验：[点击这里](./docs/zh_CN/quick_start/quick
 - 算法介绍
    - [图像分类任务介绍](./docs/zh_CN/algorithm_introduction/image_classification.md)
    - [度量学习介绍](./docs/zh_CN/algorithm_introduction/metric_learning.md)
-    - [骨干网络和预训练模型库](./docs/zh_CN/algorithm_introduction/ImageNet_models.md)
 - 高阶使用
    - [数据增广](./docs/zh_CN/advanced_tutorials/DataAugmentation.md)
    - [模型量化](./docs/zh_CN/advanced_tutorials/model_prune_quantization.md)
@@ -92,7 +84,7 @@ PP-ShiTu图像识别快速体验：[点击这里](./docs/zh_CN/quick_start/quick
    - [社区贡献指南](./docs/zh_CN/advanced_tutorials/how_to_contribute.md)
 - FAQ
    - [图像识别精选问题](docs/zh_CN/faq_series/faq_2021_s2.md)
-    - [图像分类精选问题](docs/zh_CN/faq_series/faq.md)
+    - [图像分类精选问题](docs/zh_CN/faq_series/faq_selected_30.md)
    - [图像分类FAQ第一季](docs/zh_CN/faq_series/faq_2020_s1.md)
    - [图像分类FAQ第二季](docs/zh_CN/faq_series/faq_2021_s1.md)
 - [许可证书](#许可证书)
@@ -105,9 +97,8 @@ PP-ShiTu图像识别快速体验：[点击这里](./docs/zh_CN/quick_start/quick
 <img src="./docs/images/structure.jpg"  width = "800" />
 </div>
-PP-ShiTu图像识别系统分为三步：（1）通过一个目标检测模型，检测图像物体候选区域（2）对每个候选区域进行特征提取（3）与检索库中图像进行特征匹配，提取识别结果。
+PP-ShiTu是一个实用的轻量级通用图像识别系统，主要由主体检测、特征学习和向量检索三个模块组成。该系统从骨干网络选择和调整、损失函数的选择、数据增强、学习率变换策略、正则化参数选择、预训练模型使用以及模型裁剪量化8个方面，采用多种策略，对各个模块的模型进行优化，最终得到在CPU上仅0.2s即可完成10w+库的图像识别的系统。更多细节请参考[PP-ShiTu技术方案](https://arxiv.org/pdf/2111.00775.pdf)。
-对于新的未知类别，无需重新训练模型，只需要在检索库补入该类别图像，重新建立检索库，就可以识别该类别。
 <a name="识别效果展示"></a>
 ## PP-ShiTu图像识别系统效果展示 
@@ -152,4 +143,3 @@ PP-ShiTu图像识别系统分为三步：（1）通过一个目标检测模型
 - 非常感谢[nblib](https://github.com/nblib)修正了PaddleClas中RandErasing的数据增广配置文件。
 - 非常感谢[chenpy228](https://github.com/chenpy228)修正了PaddleClas文档中的部分错别字。
 - 非常感谢[jm12138](https://github.com/jm12138)为PaddleClas添加ViT，DeiT系列模型和RepVGG系列模型。
- 非常感谢[FutureSI](https://aistudio.baidu.com/aistudio/personalcenter/thirdview/76563)对PaddleClas代码的解析与总结。
--- a/README_en.md
+++ b/README_en.md
@@ -8,7 +8,8 @@ PaddleClas is an image recognition toolset for industry and academia, helping us
 **Recent updates**
- 2021.09.17 Add PP-LCNet series model developed by PaddleClas, these models show strong competitiveness on Intel CPUs. The metrics and pretrained model are available [here](docs/en/ImageNet_models_en.md).
+- 2021.09.17 Add PP-LCNet series model developed by PaddleClas, these models show strong competitiveness on Intel CPUs. 
+For the introduction of PP-LCNet, please refer to [paper](https://arxiv.org/pdf/2109.15099.pdf) or [PP-LCNet model introduction](docs/en/models/PP-LCNet_en.md). The metrics and pretrained model are available [here](docs/en/ImageNet_models_en.md).
 - 2021.06.29 Add Swin-transformer series model，Highest top1 acc on ImageNet1k dataset reaches 87.2%, training, evaluation and inference are all supported. Pretrained models can be downloaded [here](docs/en/models/models_intro_en.md).
 - 2021.06.16 PaddleClas release/2.2. Add metric learning and vector search modules. Add product recognition, animation character recognition, vehicle recognition and logo recognition. Added 30 pretrained models of LeViT, Twins, TNT, DLA, HarDNet, and RedNet, and the accuracy is roughly the same as that of the paper.

--- a/benchmark/README.md
+++ b/benchmark/README.md
+# benchmark使用说明
+此目录所有shell脚本是为了测试PaddleClas中不同模型的速度指标，如单卡训练速度指标、多卡训练速度指标等。
+## 相关脚本说明
+一共有3个脚本：
+- `prepare_data.sh`: 下载相应的测试数据，并配置好数据路径
+- `run_benchmark.sh`: 执行单独一个训练测试的脚本，具体调用方式，可查看脚本注释
+- `run_all.sh`: 执行所有训练测试的入口脚本
+## 使用说明
+**注意**：为了跟PaddleClas中其他的模块的执行目录保持一致，此模块的执行目录为`PaddleClas`的根目录。
+### 1.准备数据
+```shell
+bash benchmark/prepare_data.sh
+```
+### 2.执行所有模型的测试
+```shell
+bash benchmark/run_all.sh
+```
--- a/benchmark/prepare_data.sh
+++ b/benchmark/prepare_data.sh
+#!/bin/bash
+dataset_url=$1
+cd dataset
+rm -rf ILSVRC2012
+wget -nc ${dataset_url}
+tar xf ILSVRC2012_val.tar
+ln -s ILSVRC2012_val ILSVRC2012
+cd ILSVRC2012
+ln -s val_list.txt train_list.txt
+cd ../../
--- a/benchmark/run_all.sh
+++ b/benchmark/run_all.sh
+# 提供可稳定复现性能的脚本，默认在标准docker环境内py37执行： paddlepaddle/paddle:latest-gpu-cuda10.1-cudnn7  paddle=2.1.2  py=37
+# 执行目录：需说明
+# cd **
+# 1 安装该模型需要的依赖 (如需开启优化策略请注明)
+# pip install ...
+# 2 拷贝该模型需要数据、预训练模型
+# 3 批量运行（如不方便批量，1，2需放到单个模型中）
+model_mode_list=(MobileNetV1 MobileNetV2 MobileNetV3_large_x1_0 EfficientNetB0 ShuffleNetV2_x1_0 DenseNet121 HRNet_W48_C SwinTransformer_tiny_patch4_window7_224 alt_gvt_base)
+fp_item_list=(fp32)
+bs_list=(32 64 96 128)
+for model_mode in ${model_mode_list[@]}; do
+      for fp_item in ${fp_item_list[@]}; do
+          for bs_item in ${bs_list[@]};do
+	    echo "index is speed, 1gpus, begin, ${model_name}"
+	    run_mode=sp
+	    CUDA_VISIBLE_DEVICES=0 bash benchmark/run_benchmark.sh ${run_mode} ${bs_item} ${fp_item} 10 ${model_mode}     #  (5min)
+	    sleep 10
+            echo "index is speed, 8gpus, run_mode is multi_process, begin, ${model_name}"
+            run_mode=mp
+            CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 bash benchmark/run_benchmark.sh ${run_mode} ${bs_item} ${fp_item} 10 ${model_mode} 
+            sleep 10
+            done
+      done
+done
--- a/benchmark/run_benchmark.sh
+++ b/benchmark/run_benchmark.sh
+#!/usr/bin/env bash
+set -xe
+# 运行示例：CUDA_VISIBLE_DEVICES=0 bash run_benchmark.sh ${run_mode} ${bs_item} ${fp_item} 500 ${model_mode}
+# 参数说明
+function _set_params(){
+    run_mode=${1:-"sp"}          # 单卡sp|多卡mp
+    batch_size=${2:-"64"}
+    fp_item=${3:-"fp32"}        # fp32|fp16
+    epochs=${4:-"10"}       # 可选，如果需要修改代码提前中断
+    model_name=${5:-"model_name"}
+    run_log_path="${TRAIN_LOG_DIR:-$(pwd)}/benchmark"  # TRAIN_LOG_DIR 后续QA设置该参数
+#   以下不用修改   
+    device=${CUDA_VISIBLE_DEVICES//,/ }
+    arr=(${device})
+    num_gpu_devices=${#arr[*]}
+    log_file=${run_log_path}/clas_${model_name}_${run_mode}_bs${batch_size}_${fp_item}_${num_gpu_devices}
+}
+function _train(){
+    echo "Train on ${num_gpu_devices} GPUs"
+    echo "current CUDA_VISIBLE_DEVICES=$CUDA_VISIBLE_DEVICES, gpus=$num_gpu_devices, batch_size=$batch_size"
+    if [ ${fp_item} = "fp32" ];then
+        model_config=`find ppcls/configs/ImageNet -name ${model_name}.yaml` 
+    else
+        model_config=`find ppcls/configs/ImageNet -name ${model_name}_fp16.yaml` 
+    fi
+    train_cmd="-c ${model_config} -o DataLoader.Train.sampler.batch_size=${batch_size} -o Global.epochs=${epochs}"   
+    case ${run_mode} in
+    sp) train_cmd="python -u tools/train.py ${train_cmd}" ;;
+    mp)
+        train_cmd="python -m paddle.distributed.launch --log_dir=./mylog --gpus=$CUDA_VISIBLE_DEVICES tools/train.py ${train_cmd}"
+        log_parse_file="mylog/workerlog.0" ;;
+    *) echo "choose run_mode(sp or mp)"; exit 1;
+    esac
+    rm -rf mylog
+# 以下不用修改
+    timeout 15m ${train_cmd} > ${log_file} 2>&1
+    if [ $? -ne 0 ];then
+        echo -e "${model_name}, FAIL"
+        export job_fail_flag=1
+    else
+        echo -e "${model_name}, SUCCESS"
+        export job_fail_flag=0
+    fi
+    kill -9 `ps -ef|grep 'python'|awk '{print $2}'`
+    if [ $run_mode = "mp" -a -d mylog ]; then
+        rm ${log_file}
+        cp mylog/workerlog.0 ${log_file}
+    fi
+}
+_set_params $@
+_train
--- a/deploy/configs/build_general.yaml
+++ b/deploy/configs/build_general.yaml
+Global:
+  rec_inference_model_dir: "./models/general_PPLCNet_x2_5_lite_v1.0_infer"
+  batch_size: 32
+  use_gpu: True
+  enable_mkldnn: True
+  cpu_num_threads: 10
+  enable_benchmark: True
+  use_fp16: False
+  ir_optim: True
+  use_tensorrt: False
+  gpu_mem: 8000
+  enable_profile: False
+RecPreProcess:
+  transform_ops:
+    - ResizeImage:
+        size: 224
+    - NormalizeImage:
+        scale: 0.00392157
+        mean: [0.485, 0.456, 0.406]
+        std: [0.229, 0.224, 0.225]
+        order: ''
+    - ToCHWImage:
+RecPostProcess: null
+# indexing engine config
+IndexProcess:
+  index_method: "HNSW32" # supported: HNSW32, IVF, Flat
+  image_root: "./drink_dataset_v1.0/gallery/"
+  index_dir: "./drink_dataset_v1.0/index"
+  data_file:  "./drink_dataset_v1.0/gallery/drink_label.txt"
+  index_operation: "new" # suported: "append", "remove", "new"
+  delimiter: "\t"
+  dist_type: "IP"
+  embedding_size: 512
--- a/deploy/configs/inference_general.yaml
+++ b/deploy/configs/inference_general.yaml
+Global:
+  infer_imgs: "./drink_dataset_v1.0/test_images/nongfu_spring.jpeg"
+  det_inference_model_dir: "./models/picodet_PPLCNet_x2_5_mainbody_lite_v1.0_infer"
+  rec_inference_model_dir: "./models/general_PPLCNet_x2_5_lite_v1.0_infer"
+  rec_nms_thresold: 0.05
+  batch_size: 1
+  image_shape: [3, 640, 640]
+  threshold: 0.2
+  max_det_results: 5
+  labe_list:
+  - foreground
+  # inference engine config
+  use_gpu: True
+  enable_mkldnn: True
+  cpu_num_threads: 10
+  enable_benchmark: True
+  use_fp16: False
+  ir_optim: True
+  use_tensorrt: False
+  gpu_mem: 8000
+  enable_profile: False
+DetPreProcess:
+  transform_ops:
+    - DetResize:
+        interp: 2
+        keep_ratio: false
+        target_size: [640, 640]
+    - DetNormalizeImage:
+        is_scale: true
+        mean: [0.485, 0.456, 0.406]
+        std: [0.229, 0.224, 0.225]
+    - DetPermute: {}
+DetPostProcess: {}
+RecPreProcess:
+  transform_ops:
+    - ResizeImage:
+        size: 224
+    - NormalizeImage:
+        scale: 0.00392157
+        mean: [0.485, 0.456, 0.406]
+        std: [0.229, 0.224, 0.225]
+        order: ''
+    - ToCHWImage:
+RecPostProcess: null
+# indexing engine config
+IndexProcess:
+  index_dir: "./drink_dataset_v1.0/index/"
+  return_k: 5
+  score_thres: 0.5
--- a/docs/en/models/PP-LCNet_en.md
+++ b/docs/en/models/PP-LCNet_en.md
-# PPLCNet series
+# PP-LCNet Series
-## Overview
+## Abstract
-The PPLCNet series is a network that has excellent performance on Intel-CPU proposed by the Baidu PaddleCV team. The author summarizes some methods that can improve the accuracy of the model on Intel-CPU but hardly increase the inference time. The author combines these methods into a new network, namely PPLCNet. Compared with other lightweight networks, PPLCNet can achieve higher accuracy with the same inference time. PPLCNet has shown strong competitiveness in image classification, object detection, and semantic segmentation.
+In the field of computer vision, the quality of backbone network determines the outcome of the whole vision task. In previous studies, researchers generally focus on the optimization of FLOPs or Params, but inference speed actually serves as an importance indicator of model quality in real-world scenarios. Nevertheless, it is difficult to balance inference speed and accuracy. In view of various CPU-based applications in industry, we are now working to raise the adaptability of the backbone network to Intel CPU, so as to obtain a faster and more accurate lightweight backbone network. At the same time, the performance of downstream vision tasks such as object detection and semantic segmentation are also improved.
+## Introduction
+Recent years witnessed the emergence of many lightweight backbone networks. In past two years, in particular, there were abundant networks searched by NAS that either enjoy advantages on FLOPs or Params, or have an edge in terms of inference speed on ARM devices. However, few of them dedicated to specified optimization of Intel CPU, resulting their imperfect inference speed on the intel CPU side. Based on this, we specially design the backbone network PP-LCNet for Intel CPU devices with its acceleration library MKLDNN. Compared with other lightweight SOTA models, this backbone network can further improve the performance of the model without increasing the inference time, significantly outperforming the existing SOTA models. A comparison chart with other models is shown below.
+<div align=center><img src="../../images/PP-LCNet/PP-LCNet-Acc.png" width="500" height="400"/></div>
-## Accuracy, FLOPS and Parameters
+## Method
-| Models           | Top1 | Top5 | FLOPs<br>(M) | Parameters<br>(M) |
+The overall structure of the network is shown in the figure below.
-|:--:|:--:|:--:|:--:|:--:|
+<div align=center><img src="../../images/PP-LCNet/PP-LCNet.png" width="700" height="400"/></div>
-| PPLCNet_x0_25        |0.5186           | 0.7565           | 18    | 1.5  |
-| PPLCNet_x0_35        |0.5809           | 0.8083           | 29    | 1.6  |
-| PPLCNet_x0_5         |0.6314           | 0.8466           | 47    | 1.9  |
-| PPLCNet_x0_75        |0.6818           | 0.8830           | 99    | 2.4  |
-| PPLCNet_x1_0         |0.7132           | 0.9003           | 161   | 3.0  |
-| PPLCNet_x1_5         |0.7371           | 0.9153           | 342   | 4.5  |
-| PPLCNet_x2_0         |0.7518           | 0.9227           | 590   | 6.5  |
-| PPLCNet_x2_5         |0.7660           | 0.9300           | 906   | 9.0  |
-| PPLCNet_x0_5_ssld    |0.6610           | 0.8646           | 47    | 1.9  |
-| PPLCNet_x1_0_ssld    |0.7439           | 0.9209           | 161   | 3.0  |
-| PPLCNet_x2_5_ssld    |0.8082           | 0.9533           | 906   | 9.0  |
+Build on extensive experiments, we found that many seemingly less time-consuming operations will increase the latency  on Intel CPU-based devices, especially when the MKLDNN acceleration library is enabled. Therefore, we finally chose a block with the leanest possible structure and the fastest possible speed to form our BaseNet (similar to MobileNetV1). Based on BaseNet, we summarized four strategies that can improve the accuracy of the model without increasing the latency, and we combined these four strategies to form PP-LCNet. Each of these four strategies is introduced as below:
+### Better Activation Function
-## Inference speed based on Intel(R)-Xeon(R)-Gold-6148-CPU
+Since the adoption of ReLU activation function by convolutional neural network, the network performance has been improved substantially, and variants of the ReLU activation function have appeared in recent years, such as Leaky-ReLU, P-ReLU, ELU, etc. In 2017, Google Brain searched to obtain the swish activation function, which performs well on lightweight networks. In 2019, the authors of MobileNetV3 further optimized this activation function to H-Swish, which removes the exponential operation, leading to faster speed and an almost unaffected network accuracy. After many experiments, we also recognized its excellent performance on lightweight networks. Therefore, this activation function is adopted in PP-LCNet.
-| Models                 | Crop Size | Resize Short Size | FP32<br>Batch Size=1<br>(ms) |
+### SE Modules at Appropriate Positions
-|------------------|-----------|-------------------|--------------------------|
-| PPLCNet_x0_25        | 224       | 256               | 1.74                    |
+The SE module is a channel attention mechanism proposed by SENet, which can effectively improve the accuracy of the model. However, on the Intel CPU side, the module also presents a large latency, leaving us the task of balancing accuracy and speed. The search of the location of the SE module in NAS search-based networks such as MobileNetV3 brings no general conclusions, but we found through our experiments that the closer the SE module is to the tail of the network the greater the improvement in model accuracy. The following table also shows some of our experimental results：
-| PPLCNet_x0_35        | 224       | 256               | 1.92                    |
-| PPLCNet_x0_5         | 224       | 256               | 2.05                    |
+| SE Location       | Top-1 Acc(\%) | Latency(ms) |
-| PPLCNet_x0_75        | 224       | 256               | 2.29                    |
+|-------------------|---------------|-------------|
-| PPLCNet_x1_0         | 224       | 256               | 2.46                    |
+| 1100000000000     | 61.73           | 2.06         |
-| PPLCNet_x1_5         | 224       | 256               | 3.19                    |
+| 0000001100000     | 62.17           | 2.03         |
-| PPLCNet_x2_0         | 224       | 256               | 4.27                    |
+| <b>0000000000011<b>     | <b>63.14<b>           | <b>2.05<b>         |
-| PPLCNet_x2_5         | 224       | 256               | 5.39                    |
+| 1111111111111     | 64.27           | 3.80         |
-| PPLCNet_x0_5_ssld    | 224       | 256               | 2.05                    |
-| PPLCNet_x1_0_ssld    | 224       | 256               | 2.46                    |
+The option in the third row of the table was chosen for the location of the SE module in PP-LCNet.
-| PPLCNet_x2_5_ssld    | 224       | 256               | 5.39                    |
+### Larger Convolution Kernels
+In the paper of MixNet, the author analyzes the effect of convolutional kernel size on model performance and concludes that larger convolutional kernels within a certain range can improve the performance of the model, but beyond this range will be detrimental to the model’s performance. So the author forms MixConv with split-concat paradigm combined, which can improve the performance of the model but is not conducive to inference. We experimentally summarize the role of some larger convolutional kernels at different positions that are similar to those of the SE module, and find that larger convolutional kernels display more prominent roles in the middle and tail of the network. The following table shows the effect of the position of the 5x5 convolutional kernels on the accuracy：
+| SE Location       | Top-1 Acc(\%) | Latency(ms) |
+|-------------------|---------------|-------------|
+| 1111111111111     | 63.22           | 2.08         |
+| 1111111000000     | 62.70           | 2.07        |
+| <b>0000001111111<b>     | <b>63.14<b>           | <b>2.05<b>         |
+Experiments show that a larger convolutional kernel placed at the middle and tail of the network can achieve the same accuracy as placed at all positions, coupled with faster inference. The option in the third row of the table was the final choice of PP-LCNet.
+### Larger Dimensional 1 × 1 Conv Layer after GAP
+Since the introduction of GoogLeNet, GAP (Global-Average-Pooling) is often directly followed by a classification layer, which fails to result in further integration and processing of features extracted after GAP in the lightweight network. If a larger 1x1 convolutional layer (equivalent to the FC layer) is used after GAP, the extracted features, instead of directly passing through the classification layer, will first be integrated, and then classified. This can greatly improve the accuracy rate without affecting the inference speed of the model. The above four improvements were made to BaseNet to obtain PP-LCNet. The following table further illustrates the impact of each scheme on the results：
+| Activation | SE-block | Large-kernal | last-1x1-conv | Top-1 Acc(\%) | Latency(ms) |
+|------------|----------|--------------|---------------|---------------|-------------|
+| 0       | 1       | 1               | 1                | 61.93 | 1.94 |
+| 1       | 0       | 1               | 1                | 62.51 | 1.87 |
+| 1       | 1       | 0               | 1                | 62.44 | 2.01 |
+| 1       | 1       | 1               | 0                | 59.91 | 1.85 |
+| <b>1<b>       | <b>1<b>       | <b>1<b>               | <b>1<b>                | <b>63.14<b> | <b>2.05<b> |
+## Experiments
+### Image Classification
+For image classification, ImageNet dataset is adopted. Compared with the current mainstream lightweight network, PP-LCNet can obtain faster inference speed with the same accuracy. When using Baidu’s self-developed SSLD distillation strategy, the accuracy is further improved, with the Top-1 Acc of ImageNet exceeding 80% at an inference speed of about 5ms on the Intel CPU side.
+| Model | Params(M) | FLOPs(M) | Top-1 Acc(\%) | Top-5 Acc(\%) | Latency(ms) | 
+|-------|-----------|----------|---------------|---------------|-------------|
+| PP-LCNet-0.25x  | 1.5 | 18  | 51.86 | 75.65 | 1.74 |
+| PP-LCNet-0.35x  | 1.6 | 29  | 58.09 | 80.83 | 1.92 |
+| PP-LCNet-0.5x   | 1.9 | 47  | 63.14 | 84.66 | 2.05 |
+| PP-LCNet-0.75x  | 2.4 | 99  | 68.18 | 88.30 | 2.29 |
+| PP-LCNet-1x     | 3.0 | 161 | 71.32 | 90.03 | 2.46 |
+| PP-LCNet-1.5x   | 4.5 | 342 | 73.71 | 91.53 | 3.19 |
+| PP-LCNet-2x     | 6.5 | 590 | 75.18 | 92.27 | 4.27 |
+| PP-LCNet-2.5x   | 9.0 | 906 | 76.60 | 93.00 | 5.39 |
+| PP-LCNet-0.25x\* | 1.9 | 47  | 66.10 | 86.46 | 2.05 |
+| PP-LCNet-0.25x\* | 3.0 | 161 | 74.39 | 92.09 | 2.46 |
+| PP-LCNet-0.25x\* | 9.0 | 906 | 80.82 | 95.33 | 5.39 |
+\* denotes the model after using SSLD distillation.
+Performance comparison with other lightweight networks:
+| Model | Params(M) | FLOPs(M) | Top-1 Acc(\%) | Top-5 Acc(\%) | Latency(ms) |
+|-------|-----------|----------|---------------|---------------|-------------|
+| MobileNetV2-0.25x  | 1.5 | 34  | 53.21 | 76.52 | 2.47 |
+| MobileNetV3-small-0.35x  | 1.7 | 15  | 53.03 | 76.37 | 3.02 |
+| ShuffleNetV2-0.33x  | 0.6 | 24  | 53.73 | 77.05 | 4.30 |
+| <b>PP-LCNet-0.25x<b>  | <b>1.5<b> | <b>18<b>  | <b>51.86<b> | <b>75.65<b> | <b>1.74<b> |
+| MobileNetV2-0.5x  | 2.0 | 99  | 65.03 | 85.72 | 2.85 |
+| MobileNetV3-large-0.35x  | 2.1 | 41  | 64.32 | 85.46 | 3.68 |
+| ShuffleNetV2-0.5x  | 1.4 | 43  | 60.32 | 82.26 | 4.65 |
+| <b>PP-LCNet-0.5x<b>   | <b>1.9<b> | <b>47<b>  | <b>63.14<b> | <b>84.66<b> | <b>2.05<b> |
+| MobileNetV1-1x  | 4.3 | 578  | 70.99 | 89.68 | 3.38 |
+| MobileNetV2-1x  | 3.5 | 327  | 72.15 | 90.65 | 4.26 |
+| MobileNetV3-small-1.25x  | 3.6 | 100  | 70.67 | 89.51 | 3.95 |
+| <b>PP-LCNet-1x<b>     |<b> 3.0<b> | <b>161<b> | <b>71.32<b> | <b>90.03<b> | <b>2.46<b> |
+### Object Detection
+For object detection, we adopt Baidu’s self-developed PicoDet, which focuses on lightweight object detection scenarios. The following table shows the comparison between the results of PP-LCNet and MobileNetV3 on the COCO dataset. PP-LCNet has an obvious advantage in both accuracy and speed.
+| Backbone | mAP(%) | Latency(ms) |
+|-------|-----------|----------|
+MobileNetV3-large-0.35x | 19.2 | 8.1 |
+<b>PP-LCNet-0.5x<b> | <b>20.3<b> | <b>6.0<b> |
+MobileNetV3-large-0.75x | 25.8 | 11.1 |
+<b>PP-LCNet-1x<b> | <b>26.9<b> | <b>7.9<b> | 
+### Semantic Segmentation
+For semantic segmentation, DeeplabV3+ is adopted. The following table presents the comparison between PP-LCNet and MobileNetV3 on the Cityscapes dataset, and PP-LCNet also stands out in terms of accuracy and speed.
+| Backbone | mIoU(%) | Latency(ms) |
+|-------|-----------|----------|
+MobileNetV3-large-0.5x | 55.42 | 135 |
+<b>PP-LCNet-0.5x<b> | <b>58.36<b> | <b>82<b> |
+MobileNetV3-large-0.75x | 64.53 | 151 |
+<b>PP-LCNet-1x<b> | <b>66.03<b> | <b>96<b> |
+## Conclusion
+Rather than holding on to perfect FLOPs and Params as academics do, PP-LCNet focuses on analyzing how to add Intel CPU-friendly modules to improve the performance of the model, which can better balance accuracy and inference time. The experimental conclusions therein are available to other researchers in network structure design, while providing NAS search researchers with a smaller search space and general conclusions. The finished PP-LCNet can also be better accepted and applied in industry.
+## Reference
+Reference to cite when you use PP-LCNet in a paper:
+```
+@misc{cui2021pplcnet,
+      title={PP-LCNet: A Lightweight CPU Convolutional Neural Network}, 
+      author={Cheng Cui and Tingquan Gao and Shengyu Wei and Yuning Du and Ruoyu Guo and Shuilong Dong and Bin Lu and Ying Zhou and Xueying Lv and Qiwen Liu and Xiaoguang Hu and Dianhai Yu and Yanjun Ma},
+      year={2021},
+      eprint={2109.15099},
+      archivePrefix={arXiv},
+      primaryClass={cs.CV}
+}
+```
--- a/docs/en/models/PPLCNet_en.md
+++ b/docs/en/models/PPLCNet_en.md
-# PPLCNet series
-## Overview
-The PPLCNet series is a network that has excellent performance on Intel-CPU proposed by the Baidu PaddleCV team. The author summarizes some methods that can improve the accuracy of the model on Intel-CPU but hardly increase the inference time. The author combines these methods into a new network, namely PPLCNet. Compared with other lightweight networks, PPLCNet can achieve higher accuracy with the same inference time. PPLCNet has shown strong competitiveness in image classification, object detection, and semantic segmentation.
-## Accuracy, FLOPS and Parameters
-| Models           | Top1 | Top5 | FLOPs<br>(M) | Parameters<br>(M) |
-|:--:|:--:|:--:|:--:|:--:|
-| PPLCNet_x0_25        |0.5186           | 0.7565           | 18    | 1.5  |
-| PPLCNet_x0_35        |0.5809           | 0.8083           | 29    | 1.6  |
-| PPLCNet_x0_5         |0.6314           | 0.8466           | 47    | 1.9  |
-| PPLCNet_x0_75        |0.6818           | 0.8830           | 99    | 2.4  |
-| PPLCNet_x1_0         |0.7132           | 0.9003           | 161   | 3.0  |
-| PPLCNet_x1_5         |0.7371           | 0.9153           | 342   | 4.5  |
-| PPLCNet_x2_0         |0.7518           | 0.9227           | 590   | 6.5  |
-| PPLCNet_x2_5         |0.7660           | 0.9300           | 906   | 9.0  |
-| PPLCNet_x0_5_ssld    |0.6610           | 0.8646           | 47    | 1.9  |
-| PPLCNet_x1_0_ssld    |0.7439           | 0.9209           | 161   | 3.0  |
-| PPLCNet_x2_5_ssld    |0.8082           | 0.9533           | 906   | 9.0  |
-## Inference speed based on Intel(R)-Xeon(R)-Gold-6148-CPU
-| Models                 | Crop Size | Resize Short Size | FP32<br>Batch Size=1<br>(ms) |
-|------------------|-----------|-------------------|--------------------------|
-| PPLCNet_x0_25        | 224       | 256               | 1.74                    |
-| PPLCNet_x0_35        | 224       | 256               | 1.92                    |
-| PPLCNet_x0_5         | 224       | 256               | 2.05                    |
-| PPLCNet_x0_75        | 224       | 256               | 2.29                    |
-| PPLCNet_x1_0         | 224       | 256               | 2.46                    |
-| PPLCNet_x1_5         | 224       | 256               | 3.19                    |
-| PPLCNet_x2_0         | 224       | 256               | 4.27                    |
-| PPLCNet_x2_5         | 224       | 256               | 5.39                    |
-| PPLCNet_x0_5_ssld    | 224       | 256               | 2.05                    |
-| PPLCNet_x1_0_ssld    | 224       | 256               | 2.46                    |
-| PPLCNet_x2_5_ssld    | 224       | 256               | 5.39                    |
--- a/docs/images/recognition/drink_data_demo/output/mosilian.jpeg
+++ b/docs/images/recognition/drink_data_demo/output/mosilian.jpeg
--- a/docs/images/recognition/drink_data_demo/output/nongfu_spring.jpeg
+++ b/docs/images/recognition/drink_data_demo/output/nongfu_spring.jpeg
--- a/docs/images/recognition/drink_data_demo/output/youjia.jpeg
+++ b/docs/images/recognition/drink_data_demo/output/youjia.jpeg
--- a/docs/images/recognition/drink_data_demo/test_images/mosilian.jpeg
+++ b/docs/images/recognition/drink_data_demo/test_images/mosilian.jpeg
--- a/docs/images/recognition/drink_data_demo/test_images/nongfu_spring.jpeg
+++ b/docs/images/recognition/drink_data_demo/test_images/nongfu_spring.jpeg
--- a/docs/images/recognition/drink_data_demo/test_images/youjia.jpeg
+++ b/docs/images/recognition/drink_data_demo/test_images/youjia.jpeg
--- a/docs/zh_CN/faq_series/faq_2020_s1.md
+++ b/docs/zh_CN/faq_series/faq_2020_s1.md
@@ -13,11 +13,10 @@
 ## 第1期
 ### Q1.1: PaddleClas可以用来做什么?
-**A**：PaddleClas是飞桨为工业界和学术界所准备的一个图像分类任务的工具集，助力使用者训练出更好的视觉模型和应用落地。PaddleClas提供了基于图像分类的模型训练、评估、预测、部署全流程的服务，方便大家更加高效地学习图像分类。具体地，PaddleClas中包含如下一些特性 。
+**A**：PaddleClas是飞桨为工业界和学术界所准备的一个图像分类任务的工具集，助力使用者训练出更好的视觉模型和应用落地。PaddleClas提供了基于图像分类的模型训练、评估、预测、部署全流程的服务，方便大家更加高效地学习图像分类。具体地，PaddleClas中包含如下一些特性。
+* PaddleClas提供了36个系列的分类网络结构(ResNet, ResNet_vd, MobileNetV3, Res2Net, HRNet等)和训练配置，175个预训练模型和性能评估与预测，供大家选择并使用。
-* PaddleClas提供了24个系列的分类网络结构(ResNet, ResNet_vd, MobileNetV3, Res2Net, HRNet等)和训练配置，122个预训练模型和性能评估与预测，供大家选择并使用。
+* PaddleClas提供了TensorRT预测、python inference、c++ inference、Paddle-Lite预测部署、PaddleServing、PaddleHub等多种预测部署推理方案，在方便在多种环境中进行部署推理。
-* PaddleClas提供了TensorRT预测、python inference、c++ inference、Paddle-Lite预测部署等多种预测部署推理方案，在方便在多种环境中进行部署推理。
 * PaddleClas提供了一种简单的SSLD知识蒸馏方案，基于该方案蒸馏模型的识别准确率普遍提升3%以上。
 * PaddleClas支持AutoAugment、Cutout、Cutmix等8种数据增广算法详细介绍、代码复现和在统一实验环境下的效果评估。
 * PaddleClas支持在Windows/Linux/MacOS环境中基于CPU/GPU进行使用。
@@ -27,7 +26,6 @@
 ### Q1.3: ResNet_vd和ResNet、ResNet_vc结构有什么区别呢？
 **A**:
 ResNet_va至vd的结构如下图所示，ResNet最早提出时为va结构，在降采样残差模块这个部分，在左边的特征变换通路中(Path A)，第一个1x1卷积部分就行了降采样，从而导致信息丢失（卷积的kernel size为1，stride为2，输入特征图中 有部分特征没有参与卷积的计算）；在vb结构中，把降采样的步骤从最开始的第一个1x1卷积调整到中间的3x3卷积中，从而避免了信息丢失的问题，PaddleClas中的ResNet模型默认就是ResNet_vb；vc结构则是将最开始这个7x7的卷积变成3个3x3的卷积，在感受野不变的情况下，计算量和存储大小几乎不变，而且实验证明精度相对于vb结构有所提升；vd结构是修改了降采样残差模块右边的特征通路(Path B)。把降采样的过程由平均池化这个操作去替代了，这一系列的改进(va->vd)，几乎没有带来新增的预测耗时，结合适当的训练策略，比如说标签平滑以及mixup数据增广，精度可以提升高达2.7%。
 <div align="center">
@@ -38,7 +36,7 @@ ResNet_va至vd的结构如下图所示，ResNet最早提出时为va结构，在
 **A**:
 ResNet系列模型中，相比于其他模型，ResNet_vd模型在预测速度几乎不变的情况下，精度有非常明显的提升，因此推荐大家使用ResNet_vd系列模型。
-下面给出了batch size=4的情况下，在T4 GPU上，不同模型的的预测耗时、flops、params与精度的变化曲线，可以根据自己自己的实际部署场景中的需求，去选择合适的模型，如果希望模型存储大小尽可能小或者预测速度尽可能快，则可以使用ResNet18_vd模型，如果希望获得尽可能高的精度，则建议使用ResNet152_vd或者ResNet200_vd模型。更多关于ResNet系列模型的介绍可以参考文档：[ResNet及其vd系列模型文档](../models/ResNet_and_vd.md)。
+[ResNet及其vd系列模型文档](../models/ResNet_and_vd.md)中给出了batch size=4的情况下，在T4 GPU上，不同模型的的预测耗时、FLOPs、Params与精度的变化曲线，可以根据自己自己的实际部署场景中的需求，去选择合适的模型，如果希望模型存储大小尽可能小或者预测速度尽可能快，则可以使用ResNet18_vd模型，如果希望获得尽可能高的精度，则建议使用ResNet152_vd或者ResNet200_vd模型。更多关于ResNet系列模型的介绍可以参考文档：[ResNet及其vd系列模型文档](../models/ResNet_and_vd.md)。
 * 精度-预测速度变化曲线
@@ -69,7 +67,7 @@ ResNet系列模型中，相比于其他模型，ResNet_vd模型在预测速度
 ### Q1.7 大卷积核一定可以带来正向收益吗？
 **A**: 不一定，将网络中的所有卷积核都增大未必会带来性能的提升，甚至会有有损性能，在论文[MixConv: Mixed Depthwise Convolutional Kernels](https://arxiv.org/abs/1907.09595)
-中指出，在一定范围内提升卷积核大小对精度的提升有正向作用，但是超出后会有损精度。所以考虑到模型的大小、计算量等问题，一般不选用大的卷积核去设计网络。
+中指出，在一定范围内提升卷积核大小对精度的提升有正向作用，但是超出后会有损精度。所以考虑到模型的大小、计算量等问题，一般不选用大的卷积核去设计网络。同时，在[PP-LCNet](../models/PP-LCNet.md)文章中，也有关于大卷积核的实验。
 <a name="第2期"></a>
 ## 第2期
@@ -77,9 +75,9 @@ ResNet系列模型中，相比于其他模型，ResNet_vd模型在预测速度
 ### Q2.1: PaddleClas如何训练自己的backbone？
 **A**：具体流程如下:
-* 首先在ppcls/modeling/architectures/文件夹下新建一个自己的模型结构文件，即你自己的backbone，模型搭建可以参考resnet.py;
+* 首先在ppcls/arch/backbone/model_zoo/文件夹下新建一个自己的模型结构文件，即你自己的backbone，模型搭建可以参考resnet.py;
-* 然后在ppcls/modeling/\_\_init\_\_.py中添加自己设计的backbone的类;
+* 然后在ppcls/arch/backbone/\_\_init\_\_.py中添加自己设计的backbone的类;
-* 其次配置训练的yaml文件，此处可以参考configs/ResNet/ResNet50.yaml;
+* 其次配置训练的yaml文件，此处可以参考ppcls/configs/ImageNet/ResNet/ResNet50.yaml;
 * 最后启动训练即可。
@@ -92,7 +90,7 @@ ResNet系列模型中，相比于其他模型，ResNet_vd模型在预测速度
 ### Q2.3: PaddleClas中configs下的默认参数适合任何一个数据集吗？
-**A**: PaddleClas中的configs下的默认参数是ImageNet-1k的训练参数，这个参数并不适合所有的数据集，具体数据集需要在此基础上进一步调试，调试方法会在之后出一个单独的faq，敬请期待。
+**A**: PaddleClas中的ppcls/configs/ImageNet/下的配置文件默认参数是ImageNet-1k的训练参数，这个参数并不适合所有的数据集，具体数据集需要在此基础上进一步调试。
 ### Q2.4 PaddleClas中的不同的模型使用了不同的分辨率，标配的应该是多少呢？
@@ -102,7 +100,7 @@ ResNet系列模型中，相比于其他模型，ResNet_vd模型在预测速度
 ### Q2.5 PaddleClas中提供了很多ssld模型，其应用的价值是？
-**A**: PaddleClas中提供了很多ssld预训练模型，其通过半监督知识蒸馏的方法获得了更好的预训练权重，在迁移任务或者下游视觉任务中，无须替换结构文件、只需要替换精度更高的ssld预训练模型即可提升精度，如在PaddleSeg中，[HRNet](https://github.com/PaddlePaddle/PaddleSeg/blob/release/v0.7.0/docs/model_zoo.md)使用了ssld预训练模型的权重后，精度大幅度超越业界同样的模型的精度，在PaddleDetection中，[PP-YOLO](https://github.com/PaddlePaddle/PaddleDetection/blob/release/0.4/configs/ppyolo/README_cn.md)使用了ssld预训练权重后，在较高的baseline上仍有进一步的提升。使用ssld预训练权重做分类的迁移表现也很抢眼，在[SSLD蒸馏策略](../advanced_tutorials/distillation/distillation.md)部分介绍了知识蒸馏对于分类任务迁移的收益。
+**A**: PaddleClas中提供了很多ssld预训练模型，其通过半监督知识蒸馏的方法获得了更好的预训练权重，在迁移任务或者下游视觉任务中，无须替换结构文件、只需要替换精度更高的ssld预训练模型即可提升精度，如在PaddleSeg中，[HRNet](https://github.com/PaddlePaddle/PaddleSeg/blob/release/v0.7.0/docs/model_zoo.md)使用了ssld预训练模型的权重后，精度大幅度超越业界同样的模型的精度，在PaddleDetection中，[PP-YOLO](https://github.com/PaddlePaddle/PaddleDetection/blob/release/0.4/configs/ppyolo/README_cn.md)使用了ssld预训练权重后，在较高的baseline上仍有进一步的提升。使用ssld预训练权重做分类的迁移表现也很抢眼，在[SSLD蒸馏策略](../advanced_tutorials/knowledge_distillation.md)部分介绍了知识蒸馏对于分类任务迁移的收益。
 <a name="第3期"></a>
@@ -121,7 +119,7 @@ ResNet系列模型中，相比于其他模型，ResNet_vd模型在预测速度
 ### Q3.3: 怎么使用多个模型进行预测融合呢？
-**A** 使用多个模型进行预测的时候，建议首先将预训练模型导出为inference模型，这样可以摆脱对网络结构定义的依赖，可以参考[模型导出脚本](../../../tools/export_model.py)进行模型导出，之后再参考[inference模型预测脚本](../../../tools/infer/predict.py)进行预测即可，在这里需要根据自己使用模型的数量创建多个predictor。
+**A** 使用多个模型进行预测的时候，建议首先将预训练模型导出为inference模型，这样可以摆脱对网络结构定义的依赖，可以参考[模型导出脚本](../../../tools/export_model.py)进行模型导出，之后再参考[inference模型预测脚本](../../../deploy/python/predict_cls.py)进行预测即可，在这里需要根据自己使用模型的数量创建多个predictor。
 ### Q3.4: PaddleClas中怎么增加自己的数据增广方法呢？
@@ -136,15 +134,17 @@ ResNet系列模型中，相比于其他模型，ResNet_vd模型在预测速度
 **A**：
-* 可以使用自动混合精度进行训练，这在精度几乎无损的情况下，可以有比较明显的速度收益，以ResNet50为例，PaddleClas中使用自动混合精度训练的配置文件可以参考：[ResNet50_fp16.yml](../../../ppcls/configs/ResNet/ResNet50_fp16.yml)，主要就是需要在标准的配置文件中添加以下几行
+* 可以使用自动混合精度进行训练，这在精度几乎无损的情况下，可以有比较明显的速度收益，以ResNet50为例，PaddleClas中使用自动混合精度训练的配置文件可以参考：[ResNet50_fp16.yml](../../../ppcls/configs/ImageNet/ResNet/ResNet50_fp16.yaml)，主要就是需要在标准的配置文件中添加以下几行
 ```
-use_fp16: True
+# mixed precision training
-amp_scale_loss: 128.0
+AMP:
-use_dynamic_loss_scaling: True
+  scale_loss: 128.0
+  use_dynamic_loss_scaling: True
+  use_pure_fp16: &use_pure_fp16 True
 ```
-* 可以开启dali，将数据预处理方法放在GPU上运行，在模型比较小时（reader耗时占比更高一些），开启dali会带来比较明显的精度收益，在训练的时候，添加`-o use_dali=True`即可使用dali进行训练，更多关于dali 安装与介绍可以参考：[dali安装教程](https://docs.nvidia.com/deeplearning/dali/user-guide/docs/installation.html#nightly-builds)。
+* 可以开启dali，将数据预处理方法放在GPU上运行，在模型比较小时（reader耗时占比更高一些），开启dali会带来比较明显的训练速度收益，在训练的时候，添加`-o Global.use_dali=True`即可使用dali进行训练，更多关于 dali 安装与介绍可以参考：[dali安装教程](https://docs.nvidia.com/deeplearning/dali/user-guide/docs/installation.html#nightly-builds)。
 <a name="第4期"></a>
 ## 第4期
@@ -294,7 +294,7 @@ Cosine_decay和piecewise_decay的学习率变化曲线如下图所示，容易
 **A**:一般来说，数据集的规模对性能影响至关重要，但是图片的标注往往比较昂贵，所以有标注的图片数量往往比较稀少，在这种情况下，数据的增广尤为重要。在训练ImageNet-1k的标准数据增广中，主要使用了Random_Crop与Random_Flip两种数据增广方式，然而，近些年，越来越多的数据增广方式被提出，如cutout、mixup、cutmix、AutoAugment等。实验表明，这些数据的增广方式可以有效提升模型的精度。具体到数据集来说：
- ImageNet-1k：下表列出了ResNet50在8种不同的数据增广方式的表现，可以看出，相比baseline，所有的数据增广方式均有收益，其中cutmix是目前最有效的数据增广。更多数据增广的介绍请参考[**数据增广章节**](../advanced_tutorials/image_augmentation/ImageAugment.md)。
+- ImageNet-1k：下表列出了ResNet50在8种不同的数据增广方式的表现，可以看出，相比baseline，所有的数据增广方式均有收益，其中cutmix是目前最有效的数据增广。更多数据增广的介绍请参考[**数据增广章节**](../advanced_tutorials/DataAugmentation.md)。
 | 模型       | 数据增广方式         | Test top-1 |
 |:--:|:--:|:--:|
@@ -332,7 +332,7 @@ Cosine_decay和piecewise_decay的学习率变化曲线如下图所示，容易
 - 挖掘相关数据：用在现有数据集上训练饱和的模型去对相关的数据做预测，将置信度较高的数据打label后加入训练集进一步训练，如此循环操作，可进一步提升模型的精度。
- 知识蒸馏：可以先使用一个较大的模型在该数据集上训练一个精度较高的teacher model，然后使用该teacher model去教导一个Student model，其中，Student model即为目标模型。PaddleClas提供了百度自研的SSLD知识蒸馏方案，即使在ImageNet-1k这么有挑战的分类任务上，其也能稳定提升3%以上。SSLD知识蒸馏的的章节请参考[**SSLD知识蒸馏**](../advanced_tutorials/distillation/distillation.md)。
+- 知识蒸馏：可以先使用一个较大的模型在该数据集上训练一个精度较高的teacher model，然后使用该teacher model去教导一个Student model，其中，Student model即为目标模型。PaddleClas提供了百度自研的SSLD知识蒸馏方案，即使在ImageNet-1k这么有挑战的分类任务上，其也能稳定提升3%以上。SSLD知识蒸馏的的章节请参考[**SSLD知识蒸馏**](../advanced_tutorials/knowledge_distillation.md)。
 <a name="第6期"></a>
@@ -342,13 +342,13 @@ Cosine_decay和piecewise_decay的学习率变化曲线如下图所示，容易
 **A**: PaddleClas目前共有3种分支：
-* 动态图分支：dygraph分支是PaddleClas的默认分支，也是更新最快的分支。所有的新功能、新改动都会先在dygraph分支上进行。如果想追踪PaddleClas的最新进展，可以关注这个分支。这个分支主要支持动态图，会跟着paddlepaddle的版本一起更新。
+* 开发分支：develop分支是PaddleClas的开发分支，也是更新最快的分支。所有的新功能、新改动都会先在develop分支上进行。如果想追踪PaddleClas的最新进展，可以关注这个分支。这个分支主要支持动态图，会跟着paddlepaddle的版本一起更新。
-* 稳定版本分支：快速更新能够让关注者了解最新进展，但也会带来不稳定性。因此在一些关键的时间点，我们会从dygraph分支中拉出分支，提供稳定的版本。这些分支的名字与paddlepaddle的版本对应，如 2.0-beta 为支持paddlepaddle2.0-beta的稳定版本。这些分支一般只会修复bug，而不更新新的特性和模型。
+* 稳定版本分支（如release/2.1.3）：快速更新能够让关注者了解最新进展，但也会带来不稳定性。因此在一些关键的时间点，我们会从develop分支中拉出分支，提供稳定的版本，最新的稳定版分支也是默认分支。需要注意，无特殊情况，我们只会维护最新的release稳定分支，并且一般只会修复bug，而不更新新的特性和模型。
-* 静态图分支：master分支是使用静态图版本的分支，主要用来支持一些老用户的使用，也只进行一些简单维护，不会更新新的特性和模型。不建议新用户使用静态图分支。老用户如果有条件，也建议迁到动态图分支或稳定版本分支。
+* 静态图分支（static）：static分支是使用静态图版本的分支，主要用来支持一些老用户的使用，也只进行一些简单维护，不会更新新的特性和模型。不建议新用户使用静态图分支。老用户如果有条件，也建议迁到动态图分支或稳定版本分支。
-总的来说，如果想跟进PaddleClas的最新进展，建议选择dygraph分支，如果需要稳定版本，建议选择最新的稳定版本分支。
+总的来说，如果想跟进PaddleClas的最新进展，建议选择develop分支，如果需要稳定版本，建议选择最新的稳定版本分支。
 ### Q6.2: 什么是静态图模式？
@@ -358,11 +358,7 @@ Cosine_decay和piecewise_decay的学习率变化曲线如下图所示，容易
 **A**: 动态图模式即为命令式编程模式，用户无需预先定义网络结构，每行代码都可以直接运行得到结果。相比静态图模式，动态图模式对用户更加友好，调试也更方便。此外，动态图模式的结构设计也更加灵活，可以在运行过程中随时调整结构。
-PaddleClas目前持续更新的dygraph分支，主要采用动态图模式。如果您是新用户，建议使用动态图模式来进行开发和训练。如果推理预测时有性能需求，可以在训练完成后，将动态图模型转为静态图模型提高效率。
+PaddleClas目前持续更新的develop分支和稳定版本的release分支，主要采用动态图模式。如果您是新用户，建议使用动态图模式来进行开发和训练。如果推理预测时有性能需求，可以在训练完成后，将动态图模型转为静态图模型提高效率。
-### Q6.4: 动态图模型的预测效率有时不如静态图，应该怎么办？
-**A**: 可以使用转换工具，将动态图模型转换为静态图模型，具体可以参考https://www.paddlepaddle.org.cn/documentation/docs/zh/2.0-rc1/guides/04_dygraph_to_static/index_cn.html。
 ### Q6.5: 构建分类数据集时，如何构建"背景"类别的数据？

--- a/docs/zh_CN/faq_series/faq_2021_s1.md
+++ b/docs/zh_CN/faq_series/faq_2021_s1.md
@@ -38,7 +38,9 @@
 ### Q1.4 PaddleClas提供的10W类图像分类预训练模型在哪里下载，应该怎么使用呢？
-**A**：基于ResNet50_vd, 百度开源了自研的大规模分类预训练模型，其中训练数据为10万个类别，4300万张图片。10万类预训练模型的下载地址：[下载地址](https://paddle-imagenet-models-name.bj.bcebos.com/ResNet50_vd_10w_pretrained.tar)，在这里需要注意的是，该预训练模型没有提供最后的FC层参数，因此无法直接拿来预测；但是可以使用它作为预训练模型，在自己的数据集上进行微调。经过验证，该预训练模型相比于基于ImageNet1k数据集的ResNet50_vd预训练模型，在不同的数据集上均有比较明显的精度收益，最多可达30%，更多的对比实验可以参考：[图像分类迁移学习教程](../application/transfer_learning.md)。
+**A**：基于ResNet50_vd, 百度开源了自研的大规模分类预训练模型，其中训练数据为10万个类别，4300万张图片。10万类预训练模型的下载地址：[下载地址](https://paddle-imagenet-models-name.bj.bcebos.com/ResNet50_vd_10w_pretrained.tar)，在这里需要注意的是，该预训练模型没有提供最后的FC层参数，因此无法直接拿来预测；但是可以使用它作为预训练模型，在自己的数据集上进行微调。经过验证，该预训练模型相比于基于ImageNet1k数据集的ResNet50_vd预训练模型，在不同的数据集上均有比较明显的精度收益，最多可达30%。
+<!-- TODO(gaotingquan): -->
+<!-- ，更多的对比实验可以参考：[图像分类迁移学习教程](../application/transfer_learning.md)。 -->
 ### Q1.5 使用C++进行预测部署的时候怎么进行加速呢？
@@ -178,7 +180,7 @@ RepVGG网络与ACNet同理，只不过ACNet的`1*d`非对称卷积变成了`1*1`
 **A**:
 1. 图像对CNN的依赖是不必要的，利用Transformer的计算效率和可伸缩性，可以训练很大模型，当模型和数据集增大的情形下，仍然不会存在饱和的情况。受到Transformer在NLP上的启发，在图像分类任务中使用时，将图片分成顺序排列的patches，并将这些patches输入一个线性单元嵌入到embedding作为transformer的输入。
-2. 在中等规模数据集中如ImageNet，ImageNet21k，视觉Transformer模型低于相同规模尺寸的ResNet几个百分点。这是因为transformer缺少CNN平移和局限性，在数据量不够大的时候，不能超越卷积网络。
+2. 在中等规模数据集中如ImageNet1k，ImageNet21k，视觉Transformer模型低于相同规模尺寸的ResNet几个百分点。猜测这是因为transformer缺少CNN所具有的局部性(Locality)和空间不变性(Spatial Invariance)的特点，而在数据量不够大的时候，难以超越卷积网络，不过对于这一问题，[DeiT](https://arxiv.org/abs/2012.12877)使用数据增强的方式在一定程度上解决了Vision Transformer依赖超大规模数据集训练的问题。
 3. 在超大规模数据集14M-300M训练时，这种方式可以越过局部信息，建模更加长距离的依赖关系，而CNN能较好关注局部信息全局信息捕获能力较弱。
@@ -199,7 +201,7 @@ RepVGG网络与ACNet同理，只不过ACNet的`1*d`非对称卷积变成了`1*1`
    <img src="../../images/faq/Transformer_input.png" width="400">
 </div>
-3. 考虑以下问题：怎样将一张图片怎么传给encoder？
+3. 考虑以下问题：怎样将一张图片传给encoder？
 * 如下图所示。假设输入图片是[224,224,3]，按照顺序从左到右，从上到下，切分成很多个patch，patch大小可以为[p,p,3]（p取值可以是16，32），对其使用Linear Projection of Flattened Patches模块转成特征向量，并concat一个位置向量，传入Encoder中。
@@ -218,7 +220,7 @@ RepVGG网络与ACNet同理，只不过ACNet的`1*d`非对称卷积变成了`1*1`
 ### Q4.4: 如何理解归纳偏置Inductive Bias？
 **A**:
-1. 在机器学习中，会对算需要应用的问题做一些假设，这个假设就称为归纳偏好。在现实生活中观察得到的现象中归纳出一定的先验规则，然后对模型做一定的约束，从而起到模型选择的作用。在CNN中，假设特征具有局部性(Locality)和空间不变性(Spatial Invariance)的特点，即把相邻的的特征有联系而远离的没有，将相邻特征融合在一起，更会容易产生“解”；还有attention机制，也是从人的直觉、生活经验归纳的规则。
+1. 在机器学习中，会对算需要应用的问题做一些假设，这个假设就称为归纳偏好。在现实生活中观察得到的现象中归纳出一定的先验规则，然后对模型做一定的约束，从而起到模型选择的作用。在CNN中，假设特征具有局部性(Locality)和空间不变性(Spatial Invariance)的特点，即把相邻的特征有联系而远离的没有，将相邻特征融合在一起，更会容易产生“解”；还有attention机制，也是从人的直觉、生活经验归纳的规则。
 2. Vision Transformer利用的归纳偏置是有序列能力Sequentiality和时间不变性Time Invariance，即序列顺序上的时间间隔的联系，因此也能得出在更大规模数据集上比CNN类的模型有更好的性能。文章Conclusion里的“Unlike prior works using self-attention in computer vision, we do not introduce any image-specific inductive biases into the architecture”和Introduction里的“We find that large scale training trumps inductive bias”，可以得出直观上inductive bias在大量数据的情况中的产生是衰减性能，应该尽可能丢弃。
@@ -242,11 +244,11 @@ PaddleClas的模型包含6大模块的配置，分别为：全局配置，网络
 学习率和优化器的配置建议优先使用默认配置，这些参数是我们已经调过的。如果任务的改动比较大，也可以做微调。
-训练和预测两个配置包含了batch_size，数据集，数据预处理（transforms），读数据进程数（num_workers）等比较重要的配置，这部分要根据实际环境适度修改。要注意的是，paddleclas中的batch_size是全局的配置，即不随卡数发生变化。而num_workers定义的是单卡的进程数，即如果num_workers是8，并且使用4卡训练，则实际有32个worker。
+训练和预测两个配置包含了batch_size，数据集，数据预处理（transforms），读数据进程数（num_workers）等比较重要的配置，这部分要根据实际环境适度修改。要注意的是，paddleclas中的batch_size是单卡配置，如果是多卡训练，则总的batch_size是配置文件中所设置的倍数，例如配置文件中设置batch_size为64，4卡训练，总batch_size也就是4*64=256。而num_workers定义的是单卡的进程数，即如果num_workers是8，并且使用4卡训练，则实际有32个worker。
 ### Q5.2: 如何在命令行中快速的修改配置？
 **A**:
-在训练中，我们常常需要对个别配置不断进行微调，而不希望频繁的修改配置文件。这时可以使用-o来调整，修改是要先按层级写出要改的配置名称，层级之间用点分割，再写出要修改的值。例如我们想要修改batch_size，可以在训练的命令后加上-o TRAIN.batchsize=512。
+在训练中，我们常常需要对个别配置不断进行微调，而不希望频繁的修改配置文件。这时可以使用-o来调整，修改是要先按层级写出要改的配置名称，层级之间用点分割，再写出要修改的值。例如我们想要修改batch_size，可以在训练的命令后加上-o DataLoader.TRAIN.sampler.batch_size=512。
 ### Q5.3: 如何根据PaddleClas的精度曲线选择合适的模型？
 **A**:
@@ -264,4 +266,4 @@ PaddleClas提供了多个模型的benchmark，并绘制了性能曲线，主要
 ### Q5.5: 使用分类模型做其他任务的预训练模型时，应该选择哪些层作为feature？
 **A**:
 使用分类模型做其他任务的backbone有很多策略，这里介绍一种较为基础的方法。首先，去掉最后的全连接层，这一层主要包含的是原始任务的分类信息。如果任务比较简单，只要将前一层的输出作为featuremap，并在此基础上添加与任务对应的结构即可。如果任务涉及多尺度，需要选取不同尺度的anchor，例如某些检测模型，那么可以选取每次下采样之前一层的输出作为featuremap。
\ No newline at end of file
--- a/docs/zh_CN/faq_series/faq_2021_s2.md
+++ b/docs/zh_CN/faq_series/faq_2021_s2.md
@@ -32,11 +32,9 @@
 #### Q2.1.8: 如何在训练时使用 `Mixup` 和 `Cutmix` ？
 **A**：
-* `Mixup` 的使用方法请参考 [Mixup](https://github.com/PaddlePaddle/PaddleClas/blob/cf9fc9363877f919996954a63716acfb959619d0/ppcls/configs/ImageNet/DataAugment/ResNet50_Mixup.yaml#L63-L65)；`Cuxmix` 请参考 [Cuxmix](https://github.com/PaddlePaddle/PaddleClas/blob/cf9fc9363877f919996954a63716acfb959619d0/ppcls/configs/ImageNet/DataAugment/ResNet50_Cutmix.yaml#L63-L65)。
+* `Mixup` 的使用方法请参考 [Mixup](../../../ppcls/configs/ImageNet/DataAugment/ResNet50_Mixup.yaml#L63-L65)；`Cuxmix` 请参考 [Cuxmix](../../../ppcls/configs/ImageNet/DataAugment/ResNet50_Cutmix.yaml#L63-L65)。
-* 在使用 `Mixup` 或 `Cutmix` 时，需要注意：
+* 使用 `Mixup` 或 `Cutmix` 做训练时无法计算训练的精度（Acc）指标，因此需要在配置文件中取消 `Metric.Train.TopkAcc` 字段，可参考 [Metric.Train.TopkAcc](../../../ppcls/configs/ImageNet/DataAugment/ResNet50_Cutmix.yaml#L125-L128)。
-    * 配置文件中的 `Loss.Tranin.CELoss` 需要修改为 `Loss.Tranin.MixCELoss`，可参考 [MixCELoss](https://github.com/PaddlePaddle/PaddleClas/blob/cf9fc9363877f919996954a63716acfb959619d0/ppcls/configs/ImageNet/DataAugment/ResNet50_Cutmix.yaml#L23-L26)；
-    * 使用 `Mixup` 或 `Cutmix` 做训练时无法计算训练的精度（Acc）指标，因此需要在配置文件中取消 `Metric.Train.TopkAcc` 字段，可参考 [Metric.Train.TopkAcc](https://github.com/PaddlePaddle/PaddleClas/blob/cf9fc9363877f919996954a63716acfb959619d0/ppcls/configs/ImageNet/DataAugment/ResNet50_Cutmix.yaml#L125-L128)。
 #### Q2.1.9: 训练配置yaml文件中，字段 `Global.pretrain_model` 和 `Global.checkpoints` 分别用于配置什么呢？
 **A**：
@@ -103,9 +101,6 @@ pip install paddle2onnx
 #### Q1.1.1 PaddleClas和PaddleDetection区别
 **A**：PaddleClas是一个兼主体检测、图像分类、图像检索于一体的图像识别repo，用于解决大部分图像识别问题，用户可以很方便的使用PaddleClas来解决小样本、多类别的图像识别问题。PaddleDetection提供了目标检测、关键点检测、多目标跟踪等能力，方便用户定位图像中的感兴趣的点和区域，被广泛应用于工业质检、遥感图像检测、无人巡检等项目。
-#### Q1.1.2 PaddleClas 2.2和PaddleClas 2.1完全兼容吗？
-**A**：PaddleClas2.2相对PaddleClas2.1新增了metric learning模块，主体检测模块、向量检索模块。另外，也提供了商品识别、车辆识别、logo识别和动漫人物识别等4个场景应用示例。用户可以基于PaddleClas 2.2快速构建图像识别系统。在图像分类模块，二者的使用方法类似，可以参考[图像分类示例](../tutorials/getting_started.md)快速迭代和评估。新增的metric learning模块，可以参考[metric learning示例](../tutorials/getting_started_retrieval.md)。另外，新版本暂时还不支持fp16、dali训练，也暂时不支持多标签训练，这块内容将在不久后支持。
 #### Q1.1.3: Momentum 优化器中的 momentum 参数是什么意思呢？
 **A**: Momentum 优化器是在 SGD 优化器的基础上引入了“动量”的概念。在 SGD 优化器中，在 `t+1` 时刻，参数 `w` 的更新可表示为：
 ```latex
@@ -139,7 +134,7 @@ w_t+1 = w_t - v_t+1
 2. 图像裁剪类： CutOut、RandErasing、HideAndSeek、GridMask；
 3. 图像混叠类：Mixup, Cutmix.
-其中，Randangment提供了多种数据增强方式的随机组合，可以满足亮度、对比度、饱和度、色调等多方面的数据增广需求
+其中，RandAngment提供了多种数据增强方式的随机组合，可以满足亮度、对比度、饱和度、色调等多方面的数据增广需求。
 <a name="1.4通用检测模块"></a>
 ### 1.4 通用检测模块
@@ -148,7 +143,7 @@ w_t+1 = w_t - v_t+1
 **A**：主体检测这块的输出数量是可以通过配置文件配置的。在配置文件中Global.threshold控制检测的阈值，小于该阈值的检测框被舍弃，Global.max_det_results控制最大返回的结果数，这两个参数共同决定了输出检测框的数量。
 #### Q1.4.2 训练主体检测模型的数据是如何选择的？换成更小的模型会有损精度吗？
-**A**：训练数据是在COCO、Object365、RPC、LogoDet等公开数据集中随机抽取的子集，小模型精度可能会有一些损失，后续我们也会尝试下更小的检测模型。关于主体检测模型的更多信息请参考[主体检测](../application/mainbody_detection.md)。
+**A**：训练数据是在COCO、Object365、RPC、LogoDet等公开数据集中随机抽取的子集。目前我们在2.3版本中推出了超轻量的主体检测模型，具体信息可以参考[主体检测](../image_recognition_pipeline/mainbody_detection.md#2-模型选择)。关于主体检测模型的更多信息请参考[主体检测](../image_recognition_pipeline/mainbody_detection.md)。
 #### Q1.4.3: 目前使用的主体检测模型检测在某些场景中会有误检？
 **A**：目前的主体检测模型训练时使用了COCO、Object365、RPC、LogoDet等公开数据集，如果被检测数据是类似工业质检等于常见类别差异较大的数据，需要基于目前的检测模型重新微调训练。
@@ -169,7 +164,7 @@ w_t+1 = w_t - v_t+1
 ### 1.6 检索模块
 #### Q1.6.1 PaddleClas目前使用的Möbius向量检索算法支持类似于faiss的那种index.add()的功能吗? 另外，每次构建新的图都要进行train吗？这里的train是为了检索加速还是为了构建相似的图？
-**A**：Mobius提供的检索算法是一种基于图的近似最近邻搜索算法，目前支持两种距离计算方式：inner product和L2 distance. faiss中提供的index.add功能暂时不支持，如果需要增加检索库的内容，需要从头重新构建新的index. 在每次构建index时，检索算法内部执行的操作是一种类似于train的过程，不同于faiss提供的train接口，我们命名为build, 主要的目的是为了加速检索的速度。
+**A**：目前在release/2.3分支已经支持faiss检索模块，并且不再支持Möbius。关于Möbius提供的检索算法，是一种基于图的近似最近邻搜索算法，目前支持两种距离计算方式：inner product和L2 distance，但是Möbius暂不支持faiss中提供的index.add功能，如果需要增加检索库的内容，需要从头重新构建新的index. 在每次构建index时，检索算法内部执行的操作是一种类似于train的过程，不同于faiss提供的train接口。因此需要faiss模块的话，可以使用release/2.3分支，需要Möbius的话，目前需要回退到release/2.2分支。
 #### Q1.6.2: PaddleClas 图像识别用于 Eval 的配置文件中，`Query` 和 `Gallery` 配置具体是用于做什么呢？
 **A**: `Query` 与 `Gallery` 均为数据集配置，其中 `Gallery` 用于配置底库数据，`Query` 用于配置验证集。在进行 Eval 时，首先使用模型对 `Gallery` 底库数据进行前向计算特征向量，特征向量用于构建底库，然后模型对 `Query` 验证集中的数据进行前向计算特征向量，再与底库计算召回率等指标。
@@ -218,11 +213,9 @@ PaddlePaddle is installed successfully! Let's start deep learning with PaddlePad
 #### Q2.1.8: 如何在训练时使用 `Mixup` 和 `Cutmix` ？
 **A**：
-* `Mixup` 的使用方法请参考 [Mixup](https://github.com/PaddlePaddle/PaddleClas/blob/cf9fc9363877f919996954a63716acfb959619d0/ppcls/configs/ImageNet/DataAugment/ResNet50_Mixup.yaml#L63-L65)；`Cuxmix` 请参考 [Cuxmix](https://github.com/PaddlePaddle/PaddleClas/blob/cf9fc9363877f919996954a63716acfb959619d0/ppcls/configs/ImageNet/DataAugment/ResNet50_Cutmix.yaml#L63-L65)。
+* `Mixup` 的使用方法请参考 [Mixup](../../../ppcls/configs/ImageNet/DataAugment/ResNet50_Mixup.yaml#L63-L65)；`Cuxmix` 请参考 [Cuxmix](../../../ppcls/configs/ImageNet/DataAugment/ResNet50_Cutmix.yaml#L63-L65)。
-* 在使用 `Mixup` 或 `Cutmix` 时，需要注意：
+* 使用 `Mixup` 或 `Cutmix` 做训练时无法计算训练的精度（Acc）指标，因此需要在配置文件中取消 `Metric.Train.TopkAcc` 字段，可参考 [Metric.Train.TopkAcc](../../../ppcls/configs/ImageNet/DataAugment/ResNet50_Cutmix.yaml#L125-L128)。
-    * 配置文件中的 `Loss.Tranin.CELoss` 需要修改为 `Loss.Tranin.MixCELoss`，可参考 [MixCELoss](https://github.com/PaddlePaddle/PaddleClas/blob/cf9fc9363877f919996954a63716acfb959619d0/ppcls/configs/ImageNet/DataAugment/ResNet50_Cutmix.yaml#L23-L26)；
-    * 使用 `Mixup` 或 `Cutmix` 做训练时无法计算训练的精度（Acc）指标，因此需要在配置文件中取消 `Metric.Train.TopkAcc` 字段，可参考 [Metric.Train.TopkAcc](https://github.com/PaddlePaddle/PaddleClas/blob/cf9fc9363877f919996954a63716acfb959619d0/ppcls/configs/ImageNet/DataAugment/ResNet50_Cutmix.yaml#L125-L128)。
 #### Q2.1.9: 训练配置yaml文件中，字段 `Global.pretrain_model` 和 `Global.checkpoints` 分别用于配置什么呢？
 **A**：
@@ -232,9 +225,9 @@ PaddlePaddle is installed successfully! Let's start deep learning with PaddlePad
 <a name="2.2图像分类"></a>
 ### 2.2 图像分类
-#### Q2.2.1 SSLD中，大模型在500M数据上预训练后蒸馏小模型，然后在1M数据上蒸馏finetune小模型？
+#### Q2.2.1 在SSLD中，大模型在500M数据上预训练后蒸馏小模型，然后在1M数据上蒸馏finetune小模型，具体步骤是怎样做的？
 **A**：步骤如下：
-1. 基于facebook开源的`ResNeXt101-32x16d-wsl`模型 去蒸馏得到了`ResNet50-vd`模型；
+1. 基于facebook开源的`ResNeXt101-32x16d-wsl`模型去蒸馏得到了`ResNet50-vd`模型；
 2. 用这个`ResNet50-vd`，在500W数据集上去蒸馏`MobilNetV3`；
 3. 考虑到500W的数据集的分布和100W的数据分布不完全一致，所以这块，在100W上的数据上又finetune了一下，精度有微弱的提升。
@@ -257,13 +250,13 @@ PaddlePaddle is installed successfully! Let's start deep learning with PaddlePad
 ### 2.4 图像识别模块
 #### Q2.4.1: 识别模块预测时报`Illegal instruction`错？
-**A**：可能是编译生成的库文件与您的环境不兼容，导致程序报错，如果报错，推荐参考[向量检索教程](../../../deploy/vector_search/README.md)重新编译库文件。
+**A**：如果使用的是release/2.2分支，建议更新为release/2.3分支，在release/2.3分支中，我们使用faiss检索模块替换了Möbius检索模型，具体可以参考[向量检索教程](../../../deploy/vector_search/README.md)。如仍存在问题，可以在用户微信群中联系我们，也可以在GitHub提issue。
 #### Q2.4.2: 识别模型怎么在预训练模型的基础上进行微调训练？
-**A**：识别模型的微调训练和分类模型的微调训练类似，识别模型可以加载商品的预训练模型，训练过程可以参考[识别模型训练](../tutorials/getting_started_retrieval.md)，后续我们也会持续细化这块的文档。
+**A**：识别模型的微调训练和分类模型的微调训练类似，识别模型可以加载商品的预训练模型，训练过程可以参考[识别模型训练](../../zh_CN/models_training/recognition.md)，后续我们也会持续细化这块的文档。
 #### Q2.4.3: 训练metric learning时，每个epoch中，无法跑完所有mini-batch，为什么？
-**A**：在训练metric learning时，使用的Sampler是DistributedRandomIdentitySampler，该Sampler不会采样全部的图片，导致会让每一个epoch采样的数据不是所有的数据，所以无法跑完显示的mini-batch是正常现象。后续我们会优化下打印的信息，尽可能减少给大家带来的困惑。
+**A**：在训练metric learning时，使用的Sampler是DistributedRandomIdentitySampler，该Sampler不会采样全部的图片，导致会让每一个epoch采样的数据不是所有的数据，所以无法跑完显示的mini-batch是正常现象。该问题在release/2.3分支已经优化，请更新到release/2.3使用。
 #### Q2.4.4: 有些图片没有识别出结果，为什么？
 **A**：在配置文件（如inference_product.yaml）中，`IndexProcess.score_thres`中会控制被识别的图片与库中的图片的余弦相似度的最小值。当余弦相似度小于该值时，不会打印结果。您可以根据自己的实际数据调整该值。
@@ -275,10 +268,10 @@ PaddlePaddle is installed successfully! Let's start deep learning with PaddlePad
 **A**：请确保data_file.txt中图片路径和图片名称中间的间隔为单个table，而不是空格。
 #### Q2.5.2: 新增底库数据需要重新构建索引吗？
-**A**：这一版需要重新构建索引，未来版本会支持只构建新增图片的索引。
+**A**：从release/2.3分支起，我们使用faiss检索模块替换了Möbius检索模型，已经支持在不构建底库的前提下新增底库数据，具体可以参考[向量检索教程](../../../deploy/vector_search/README.md)。
 #### Q2.5.3: Mac重新编译index.so时报错如下：clang: error: unsupported option '-fopenmp', 该如何处理？
-**A**：该问题已经解决。可以参照[文档](../../../develop/deploy/vector_search/README.md)重新编译 index.so。
+**A**：如果使用的是release/2.2分支，建议更新为release/2.3分支，在release/2.3分支中，我们使用faiss检索模块替换了Möbius检索模型，具体可以参考[向量检索教程](../../../deploy/vector_search/README.md)。如仍存在问题，可以在用户微信群中联系我们，也可以在GitHub提issue。
 #### Q2.5.4: 在 build 检索底库时，参数 `pq_size` 应该如何设置？
 **A**：`pq_size` 是PQ检索算法的参数。PQ检索算法可以简单理解为“分层”检索算法，`pq_size` 是每层的“容量”，因此该参数的设置会影响检索性能，不过，在底库总数据量不太大（小于10000张）的情况下，这个参数对性能的影响很小，因此对于大多数使用场景而言，在构建底库时无需修改该参数。关于PQ检索算法的更多内容，可以查看相关[论文](https://lear.inrialpes.fr/pubs/2011/JDS11/jegou_searching_with_quantization.pdf)。

--- a/docs/zh_CN/faq_series/faq_selected_30.md
+++ b/docs/zh_CN/faq_series/faq_selected_30.md
--- a/docs/zh_CN/others/update_history.md
+++ b/docs/zh_CN/others/update_history.md
 # 更新日志
 - 2021.08.11 更新7个[FAQ](docs/zh_CN/faq_series/faq_2021_s2.md)。
 - 2021.06.29 添加Swin-transformer系列模型，ImageNet1k数据集上Top1 acc最高精度可达87.2%；支持训练预测评估与whl包部署，预训练模型可以从[这里](docs/zh_CN/models/models_intro.md)下载。
 - 2021.06.22,23,24 PaddleClas官方研发团队带来技术深入解读三日直播课。课程回放：[https://aistudio.baidu.com/aistudio/course/introduce/24519](https://aistudio.baidu.com/aistudio/course/introduce/24519)

--- a/docs/zh_CN/quick_start/quick_start_recognition.md
+++ b/docs/zh_CN/quick_start/quick_start_recognition.md
--- a/ppcls/arch/backbone/base/theseus_layer.py
+++ b/ppcls/arch/backbone/base/theseus_layer.py
@@ -15,6 +15,7 @@ class TheseusLayer(nn.Layer):
    def __init__(self, *args, **kwargs):
        super(TheseusLayer, self).__init__()
        self.res_dict = {}
+        self.res_name = self.full_name()
    # stop doesn't work when stop layer has a parallel branch.
    def stop_after(self, stop_layer_name: str):
@@ -33,29 +34,45 @@ class TheseusLayer(nn.Layer):
        return after_stop
    def update_res(self, return_patterns):
-        if not return_patterns or isinstance(self, WrapLayer):
+        for return_pattern in return_patterns:
-            return
+            pattern_list = return_pattern.split(".")
-        for layer_i in self._sub_layers:
+            if not pattern_list:
-            layer_name = self._sub_layers[layer_i].full_name()
+                continue
-            if isinstance(self._sub_layers[layer_i], (nn.Sequential, nn.LayerList)):
+            sub_layer_parent = self
-                self._sub_layers[layer_i] = wrap_theseus(self._sub_layers[layer_i], self.res_dict)
+            while len(pattern_list) > 1:
-                self._sub_layers[layer_i].update_res(return_patterns)
+                if '[' in pattern_list[0]:
+                    sub_layer_name = pattern_list[0].split('[')[0]
+                    sub_layer_index = pattern_list[0].split('[')[1].split(']')[0]
+                    sub_layer_parent = getattr(sub_layer_parent, sub_layer_name)[sub_layer_index]
+                else:
+                    sub_layer_parent = getattr(sub_layer_parent, pattern_list[0],
+                                               None)
+                    if sub_layer_parent is None:
+                        break
+                if isinstance(sub_layer_parent, WrapLayer):
+                    sub_layer_parent = sub_layer_parent.sub_layer
+                pattern_list = pattern_list[1:]
+            if sub_layer_parent is None:
+                continue
+            if '[' in pattern_list[0]:
+                sub_layer_name = pattern_list[0].split('[')[0]
+                sub_layer_index = pattern_list[0].split('[')[1].split(']')[0]
+                sub_layer = getattr(sub_layer_parent, sub_layer_name)[sub_layer_index]
+                if not isinstance(sub_layer, TheseusLayer):
+                    sub_layer = wrap_theseus(sub_layer)
+                getattr(sub_layer_parent, sub_layer_name)[sub_layer_index] = sub_layer
            else:
-                for return_pattern in return_patterns:
+                sub_layer = getattr(sub_layer_parent, pattern_list[0])
-                    if re.match(return_pattern, layer_name):
+                if not isinstance(sub_layer, TheseusLayer):
-                        if not isinstance(self._sub_layers[layer_i], TheseusLayer):
+                    sub_layer = wrap_theseus(sub_layer)
-                            self._sub_layers[layer_i] = wrap_theseus(self._sub_layers[layer_i], self.res_dict)
+                setattr(sub_layer_parent, pattern_list[0], sub_layer)
-                        else:
-                            self._sub_layers[layer_i].res_dict = self.res_dict
+            sub_layer.res_dict = self.res_dict
+            sub_layer.res_name = return_pattern
-                        self._sub_layers[layer_i].register_forward_post_hook(
+            sub_layer.register_forward_post_hook(sub_layer._save_sub_res_hook)
-                            self._sub_layers[layer_i]._save_sub_res_hook)
-            if isinstance(self._sub_layers[layer_i], TheseusLayer):
-                self._sub_layers[layer_i].res_dict = self.res_dict
-                self._sub_layers[layer_i].update_res(return_patterns)
    def _save_sub_res_hook(self, layer, input, output):
-        self.res_dict[layer.full_name()] = output
+        self.res_dict[self.res_name] = output
    def _return_dict_hook(self, layer, input, output):
        res_dict = {"output": output}
@@ -63,19 +80,23 @@ class TheseusLayer(nn.Layer):
            res_dict[res_key] = self.res_dict.pop(res_key)
        return res_dict
-    def replace_sub(self, layer_name_pattern, replace_function, recursive=True):
+    def replace_sub(self, layer_name_pattern, replace_function,
+                    recursive=True):
        for layer_i in self._sub_layers:
            layer_name = self._sub_layers[layer_i].full_name()
            if re.match(layer_name_pattern, layer_name):
-                self._sub_layers[layer_i] = replace_function(self._sub_layers[layer_i])
+                self._sub_layers[layer_i] = replace_function(self._sub_layers[
+                    layer_i])
            if recursive:
                if isinstance(self._sub_layers[layer_i], TheseusLayer):
                    self._sub_layers[layer_i].replace_sub(
                        layer_name_pattern, replace_function, recursive)
-                elif isinstance(self._sub_layers[layer_i], (nn.Sequential, nn.LayerList)):
+                elif isinstance(self._sub_layers[layer_i],
+                                (nn.Sequential, nn.LayerList)):
                    for layer_j in self._sub_layers[layer_i]._sub_layers:
-                        self._sub_layers[layer_i]._sub_layers[layer_j].replace_sub(
+                        self._sub_layers[layer_i]._sub_layers[
-                            layer_name_pattern, replace_function, recursive)
+                            layer_j].replace_sub(layer_name_pattern,
+                                                 replace_function, recursive)
    '''
    example of replace function:
@@ -92,39 +113,14 @@ class TheseusLayer(nn.Layer):
 class WrapLayer(TheseusLayer):
-    def __init__(self, sub_layer, res_dict=None):
+    def __init__(self, sub_layer):
        super(WrapLayer, self).__init__()
        self.sub_layer = sub_layer
-        self.name = sub_layer.full_name()
-        if res_dict is not None:
-            self.res_dict = res_dict
-    def full_name(self):
-        return self.name
    def forward(self, *inputs, **kwargs):
        return self.sub_layer(*inputs, **kwargs)
-    def update_res(self, return_patterns):
-        if not return_patterns or not isinstance(self.sub_layer, (nn.Sequential, nn.LayerList)):
+def wrap_theseus(sub_layer):
-            return
+    wrapped_layer = WrapLayer(sub_layer)
-        for layer_i in self.sub_layer._sub_layers:
-            if isinstance(self.sub_layer._sub_layers[layer_i], (nn.Sequential, nn.LayerList)):
-                self.sub_layer._sub_layers[layer_i] = wrap_theseus(self.sub_layer._sub_layers[layer_i], self.res_dict)
-                self.sub_layer._sub_layers[layer_i].update_res(return_patterns)
-            elif isinstance(self.sub_layer._sub_layers[layer_i], TheseusLayer):
-                self.sub_layer._sub_layers[layer_i].res_dict = self.res_dict
-            layer_name = self.sub_layer._sub_layers[layer_i].full_name()
-            for return_pattern in return_patterns:
-                if re.match(return_pattern, layer_name):
-                    self.sub_layer._sub_layers[layer_i].register_forward_post_hook(
-                        self._sub_layers[layer_i]._save_sub_res_hook)
-            if isinstance(self.sub_layer._sub_layers[layer_i], TheseusLayer):
-                self.sub_layer._sub_layers[layer_i].update_res(return_patterns)
-def wrap_theseus(sub_layer, res_dict=None):
-    wrapped_layer = WrapLayer(sub_layer, res_dict)
    return wrapped_layer
--- a/ppcls/configs/ImageNet/DeiT/DeiT_base_distilled_patch16_224.yaml
+++ b/ppcls/configs/ImageNet/DeiT/DeiT_base_distilled_patch16_224.yaml
@@ -17,6 +17,8 @@ Global:
 # model architecture
 Arch:
  name: DeiT_base_distilled_patch16_224
+  drop_path_rate : 0.1
+  drop_rate : 0.0
  class_num: 1000
 # loss function config for traing/eval process

--- a/ppcls/configs/ImageNet/DeiT/DeiT_base_distilled_patch16_384.yaml
+++ b/ppcls/configs/ImageNet/DeiT/DeiT_base_distilled_patch16_384.yaml
@@ -17,6 +17,8 @@ Global:
 # model architecture
 Arch:
  name: DeiT_base_distilled_patch16_384
+  drop_path_rate : 0.1
+  drop_rate : 0.0
  class_num: 1000
 # loss function config for traing/eval process

--- a/ppcls/configs/ImageNet/DeiT/DeiT_base_patch16_224.yaml
+++ b/ppcls/configs/ImageNet/DeiT/DeiT_base_patch16_224.yaml
@@ -17,6 +17,8 @@ Global:
 # model architecture
 Arch:
  name: DeiT_base_patch16_224
+  drop_path_rate : 0.1
+  drop_rate : 0.0
  class_num: 1000
 # loss function config for traing/eval process

--- a/ppcls/configs/ImageNet/DeiT/DeiT_base_patch16_384.yaml
+++ b/ppcls/configs/ImageNet/DeiT/DeiT_base_patch16_384.yaml
@@ -17,6 +17,8 @@ Global:
 # model architecture
 Arch:
  name: DeiT_base_patch16_384
+  drop_path_rate : 0.1
+  drop_rate : 0.0
  class_num: 1000
 # loss function config for traing/eval process

--- a/ppcls/configs/ImageNet/DeiT/DeiT_small_distilled_patch16_224.yaml
+++ b/ppcls/configs/ImageNet/DeiT/DeiT_small_distilled_patch16_224.yaml
@@ -17,6 +17,8 @@ Global:
 # model architecture
 Arch:
  name: DeiT_small_distilled_patch16_224
+  drop_path_rate : 0.1
+  drop_rate : 0.0
  class_num: 1000
 # loss function config for traing/eval process

--- a/ppcls/configs/ImageNet/DeiT/DeiT_small_patch16_224.yaml
+++ b/ppcls/configs/ImageNet/DeiT/DeiT_small_patch16_224.yaml
@@ -17,6 +17,8 @@ Global:
 # model architecture
 Arch:
  name: DeiT_small_patch16_224
+  drop_path_rate : 0.1
+  drop_rate : 0.0
  class_num: 1000
 # loss function config for traing/eval process

--- a/ppcls/configs/ImageNet/DeiT/DeiT_tiny_distilled_patch16_224.yaml
+++ b/ppcls/configs/ImageNet/DeiT/DeiT_tiny_distilled_patch16_224.yaml
@@ -17,6 +17,8 @@ Global:
 # model architecture
 Arch:
  name: DeiT_tiny_distilled_patch16_224
+  drop_path_rate : 0.1
+  drop_rate : 0.0
  class_num: 1000
 # loss function config for traing/eval process

--- a/ppcls/configs/ImageNet/DeiT/DeiT_tiny_patch16_224.yaml
+++ b/ppcls/configs/ImageNet/DeiT/DeiT_tiny_patch16_224.yaml
@@ -17,6 +17,8 @@ Global:
 # model architecture
 Arch:
  name: DeiT_tiny_patch16_224
+  drop_path_rate : 0.1
+  drop_rate : 0.0
  class_num: 1000
 # loss function config for traing/eval process

--- a/ppcls/engine/engine.py
+++ b/ppcls/engine/engine.py
@@ -61,7 +61,7 @@ class Engine(object):
        # set seed
        seed = self.config["Global"].get("seed", False)
-        if seed:
+        if seed or seed == 0:
            assert isinstance(seed, int), "The 'seed' must be a integer!"
            paddle.seed(seed)
            np.random.seed(seed)
@@ -91,7 +91,7 @@ class Engine(object):
            self.vdl_writer = LogWriter(logdir=vdl_writer_path)
        # set device
-        assert self.config["Global"]["device"] in ["cpu", "gpu", "xpu"]
+        assert self.config["Global"]["device"] in ["cpu", "gpu", "xpu", "npu"]
        self.device = paddle.set_device(self.config["Global"]["device"])
        logger.info('train with paddle {} and device {}'.format(
            paddle.__version__, self.device))

--- a/ppcls/engine/train/train.py
+++ b/ppcls/engine/train/train.py
@@ -16,6 +16,7 @@ from __future__ import absolute_import, division, print_function
 import time
 import paddle
 from ppcls.engine.train.utils import update_loss, update_metric, log_info
+from ppcls.utils import profiler
 def train_epoch(engine, epoch_id, print_batch_step):
@@ -23,6 +24,7 @@ def train_epoch(engine, epoch_id, print_batch_step):
    for iter_id, batch in enumerate(engine.train_dataloader):
        if iter_id >= engine.max_iter:
            break
+        profiler.add_profiler_step(engine.config["profiler_options"])
        if iter_id == 5:
            for key in engine.time_info:
                engine.time_info[key].reset()

--- a/ppcls/static/program.py
+++ b/ppcls/static/program.py
@@ -433,9 +433,8 @@ def run(dataloader,
    end_str = ' '.join([str(m.mean) for m in metric_dict.values()] +
                       [metric_dict["batch_time"].total])
-    ips_info = "ips: {:.5f} images/sec.".format(
+    ips_info = "ips: {:.5f} images/sec.".format(batch_size /
-        batch_size * metric_dict["batch_time"].count /
+                                                metric_dict["batch_time"].avg)
-        metric_dict["batch_time"].sum)
    if mode == 'eval':
        logger.info("END {:s} {:s} {:s}".format(mode, end_str, ips_info))
    else:

--- a/ppcls/static/run_dali.sh
+++ b/ppcls/static/run_dali.sh
@@ -5,7 +5,7 @@ export FLAGS_fraction_of_gpu_memory_to_use=0.80
 python3.7 -m paddle.distributed.launch \
    --gpus="0,1,2,3,4,5,6,7" \
-    ppcls/static//train.py \
+    ppcls/static/train.py \
    -c ./ppcls/configs/ImageNet/ResNet/ResNet50_fp16.yaml \
    -o Global.use_dali=True
--- a/ppcls/static/train.py
+++ b/ppcls/static/train.py
@@ -91,14 +91,17 @@ def main(args):
            os.environ[k] = AMP_RELATED_FLAGS_SETTING[k]
    use_xpu = global_config.get("use_xpu", False)
+    use_npu = global_config.get("use_npu", False)
    assert (
-        use_gpu and use_xpu
+        use_gpu and use_xpu and use_npu
-    ) is not True, "gpu and xpu can not be true in the same time in static mode!"
+    ) is not True, "gpu, xpu and npu can not be true in the same time in static mode!"
    if use_gpu:
        device = paddle.set_device('gpu')
    elif use_xpu:
        device = paddle.set_device('xpu')
+    elif use_npu:
+        device = paddle.set_device('npu')
    else:
        device = paddle.set_device('cpu')

--- a/ppcls/utils/config.py
+++ b/ppcls/utils/config.py
@@ -199,5 +199,12 @@ def parse_args():
        action='append',
        default=[],
        help='config options to be overridden')
+    parser.add_argument(
+        '-p',
+        '--profiler_options',
+        type=str,
+        default=None,
+        help='The option of profiler, which should be in format \"key1=value1;key2=value2;key3=value3\".'
+    )
    args = parser.parse_args()
    return args
--- a/tests/config/ResNet50_vd.txt
+++ b/tests/config/ResNet50_vd.txt
@@ -33,7 +33,7 @@ fpgm_export:tools/export_model.py -c ppcls/configs/slim/ResNet50_vd_prune.yaml
 distill_export:null
 kl_quant:deploy/slim/quant_post_static.py -c ppcls/configs/ImageNet/ResNet/ResNet50_vd.yaml -o Global.save_inference_dir=./inference
 export2:null
-infer_model_url:https://paddle-imagenet-models-name.bj.bcebos.com/dygraph/whole_chain/ResNet50_vd_inference.tar
+inference_model_url:https://paddle-imagenet-models-name.bj.bcebos.com/dygraph/whole_chain/ResNet50_vd_inference.tar
 infer_model:../inference/
 infer_export:null
 infer_quant:Fasle

--- a/tests/prepare.sh
+++ b/tests/prepare.sh
@@ -42,6 +42,9 @@ if [ ${MODE} = "lite_train_infer" ] || [ ${MODE} = "whole_infer" ];then
    cd ILSVRC2012 
    mv train.txt train_list.txt
    mv val.txt val_list.txt
+    if [ ${MODE} = "lite_train_infer" ];then
+	cp -r train/* val/
+    fi
    cd ../../
 elif [ ${MODE} = "infer" ] || [ ${MODE} = "cpp_infer" ];then
    # download data

--- a/tools/train.py
+++ b/tools/train.py
@@ -27,5 +27,6 @@ if __name__ == "__main__":
    args = config.parse_args()
    config = config.get_config(
        args.config, overrides=args.override, show=False)
+    config.profiler_options = args.profiler_options
    engine = Engine(config, mode="train")
    engine.train()