diff --git a/README_ch.md b/README_ch.md
index 95ee77e29b2c35da856840ced947d09499937340..a3aacd931b8a54e6ee1f275a8141b2b82602f65e 100644
--- a/README_ch.md
+++ b/README_ch.md
@@ -7,32 +7,26 @@
 飞桨图像识别套件PaddleClas是飞桨为工业界和学术界所准备的一个图像识别任务的工具集，助力使用者训练出更好的视觉模型和应用落地。
 
 **近期更新**
-- 2021.10.31 发布[PP-ShiTu技术报告](./docs/PP_ShiTu.pdf)，优化文档，新增饮料识别demo
-- 2021.10.23 发布PP-ShiTu图像识别系统，cpu上200ms即可完成在10w+库的图像识别。
+
+- 2021.11.1 发布[PP-ShiTu技术报告](https://arxiv.org/pdf/2111.00775.pdf)，新增饮料识别demo
+- 2021.10.23 发布轻量级图像识别系统PP-ShiTu，CPU上0.2s即可完成在10w+库的图像识别。
 [点击这里](./docs/zh_CN/quick_start/quick_start_recognition.md)立即体验
-- 2021.09.17 增加PaddleClas自研PP-LCNet系列模型, 这些模型在Intel CPU上有较强的竞争力。PP-LCNet的介绍可以参考[论文](https://arxiv.org/pdf/2109.15099.pdf), 或者[PP-LCNet模型介绍](docs/zh_CN/models/PP-LCNet.md)，相关指标和预训练权重可以从 [这里](docs/zh_CN/ImageNet_models_cn.md)下载。
+- 2021.09.17 发布PP-LCNet系列超轻量骨干网络模型, 在Intel CPU上，单张图像预测速度约5ms，ImageNet-1K数据集上Top1识别准确率达到80.82%，超越ResNet152的模型效果。PP-LCNet的介绍可以参考[论文](https://arxiv.org/pdf/2109.15099.pdf), 或者[PP-LCNet模型介绍](docs/zh_CN/models/PP-LCNet.md)，相关指标和预训练权重可以从 [这里](docs/zh_CN/algorithm_introduction/ImageNet_models.md)下载。
 - [more](./docs/zh_CN/others/update_history.md)
 
 ## 特性
 
-- PP-ShiTu轻量图像识别系统：集成了目标检测、特征学习、图像检索等模块，广泛适用于各类图像识别任务。
-cpu上200ms即可完成在10w+库的图像识别。
-详细介绍见[PP-ShiTu: A Practical Lightweight Image Recognition System](./docs/PP_ShiTu.pdf)
+- PP-ShiTu轻量图像识别系统：集成了目标检测、特征学习、图像检索等模块，广泛适用于各类图像识别任务。cpu上0.2s即可完成在10w+库的图像识别。
 
-- PP-LCNet轻量级CPU骨干网络：专门为CPU设备打造轻量级骨干网络，速度、精度均超越竞品。
-详细介绍见[PP-LCNet: A Lightweight CPU Convolutional Neural Network](https://arxiv.org/pdf/2109.15099.pdf),
-或者[PP-LCNet模型介绍](docs/zh_CN/models/PP-LCNet.md)。
+- PP-LCNet轻量级CPU骨干网络：专门为CPU设备打造轻量级骨干网络，速度、精度均远超竞品。
 
-- 丰富的预训练模型库：提供了35个系列共164个ImageNet预训练模型，其中6个精选系列模型支持结构快速修改。
+- 丰富的预训练模型库：提供了36个系列共175个ImageNet预训练模型，其中7个精选系列模型支持结构快速修改。
 
 - 全面易用的特征学习组件：集成arcmargin, triplet loss等12度量学习方法，通过配置文件即可随意组合切换。
 
 - SSLD知识蒸馏：14个分类预训练模型，精度普遍提升3%以上；其中ResNet50_vd模型在ImageNet-1k数据集上的Top-1精度达到了84.0%，
 Res2Net200_vd预训练模型Top-1精度高达85.1%。
 
-- 数据增广：支持AutoAugment、Cutout、Cutmix等8种数据增广算法详细介绍、代码复现和在统一实验环境下的效果评估。
-
-
 <div align="center">
 <img src="./docs/images/recognition.gif"  width = "400" />
 </div>
@@ -47,6 +41,7 @@ Res2Net200_vd预训练模型Top-1精度高达85.1%。
 </div>
 
 ## 快速体验
+
 PP-ShiTu图像识别快速体验：[点击这里](./docs/zh_CN/quick_start/quick_start_recognition.md)
 
 ## 文档教程
@@ -59,9 +54,7 @@ PP-ShiTu图像识别快速体验：[点击这里](./docs/zh_CN/quick_start/quick
     - [尝鲜版](./docs/zh_CN/quick_start/quick_start_classification_new_user.md)
     - [进阶版](./docs/zh_CN/quick_start/quick_start_classification_professional.md) 
 - [PP-ShiTu图像识别系统介绍](#图像识别系统介绍)
-  - [主体检测](./docs/zh_CN/algorithm_introduction/mainbody_detection.md)
-  - [特征学习](./docs/zh_CN/algorithm_introduction/metric_learning.md)
-  - [向量检索](./deploy/vector_search/README.md)
+- [骨干网络和预训练模型库](./docs/zh_CN/algorithm_introduction/ImageNet_models.md)
 - 数据准备
   - [图像分类数据集介绍](./docs/zh_CN/data_preparation/classification_dataset.md)
   - [图像识别数据集介绍](./docs/zh_CN/data_preparation/recognition_dataset.md)
@@ -83,7 +76,6 @@ PP-ShiTu图像识别快速体验：[点击这里](./docs/zh_CN/quick_start/quick
 - 算法介绍
     - [图像分类任务介绍](./docs/zh_CN/algorithm_introduction/image_classification.md)
     - [度量学习介绍](./docs/zh_CN/algorithm_introduction/metric_learning.md)
-    - [骨干网络和预训练模型库](./docs/zh_CN/algorithm_introduction/ImageNet_models.md)
 - 高阶使用
     - [数据增广](./docs/zh_CN/advanced_tutorials/DataAugmentation.md)
     - [模型量化](./docs/zh_CN/advanced_tutorials/model_prune_quantization.md)
@@ -92,7 +84,7 @@ PP-ShiTu图像识别快速体验：[点击这里](./docs/zh_CN/quick_start/quick
     - [社区贡献指南](./docs/zh_CN/advanced_tutorials/how_to_contribute.md)
 - FAQ
     - [图像识别精选问题](docs/zh_CN/faq_series/faq_2021_s2.md)
-    - [图像分类精选问题](docs/zh_CN/faq_series/faq.md)
+    - [图像分类精选问题](docs/zh_CN/faq_series/faq_selected_30.md)
     - [图像分类FAQ第一季](docs/zh_CN/faq_series/faq_2020_s1.md)
     - [图像分类FAQ第二季](docs/zh_CN/faq_series/faq_2021_s1.md)
 - [许可证书](#许可证书)
@@ -105,9 +97,8 @@ PP-ShiTu图像识别快速体验：[点击这里](./docs/zh_CN/quick_start/quick
 <img src="./docs/images/structure.jpg"  width = "800" />
 </div>
 
-PP-ShiTu图像识别系统分为三步：（1）通过一个目标检测模型，检测图像物体候选区域（2）对每个候选区域进行特征提取（3）与检索库中图像进行特征匹配，提取识别结果。
+PP-ShiTu是一个实用的轻量级通用图像识别系统，主要由主体检测、特征学习和向量检索三个模块组成。该系统从骨干网络选择和调整、损失函数的选择、数据增强、学习率变换策略、正则化参数选择、预训练模型使用以及模型裁剪量化8个方面，采用多种策略，对各个模块的模型进行优化，最终得到在CPU上仅0.2s即可完成10w+库的图像识别的系统。更多细节请参考[PP-ShiTu技术方案](https://arxiv.org/pdf/2111.00775.pdf)。
 
-对于新的未知类别，无需重新训练模型，只需要在检索库补入该类别图像，重新建立检索库，就可以识别该类别。
 
 <a name="识别效果展示"></a>
 ## PP-ShiTu图像识别系统效果展示 
@@ -152,4 +143,3 @@ PP-ShiTu图像识别系统分为三步：（1）通过一个目标检测模型
 - 非常感谢[nblib](https://github.com/nblib)修正了PaddleClas中RandErasing的数据增广配置文件。
 - 非常感谢[chenpy228](https://github.com/chenpy228)修正了PaddleClas文档中的部分错别字。
 - 非常感谢[jm12138](https://github.com/jm12138)为PaddleClas添加ViT，DeiT系列模型和RepVGG系列模型。
-- 非常感谢[FutureSI](https://aistudio.baidu.com/aistudio/personalcenter/thirdview/76563)对PaddleClas代码的解析与总结。
diff --git a/README_en.md b/README_en.md
index b6dbd1281822215a8d1cbef8f6dff658dd019f2c..47abd67e71a318444fd492c8ccee66ed6107f561 100644
--- a/README_en.md
+++ b/README_en.md
@@ -8,7 +8,8 @@ PaddleClas is an image recognition toolset for industry and academia, helping us
 
 **Recent updates**
 
-- 2021.09.17 Add PP-LCNet series model developed by PaddleClas, these models show strong competitiveness on Intel CPUs. The metrics and pretrained model are available [here](docs/en/ImageNet_models_en.md).
+- 2021.09.17 Add PP-LCNet series model developed by PaddleClas, these models show strong competitiveness on Intel CPUs. 
+For the introduction of PP-LCNet, please refer to [paper](https://arxiv.org/pdf/2109.15099.pdf) or [PP-LCNet model introduction](docs/en/models/PP-LCNet_en.md). The metrics and pretrained model are available [here](docs/en/ImageNet_models_en.md).
 
 - 2021.06.29 Add Swin-transformer series model，Highest top1 acc on ImageNet1k dataset reaches 87.2%, training, evaluation and inference are all supported. Pretrained models can be downloaded [here](docs/en/models/models_intro_en.md).
 - 2021.06.16 PaddleClas release/2.2. Add metric learning and vector search modules. Add product recognition, animation character recognition, vehicle recognition and logo recognition. Added 30 pretrained models of LeViT, Twins, TNT, DLA, HarDNet, and RedNet, and the accuracy is roughly the same as that of the paper.
diff --git a/benchmark/README.md b/benchmark/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..2e892d97ed62ec6d02060db2015a8fef40c8a7c2
--- /dev/null
+++ b/benchmark/README.md
@@ -0,0 +1,27 @@
+# benchmark使用说明
+
+此目录所有shell脚本是为了测试PaddleClas中不同模型的速度指标，如单卡训练速度指标、多卡训练速度指标等。
+
+## 相关脚本说明
+
+一共有3个脚本：
+
+- `prepare_data.sh`: 下载相应的测试数据，并配置好数据路径
+- `run_benchmark.sh`: 执行单独一个训练测试的脚本，具体调用方式，可查看脚本注释
+- `run_all.sh`: 执行所有训练测试的入口脚本
+
+## 使用说明
+
+**注意**：为了跟PaddleClas中其他的模块的执行目录保持一致，此模块的执行目录为`PaddleClas`的根目录。
+
+### 1.准备数据
+
+```shell
+bash benchmark/prepare_data.sh
+```
+
+### 2.执行所有模型的测试
+
+```shell
+bash benchmark/run_all.sh
+```
diff --git a/benchmark/prepare_data.sh b/benchmark/prepare_data.sh
new file mode 100644
index 0000000000000000000000000000000000000000..83a6856983920968539d96c608559e65c37ca48a
--- /dev/null
+++ b/benchmark/prepare_data.sh
@@ -0,0 +1,11 @@
+#!/bin/bash
+dataset_url=$1
+
+cd dataset
+rm -rf ILSVRC2012
+wget -nc ${dataset_url}
+tar xf ILSVRC2012_val.tar
+ln -s ILSVRC2012_val ILSVRC2012
+cd ILSVRC2012
+ln -s val_list.txt train_list.txt
+cd ../../
diff --git a/benchmark/run_all.sh b/benchmark/run_all.sh
new file mode 100644
index 0000000000000000000000000000000000000000..7e7b5fe0a26c533e467f876d12929a788d23097c
--- /dev/null
+++ b/benchmark/run_all.sh
@@ -0,0 +1,25 @@
+# 提供可稳定复现性能的脚本，默认在标准docker环境内py37执行： paddlepaddle/paddle:latest-gpu-cuda10.1-cudnn7  paddle=2.1.2  py=37
+# 执行目录：需说明
+# cd **
+# 1 安装该模型需要的依赖 (如需开启优化策略请注明)
+# pip install ...
+# 2 拷贝该模型需要数据、预训练模型
+# 3 批量运行（如不方便批量，1，2需放到单个模型中）
+ 
+model_mode_list=(MobileNetV1 MobileNetV2 MobileNetV3_large_x1_0 EfficientNetB0 ShuffleNetV2_x1_0 DenseNet121 HRNet_W48_C SwinTransformer_tiny_patch4_window7_224 alt_gvt_base)
+fp_item_list=(fp32)
+bs_list=(32 64 96 128)
+for model_mode in ${model_mode_list[@]}; do
+      for fp_item in ${fp_item_list[@]}; do
+          for bs_item in ${bs_list[@]};do
+	    echo "index is speed, 1gpus, begin, ${model_name}"
+	    run_mode=sp
+	    CUDA_VISIBLE_DEVICES=0 bash benchmark/run_benchmark.sh ${run_mode} ${bs_item} ${fp_item} 10 ${model_mode}     #  (5min)
+	    sleep 10
+            echo "index is speed, 8gpus, run_mode is multi_process, begin, ${model_name}"
+            run_mode=mp
+            CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 bash benchmark/run_benchmark.sh ${run_mode} ${bs_item} ${fp_item} 10 ${model_mode} 
+            sleep 10
+            done
+      done
+done
diff --git a/benchmark/run_benchmark.sh b/benchmark/run_benchmark.sh
new file mode 100644
index 0000000000000000000000000000000000000000..85cb4bc8171405386c7448852a6bc295acd3d406
--- /dev/null
+++ b/benchmark/run_benchmark.sh
@@ -0,0 +1,56 @@
+#!/usr/bin/env bash
+set -xe
+# 运行示例：CUDA_VISIBLE_DEVICES=0 bash run_benchmark.sh ${run_mode} ${bs_item} ${fp_item} 500 ${model_mode}
+# 参数说明
+function _set_params(){
+    run_mode=${1:-"sp"}          # 单卡sp|多卡mp
+    batch_size=${2:-"64"}
+    fp_item=${3:-"fp32"}        # fp32|fp16
+    epochs=${4:-"10"}       # 可选，如果需要修改代码提前中断
+    model_name=${5:-"model_name"}
+    run_log_path="${TRAIN_LOG_DIR:-$(pwd)}/benchmark"  # TRAIN_LOG_DIR 后续QA设置该参数
+ 
+#   以下不用修改   
+    device=${CUDA_VISIBLE_DEVICES//,/ }
+    arr=(${device})
+    num_gpu_devices=${#arr[*]}
+    log_file=${run_log_path}/clas_${model_name}_${run_mode}_bs${batch_size}_${fp_item}_${num_gpu_devices}
+}
+function _train(){
+    echo "Train on ${num_gpu_devices} GPUs"
+    echo "current CUDA_VISIBLE_DEVICES=$CUDA_VISIBLE_DEVICES, gpus=$num_gpu_devices, batch_size=$batch_size"
+
+    if [ ${fp_item} = "fp32" ];then
+        model_config=`find ppcls/configs/ImageNet -name ${model_name}.yaml` 
+    else
+        model_config=`find ppcls/configs/ImageNet -name ${model_name}_fp16.yaml` 
+    fi
+
+    train_cmd="-c ${model_config} -o DataLoader.Train.sampler.batch_size=${batch_size} -o Global.epochs=${epochs}"   
+    case ${run_mode} in
+    sp) train_cmd="python -u tools/train.py ${train_cmd}" ;;
+    mp)
+        train_cmd="python -m paddle.distributed.launch --log_dir=./mylog --gpus=$CUDA_VISIBLE_DEVICES tools/train.py ${train_cmd}"
+        log_parse_file="mylog/workerlog.0" ;;
+    *) echo "choose run_mode(sp or mp)"; exit 1;
+    esac
+    rm -rf mylog
+# 以下不用修改
+    timeout 15m ${train_cmd} > ${log_file} 2>&1
+    if [ $? -ne 0 ];then
+        echo -e "${model_name}, FAIL"
+        export job_fail_flag=1
+    else
+        echo -e "${model_name}, SUCCESS"
+        export job_fail_flag=0
+    fi
+    kill -9 `ps -ef|grep 'python'|awk '{print $2}'`
+ 
+    if [ $run_mode = "mp" -a -d mylog ]; then
+        rm ${log_file}
+        cp mylog/workerlog.0 ${log_file}
+    fi
+}
+ 
+_set_params $@
+_train
diff --git a/deploy/configs/build_general.yaml b/deploy/configs/build_general.yaml
new file mode 100644
index 0000000000000000000000000000000000000000..258b40a10ac0583c7d6cda4db7a7d694e3e2bac9
--- /dev/null
+++ b/deploy/configs/build_general.yaml
@@ -0,0 +1,36 @@
+Global:
+  rec_inference_model_dir: "./models/general_PPLCNet_x2_5_lite_v1.0_infer"
+  batch_size: 32
+  use_gpu: True
+  enable_mkldnn: True
+  cpu_num_threads: 10
+  enable_benchmark: True
+  use_fp16: False
+  ir_optim: True
+  use_tensorrt: False
+  gpu_mem: 8000
+  enable_profile: False
+
+RecPreProcess:
+  transform_ops:
+    - ResizeImage:
+        size: 224
+    - NormalizeImage:
+        scale: 0.00392157
+        mean: [0.485, 0.456, 0.406]
+        std: [0.229, 0.224, 0.225]
+        order: ''
+    - ToCHWImage:
+
+RecPostProcess: null
+
+# indexing engine config
+IndexProcess:
+  index_method: "HNSW32" # supported: HNSW32, IVF, Flat
+  image_root: "./drink_dataset_v1.0/gallery/"
+  index_dir: "./drink_dataset_v1.0/index"
+  data_file:  "./drink_dataset_v1.0/gallery/drink_label.txt"
+  index_operation: "new" # suported: "append", "remove", "new"
+  delimiter: "\t"
+  dist_type: "IP"
+  embedding_size: 512
diff --git a/deploy/configs/inference_general.yaml b/deploy/configs/inference_general.yaml
new file mode 100644
index 0000000000000000000000000000000000000000..86f26f6401964182dbff2b43a6b65c6f8ca5d33d
--- /dev/null
+++ b/deploy/configs/inference_general.yaml
@@ -0,0 +1,55 @@
+Global:
+  infer_imgs: "./drink_dataset_v1.0/test_images/nongfu_spring.jpeg"
+  det_inference_model_dir: "./models/picodet_PPLCNet_x2_5_mainbody_lite_v1.0_infer"
+  rec_inference_model_dir: "./models/general_PPLCNet_x2_5_lite_v1.0_infer"
+  rec_nms_thresold: 0.05
+  
+  batch_size: 1
+  image_shape: [3, 640, 640]
+  threshold: 0.2
+  max_det_results: 5
+  labe_list:
+  - foreground
+
+  # inference engine config
+  use_gpu: True
+  enable_mkldnn: True
+  cpu_num_threads: 10
+  enable_benchmark: True
+  use_fp16: False
+  ir_optim: True
+  use_tensorrt: False
+  gpu_mem: 8000
+  enable_profile: False
+
+DetPreProcess:
+  transform_ops:
+    - DetResize:
+        interp: 2
+        keep_ratio: false
+        target_size: [640, 640]
+    - DetNormalizeImage:
+        is_scale: true
+        mean: [0.485, 0.456, 0.406]
+        std: [0.229, 0.224, 0.225]
+    - DetPermute: {}
+DetPostProcess: {}
+
+RecPreProcess:
+  transform_ops:
+    - ResizeImage:
+        size: 224
+    - NormalizeImage:
+        scale: 0.00392157
+        mean: [0.485, 0.456, 0.406]
+        std: [0.229, 0.224, 0.225]
+        order: ''
+    - ToCHWImage:
+
+RecPostProcess: null
+
+# indexing engine config
+IndexProcess:
+  index_dir: "./drink_dataset_v1.0/index/"
+  return_k: 5
+  score_thres: 0.5
diff --git a/docs/en/models/PP-LCNet_en.md b/docs/en/models/PP-LCNet_en.md
index 56a90c1266e6ed96ab832097c5851483ef43efa9..57e34151d0e2306bbbdf9f756fffc81889e4a947 100644
--- a/docs/en/models/PP-LCNet_en.md
+++ b/docs/en/models/PP-LCNet_en.md
@@ -1,41 +1,141 @@
-# PPLCNet series
+# PP-LCNet Series
 
-## Overview
+## Abstract
 
-The PPLCNet series is a network that has excellent performance on Intel-CPU proposed by the Baidu PaddleCV team. The author summarizes some methods that can improve the accuracy of the model on Intel-CPU but hardly increase the inference time. The author combines these methods into a new network, namely PPLCNet. Compared with other lightweight networks, PPLCNet can achieve higher accuracy with the same inference time. PPLCNet has shown strong competitiveness in image classification, object detection, and semantic segmentation.
+In the field of computer vision, the quality of backbone network determines the outcome of the whole vision task. In previous studies, researchers generally focus on the optimization of FLOPs or Params, but inference speed actually serves as an importance indicator of model quality in real-world scenarios. Nevertheless, it is difficult to balance inference speed and accuracy. In view of various CPU-based applications in industry, we are now working to raise the adaptability of the backbone network to Intel CPU, so as to obtain a faster and more accurate lightweight backbone network. At the same time, the performance of downstream vision tasks such as object detection and semantic segmentation are also improved.
 
+## Introduction
 
+Recent years witnessed the emergence of many lightweight backbone networks. In past two years, in particular, there were abundant networks searched by NAS that either enjoy advantages on FLOPs or Params, or have an edge in terms of inference speed on ARM devices. However, few of them dedicated to specified optimization of Intel CPU, resulting their imperfect inference speed on the intel CPU side. Based on this, we specially design the backbone network PP-LCNet for Intel CPU devices with its acceleration library MKLDNN. Compared with other lightweight SOTA models, this backbone network can further improve the performance of the model without increasing the inference time, significantly outperforming the existing SOTA models. A comparison chart with other models is shown below.
+<div align=center><img src="../../images/PP-LCNet/PP-LCNet-Acc.png" width="500" height="400"/></div>
 
-## Accuracy, FLOPS and Parameters
+## Method
 
-| Models           | Top1 | Top5 | FLOPs<br>(M) | Parameters<br>(M) |
-|:--:|:--:|:--:|:--:|:--:|
-| PPLCNet_x0_25        |0.5186           | 0.7565           | 18    | 1.5  |
-| PPLCNet_x0_35        |0.5809           | 0.8083           | 29    | 1.6  |
-| PPLCNet_x0_5         |0.6314           | 0.8466           | 47    | 1.9  |
-| PPLCNet_x0_75        |0.6818           | 0.8830           | 99    | 2.4  |
-| PPLCNet_x1_0         |0.7132           | 0.9003           | 161   | 3.0  |
-| PPLCNet_x1_5         |0.7371           | 0.9153           | 342   | 4.5  |
-| PPLCNet_x2_0         |0.7518           | 0.9227           | 590   | 6.5  |
-| PPLCNet_x2_5         |0.7660           | 0.9300           | 906   | 9.0  |
-| PPLCNet_x0_5_ssld    |0.6610           | 0.8646           | 47    | 1.9  |
-| PPLCNet_x1_0_ssld    |0.7439           | 0.9209           | 161   | 3.0  |
-| PPLCNet_x2_5_ssld    |0.8082           | 0.9533           | 906   | 9.0  |
+The overall structure of the network is shown in the figure below.
+<div align=center><img src="../../images/PP-LCNet/PP-LCNet.png" width="700" height="400"/></div>
 
+Build on extensive experiments, we found that many seemingly less time-consuming operations will increase the latency  on Intel CPU-based devices, especially when the MKLDNN acceleration library is enabled. Therefore, we finally chose a block with the leanest possible structure and the fastest possible speed to form our BaseNet (similar to MobileNetV1). Based on BaseNet, we summarized four strategies that can improve the accuracy of the model without increasing the latency, and we combined these four strategies to form PP-LCNet. Each of these four strategies is introduced as below:
 
+### Better Activation Function
 
-## Inference speed based on Intel(R)-Xeon(R)-Gold-6148-CPU
+Since the adoption of ReLU activation function by convolutional neural network, the network performance has been improved substantially, and variants of the ReLU activation function have appeared in recent years, such as Leaky-ReLU, P-ReLU, ELU, etc. In 2017, Google Brain searched to obtain the swish activation function, which performs well on lightweight networks. In 2019, the authors of MobileNetV3 further optimized this activation function to H-Swish, which removes the exponential operation, leading to faster speed and an almost unaffected network accuracy. After many experiments, we also recognized its excellent performance on lightweight networks. Therefore, this activation function is adopted in PP-LCNet.
 
-| Models                 | Crop Size | Resize Short Size | FP32<br>Batch Size=1<br>(ms) |
-|------------------|-----------|-------------------|--------------------------|
-| PPLCNet_x0_25        | 224       | 256               | 1.74                    |
-| PPLCNet_x0_35        | 224       | 256               | 1.92                    |
-| PPLCNet_x0_5         | 224       | 256               | 2.05                    |
-| PPLCNet_x0_75        | 224       | 256               | 2.29                    |
-| PPLCNet_x1_0         | 224       | 256               | 2.46                    |
-| PPLCNet_x1_5         | 224       | 256               | 3.19                    |
-| PPLCNet_x2_0         | 224       | 256               | 4.27                    |
-| PPLCNet_x2_5         | 224       | 256               | 5.39                    |
-| PPLCNet_x0_5_ssld    | 224       | 256               | 2.05                    |
-| PPLCNet_x1_0_ssld    | 224       | 256               | 2.46                    |
-| PPLCNet_x2_5_ssld    | 224       | 256               | 5.39                    |
+### SE Modules at Appropriate Positions
+
+The SE module is a channel attention mechanism proposed by SENet, which can effectively improve the accuracy of the model. However, on the Intel CPU side, the module also presents a large latency, leaving us the task of balancing accuracy and speed. The search of the location of the SE module in NAS search-based networks such as MobileNetV3 brings no general conclusions, but we found through our experiments that the closer the SE module is to the tail of the network the greater the improvement in model accuracy. The following table also shows some of our experimental results：
+
+| SE Location       | Top-1 Acc(\%) | Latency(ms) |
+|-------------------|---------------|-------------|
+| 1100000000000     | 61.73           | 2.06         |
+| 0000001100000     | 62.17           | 2.03         |
+| <b>0000000000011<b>     | <b>63.14<b>           | <b>2.05<b>         |
+| 1111111111111     | 64.27           | 3.80         |
+
+The option in the third row of the table was chosen for the location of the SE module in PP-LCNet.
+
+### Larger Convolution Kernels
+
+In the paper of MixNet, the author analyzes the effect of convolutional kernel size on model performance and concludes that larger convolutional kernels within a certain range can improve the performance of the model, but beyond this range will be detrimental to the model’s performance. So the author forms MixConv with split-concat paradigm combined, which can improve the performance of the model but is not conducive to inference. We experimentally summarize the role of some larger convolutional kernels at different positions that are similar to those of the SE module, and find that larger convolutional kernels display more prominent roles in the middle and tail of the network. The following table shows the effect of the position of the 5x5 convolutional kernels on the accuracy：
+
+| SE Location       | Top-1 Acc(\%) | Latency(ms) |
+|-------------------|---------------|-------------|
+| 1111111111111     | 63.22           | 2.08         |
+| 1111111000000     | 62.70           | 2.07        |
+| <b>0000001111111<b>     | <b>63.14<b>           | <b>2.05<b>         |
+
+
+Experiments show that a larger convolutional kernel placed at the middle and tail of the network can achieve the same accuracy as placed at all positions, coupled with faster inference. The option in the third row of the table was the final choice of PP-LCNet.
+
+### Larger Dimensional 1 × 1 Conv Layer after GAP
+
+Since the introduction of GoogLeNet, GAP (Global-Average-Pooling) is often directly followed by a classification layer, which fails to result in further integration and processing of features extracted after GAP in the lightweight network. If a larger 1x1 convolutional layer (equivalent to the FC layer) is used after GAP, the extracted features, instead of directly passing through the classification layer, will first be integrated, and then classified. This can greatly improve the accuracy rate without affecting the inference speed of the model. The above four improvements were made to BaseNet to obtain PP-LCNet. The following table further illustrates the impact of each scheme on the results：
+
+| Activation | SE-block | Large-kernal | last-1x1-conv | Top-1 Acc(\%) | Latency(ms) |
+|------------|----------|--------------|---------------|---------------|-------------|
+| 0       | 1       | 1               | 1                | 61.93 | 1.94 |
+| 1       | 0       | 1               | 1                | 62.51 | 1.87 |
+| 1       | 1       | 0               | 1                | 62.44 | 2.01 |
+| 1       | 1       | 1               | 0                | 59.91 | 1.85 |
+| <b>1<b>       | <b>1<b>       | <b>1<b>               | <b>1<b>                | <b>63.14<b> | <b>2.05<b> |
+
+
+## Experiments
+
+### Image Classification
+
+For image classification, ImageNet dataset is adopted. Compared with the current mainstream lightweight network, PP-LCNet can obtain faster inference speed with the same accuracy. When using Baidu’s self-developed SSLD distillation strategy, the accuracy is further improved, with the Top-1 Acc of ImageNet exceeding 80% at an inference speed of about 5ms on the Intel CPU side.
+
+| Model | Params(M) | FLOPs(M) | Top-1 Acc(\%) | Top-5 Acc(\%) | Latency(ms) | 
+|-------|-----------|----------|---------------|---------------|-------------|
+| PP-LCNet-0.25x  | 1.5 | 18  | 51.86 | 75.65 | 1.74 |
+| PP-LCNet-0.35x  | 1.6 | 29  | 58.09 | 80.83 | 1.92 |
+| PP-LCNet-0.5x   | 1.9 | 47  | 63.14 | 84.66 | 2.05 |
+| PP-LCNet-0.75x  | 2.4 | 99  | 68.18 | 88.30 | 2.29 |
+| PP-LCNet-1x     | 3.0 | 161 | 71.32 | 90.03 | 2.46 |
+| PP-LCNet-1.5x   | 4.5 | 342 | 73.71 | 91.53 | 3.19 |
+| PP-LCNet-2x     | 6.5 | 590 | 75.18 | 92.27 | 4.27 |
+| PP-LCNet-2.5x   | 9.0 | 906 | 76.60 | 93.00 | 5.39 |
+| PP-LCNet-0.25x\* | 1.9 | 47  | 66.10 | 86.46 | 2.05 |
+| PP-LCNet-0.25x\* | 3.0 | 161 | 74.39 | 92.09 | 2.46 |
+| PP-LCNet-0.25x\* | 9.0 | 906 | 80.82 | 95.33 | 5.39 |
+    
+\* denotes the model after using SSLD distillation.
+
+Performance comparison with other lightweight networks:
+
+| Model | Params(M) | FLOPs(M) | Top-1 Acc(\%) | Top-5 Acc(\%) | Latency(ms) |
+|-------|-----------|----------|---------------|---------------|-------------|
+| MobileNetV2-0.25x  | 1.5 | 34  | 53.21 | 76.52 | 2.47 |
+| MobileNetV3-small-0.35x  | 1.7 | 15  | 53.03 | 76.37 | 3.02 |
+| ShuffleNetV2-0.33x  | 0.6 | 24  | 53.73 | 77.05 | 4.30 |
+| <b>PP-LCNet-0.25x<b>  | <b>1.5<b> | <b>18<b>  | <b>51.86<b> | <b>75.65<b> | <b>1.74<b> |
+| MobileNetV2-0.5x  | 2.0 | 99  | 65.03 | 85.72 | 2.85 |
+| MobileNetV3-large-0.35x  | 2.1 | 41  | 64.32 | 85.46 | 3.68 |
+| ShuffleNetV2-0.5x  | 1.4 | 43  | 60.32 | 82.26 | 4.65 |
+| <b>PP-LCNet-0.5x<b>   | <b>1.9<b> | <b>47<b>  | <b>63.14<b> | <b>84.66<b> | <b>2.05<b> |
+| MobileNetV1-1x  | 4.3 | 578  | 70.99 | 89.68 | 3.38 |
+| MobileNetV2-1x  | 3.5 | 327  | 72.15 | 90.65 | 4.26 |
+| MobileNetV3-small-1.25x  | 3.6 | 100  | 70.67 | 89.51 | 3.95 |
+| <b>PP-LCNet-1x<b>     |<b> 3.0<b> | <b>161<b> | <b>71.32<b> | <b>90.03<b> | <b>2.46<b> |
+
+
+### Object Detection
+
+For object detection, we adopt Baidu’s self-developed PicoDet, which focuses on lightweight object detection scenarios. The following table shows the comparison between the results of PP-LCNet and MobileNetV3 on the COCO dataset. PP-LCNet has an obvious advantage in both accuracy and speed.
+
+| Backbone | mAP(%) | Latency(ms) |
+|-------|-----------|----------|
+MobileNetV3-large-0.35x | 19.2 | 8.1 |
+<b>PP-LCNet-0.5x<b> | <b>20.3<b> | <b>6.0<b> |
+MobileNetV3-large-0.75x | 25.8 | 11.1 |
+<b>PP-LCNet-1x<b> | <b>26.9<b> | <b>7.9<b> | 
+
+
+### Semantic Segmentation
+
+For semantic segmentation, DeeplabV3+ is adopted. The following table presents the comparison between PP-LCNet and MobileNetV3 on the Cityscapes dataset, and PP-LCNet also stands out in terms of accuracy and speed.
+
+| Backbone | mIoU(%) | Latency(ms) |
+|-------|-----------|----------|
+MobileNetV3-large-0.5x | 55.42 | 135 |
+<b>PP-LCNet-0.5x<b> | <b>58.36<b> | <b>82<b> |
+MobileNetV3-large-0.75x | 64.53 | 151 |
+<b>PP-LCNet-1x<b> | <b>66.03<b> | <b>96<b> |
+
+## Conclusion
+
+Rather than holding on to perfect FLOPs and Params as academics do, PP-LCNet focuses on analyzing how to add Intel CPU-friendly modules to improve the performance of the model, which can better balance accuracy and inference time. The experimental conclusions therein are available to other researchers in network structure design, while providing NAS search researchers with a smaller search space and general conclusions. The finished PP-LCNet can also be better accepted and applied in industry.
+
+## Reference
+
+Reference to cite when you use PP-LCNet in a paper:
+```
+@misc{cui2021pplcnet,
+      title={PP-LCNet: A Lightweight CPU Convolutional Neural Network}, 
+      author={Cheng Cui and Tingquan Gao and Shengyu Wei and Yuning Du and Ruoyu Guo and Shuilong Dong and Bin Lu and Ying Zhou and Xueying Lv and Qiwen Liu and Xiaoguang Hu and Dianhai Yu and Yanjun Ma},
+      year={2021},
+      eprint={2109.15099},
+      archivePrefix={arXiv},
+      primaryClass={cs.CV}
+}
+```
diff --git a/docs/en/models/PPLCNet_en.md b/docs/en/models/PPLCNet_en.md
deleted file mode 100644
index 56a90c1266e6ed96ab832097c5851483ef43efa9..0000000000000000000000000000000000000000
--- a/docs/en/models/PPLCNet_en.md
+++ /dev/null
@@ -1,41 +0,0 @@
-# PPLCNet series
-
-## Overview
-
-The PPLCNet series is a network that has excellent performance on Intel-CPU proposed by the Baidu PaddleCV team. The author summarizes some methods that can improve the accuracy of the model on Intel-CPU but hardly increase the inference time. The author combines these methods into a new network, namely PPLCNet. Compared with other lightweight networks, PPLCNet can achieve higher accuracy with the same inference time. PPLCNet has shown strong competitiveness in image classification, object detection, and semantic segmentation.
-
-
-
-## Accuracy, FLOPS and Parameters
-
-| Models           | Top1 | Top5 | FLOPs<br>(M) | Parameters<br>(M) |
-|:--:|:--:|:--:|:--:|:--:|
-| PPLCNet_x0_25        |0.5186           | 0.7565           | 18    | 1.5  |
-| PPLCNet_x0_35        |0.5809           | 0.8083           | 29    | 1.6  |
-| PPLCNet_x0_5         |0.6314           | 0.8466           | 47    | 1.9  |
-| PPLCNet_x0_75        |0.6818           | 0.8830           | 99    | 2.4  |
-| PPLCNet_x1_0         |0.7132           | 0.9003           | 161   | 3.0  |
-| PPLCNet_x1_5         |0.7371           | 0.9153           | 342   | 4.5  |
-| PPLCNet_x2_0         |0.7518           | 0.9227           | 590   | 6.5  |
-| PPLCNet_x2_5         |0.7660           | 0.9300           | 906   | 9.0  |
-| PPLCNet_x0_5_ssld    |0.6610           | 0.8646           | 47    | 1.9  |
-| PPLCNet_x1_0_ssld    |0.7439           | 0.9209           | 161   | 3.0  |
-| PPLCNet_x2_5_ssld    |0.8082           | 0.9533           | 906   | 9.0  |
-
-
-
-## Inference speed based on Intel(R)-Xeon(R)-Gold-6148-CPU
-
-| Models                 | Crop Size | Resize Short Size | FP32<br>Batch Size=1<br>(ms) |
-|------------------|-----------|-------------------|--------------------------|
-| PPLCNet_x0_25        | 224       | 256               | 1.74                    |
-| PPLCNet_x0_35        | 224       | 256               | 1.92                    |
-| PPLCNet_x0_5         | 224       | 256               | 2.05                    |
-| PPLCNet_x0_75        | 224       | 256               | 2.29                    |
-| PPLCNet_x1_0         | 224       | 256               | 2.46                    |
-| PPLCNet_x1_5         | 224       | 256               | 3.19                    |
-| PPLCNet_x2_0         | 224       | 256               | 4.27                    |
-| PPLCNet_x2_5         | 224       | 256               | 5.39                    |
-| PPLCNet_x0_5_ssld    | 224       | 256               | 2.05                    |
-| PPLCNet_x1_0_ssld    | 224       | 256               | 2.46                    |
-| PPLCNet_x2_5_ssld    | 224       | 256               | 5.39                    |
diff --git a/docs/images/recognition/drink_data_demo/output/mosilian.jpeg b/docs/images/recognition/drink_data_demo/output/mosilian.jpeg
new file mode 100644
index 0000000000000000000000000000000000000000..fd3b6bb5245a833c3f282bb375627bd9bdfe6ceb
Binary files /dev/null and b/docs/images/recognition/drink_data_demo/output/mosilian.jpeg differ
diff --git a/docs/images/recognition/drink_data_demo/output/nongfu_spring.jpeg b/docs/images/recognition/drink_data_demo/output/nongfu_spring.jpeg
new file mode 100644
index 0000000000000000000000000000000000000000..b6669ba724088ccc70cbb5650f6b297c88393fcc
Binary files /dev/null and b/docs/images/recognition/drink_data_demo/output/nongfu_spring.jpeg differ
diff --git a/docs/images/recognition/drink_data_demo/output/youjia.jpeg b/docs/images/recognition/drink_data_demo/output/youjia.jpeg
new file mode 100644
index 0000000000000000000000000000000000000000..a437915fb267ffd0dfdc367f82b083ee992dd6b8
Binary files /dev/null and b/docs/images/recognition/drink_data_demo/output/youjia.jpeg differ
diff --git a/docs/images/recognition/drink_data_demo/test_images/mosilian.jpeg b/docs/images/recognition/drink_data_demo/test_images/mosilian.jpeg
new file mode 100644
index 0000000000000000000000000000000000000000..e5ed054382f6ad912a507d21107392fa99e1f220
Binary files /dev/null and b/docs/images/recognition/drink_data_demo/test_images/mosilian.jpeg differ
diff --git a/docs/images/recognition/drink_data_demo/test_images/nongfu_spring.jpeg b/docs/images/recognition/drink_data_demo/test_images/nongfu_spring.jpeg
new file mode 100644
index 0000000000000000000000000000000000000000..0f4166d7e9b39301b01124616039e8fa0a171b3c
Binary files /dev/null and b/docs/images/recognition/drink_data_demo/test_images/nongfu_spring.jpeg differ
diff --git a/docs/images/recognition/drink_data_demo/test_images/youjia.jpeg b/docs/images/recognition/drink_data_demo/test_images/youjia.jpeg
new file mode 100644
index 0000000000000000000000000000000000000000..2875a76ec4ee83b2fa22b45f3963c0897381f0ea
Binary files /dev/null and b/docs/images/recognition/drink_data_demo/test_images/youjia.jpeg differ
diff --git a/docs/zh_CN/faq_series/faq_2020_s1.md b/docs/zh_CN/faq_series/faq_2020_s1.md
index e0f3c98c986947bf45266012d4c648fa2e4b3b08..1cf7642218dde869b148233ca7314514be32b888 100644
--- a/docs/zh_CN/faq_series/faq_2020_s1.md
+++ b/docs/zh_CN/faq_series/faq_2020_s1.md
@@ -13,11 +13,10 @@
 ## 第1期
 
 ### Q1.1: PaddleClas可以用来做什么?
-**A**：PaddleClas是飞桨为工业界和学术界所准备的一个图像分类任务的工具集，助力使用者训练出更好的视觉模型和应用落地。PaddleClas提供了基于图像分类的模型训练、评估、预测、部署全流程的服务，方便大家更加高效地学习图像分类。具体地，PaddleClas中包含如下一些特性 。
+**A**：PaddleClas是飞桨为工业界和学术界所准备的一个图像分类任务的工具集，助力使用者训练出更好的视觉模型和应用落地。PaddleClas提供了基于图像分类的模型训练、评估、预测、部署全流程的服务，方便大家更加高效地学习图像分类。具体地，PaddleClas中包含如下一些特性。
 
-
-* PaddleClas提供了24个系列的分类网络结构(ResNet, ResNet_vd, MobileNetV3, Res2Net, HRNet等)和训练配置，122个预训练模型和性能评估与预测，供大家选择并使用。
-* PaddleClas提供了TensorRT预测、python inference、c++ inference、Paddle-Lite预测部署等多种预测部署推理方案，在方便在多种环境中进行部署推理。
+* PaddleClas提供了36个系列的分类网络结构(ResNet, ResNet_vd, MobileNetV3, Res2Net, HRNet等)和训练配置，175个预训练模型和性能评估与预测，供大家选择并使用。
+* PaddleClas提供了TensorRT预测、python inference、c++ inference、Paddle-Lite预测部署、PaddleServing、PaddleHub等多种预测部署推理方案，在方便在多种环境中进行部署推理。
 * PaddleClas提供了一种简单的SSLD知识蒸馏方案，基于该方案蒸馏模型的识别准确率普遍提升3%以上。
 * PaddleClas支持AutoAugment、Cutout、Cutmix等8种数据增广算法详细介绍、代码复现和在统一实验环境下的效果评估。
 * PaddleClas支持在Windows/Linux/MacOS环境中基于CPU/GPU进行使用。
@@ -27,7 +26,6 @@
 
 ### Q1.3: ResNet_vd和ResNet、ResNet_vc结构有什么区别呢？
 **A**:
-
 ResNet_va至vd的结构如下图所示，ResNet最早提出时为va结构，在降采样残差模块这个部分，在左边的特征变换通路中(Path A)，第一个1x1卷积部分就行了降采样，从而导致信息丢失（卷积的kernel size为1，stride为2，输入特征图中 有部分特征没有参与卷积的计算）；在vb结构中，把降采样的步骤从最开始的第一个1x1卷积调整到中间的3x3卷积中，从而避免了信息丢失的问题，PaddleClas中的ResNet模型默认就是ResNet_vb；vc结构则是将最开始这个7x7的卷积变成3个3x3的卷积，在感受野不变的情况下，计算量和存储大小几乎不变，而且实验证明精度相对于vb结构有所提升；vd结构是修改了降采样残差模块右边的特征通路(Path B)。把降采样的过程由平均池化这个操作去替代了，这一系列的改进(va->vd)，几乎没有带来新增的预测耗时，结合适当的训练策略，比如说标签平滑以及mixup数据增广，精度可以提升高达2.7%。
 
 <div align="center">
@@ -38,7 +36,7 @@ ResNet_va至vd的结构如下图所示，ResNet最早提出时为va结构，在
 **A**:
 
 ResNet系列模型中，相比于其他模型，ResNet_vd模型在预测速度几乎不变的情况下，精度有非常明显的提升，因此推荐大家使用ResNet_vd系列模型。
-下面给出了batch size=4的情况下，在T4 GPU上，不同模型的的预测耗时、flops、params与精度的变化曲线，可以根据自己自己的实际部署场景中的需求，去选择合适的模型，如果希望模型存储大小尽可能小或者预测速度尽可能快，则可以使用ResNet18_vd模型，如果希望获得尽可能高的精度，则建议使用ResNet152_vd或者ResNet200_vd模型。更多关于ResNet系列模型的介绍可以参考文档：[ResNet及其vd系列模型文档](../models/ResNet_and_vd.md)。
+[ResNet及其vd系列模型文档](../models/ResNet_and_vd.md)中给出了batch size=4的情况下，在T4 GPU上，不同模型的的预测耗时、FLOPs、Params与精度的变化曲线，可以根据自己自己的实际部署场景中的需求，去选择合适的模型，如果希望模型存储大小尽可能小或者预测速度尽可能快，则可以使用ResNet18_vd模型，如果希望获得尽可能高的精度，则建议使用ResNet152_vd或者ResNet200_vd模型。更多关于ResNet系列模型的介绍可以参考文档：[ResNet及其vd系列模型文档](../models/ResNet_and_vd.md)。
 
 * 精度-预测速度变化曲线
 
@@ -69,7 +67,7 @@ ResNet系列模型中，相比于其他模型，ResNet_vd模型在预测速度
 ### Q1.7 大卷积核一定可以带来正向收益吗？
 
 **A**: 不一定，将网络中的所有卷积核都增大未必会带来性能的提升，甚至会有有损性能，在论文[MixConv: Mixed Depthwise Convolutional Kernels](https://arxiv.org/abs/1907.09595)
-中指出，在一定范围内提升卷积核大小对精度的提升有正向作用，但是超出后会有损精度。所以考虑到模型的大小、计算量等问题，一般不选用大的卷积核去设计网络。
+中指出，在一定范围内提升卷积核大小对精度的提升有正向作用，但是超出后会有损精度。所以考虑到模型的大小、计算量等问题，一般不选用大的卷积核去设计网络。同时，在[PP-LCNet](../models/PP-LCNet.md)文章中，也有关于大卷积核的实验。
 
 <a name="第2期"></a>
 ## 第2期
@@ -77,9 +75,9 @@ ResNet系列模型中，相比于其他模型，ResNet_vd模型在预测速度
 ### Q2.1: PaddleClas如何训练自己的backbone？
 
 **A**：具体流程如下:
-* 首先在ppcls/modeling/architectures/文件夹下新建一个自己的模型结构文件，即你自己的backbone，模型搭建可以参考resnet.py;
-* 然后在ppcls/modeling/\_\_init\_\_.py中添加自己设计的backbone的类;
-* 其次配置训练的yaml文件，此处可以参考configs/ResNet/ResNet50.yaml;
+* 首先在ppcls/arch/backbone/model_zoo/文件夹下新建一个自己的模型结构文件，即你自己的backbone，模型搭建可以参考resnet.py;
+* 然后在ppcls/arch/backbone/\_\_init\_\_.py中添加自己设计的backbone的类;
+* 其次配置训练的yaml文件，此处可以参考ppcls/configs/ImageNet/ResNet/ResNet50.yaml;
 * 最后启动训练即可。
 
 
@@ -92,7 +90,7 @@ ResNet系列模型中，相比于其他模型，ResNet_vd模型在预测速度
 
 ### Q2.3: PaddleClas中configs下的默认参数适合任何一个数据集吗？
 
-**A**: PaddleClas中的configs下的默认参数是ImageNet-1k的训练参数，这个参数并不适合所有的数据集，具体数据集需要在此基础上进一步调试，调试方法会在之后出一个单独的faq，敬请期待。
+**A**: PaddleClas中的ppcls/configs/ImageNet/下的配置文件默认参数是ImageNet-1k的训练参数，这个参数并不适合所有的数据集，具体数据集需要在此基础上进一步调试。
 
 
 ### Q2.4 PaddleClas中的不同的模型使用了不同的分辨率，标配的应该是多少呢？
@@ -102,7 +100,7 @@ ResNet系列模型中，相比于其他模型，ResNet_vd模型在预测速度
 
 ### Q2.5 PaddleClas中提供了很多ssld模型，其应用的价值是？
 
-**A**: PaddleClas中提供了很多ssld预训练模型，其通过半监督知识蒸馏的方法获得了更好的预训练权重，在迁移任务或者下游视觉任务中，无须替换结构文件、只需要替换精度更高的ssld预训练模型即可提升精度，如在PaddleSeg中，[HRNet](https://github.com/PaddlePaddle/PaddleSeg/blob/release/v0.7.0/docs/model_zoo.md)使用了ssld预训练模型的权重后，精度大幅度超越业界同样的模型的精度，在PaddleDetection中，[PP-YOLO](https://github.com/PaddlePaddle/PaddleDetection/blob/release/0.4/configs/ppyolo/README_cn.md)使用了ssld预训练权重后，在较高的baseline上仍有进一步的提升。使用ssld预训练权重做分类的迁移表现也很抢眼，在[SSLD蒸馏策略](../advanced_tutorials/distillation/distillation.md)部分介绍了知识蒸馏对于分类任务迁移的收益。
+**A**: PaddleClas中提供了很多ssld预训练模型，其通过半监督知识蒸馏的方法获得了更好的预训练权重，在迁移任务或者下游视觉任务中，无须替换结构文件、只需要替换精度更高的ssld预训练模型即可提升精度，如在PaddleSeg中，[HRNet](https://github.com/PaddlePaddle/PaddleSeg/blob/release/v0.7.0/docs/model_zoo.md)使用了ssld预训练模型的权重后，精度大幅度超越业界同样的模型的精度，在PaddleDetection中，[PP-YOLO](https://github.com/PaddlePaddle/PaddleDetection/blob/release/0.4/configs/ppyolo/README_cn.md)使用了ssld预训练权重后，在较高的baseline上仍有进一步的提升。使用ssld预训练权重做分类的迁移表现也很抢眼，在[SSLD蒸馏策略](../advanced_tutorials/knowledge_distillation.md)部分介绍了知识蒸馏对于分类任务迁移的收益。
 
 
 <a name="第3期"></a>
@@ -121,7 +119,7 @@ ResNet系列模型中，相比于其他模型，ResNet_vd模型在预测速度
 
 ### Q3.3: 怎么使用多个模型进行预测融合呢？
 
-**A** 使用多个模型进行预测的时候，建议首先将预训练模型导出为inference模型，这样可以摆脱对网络结构定义的依赖，可以参考[模型导出脚本](../../../tools/export_model.py)进行模型导出，之后再参考[inference模型预测脚本](../../../tools/infer/predict.py)进行预测即可，在这里需要根据自己使用模型的数量创建多个predictor。
+**A** 使用多个模型进行预测的时候，建议首先将预训练模型导出为inference模型，这样可以摆脱对网络结构定义的依赖，可以参考[模型导出脚本](../../../tools/export_model.py)进行模型导出，之后再参考[inference模型预测脚本](../../../deploy/python/predict_cls.py)进行预测即可，在这里需要根据自己使用模型的数量创建多个predictor。
 
 
 ### Q3.4: PaddleClas中怎么增加自己的数据增广方法呢？
@@ -136,15 +134,17 @@ ResNet系列模型中，相比于其他模型，ResNet_vd模型在预测速度
 
 **A**：
 
-* 可以使用自动混合精度进行训练，这在精度几乎无损的情况下，可以有比较明显的速度收益，以ResNet50为例，PaddleClas中使用自动混合精度训练的配置文件可以参考：[ResNet50_fp16.yml](../../../ppcls/configs/ResNet/ResNet50_fp16.yml)，主要就是需要在标准的配置文件中添加以下几行
+* 可以使用自动混合精度进行训练，这在精度几乎无损的情况下，可以有比较明显的速度收益，以ResNet50为例，PaddleClas中使用自动混合精度训练的配置文件可以参考：[ResNet50_fp16.yml](../../../ppcls/configs/ImageNet/ResNet/ResNet50_fp16.yaml)，主要就是需要在标准的配置文件中添加以下几行
 
 ```
-use_fp16: True
-amp_scale_loss: 128.0
-use_dynamic_loss_scaling: True
+# mixed precision training
+AMP:
+  scale_loss: 128.0
+  use_dynamic_loss_scaling: True
+  use_pure_fp16: &use_pure_fp16 True
 ```
 
-* 可以开启dali，将数据预处理方法放在GPU上运行，在模型比较小时（reader耗时占比更高一些），开启dali会带来比较明显的精度收益，在训练的时候，添加`-o use_dali=True`即可使用dali进行训练，更多关于dali 安装与介绍可以参考：[dali安装教程](https://docs.nvidia.com/deeplearning/dali/user-guide/docs/installation.html#nightly-builds)。
+* 可以开启dali，将数据预处理方法放在GPU上运行，在模型比较小时（reader耗时占比更高一些），开启dali会带来比较明显的训练速度收益，在训练的时候，添加`-o Global.use_dali=True`即可使用dali进行训练，更多关于 dali 安装与介绍可以参考：[dali安装教程](https://docs.nvidia.com/deeplearning/dali/user-guide/docs/installation.html#nightly-builds)。
 
 <a name="第4期"></a>
 ## 第4期
@@ -294,7 +294,7 @@ Cosine_decay和piecewise_decay的学习率变化曲线如下图所示，容易
 
 **A**:一般来说，数据集的规模对性能影响至关重要，但是图片的标注往往比较昂贵，所以有标注的图片数量往往比较稀少，在这种情况下，数据的增广尤为重要。在训练ImageNet-1k的标准数据增广中，主要使用了Random_Crop与Random_Flip两种数据增广方式，然而，近些年，越来越多的数据增广方式被提出，如cutout、mixup、cutmix、AutoAugment等。实验表明，这些数据的增广方式可以有效提升模型的精度。具体到数据集来说：
 
-- ImageNet-1k：下表列出了ResNet50在8种不同的数据增广方式的表现，可以看出，相比baseline，所有的数据增广方式均有收益，其中cutmix是目前最有效的数据增广。更多数据增广的介绍请参考[**数据增广章节**](../advanced_tutorials/image_augmentation/ImageAugment.md)。
+- ImageNet-1k：下表列出了ResNet50在8种不同的数据增广方式的表现，可以看出，相比baseline，所有的数据增广方式均有收益，其中cutmix是目前最有效的数据增广。更多数据增广的介绍请参考[**数据增广章节**](../advanced_tutorials/DataAugmentation.md)。
 
 | 模型       | 数据增广方式         | Test top-1 |
 |:--:|:--:|:--:|
@@ -332,7 +332,7 @@ Cosine_decay和piecewise_decay的学习率变化曲线如下图所示，容易
 
 - 挖掘相关数据：用在现有数据集上训练饱和的模型去对相关的数据做预测，将置信度较高的数据打label后加入训练集进一步训练，如此循环操作，可进一步提升模型的精度。
 
-- 知识蒸馏：可以先使用一个较大的模型在该数据集上训练一个精度较高的teacher model，然后使用该teacher model去教导一个Student model，其中，Student model即为目标模型。PaddleClas提供了百度自研的SSLD知识蒸馏方案，即使在ImageNet-1k这么有挑战的分类任务上，其也能稳定提升3%以上。SSLD知识蒸馏的的章节请参考[**SSLD知识蒸馏**](../advanced_tutorials/distillation/distillation.md)。
+- 知识蒸馏：可以先使用一个较大的模型在该数据集上训练一个精度较高的teacher model，然后使用该teacher model去教导一个Student model，其中，Student model即为目标模型。PaddleClas提供了百度自研的SSLD知识蒸馏方案，即使在ImageNet-1k这么有挑战的分类任务上，其也能稳定提升3%以上。SSLD知识蒸馏的的章节请参考[**SSLD知识蒸馏**](../advanced_tutorials/knowledge_distillation.md)。
 
 
 <a name="第6期"></a>
@@ -342,13 +342,13 @@ Cosine_decay和piecewise_decay的学习率变化曲线如下图所示，容易
 
 **A**: PaddleClas目前共有3种分支：
 
-* 动态图分支：dygraph分支是PaddleClas的默认分支，也是更新最快的分支。所有的新功能、新改动都会先在dygraph分支上进行。如果想追踪PaddleClas的最新进展，可以关注这个分支。这个分支主要支持动态图，会跟着paddlepaddle的版本一起更新。
+* 开发分支：develop分支是PaddleClas的开发分支，也是更新最快的分支。所有的新功能、新改动都会先在develop分支上进行。如果想追踪PaddleClas的最新进展，可以关注这个分支。这个分支主要支持动态图，会跟着paddlepaddle的版本一起更新。
 
-* 稳定版本分支：快速更新能够让关注者了解最新进展，但也会带来不稳定性。因此在一些关键的时间点，我们会从dygraph分支中拉出分支，提供稳定的版本。这些分支的名字与paddlepaddle的版本对应，如 2.0-beta 为支持paddlepaddle2.0-beta的稳定版本。这些分支一般只会修复bug，而不更新新的特性和模型。
+* 稳定版本分支（如release/2.1.3）：快速更新能够让关注者了解最新进展，但也会带来不稳定性。因此在一些关键的时间点，我们会从develop分支中拉出分支，提供稳定的版本，最新的稳定版分支也是默认分支。需要注意，无特殊情况，我们只会维护最新的release稳定分支，并且一般只会修复bug，而不更新新的特性和模型。
 
-* 静态图分支：master分支是使用静态图版本的分支，主要用来支持一些老用户的使用，也只进行一些简单维护，不会更新新的特性和模型。不建议新用户使用静态图分支。老用户如果有条件，也建议迁到动态图分支或稳定版本分支。
+* 静态图分支（static）：static分支是使用静态图版本的分支，主要用来支持一些老用户的使用，也只进行一些简单维护，不会更新新的特性和模型。不建议新用户使用静态图分支。老用户如果有条件，也建议迁到动态图分支或稳定版本分支。
 
-总的来说，如果想跟进PaddleClas的最新进展，建议选择dygraph分支，如果需要稳定版本，建议选择最新的稳定版本分支。
+总的来说，如果想跟进PaddleClas的最新进展，建议选择develop分支，如果需要稳定版本，建议选择最新的稳定版本分支。
 
 ### Q6.2: 什么是静态图模式？
 
@@ -358,11 +358,7 @@ Cosine_decay和piecewise_decay的学习率变化曲线如下图所示，容易
 
 **A**: 动态图模式即为命令式编程模式，用户无需预先定义网络结构，每行代码都可以直接运行得到结果。相比静态图模式，动态图模式对用户更加友好，调试也更方便。此外，动态图模式的结构设计也更加灵活，可以在运行过程中随时调整结构。
 
-PaddleClas目前持续更新的dygraph分支，主要采用动态图模式。如果您是新用户，建议使用动态图模式来进行开发和训练。如果推理预测时有性能需求，可以在训练完成后，将动态图模型转为静态图模型提高效率。
-
-### Q6.4: 动态图模型的预测效率有时不如静态图，应该怎么办？
-
-**A**: 可以使用转换工具，将动态图模型转换为静态图模型，具体可以参考https://www.paddlepaddle.org.cn/documentation/docs/zh/2.0-rc1/guides/04_dygraph_to_static/index_cn.html。
+PaddleClas目前持续更新的develop分支和稳定版本的release分支，主要采用动态图模式。如果您是新用户，建议使用动态图模式来进行开发和训练。如果推理预测时有性能需求，可以在训练完成后，将动态图模型转为静态图模型提高效率。
 
 ### Q6.5: 构建分类数据集时，如何构建"背景"类别的数据？
 
diff --git a/docs/zh_CN/faq_series/faq_2021_s1.md b/docs/zh_CN/faq_series/faq_2021_s1.md
index ccf53f64d564e7b61be031c3223b94bd46190523..cff7c98bb57c2c048039ee3a6c8f615038aac298 100644
--- a/docs/zh_CN/faq_series/faq_2021_s1.md
+++ b/docs/zh_CN/faq_series/faq_2021_s1.md
@@ -38,7 +38,9 @@
 
 ### Q1.4 PaddleClas提供的10W类图像分类预训练模型在哪里下载，应该怎么使用呢？
 
-**A**：基于ResNet50_vd, 百度开源了自研的大规模分类预训练模型，其中训练数据为10万个类别，4300万张图片。10万类预训练模型的下载地址：[下载地址](https://paddle-imagenet-models-name.bj.bcebos.com/ResNet50_vd_10w_pretrained.tar)，在这里需要注意的是，该预训练模型没有提供最后的FC层参数，因此无法直接拿来预测；但是可以使用它作为预训练模型，在自己的数据集上进行微调。经过验证，该预训练模型相比于基于ImageNet1k数据集的ResNet50_vd预训练模型，在不同的数据集上均有比较明显的精度收益，最多可达30%，更多的对比实验可以参考：[图像分类迁移学习教程](../application/transfer_learning.md)。
+**A**：基于ResNet50_vd, 百度开源了自研的大规模分类预训练模型，其中训练数据为10万个类别，4300万张图片。10万类预训练模型的下载地址：[下载地址](https://paddle-imagenet-models-name.bj.bcebos.com/ResNet50_vd_10w_pretrained.tar)，在这里需要注意的是，该预训练模型没有提供最后的FC层参数，因此无法直接拿来预测；但是可以使用它作为预训练模型，在自己的数据集上进行微调。经过验证，该预训练模型相比于基于ImageNet1k数据集的ResNet50_vd预训练模型，在不同的数据集上均有比较明显的精度收益，最多可达30%。
+<!-- TODO(gaotingquan): -->
+<!-- ，更多的对比实验可以参考：[图像分类迁移学习教程](../application/transfer_learning.md)。 -->
 
 
 ### Q1.5 使用C++进行预测部署的时候怎么进行加速呢？
@@ -178,7 +180,7 @@ RepVGG网络与ACNet同理，只不过ACNet的`1*d`非对称卷积变成了`1*1`
 **A**:
 1. 图像对CNN的依赖是不必要的，利用Transformer的计算效率和可伸缩性，可以训练很大模型，当模型和数据集增大的情形下，仍然不会存在饱和的情况。受到Transformer在NLP上的启发，在图像分类任务中使用时，将图片分成顺序排列的patches，并将这些patches输入一个线性单元嵌入到embedding作为transformer的输入。
 
-2. 在中等规模数据集中如ImageNet，ImageNet21k，视觉Transformer模型低于相同规模尺寸的ResNet几个百分点。这是因为transformer缺少CNN平移和局限性，在数据量不够大的时候，不能超越卷积网络。
+2. 在中等规模数据集中如ImageNet1k，ImageNet21k，视觉Transformer模型低于相同规模尺寸的ResNet几个百分点。猜测这是因为transformer缺少CNN所具有的局部性(Locality)和空间不变性(Spatial Invariance)的特点，而在数据量不够大的时候，难以超越卷积网络，不过对于这一问题，[DeiT](https://arxiv.org/abs/2012.12877)使用数据增强的方式在一定程度上解决了Vision Transformer依赖超大规模数据集训练的问题。
 
 3. 在超大规模数据集14M-300M训练时，这种方式可以越过局部信息，建模更加长距离的依赖关系，而CNN能较好关注局部信息全局信息捕获能力较弱。
 
@@ -199,7 +201,7 @@ RepVGG网络与ACNet同理，只不过ACNet的`1*d`非对称卷积变成了`1*1`
     <img src="../../images/faq/Transformer_input.png" width="400">
 </div>
 
-3. 考虑以下问题：怎样将一张图片怎么传给encoder？
+3. 考虑以下问题：怎样将一张图片传给encoder？
 
 * 如下图所示。假设输入图片是[224,224,3]，按照顺序从左到右，从上到下，切分成很多个patch，patch大小可以为[p,p,3]（p取值可以是16，32），对其使用Linear Projection of Flattened Patches模块转成特征向量，并concat一个位置向量，传入Encoder中。
 
@@ -218,7 +220,7 @@ RepVGG网络与ACNet同理，只不过ACNet的`1*d`非对称卷积变成了`1*1`
 ### Q4.4: 如何理解归纳偏置Inductive Bias？
 
 **A**:
-1. 在机器学习中，会对算需要应用的问题做一些假设，这个假设就称为归纳偏好。在现实生活中观察得到的现象中归纳出一定的先验规则，然后对模型做一定的约束，从而起到模型选择的作用。在CNN中，假设特征具有局部性(Locality)和空间不变性(Spatial Invariance)的特点，即把相邻的的特征有联系而远离的没有，将相邻特征融合在一起，更会容易产生“解”；还有attention机制，也是从人的直觉、生活经验归纳的规则。
+1. 在机器学习中，会对算需要应用的问题做一些假设，这个假设就称为归纳偏好。在现实生活中观察得到的现象中归纳出一定的先验规则，然后对模型做一定的约束，从而起到模型选择的作用。在CNN中，假设特征具有局部性(Locality)和空间不变性(Spatial Invariance)的特点，即把相邻的特征有联系而远离的没有，将相邻特征融合在一起，更会容易产生“解”；还有attention机制，也是从人的直觉、生活经验归纳的规则。
 
 2. Vision Transformer利用的归纳偏置是有序列能力Sequentiality和时间不变性Time Invariance，即序列顺序上的时间间隔的联系，因此也能得出在更大规模数据集上比CNN类的模型有更好的性能。文章Conclusion里的“Unlike prior works using self-attention in computer vision, we do not introduce any image-specific inductive biases into the architecture”和Introduction里的“We find that large scale training trumps inductive bias”，可以得出直观上inductive bias在大量数据的情况中的产生是衰减性能，应该尽可能丢弃。
 
@@ -242,11 +244,11 @@ PaddleClas的模型包含6大模块的配置，分别为：全局配置，网络
 
 学习率和优化器的配置建议优先使用默认配置，这些参数是我们已经调过的。如果任务的改动比较大，也可以做微调。
 
-训练和预测两个配置包含了batch_size，数据集，数据预处理（transforms），读数据进程数（num_workers）等比较重要的配置，这部分要根据实际环境适度修改。要注意的是，paddleclas中的batch_size是全局的配置，即不随卡数发生变化。而num_workers定义的是单卡的进程数，即如果num_workers是8，并且使用4卡训练，则实际有32个worker。
+训练和预测两个配置包含了batch_size，数据集，数据预处理（transforms），读数据进程数（num_workers）等比较重要的配置，这部分要根据实际环境适度修改。要注意的是，paddleclas中的batch_size是单卡配置，如果是多卡训练，则总的batch_size是配置文件中所设置的倍数，例如配置文件中设置batch_size为64，4卡训练，总batch_size也就是4*64=256。而num_workers定义的是单卡的进程数，即如果num_workers是8，并且使用4卡训练，则实际有32个worker。
 
 ### Q5.2: 如何在命令行中快速的修改配置？
 **A**:
-在训练中，我们常常需要对个别配置不断进行微调，而不希望频繁的修改配置文件。这时可以使用-o来调整，修改是要先按层级写出要改的配置名称，层级之间用点分割，再写出要修改的值。例如我们想要修改batch_size，可以在训练的命令后加上-o TRAIN.batchsize=512。
+在训练中，我们常常需要对个别配置不断进行微调，而不希望频繁的修改配置文件。这时可以使用-o来调整，修改是要先按层级写出要改的配置名称，层级之间用点分割，再写出要修改的值。例如我们想要修改batch_size，可以在训练的命令后加上-o DataLoader.TRAIN.sampler.batch_size=512。
 
 ### Q5.3: 如何根据PaddleClas的精度曲线选择合适的模型？
 **A**:
@@ -264,4 +266,4 @@ PaddleClas提供了多个模型的benchmark，并绘制了性能曲线，主要
 
 ### Q5.5: 使用分类模型做其他任务的预训练模型时，应该选择哪些层作为feature？
 **A**:
-使用分类模型做其他任务的backbone有很多策略，这里介绍一种较为基础的方法。首先，去掉最后的全连接层，这一层主要包含的是原始任务的分类信息。如果任务比较简单，只要将前一层的输出作为featuremap，并在此基础上添加与任务对应的结构即可。如果任务涉及多尺度，需要选取不同尺度的anchor，例如某些检测模型，那么可以选取每次下采样之前一层的输出作为featuremap。
\ No newline at end of file
+使用分类模型做其他任务的backbone有很多策略，这里介绍一种较为基础的方法。首先，去掉最后的全连接层，这一层主要包含的是原始任务的分类信息。如果任务比较简单，只要将前一层的输出作为featuremap，并在此基础上添加与任务对应的结构即可。如果任务涉及多尺度，需要选取不同尺度的anchor，例如某些检测模型，那么可以选取每次下采样之前一层的输出作为featuremap。
diff --git a/docs/zh_CN/faq_series/faq_2021_s2.md b/docs/zh_CN/faq_series/faq_2021_s2.md
index be102b291d14affea34b6ff6282cc48953033436..3172b38393536bd5342f56bf277ba61f330a5ab8 100644
--- a/docs/zh_CN/faq_series/faq_2021_s2.md
+++ b/docs/zh_CN/faq_series/faq_2021_s2.md
@@ -32,11 +32,9 @@
 
 #### Q2.1.8: 如何在训练时使用 `Mixup` 和 `Cutmix` ？
 **A**：
-* `Mixup` 的使用方法请参考 [Mixup](https://github.com/PaddlePaddle/PaddleClas/blob/cf9fc9363877f919996954a63716acfb959619d0/ppcls/configs/ImageNet/DataAugment/ResNet50_Mixup.yaml#L63-L65)；`Cuxmix` 请参考 [Cuxmix](https://github.com/PaddlePaddle/PaddleClas/blob/cf9fc9363877f919996954a63716acfb959619d0/ppcls/configs/ImageNet/DataAugment/ResNet50_Cutmix.yaml#L63-L65)。
+* `Mixup` 的使用方法请参考 [Mixup](../../../ppcls/configs/ImageNet/DataAugment/ResNet50_Mixup.yaml#L63-L65)；`Cuxmix` 请参考 [Cuxmix](../../../ppcls/configs/ImageNet/DataAugment/ResNet50_Cutmix.yaml#L63-L65)。
 
-* 在使用 `Mixup` 或 `Cutmix` 时，需要注意：
-    * 配置文件中的 `Loss.Tranin.CELoss` 需要修改为 `Loss.Tranin.MixCELoss`，可参考 [MixCELoss](https://github.com/PaddlePaddle/PaddleClas/blob/cf9fc9363877f919996954a63716acfb959619d0/ppcls/configs/ImageNet/DataAugment/ResNet50_Cutmix.yaml#L23-L26)；
-    * 使用 `Mixup` 或 `Cutmix` 做训练时无法计算训练的精度（Acc）指标，因此需要在配置文件中取消 `Metric.Train.TopkAcc` 字段，可参考 [Metric.Train.TopkAcc](https://github.com/PaddlePaddle/PaddleClas/blob/cf9fc9363877f919996954a63716acfb959619d0/ppcls/configs/ImageNet/DataAugment/ResNet50_Cutmix.yaml#L125-L128)。
+* 使用 `Mixup` 或 `Cutmix` 做训练时无法计算训练的精度（Acc）指标，因此需要在配置文件中取消 `Metric.Train.TopkAcc` 字段，可参考 [Metric.Train.TopkAcc](../../../ppcls/configs/ImageNet/DataAugment/ResNet50_Cutmix.yaml#L125-L128)。
 
 #### Q2.1.9: 训练配置yaml文件中，字段 `Global.pretrain_model` 和 `Global.checkpoints` 分别用于配置什么呢？
 **A**：
@@ -103,9 +101,6 @@ pip install paddle2onnx
 #### Q1.1.1 PaddleClas和PaddleDetection区别
 **A**：PaddleClas是一个兼主体检测、图像分类、图像检索于一体的图像识别repo，用于解决大部分图像识别问题，用户可以很方便的使用PaddleClas来解决小样本、多类别的图像识别问题。PaddleDetection提供了目标检测、关键点检测、多目标跟踪等能力，方便用户定位图像中的感兴趣的点和区域，被广泛应用于工业质检、遥感图像检测、无人巡检等项目。
 
-#### Q1.1.2 PaddleClas 2.2和PaddleClas 2.1完全兼容吗？
-**A**：PaddleClas2.2相对PaddleClas2.1新增了metric learning模块，主体检测模块、向量检索模块。另外，也提供了商品识别、车辆识别、logo识别和动漫人物识别等4个场景应用示例。用户可以基于PaddleClas 2.2快速构建图像识别系统。在图像分类模块，二者的使用方法类似，可以参考[图像分类示例](../tutorials/getting_started.md)快速迭代和评估。新增的metric learning模块，可以参考[metric learning示例](../tutorials/getting_started_retrieval.md)。另外，新版本暂时还不支持fp16、dali训练，也暂时不支持多标签训练，这块内容将在不久后支持。
-
 #### Q1.1.3: Momentum 优化器中的 momentum 参数是什么意思呢？
 **A**: Momentum 优化器是在 SGD 优化器的基础上引入了“动量”的概念。在 SGD 优化器中，在 `t+1` 时刻，参数 `w` 的更新可表示为：
 ```latex
@@ -139,7 +134,7 @@ w_t+1 = w_t - v_t+1
 2. 图像裁剪类： CutOut、RandErasing、HideAndSeek、GridMask；
 3. 图像混叠类：Mixup, Cutmix.
 
-其中，Randangment提供了多种数据增强方式的随机组合，可以满足亮度、对比度、饱和度、色调等多方面的数据增广需求
+其中，RandAngment提供了多种数据增强方式的随机组合，可以满足亮度、对比度、饱和度、色调等多方面的数据增广需求。
 
 <a name="1.4通用检测模块"></a>
 ### 1.4 通用检测模块
@@ -148,7 +143,7 @@ w_t+1 = w_t - v_t+1
 **A**：主体检测这块的输出数量是可以通过配置文件配置的。在配置文件中Global.threshold控制检测的阈值，小于该阈值的检测框被舍弃，Global.max_det_results控制最大返回的结果数，这两个参数共同决定了输出检测框的数量。
 
 #### Q1.4.2 训练主体检测模型的数据是如何选择的？换成更小的模型会有损精度吗？
-**A**：训练数据是在COCO、Object365、RPC、LogoDet等公开数据集中随机抽取的子集，小模型精度可能会有一些损失，后续我们也会尝试下更小的检测模型。关于主体检测模型的更多信息请参考[主体检测](../application/mainbody_detection.md)。
+**A**：训练数据是在COCO、Object365、RPC、LogoDet等公开数据集中随机抽取的子集。目前我们在2.3版本中推出了超轻量的主体检测模型，具体信息可以参考[主体检测](../image_recognition_pipeline/mainbody_detection.md#2-模型选择)。关于主体检测模型的更多信息请参考[主体检测](../image_recognition_pipeline/mainbody_detection.md)。
 
 #### Q1.4.3: 目前使用的主体检测模型检测在某些场景中会有误检？
 **A**：目前的主体检测模型训练时使用了COCO、Object365、RPC、LogoDet等公开数据集，如果被检测数据是类似工业质检等于常见类别差异较大的数据，需要基于目前的检测模型重新微调训练。
@@ -169,7 +164,7 @@ w_t+1 = w_t - v_t+1
 ### 1.6 检索模块
 
 #### Q1.6.1 PaddleClas目前使用的Möbius向量检索算法支持类似于faiss的那种index.add()的功能吗? 另外，每次构建新的图都要进行train吗？这里的train是为了检索加速还是为了构建相似的图？
-**A**：Mobius提供的检索算法是一种基于图的近似最近邻搜索算法，目前支持两种距离计算方式：inner product和L2 distance. faiss中提供的index.add功能暂时不支持，如果需要增加检索库的内容，需要从头重新构建新的index. 在每次构建index时，检索算法内部执行的操作是一种类似于train的过程，不同于faiss提供的train接口，我们命名为build, 主要的目的是为了加速检索的速度。
+**A**：目前在release/2.3分支已经支持faiss检索模块，并且不再支持Möbius。关于Möbius提供的检索算法，是一种基于图的近似最近邻搜索算法，目前支持两种距离计算方式：inner product和L2 distance，但是Möbius暂不支持faiss中提供的index.add功能，如果需要增加检索库的内容，需要从头重新构建新的index. 在每次构建index时，检索算法内部执行的操作是一种类似于train的过程，不同于faiss提供的train接口。因此需要faiss模块的话，可以使用release/2.3分支，需要Möbius的话，目前需要回退到release/2.2分支。
 
 #### Q1.6.2: PaddleClas 图像识别用于 Eval 的配置文件中，`Query` 和 `Gallery` 配置具体是用于做什么呢？
 **A**: `Query` 与 `Gallery` 均为数据集配置，其中 `Gallery` 用于配置底库数据，`Query` 用于配置验证集。在进行 Eval 时，首先使用模型对 `Gallery` 底库数据进行前向计算特征向量，特征向量用于构建底库，然后模型对 `Query` 验证集中的数据进行前向计算特征向量，再与底库计算召回率等指标。
@@ -218,11 +213,9 @@ PaddlePaddle is installed successfully! Let's start deep learning with PaddlePad
 
 #### Q2.1.8: 如何在训练时使用 `Mixup` 和 `Cutmix` ？
 **A**：
-* `Mixup` 的使用方法请参考 [Mixup](https://github.com/PaddlePaddle/PaddleClas/blob/cf9fc9363877f919996954a63716acfb959619d0/ppcls/configs/ImageNet/DataAugment/ResNet50_Mixup.yaml#L63-L65)；`Cuxmix` 请参考 [Cuxmix](https://github.com/PaddlePaddle/PaddleClas/blob/cf9fc9363877f919996954a63716acfb959619d0/ppcls/configs/ImageNet/DataAugment/ResNet50_Cutmix.yaml#L63-L65)。
+* `Mixup` 的使用方法请参考 [Mixup](../../../ppcls/configs/ImageNet/DataAugment/ResNet50_Mixup.yaml#L63-L65)；`Cuxmix` 请参考 [Cuxmix](../../../ppcls/configs/ImageNet/DataAugment/ResNet50_Cutmix.yaml#L63-L65)。
 
-* 在使用 `Mixup` 或 `Cutmix` 时，需要注意：
-    * 配置文件中的 `Loss.Tranin.CELoss` 需要修改为 `Loss.Tranin.MixCELoss`，可参考 [MixCELoss](https://github.com/PaddlePaddle/PaddleClas/blob/cf9fc9363877f919996954a63716acfb959619d0/ppcls/configs/ImageNet/DataAugment/ResNet50_Cutmix.yaml#L23-L26)；
-    * 使用 `Mixup` 或 `Cutmix` 做训练时无法计算训练的精度（Acc）指标，因此需要在配置文件中取消 `Metric.Train.TopkAcc` 字段，可参考 [Metric.Train.TopkAcc](https://github.com/PaddlePaddle/PaddleClas/blob/cf9fc9363877f919996954a63716acfb959619d0/ppcls/configs/ImageNet/DataAugment/ResNet50_Cutmix.yaml#L125-L128)。
+* 使用 `Mixup` 或 `Cutmix` 做训练时无法计算训练的精度（Acc）指标，因此需要在配置文件中取消 `Metric.Train.TopkAcc` 字段，可参考 [Metric.Train.TopkAcc](../../../ppcls/configs/ImageNet/DataAugment/ResNet50_Cutmix.yaml#L125-L128)。
 
 #### Q2.1.9: 训练配置yaml文件中，字段 `Global.pretrain_model` 和 `Global.checkpoints` 分别用于配置什么呢？
 **A**：
@@ -232,9 +225,9 @@ PaddlePaddle is installed successfully! Let's start deep learning with PaddlePad
 <a name="2.2图像分类"></a>
 ### 2.2 图像分类
 
-#### Q2.2.1 SSLD中，大模型在500M数据上预训练后蒸馏小模型，然后在1M数据上蒸馏finetune小模型？
+#### Q2.2.1 在SSLD中，大模型在500M数据上预训练后蒸馏小模型，然后在1M数据上蒸馏finetune小模型，具体步骤是怎样做的？
 **A**：步骤如下：
-1. 基于facebook开源的`ResNeXt101-32x16d-wsl`模型 去蒸馏得到了`ResNet50-vd`模型；
+1. 基于facebook开源的`ResNeXt101-32x16d-wsl`模型去蒸馏得到了`ResNet50-vd`模型；
 2. 用这个`ResNet50-vd`，在500W数据集上去蒸馏`MobilNetV3`；
 3. 考虑到500W的数据集的分布和100W的数据分布不完全一致，所以这块，在100W上的数据上又finetune了一下，精度有微弱的提升。
 
@@ -257,13 +250,13 @@ PaddlePaddle is installed successfully! Let's start deep learning with PaddlePad
 ### 2.4 图像识别模块
 
 #### Q2.4.1: 识别模块预测时报`Illegal instruction`错？
-**A**：可能是编译生成的库文件与您的环境不兼容，导致程序报错，如果报错，推荐参考[向量检索教程](../../../deploy/vector_search/README.md)重新编译库文件。
+**A**：如果使用的是release/2.2分支，建议更新为release/2.3分支，在release/2.3分支中，我们使用faiss检索模块替换了Möbius检索模型，具体可以参考[向量检索教程](../../../deploy/vector_search/README.md)。如仍存在问题，可以在用户微信群中联系我们，也可以在GitHub提issue。
 
 #### Q2.4.2: 识别模型怎么在预训练模型的基础上进行微调训练？
-**A**：识别模型的微调训练和分类模型的微调训练类似，识别模型可以加载商品的预训练模型，训练过程可以参考[识别模型训练](../tutorials/getting_started_retrieval.md)，后续我们也会持续细化这块的文档。
+**A**：识别模型的微调训练和分类模型的微调训练类似，识别模型可以加载商品的预训练模型，训练过程可以参考[识别模型训练](../../zh_CN/models_training/recognition.md)，后续我们也会持续细化这块的文档。
 
 #### Q2.4.3: 训练metric learning时，每个epoch中，无法跑完所有mini-batch，为什么？
-**A**：在训练metric learning时，使用的Sampler是DistributedRandomIdentitySampler，该Sampler不会采样全部的图片，导致会让每一个epoch采样的数据不是所有的数据，所以无法跑完显示的mini-batch是正常现象。后续我们会优化下打印的信息，尽可能减少给大家带来的困惑。
+**A**：在训练metric learning时，使用的Sampler是DistributedRandomIdentitySampler，该Sampler不会采样全部的图片，导致会让每一个epoch采样的数据不是所有的数据，所以无法跑完显示的mini-batch是正常现象。该问题在release/2.3分支已经优化，请更新到release/2.3使用。
 
 #### Q2.4.4: 有些图片没有识别出结果，为什么？
 **A**：在配置文件（如inference_product.yaml）中，`IndexProcess.score_thres`中会控制被识别的图片与库中的图片的余弦相似度的最小值。当余弦相似度小于该值时，不会打印结果。您可以根据自己的实际数据调整该值。
@@ -275,10 +268,10 @@ PaddlePaddle is installed successfully! Let's start deep learning with PaddlePad
 **A**：请确保data_file.txt中图片路径和图片名称中间的间隔为单个table，而不是空格。
 
 #### Q2.5.2: 新增底库数据需要重新构建索引吗？
-**A**：这一版需要重新构建索引，未来版本会支持只构建新增图片的索引。
+**A**：从release/2.3分支起，我们使用faiss检索模块替换了Möbius检索模型，已经支持在不构建底库的前提下新增底库数据，具体可以参考[向量检索教程](../../../deploy/vector_search/README.md)。
 
 #### Q2.5.3: Mac重新编译index.so时报错如下：clang: error: unsupported option '-fopenmp', 该如何处理？
-**A**：该问题已经解决。可以参照[文档](../../../develop/deploy/vector_search/README.md)重新编译 index.so。
+**A**：如果使用的是release/2.2分支，建议更新为release/2.3分支，在release/2.3分支中，我们使用faiss检索模块替换了Möbius检索模型，具体可以参考[向量检索教程](../../../deploy/vector_search/README.md)。如仍存在问题，可以在用户微信群中联系我们，也可以在GitHub提issue。
 
 #### Q2.5.4: 在 build 检索底库时，参数 `pq_size` 应该如何设置？
 **A**：`pq_size` 是PQ检索算法的参数。PQ检索算法可以简单理解为“分层”检索算法，`pq_size` 是每层的“容量”，因此该参数的设置会影响检索性能，不过，在底库总数据量不太大（小于10000张）的情况下，这个参数对性能的影响很小，因此对于大多数使用场景而言，在构建底库时无需修改该参数。关于PQ检索算法的更多内容，可以查看相关[论文](https://lear.inrialpes.fr/pubs/2011/JDS11/jegou_searching_with_quantization.pdf)。
diff --git a/docs/zh_CN/faq_series/faq_selected_30.md b/docs/zh_CN/faq_series/faq_selected_30.md
new file mode 100644
index 0000000000000000000000000000000000000000..edcad9a93175f391840705196a010492bd41e02c
--- /dev/null
+++ b/docs/zh_CN/faq_series/faq_selected_30.md
@@ -0,0 +1,242 @@
+# FAQ
+
+## 写在前面
+
+* 我们收集整理了开源以来在issues和用户群中的常见问题并且给出了简要解答，旨在为图像分类的开发者提供一些参考，也希望帮助大家少走一些弯路。
+
+* 图像分类领域大佬众多，模型和论文更新速度也很快，本文档回答主要依赖有限的项目实践，难免挂一漏万，如有遗漏和不足，也希望有识之士帮忙补充和修正，万分感谢。
+
+
+## PaddleClas常见问题汇总
+
+* [图像分类30个问题](#图像分类30个问题)
+    * [基础知识](#基础知识)
+    * [模型训练相关](#模型训练相关)
+    * [数据相关](#数据相关)
+    * [模型推理与预测相关](#模型推理与预测相关)
+* [PaddleClas使用问题](#PaddleClas使用问题)
+
+
+<a name="图像分类30个问题"></a>
+## 图像分类30个问题
+
+<a name="基础知识"></a>
+### 基础知识
+
+>>
+* Q: 图像分类领域常用的分类指标有几种
+* A:
+    * 对于单个标签的图像分类问题（仅包含1个类别与背景），评估指标主要有Accuracy，Precision，Recall，F-score等，令TP(True Positive)表示将正类预测为正类，FP(False Positive)表示将负类预测为正类，TN(True Negative)表示将负类预测为负类，FN(False Negative)表示将正类预测为负类。那么Accuracy=(TP + TN) / NUM，Precision=TP /(TP + FP)，Recall=TP /(TP + FN)。
+    * 对于类别数大于1的图像分类问题，评估指标主要有Accuary和Class-wise Accuracy，Accuary表示所有类别预测正确的图像数量占总图像数量的百分比；Class-wise Accuracy是对每个类别的图像计算Accuracy，然后再对所有类别的Accuracy取平均得到。
+
+>>
+* Q: 怎样根据自己的任务选择合适的模型进行训练？
+* A: 如果希望在服务器部署，或者希望精度尽可能地高，对模型存储大小或者预测速度的要求不是很高，那么推荐使用ResNet_vd、Res2Net_vd、DenseNet、Xception等适合于服务器端的系列模型；如果希望在移动端侧部署，则推荐使用MobileNetV3、GhostNet等适合于移动端的系列模型。同时，我们推荐在选择模型的时候可以参考[模型库](../models/models_intro.md)中的速度-精度指标图。
+
+>>
+* Q: 如何进行参数初始化，什么样的初始化可以加快模型收敛？
+* A: 众所周知，参数的初始化可以影响模型的最终性能。一般来说，如果目标数据集不是很大，建议使用ImageNet-1k训练得到的预训练模型进行初始化。如果是自己手动设计的网络或者暂时没有基于ImageNet-1k训练得到的预训练权重，可以使用Xavier初始化或者MSRA初始化，其中Xavier初始化是针对Sigmoid函数提出的，对RELU函数不太友好，网络越深，各层输入的方差越小，网络越难训练，所以当神经网络中使用较多RELU激活函数时，推荐使用MSRA初始化。
+
+>>
+* Q: 针对深度神经网络参数冗余的问题，目前有哪些比较好的解决办法？
+* A: 目前有几种主要的方法对模型进行压缩，减少模型参数冗余的问题，如剪枝、量化、知识蒸馏等。模型剪枝指的是将权重矩阵中相对不重要的权值剔除，然后再重新对网络进行微调；模型量化指的是一种将浮点计算转成低比特定点计算的技术，如8比特、4比特等，可以有效的降低模型计算强度、参数大小和内存消耗。知识蒸馏是指使用教师模型(teacher model)去指导学生模型(student model)学习特定任务，保证小模型在参数量不变的情况下，性能有较大的提升，甚至获得与大模型相似的精度指标。
+
+>>
+* Q: 怎样在其他任务，如目标检测、图像分割、关键点检测等任务中选择比较合适的分类模型作为骨干网络？
+* A: 在不考虑速度的情况下，在大部分的任务中，推荐使用精度更高的预训练模型和骨干网络，PaddleClas中开源了一系列的SSLD知识蒸馏预训练模型，如ResNet50_vd_ssld, Res2Net200_vd_26w_4s_ssld等，在模型精度和速度方面都是非常有优势的，推荐大家使用。对于一些特定的任务，如图像分割或者关键点检测等任务，对图像分辨率的要求比较高，那么更推荐使用HRNet等能够同时兼顾网络深度和分辨率的神经网络模型，PaddleClas也提供了HRNet_W18_C_ssld、HRNet_W48_C_ssld等精度非常高的HRNet SSLD蒸馏系列预训练模型，大家可以使用这些精度更高的预训练模型与骨干网络，提升自己在其他任务上的模型精度。
+
+>>
+* Q: 注意力机制是什么？目前有哪些比较常用的注意力机制方法？
+* A: 注意力机制（Attention Mechanism）源于对人类视觉的研究。将注意力机制用在计算机视觉任务上，可以有效捕捉图片中有用的区域，从而提升整体网络性能。目前比较常用的有[SE block](https://arxiv.org/abs/1709.01507)、[SK-block](https://arxiv.org/abs/1903.06586)、[Non-local block](https://arxiv.org/abs/1711.07971)、[GC block](https://arxiv.org/abs/1904.11492)、[CBAM](https://arxiv.org/abs/1807.06521)等，核心思想就是去学习特征图在不同区域或者不同通道中的重要性，从而让网络更加注意显著性的区域。
+
+<a name="模型训练相关"></a>
+### 模型训练相关
+
+>>
+* Q: 使用深度卷积网络做图像分类，如果训练一个拥有1000万个类的模型会碰到什么问题？
+* A: 因为FC层参数很多，内存/显存/模型的存储占用都会大幅增大；模型收敛速度也会变慢一些。建议在这种情况下，再最后的FC层前加一层维度较小的FC，这样可以大幅减少模型的存储大小。
+
+>>
+* Q: 训练过程中，如果模型收敛效果很差，可能的原因有哪些呢？
+* A: 主要有以下几个可以排查的地方：（1）应该检查数据标注，确保训练集和验证集的数据标注没有问题。（2）可以试着调整一下学习率（初期可以以10倍为单位进行调节），过大（训练震荡）或者过小（收敛太慢）的学习率都可能导致收敛效果差。（3）数据量太大，选择的模型太小，难以学习所有数据的特征。（4）可以看下数据预处理的过程中是否使用了归一化，如果没有使用归一化操作，收敛速度可能会比较慢。（5）如果数据量比较小，可以试着加载PaddleClas中提供的基于ImageNet-1k数据集的预训练模型，这可以大大提升训练收敛速度。（6）数据集存在长尾问题，可以参考[数据长尾问题解决方案](#long_tail)。
+
+>>
+* Q: 训练图像分类任务时，该怎么选择合适的优化器？
+* A: 优化器的目的是为了让损失函数尽可能的小，从而找到合适的参数来完成某项任务。目前业界主要用到的优化器有SGD、RMSProp、Adam、AdaDelt等，其中由于带momentum的SGD优化器广泛应用于学术界和工业界，所以我们发布的模型也大都使用该优化器来实现损失函数的梯度下降。带momentum的SGD优化器有两个劣势，其一是收敛速度慢，其二是初始学习率的设置需要依靠大量的经验，然而如果初始学习率设置得当并且迭代轮数充足，该优化器也会在众多的优化器中脱颖而出，使得其在验证集上获得更高的准确率。一些自适应学习率的优化器如Adam、RMSProp等，收敛速度往往比较快，但是最终的收敛精度会稍差一些。如果追求更快的收敛速度，我们推荐使用这些自适应学习率的优化器，如果追求更高的收敛精度，我们推荐使用带momentum的SGD优化器。
+
+>>
+* Q: 当前主流的学习率下降策略有哪些？一般需要怎么选择呢？
+* A: 学习率是通过损失函数的梯度调整网络权重的超参数的速度。学习率越低，损失函数的变化速度就越慢。虽然使用低学习率可以确保不会错过任何局部极小值，但也意味着将花费更长的时间来进行收敛，特别是在被困在高原区域的情况下。在整个训练过程中，我们不能使用同样的学习率来更新权重，否则无法到达最优点，所以需要在训练过程中调整学习率的大小。在训练初始阶段，由于权重处于随机初始化的状态，损失函数相对容易进行梯度下降，所以可以设置一个较大的学习率。在训练后期，由于权重参数已经接近最优值，较大的学习率无法进一步寻找最优值，所以需要设置一个较小的学习率。在训练整个过程中，很多研究者使用的学习率下降方式是piecewise_decay，即阶梯式下降学习率，如在ResNet50标准的训练中，我们设置的初始学习率是0.1，每30epoch学习率下降到原来的1/10，一共迭代120epoch。除了piecewise_decay，很多研究者也提出了学习率的其他下降方式，如polynomial_decay（多项式下降）、exponential_decay（指数下降）,cosine_decay（余弦下降）等，其中cosine_decay无需调整超参数，鲁棒性也比较高，所以成为现在提高模型精度首选的学习率下降方式。Cosine_decay和piecewise_decay的学习率变化曲线如下图所示，容易观察到，在整个训练过程中，cosine_decay都保持着较大的学习率，所以其收敛较为缓慢，但是最终的收敛效果较peicewise_decay更好一些。
+![](../../images/models/lr_decay.jpeg)
+>>
+* Q: Warmup学习率策略是什么？一般用在什么样的场景中？
+* A: Warmup策略顾名思义就是让学习率先预热一下，在训练初期我们不直接使用最大的学习率，而是用一个逐渐增大的学习率去训练网络，当学习率增大到最高点时，再使用学习率下降策略中提到的学习率下降方式衰减学习率的值。如果使用较大的batch_size训练神经网络时，我们建议您使用warmup策略。实验表明，在batch_size较大时，warmup可以稳定提升模型的精度。在训练MobileNetV3等batch_size较大的实验中，我们默认将warmup中的epoch设置为5，即先用5epoch将学习率从0增加到最大值，再去做相应的学习率衰减。
+
+>>
+* Q: 什么是`batch size`？在模型训练中，怎么选择合适的`batch size`？
+* A: `batch size`是训练神经网络中的一个重要的超参数，该值决定了一次将多少数据送入神经网络参与训练。论文[Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour](https://arxiv.org/abs/1706.02677)，当`batch size`的值与学习率的值呈线性关系时，收敛精度几乎不受影响。在训练ImageNet数据时，大部分的神经网络选择的初始学习率为0.1，`batch size`是256，所以根据实际的模型大小和显存情况，可以将学习率设置为0.1*k,batch_size设置为256*k。在实际任务中，也可以将该设置作为初始参数，进一步调节学习率参数并获得更优的性能。
+>>
+* Q: weight_decay是什么？怎么选择合适的weight_decay呢？
+* A: 过拟合是机器学习中常见的一个名词，简单理解即为模型在训练数据上表现很好，但在测试数据上表现较差，在卷积神经网络中，同样存在过拟合的问题，为了避免过拟合，很多正则方式被提出，其中，weight_decay是其中一个广泛使用的避免过拟合的方式。在使用SGD优化器时，weight_decay等价于在最终的损失函数后添加L2正则化，L2正则化使得网络的权重倾向于选择更小的值，最终整个网络中的参数值更趋向于0，模型的泛化性能相应提高。在各大深度学习框架的实现中，该值表达的含义是L2正则前的系数，在paddle框架中，该值的名称是l2_decay，所以以下都称其为l2_decay。该系数越大，表示加入的正则越强，模型越趋于欠拟合状态。在训练ImageNet的任务中，大多数的网络将该参数值设置为1e-4，在一些小的网络如MobileNet系列网络中，为了避免网络欠拟合，该值设置为1e-5~4e-5之间。当然，该值的设置也和具体的数据集有关系，当任务的数据集较大时，网络本身趋向于欠拟合状态，可以将该值适当减小，当任务的数据集较小时，网络本身趋向于过拟合状态，可以将该值适当增大。下表展示了MobileNetV1_x0_25在ImageNet-1k上使用不同l2_decay的精度情况。由于MobileNetV1_x0_25是一个比较小的网络，所以l2_decay过大会使网络趋向于欠拟合状态，所以在该网络中，相对1e-4，3e-5是更好的选择。
+
+| 模型                | L2_decay | Train acc1/acc5 | Test acc1/acc5 |
+|:--:|:--:|:--:|:--:|
+| MobileNetV1_x0_25 | 1e-4     | 43.79%/67.61%   | 50.41%/74.70%  |
+| MobileNetV1_x0_25 | 3e-5     | 47.38%/70.83%   | 51.45%/75.45%  |
+
+
+>>
+* Q: 标签平滑(label_smoothing)指的是什么？有什么效果呢？一般适用于什么样的场景中？
+* A: Label_smoothing是深度学习中的一种正则化方法，其全称是 Label Smoothing Regularization(LSR)，即标签平滑正则化。在传统的分类任务计算损失函数时，是将真实的one hot标签与神经网络的输出做相应的交叉熵计算，而label_smoothing是将真实的one hot标签做一个标签平滑的处理，使得网络学习的标签不再是一个hard label，而是一个有概率值的soft label，其中在类别对应的位置的概率最大，其他位置概率是一个非常小的数。具体的计算方式参见论文[2]。在label_smoothing里，有一个epsilon的参数值，该值描述了将标签软化的程度，该值越大，经过label smoothing后的标签向量的标签概率值越小，标签越平滑，反之，标签越趋向于hard label，在训练ImageNet-1k的实验里通常将该值设置为0.1。
+在训练ImageNet-1k的实验中，我们发现，ResNet50大小级别及其以上的模型在使用label_smooting后，精度有稳定的提升。下表展示了ResNet50_vd在使用label_smoothing前后的精度指标。同时，由于label_smoohing相当于一种正则方式，在相对较小的模型上，精度提升不明显甚至会有所下降，下表展示了ResNet18在ImageNet-1k上使用label_smoothing前后的精度指标。可以明显看到，在使用label_smoothing后，精度有所下降。
+
+| 模型   | Use_label_smoothing | Test acc1 |
+|:--:|:--:|:--:|
+| ResNet50_vd | 0    | 77.9%  |
+| ResNet50_vd | 1    | 78.4%  |
+| ResNet18    | 0    | 71.0%  |
+| ResNet18    | 1    | 70.8%  |
+
+
+>>
+* Q: 在训练的时候怎么通过训练集和验证集的准确率或者loss确定进一步的调优策略呢？
+* A: 在训练网络的过程中，通常会打印每一个epoch的训练集准确率和验证集准确率，二者刻画了该模型在两个数据集上的表现。通常来说，训练集的准确率比验证集准确率微高或者二者相当是比较不错的状态。如果发现训练集的准确率比验证集高很多，说明在这个任务上已经过拟合，需要在训练过程中加入更多的正则，如增大l2_decay的值，加入更多的数据增广策略，加入label_smoothing策略等；如果发现训练集的准确率比验证集低一些，说明在这个任务上可能欠拟合，需要在训练过程中减弱正则效果，如减小l2_decay的值，减少数据增广方式，增大图片crop区域面积，减弱图片拉伸变换，去除label_smoothing等。
+
+>>
+* Q: 怎么使用已有的预训练模型提升自己的数据集的精度呢？
+* A: 在现阶段计算机视觉领域中，加载预训练模型来训练自己的任务已成为普遍的做法，相比从随机初始化开始训练，加载预训练模型往往可以提升特定任务的精度。一般来说，业界广泛使用的预训练模型是通过训练128万张图片1000类的ImageNet-1k数据集得到的，该预训练模型的fc层权重是是一个k\*1000的矩阵，其中k是fc层以前的神经元数，在加载预训练权重时，无需加载fc层的权重。在学习率方面，如果您的任务训练的数据集特别小（如小于1千张），我们建议你使用较小的初始学习率，如0.001（batch_size:256,下同），以免较大的学习率破坏预训练权重。如果您的训练数据集规模相对较大（大于10万），我们建议你尝试更大的初始学习率，如0.01或者更大。
+
+<a name="数据相关"></a>
+### 数据相关
+
+>>
+* Q: 图像分类的数据预处理过程一般包括哪些步骤？
+* A: 以在ImageNet-1k数据集上训练ResNet50为例，一张图片被输入进网络，主要有图像解码、随机裁剪、随机水平翻转、标准化、数据重排，组batch并送进网络这几个步骤。图像解码指的是将图片文件读入到内存中，随机裁剪指的是将读入的图像随机拉伸并裁剪到长宽均为224的图像，随机水平翻转指的是对裁剪后的图片以0.5的概率进行水平翻转，标准化指的是将图片每个通道的数据通过去均值实现中心化的处理，使得数据尽可能符合`N(0,1)`的正态分布，数据重排指的是将数据由`[224,224,3]`的格式变为`[3,224,224]`的格式，组batch指的是将多幅图像组成一个批数据，送进网络进行训练。
+
+>>
+* Q: 随机裁剪是怎么影响小模型训练的性能的？
+* A: 在ImageNet-1k数据的标准预处理中，随机裁剪函数中定义了scale和ratio两个值，两个值分别确定了图片crop的大小和图片的拉伸程度，其中scale的默认取值范围是0.08-1(lower_scale-upper_scale),ratio的默认取值范围是3/4-4/3(lower_ratio-upper_ratio)。在非常小的网络训练中，此类数据增强会使得网络欠拟合，导致精度有所下降。为了提升网络的精度，可以使其数据增强变的更弱，即增大图片的crop区域或者减弱图片的拉伸变换程度。我们可以分别通过增大lower_scale的值或缩小lower_ratio与upper_scale的差距来实现更弱的图片变换。下表列出了使用不同lower_scale训练MobileNetV2_x0_25的精度，可以看到，增大图片的crop区域面积后训练精度和验证精度均有提升。
+
+| 模型                | Scale取值范围 | Train_acc1/acc5 | Test_acc1/acc5 |
+|:--:|:--:|:--:|:--:|
+| MobileNetV2_x0_25 | [0.08,1]  | 50.36%/72.98%   | 52.35%/75.65%  |
+| MobileNetV2_x0_25 | [0.2,1]   | 54.39%/77.08%   | 53.18%/76.14%  |
+
+
+>>
+* Q: 数据量不足的情况下，目前有哪些常见的数据增广方法来增加训练样本的丰富度呢？
+* A: PaddleClas中将目前比较常见的数据增广方法分为了三大类，分别是图像变换类、图像裁剪类和图像混叠类，图像变换类主要包括AutoAugment和RandAugment，图像裁剪类主要包括CutOut、RandErasing、HideAndSeek和GridMask，图像混叠类主要包括Mixup和Cutmix，更详细的关于数据增广的介绍可以参考：[数据增广章节](../algorithm_introduction/DataAugmentation.md)。
+>>
+* Q: 对于遮挡情况比较常见的图像分类场景，该使用什么数据增广方法去提升模型的精度呢？
+* A: 在训练的过程中可以尝试对训练集使用CutOut、RandErasing、HideAndSeek和GridMask等裁剪类数据增广方法，让模型也能够不止学习到显著区域，也能关注到非显著性区域，从而在遮挡的情况下，也能较好地完成识别任务。
+
+>>
+* Q: 对于色彩变换情况比较复杂的情况下，应该使用哪些数据增广方法提升模型精度呢？
+* A: 可以考虑使用AutoAugment或者RandAugment的数据增广策略，这两种策略中都包括了锐化、直方图均衡化等丰富的颜色变换，可以让模型在训练的过程中对这些变换更加鲁棒。
+>>
+* Q: Mixup和Cutmix的工作原理是什么？为什么它们也是非常有效的数据增广方法？
+* A: Mixup通过线性叠加两张图片生成新的图片，对应label也进行线性叠加用以训练，Cutmix则是从一幅图中随机裁剪出一个 感兴趣区域(ROI)，然后覆盖当前图像中对应的区域，label也按照图像面积比例进行线性叠加。它们其实也是生成了和训练集不同的样本和label并让网络去学习，从而扩充了样本的丰富度。
+>>
+* Q: 对于精度要求不是那么高的图像分类任务，大概需要准备多大的训练数据集呢？
+* A: 训练数据的数量和需要解决问题的复杂度有关系。难度越大，精度要求越高，则数据集需求越大，而且一般情况实际中的训练数据越多效果越好。当然，一般情况下，在加载预训练模型的情况下，每个类别包括10-20张图像即可保证基本的分类效果；不加载预训练模型的情况下，每个类别需要至少包含100-200张图像以保证基本的分类效果。
+
+<a name="long_tail"></a>
+>>
+* Q: 对于长尾分布的数据集，目前有哪些比较常用的方法？
+* A: （1）可以对数据量比较少的类别进行重采样，增加其出现的概率；（2）可以修改loss，增加图像较少对应的类别的图片的loss权重；（3）可以借鉴迁移学习的方法，从常见类别中学习通用知识，然后迁移到少样本的类别中。
+
+<a name="模型推理与预测相关"></a>
+### 模型推理与预测相关
+
+>>
+* Q: 有时候图像中只有小部分区域是所关注的前景物体，直接拿原图来进行分类的话，识别效果很差，这种情况要怎么做呢？
+* A: 可以在分类之前先加一个主体检测的模型，将前景物体检测出来之后再进行分类，可以大大提升最终的识别效果。如果不考虑时间成本，也可以使用multi-crop的方式对所有的预测做融合来决定最终的类别。
+>>
+* Q: 目前推荐的，模型预测方式有哪些？
+* A: 在模型训练完成之后，推荐使用导出的固化模型（inference model），基于Paddle预测引擎进行预测，目前支持python inference与cpp inference。如果希望基于服务化部署预测模型，那么推荐使用PaddleServing的部署方式。
+>>
+* Q: 模型训练完成之后，有哪些比较合适的预测方法进一步提升模型精度呢？
+* A: （1）可以使用更大的预测尺度，比如说训练的时候使用的是224，那么预测的时候可以考虑使用288或者320，这会直接带来0.5%左右的精度提升。（2）可以使用测试时增广的策略（Test Time Augmentation, TTA)，将测试集通过旋转、翻转、颜色变换等策略，创建多个副本，并分别预测，最后将所有的预测结果进行融合，这可以大大提升预测结果的精度和鲁棒性。（3）当然，也可以使用多模型融合的策略，将多个模型针对相同图片的预测结果进行融合。
+>>
+* Q: 多模型融合的时候，该怎么选择合适的模型进行融合呢？
+* A: 在不考虑预测速度的情况下，建议选择精度尽量高的模型；同时建议选择不同结构或者系列的模型进行融合，比如在精度相似的情况下，ResNet50_vd与Xception65的模型融合结果往往比ResNet50_vd与ResNet101_vd的模型融合结果要好一些。
+
+>>
+* Q: 使用固定的模型进行预测时有哪些比较常用的加速方法？
+* A: （1）使用性能更优的GPU进行预测；（2）增大预测的batch size；（3）使用TenorRT以及FP16半精度浮点数等方法进行预测。
+
+
+<a name="PaddleClas使用问题"></a>
+## PaddleClas使用问题
+
+>>
+* Q: 评估和预测时，已经指定了预训练模型所在文件夹的地址，但是仍然无法导入参数，这么为什么呢？
+* A: 加载预训练模型时，需要指定预训练模型的前缀，例如预训练模型参数所在的文件夹为`output/ResNet50_vd/19`，预训练模型参数的名称为`output/ResNet50_vd/19/ppcls.pdparams`，则`pretrained_model`参数需要指定为`output/ResNet50_vd/19/ppcls`，PaddleClas会自动补齐`.pdparams`的后缀。
+
+
+>>
+* Q: 在评测`EfficientNetB0_small`模型时，为什么最终的精度始终比官网的低0.3%左右？
+* A: `EfficientNet`系列的网络在进行resize的时候，是使用`cubic插值方式`(resize参数的interpolation值设置为2)，而其他模型默认情况下为None，因此在训练和评估的时候需要显式地指定resize的interpolation值。具体地，可以参考以下配置中预处理过程中ResizeImage的参数。
+```
+  Eval:
+    dataset:
+      name: ImageNetDataset
+      image_root: ./dataset/ILSVRC2012/
+      cls_label_path: ./dataset/ILSVRC2012/val_list.txt
+      transform_ops:
+        - DecodeImage:
+            to_rgb: True
+            channel_first: False
+        - ResizeImage:
+            resize_short: 256
+            interpolation: 2
+        - CropImage:
+            size: 224
+        - NormalizeImage:
+            scale: 1.0/255.0
+            mean: [0.485, 0.456, 0.406]
+            std: [0.229, 0.224, 0.225]
+            order: ''
+```
+
+>>
+* Q: python2下，使用visualdl的时候，报出以下错误，`TypeError: __init__() missing 1 required positional argument: 'sync_cycle'`，这是为什么呢？
+* A: 目前visualdl仅支持在python3下运行，visualdl需要是2.0以上的版本，如果visualdl版本不对的话，可以通过以下方式进行安装：`pip3 install visualdl -i https://mirror.baidu.com/pypi/simple`
+
+>>
+* Q: 自己在测ResNet50_vd预测单张图片速度的时候发现比官网提供的速度benchmark慢了很多，而且CPU速度比GPU速度快很多，这个是为什么呢？
+* A: 模型预测需要初始化，初始化的过程比较耗时，因此在统计预测速度的时候，需要批量跑一批图片，去除前若干张图片的预测耗时，再统计下平均的时间。GPU比CPU速度测试单张图片速度慢是因为GPU的初始化并CPU要慢很多。
+
+>>
+* Q: 灰度图可以用于模型训练吗？
+* A: 灰度图也可以用于模型训练，不过需要修改模型的输入shape为`[1, 224, 224]`，此外数据增广部分也需要注意适配一下。不过为了更好地使用PaddleClas代码的话，即使是灰度图，也建议调整为3通道的图片进行训练（RGB通道的像素值相等）。
+
+>>
+* Q: 怎么在windows上或者cpu上面模型训练呢？
+* A: 可以参考[开始使用教程](../models_training/classification.md)，详细介绍了在Linux、Windows、CPU等环境中进行模型训练、评估与预测的教程。
+>>
+* Q: 怎样在模型训练的时候使用label smoothing呢？
+* A: 可以在配置文件中的`Loss`字段下进行设置，如下所示，`epsilon=0.1` 表示设置该值为0.1，若不设置 `epsilon` 字段，则不使用 `label smoothing`。
+```yaml
+Loss:
+  Train:
+    - CELoss:
+        weight: 1.0
+        epsilon: 0.1
+```
+>>
+* Q: PaddleClas提供的10W类图像分类预训练模型能否用于模型推断呢？
+* A: 该10W类图像分类预训练模型没有提供fc全连接层的参数，无法用于模型推断，目前可以用于模型微调。
+>>
+* Q: 在使用`deploy/python/predict_cls.py`进行模型预测的时候，报了这个问题:`Error: Pass tensorrt_subgraph_pass has not been registered`，这是为什么呢？
+* A: 如果希望使用TensorRT进行模型预测推理的话，需要安装或是自己编译带TensorRT的PaddlePaddle，Linux、Windows、macOS系统的用户下载安装可以参考参考[下载预测库](https://paddleinference.paddlepaddle.org.cn/user_guides/download_lib.html)，如果没有符合您所需要的版本，则需要本地编译安装，编译方法可以参考[源码编译](https://paddleinference.paddlepaddle.org.cn/user_guides/source_compile.html)。
+>>
+* Q: 怎样在训练的时候使用自动混合精度(Automatic Mixed Precision, AMP)训练呢？
+* A: 可以参考[ResNet50_fp16.yaml](../../../ppcls/configs/ImageNet/ResNet/ResNet50_fp16.yaml)这个配置文件；具体地，如果希望自己的配置文件在模型训练的时候也支持自动混合精度，可以在配置文件中添加下面的配置信息。
+```
+# mixed precision training
+AMP:
+  scale_loss: 128.0
+  use_dynamic_loss_scaling: True
+  use_pure_fp16: &use_pure_fp16 True
+```
diff --git a/docs/zh_CN/others/update_history.md b/docs/zh_CN/others/update_history.md
index 11dc6ff5d7139f8e9f4e51c5320f7d4817e6c2bd..e8c05caff4519a94f595fc0a8e3793afa9323217 100644
--- a/docs/zh_CN/others/update_history.md
+++ b/docs/zh_CN/others/update_history.md
@@ -1,4 +1,5 @@
 # 更新日志
+
 - 2021.08.11 更新7个[FAQ](docs/zh_CN/faq_series/faq_2021_s2.md)。
 - 2021.06.29 添加Swin-transformer系列模型，ImageNet1k数据集上Top1 acc最高精度可达87.2%；支持训练预测评估与whl包部署，预训练模型可以从[这里](docs/zh_CN/models/models_intro.md)下载。
 - 2021.06.22,23,24 PaddleClas官方研发团队带来技术深入解读三日直播课。课程回放：[https://aistudio.baidu.com/aistudio/course/introduce/24519](https://aistudio.baidu.com/aistudio/course/introduce/24519)
diff --git a/docs/zh_CN/quick_start/quick_start_recognition.md b/docs/zh_CN/quick_start/quick_start_recognition.md
index 6fd5fd7effc766c721f30c2edc9f38ccf7312ef0..076fb6d1767a7e67b0ad37c44116fe8348b2947b 100644
--- a/docs/zh_CN/quick_start/quick_start_recognition.md
+++ b/docs/zh_CN/quick_start/quick_start_recognition.md
@@ -9,19 +9,20 @@
 * [1. 环境配置](#环境配置)
 * [2. 图像识别体验](#图像识别体验)
   * [2.1 下载、解压inference 模型与demo数据](#下载、解压inference_模型与demo数据)
-  * [2.2 商品别与检索](#商品识别与检索)
+  * [2.2 瓶装饮料识别与检索](#瓶装饮料识别与检索)
     * [2.2.1 识别单张图像](#识别单张图像)
     * [2.2.2 基于文件夹的批量识别](#基于文件夹的批量识别)
 * [3. 未知类别的图像识别体验](#未知类别的图像识别体验)
   * [3.1 准备新的数据与标签](#准备新的数据与标签)
   * [3.2 建立新的索引库](#建立新的索引库)
   * [3.3 基于新的索引库的图像识别](#基于新的索引库的图像识别)
+* [4. 服务端识别模型列表](#4)
 
 
 <a name="环境配置"></a>
 ## 1. 环境配置
 
-* 安装：请先参考[快速安装](./install.md)配置 PaddleClas 运行环境。
+* 安装：请先参考[Paddle安装教程](../installation/install_paddle.md) 以及 [PaddleClas安装教程](../installation/install_paddleclas.md) 配置 PaddleClas 运行环境。
 
 * 进入 `deploy` 运行目录。本部分所有内容与命令均需要在 `deploy` 目录下运行，可以通过下面的命令进入 `deploy` 目录。
 
@@ -32,33 +33,23 @@
 <a name="图像识别体验"></a>
 ## 2. 图像识别体验
 
-检测模型与 4 个方向( Logo、动漫人物、车辆、商品 )的识别 inference 模型、测试数据下载地址以及对应的配置文件地址如下。
-
-服务器端通用主体检测模型与各方向识别模型：
+轻量级通用主体检测模型与轻量级通用识别模型和配置文件下载方式如下表所示。
 
 | 模型简介       | 推荐场景   | inference模型  | 预测配置文件  | 构建索引库的配置文件 |
 | ------------  | ------------- | -------- | ------- | -------- |
-| 通用主体检测模型 | 通用场景  |[模型下载链接](https://paddle-imagenet-models-name.bj.bcebos.com/dygraph/rec/models/inference/ppyolov2_r50vd_dcn_mainbody_v1.0_infer.tar) | - | - |
-| Logo 识别模型 | Logo场景  | [模型下载链接](https://paddle-imagenet-models-name.bj.bcebos.com/dygraph/rec/models/inference/logo_rec_ResNet50_Logo3K_v1.0_infer.tar) | [inference_logo.yaml](../../../deploy/configs/inference_logo.yaml) | [build_logo.yaml](../../../deploy/configs/build_logo.yaml) |
-| 动漫人物识别模型 | 动漫人物场景  | [模型下载链接](https://paddle-imagenet-models-name.bj.bcebos.com/dygraph/rec/models/inference/cartoon_rec_ResNet50_iCartoon_v1.0_infer.tar) | [inference_cartoon.yaml](../../../deploy/configs/inference_cartoon.yaml) | [build_cartoon.yaml](../../../deploy/configs/build_cartoon.yaml) |
-| 车辆细分类模型 | 车辆场景  |  [模型下载链接](https://paddle-imagenet-models-name.bj.bcebos.com/dygraph/rec/models/inference/vehicle_cls_ResNet50_CompCars_v1.0_infer.tar) | [inference_vehicle.yaml](../../../deploy/configs/inference_vehicle.yaml) | [build_vehicle.yaml](../../../deploy/configs/build_vehicle.yaml) |
-| 商品识别模型 | 商品场景  |  [模型下载链接](https://paddle-imagenet-models-name.bj.bcebos.com/dygraph/rec/models/inference/product_ResNet50_vd_aliproduct_v1.0_infer.tar) | [inference_product.yaml](../../../deploy/configs/inference_product.yaml) | [build_product.yaml](../../../deploy/configs/build_product.yaml) |
-| 车辆ReID模型 | 车辆ReID场景 | [模型下载链接](https://paddle-imagenet-models-name.bj.bcebos.com/dygraph/rec/models/inference/vehicle_reid_ResNet50_VERIWild_v1.0_infer.tar) | - | - |
+| 轻量级通用主体检测模型 | 通用场景  |[模型下载链接](https://paddle-imagenet-models-name.bj.bcebos.com/dygraph/rec/models/inference/picodet_PPLCNet_x2_5_mainbody_lite_v1.0_infer.tar) | - | - |
+| 轻量级通用识别模型 | 通用场景  | [模型下载链接](https://paddle-imagenet-models-name.bj.bcebos.com/dygraph/rec/models/inference/general_PPLCNet_x2_5_lite_v1.0_infer.tar) | [inference_general.yaml](../../../deploy/configs/inference_general.yaml) | [build_general.yaml](../../../deploy/configs/build_general.yaml) |
 
-轻量级通用主体检测模型与轻量级通用识别模型：
+本章节 demo 数据下载地址如下: [瓶装饮料数据下载链接](https://paddle-imagenet-models-name.bj.bcebos.com/dygraph/rec/data/drink_dataset_v1.0.tar)。
 
-| 模型简介       | 推荐场景   | inference模型  | 预测配置文件  | 构建索引库的配置文件 |
-| ------------  | ------------- | -------- | ------- | -------- |
-| 轻量级通用主体检测模型 | 通用场景  |[模型下载链接](https://paddle-imagenet-models-name.bj.bcebos.com/dygraph/rec/models/inference/picodet_PPLCNet_x2_5_mainbody_lite_v1.0_infer.tar) | - | - |
-| 轻量级通用识别模型 | 通用场景  | [模型下载链接](https://paddle-imagenet-models-name.bj.bcebos.com/dygraph/rec/models/inference/general_PPLCNet_x2_5_lite_v1.0_infer.tar) | [inference_product.yaml](../../../deploy/configs/inference_product.yaml) | [build_product.yaml](../../../deploy/configs/build_product.yaml) |
 
-本章节 demo 数据下载地址如下: [数据下载链接](https://paddle-imagenet-models-name.bj.bcebos.com/dygraph/rec/data/recognition_demo_data_v1.1.tar)。
+如果希望体验服务端主体检测和各垂类方向的识别模型，可以参考[第4章](#4)。
+
 
 **注意**
 
 1. windows 环境下如果没有安装 wget , 可以按照下面的步骤安装 wget 与 tar 命令，也可以在下载模型时将链接复制到浏览器中下载，并解压放置在相应目录下； linux 或者 macOS 用户可以右键点击，然后复制下载链接，即可通过 `wget` 命令下载。
 2. 如果 macOS 环境下没有安装 `wget` 命令，可以运行下面的命令进行安装。
-3. 轻量级通用识别模型的预测配置文件和构建索引的配置文件目前使用的是服务器端商品识别模型的配置，您可以自行修改模型的路径完成相应的索引构建和识别预测。
 
 ```shell
 # 安装 homebrew
@@ -87,47 +78,42 @@ wget {数据下载链接地址} && tar -xf {压缩包的名称}
 
 ### 2.1 下载、解压 inference 模型与 demo 数据
 
-以商品识别为例，下载 demo 数据集以及通用检测、识别模型，命令如下。
+下载 demo 数据集以及轻量级主体检测、识别模型，命令如下。
 
 ```shell
 mkdir models
 cd models
 # 下载通用检测 inference 模型并解压
-wget https://paddle-imagenet-models-name.bj.bcebos.com/dygraph/rec/models/inference/ppyolov2_r50vd_dcn_mainbody_v1.0_infer.tar && tar -xf ppyolov2_r50vd_dcn_mainbody_v1.0_infer.tar
+wget https://paddle-imagenet-models-name.bj.bcebos.com/dygraph/rec/models/inference/picodet_PPLCNet_x2_5_mainbody_lite_v1.0_infer.tar && tar -xf picodet_PPLCNet_x2_5_mainbody_lite_v1.0_infer.tar
 # 下载识别 inference 模型并解压
-wget https://paddle-imagenet-models-name.bj.bcebos.com/dygraph/rec/models/inference/product_ResNet50_vd_aliproduct_v1.0_infer.tar && tar -xf product_ResNet50_vd_aliproduct_v1.0_infer.tar
+wget https://paddle-imagenet-models-name.bj.bcebos.com/dygraph/rec/models/inference/general_PPLCNet_x2_5_lite_v1.0_infer.tar && tar -xf general_PPLCNet_x2_5_lite_v1.0_infer.tar
 
 cd ../
 # 下载 demo 数据并解压
-wget https://paddle-imagenet-models-name.bj.bcebos.com/dygraph/rec/data/recognition_demo_data_v1.1.tar && tar -xf recognition_demo_data_v1.1.tar
+wget https://paddle-imagenet-models-name.bj.bcebos.com/dygraph/rec/data/drink_dataset_v1.0.tar && tar -xf drink_dataset_v1.0.tar
 ```
 
-解压完毕后，`recognition_demo_data_v1.1` 文件夹下应有如下文件结构：
+解压完毕后，`drink_dataset_v1.0/` 文件夹下应有如下文件结构：
 
 ```
-├── recognition_demo_data_v1.1
-│   ├── gallery_cartoon
-│   ├── gallery_logo
-│   ├── gallery_product
-│   ├── gallery_vehicle
-│   ├── test_cartoon
-│   ├── test_logo
-│   ├── test_product
-│   └── test_vehicle
+├── drink_dataset_v1.0/
+│   ├── gallery/
+│   ├── index/
+│   ├── test_images/
 ├── ...
 ```
 
-其中 `gallery_xxx` 文件夹中存放的是用于构建索引库的原始图像， `test_xxx` 文件夹中存放的是用于测试识别效果的图像列表。
+其中 `gallery` 文件夹中存放的是用于构建索引库的原始图像，`index` 表示基于原始图像构建得到的索引库信息， `test_images` 文件夹中存放的是用于测试识别效果的图像列表。
 
 
 `models` 文件夹下应有如下文件结构：
 
 ```
-├── product_ResNet50_vd_aliproduct_v1.0_infer
+├── general_PPLCNet_x2_5_lite_v1.0_infer
 │   ├── inference.pdiparams
 │   ├── inference.pdiparams.info
 │   └── inference.pdmodel
-├── ppyolov2_r50vd_dcn_mainbody_v1.0_infer
+├── picodet_PPLCNet_x2_5_mainbody_lite_v1.0_infer
 │   ├── inference.pdiparams
 │   ├── inference.pdiparams.info
 │   └── inference.pdmodel
@@ -135,16 +121,17 @@ wget https://paddle-imagenet-models-name.bj.bcebos.com/dygraph/rec/data/recognit
 
 **注意**
 
-如果使用轻量级通用识别模型， Demo 数据需要重新提取特征、够建索引，方式如下：
+如果使用服务端通用识别模型， Demo 数据需要重新提取特征、够建索引，方式如下：
 
 ```shell
-python3.7 python/build_gallery.py -c configs/build_product.yaml -o Global.rec_inference_model_dir=./models/general_PPLCNet_x2_5_lite_v1.0_infer
+# 下面是使用下载的服务端商品识别模型进行索引库构建
+python3.7 python/build_gallery.py -c configs/build_general.yaml -o Global.rec_inference_model_dir=./models/product_ResNet50_vd_aliproduct_v1.0_infer
 ```
 
-<a name="商品识别与检索"></a>
-### 2.2 商品识别与检索
+<a name="瓶装饮料识别与检索"></a>
+### 2.2 瓶装饮料识别与检索
 
-以商品识别 demo 为例，展示识别与检索过程（如果希望尝试其他方向的识别与检索效果，在下载解压好对应的 demo 数据与模型之后，替换对应的配置文件即可完成预测）。
+以瓶装饮料识别 demo 为例，展示识别与检索过程（如果希望尝试其他方向的识别与检索效果，在下载解压好对应的 demo 数据与模型之后，替换对应的配置文件即可完成预测）。
 
 注意，此部分使用了 `faiss` 作为检索库，安装方法如下：
 
@@ -158,26 +145,26 @@ pip install faiss-cpu==1.7.1post2
 
 #### 2.2.1 识别单张图像
 
-运行下面的命令，对图像 `./recognition_demo_data_v1.1/test_product/daoxiangcunjinzhubing_6.jpg` 进行识别与检索
+运行下面的命令，对图像 `./drink_dataset_v1.0/test_images/nongfu_spring.jpeg` 进行识别与检索
 
 ```shell
 # 使用下面的命令使用GPU进行预测
-python3.7 python/predict_system.py -c configs/inference_product.yaml
+python3.7 python/predict_system.py -c configs/inference_general.yaml
 # 使用下面的命令使用CPU进行预测
-python3.7 python/predict_system.py -c configs/inference_product.yaml -o Global.use_gpu=False
+python3.7 python/predict_system.py -c configs/inference_general.yaml -o Global.use_gpu=False
 ```
 
 待检索图像如下所示。
 
 <div align="center">
-<img src="../../images/recognition/product_demo/query/daoxiangcunjinzhubing_6.jpg"  width = "400" />
+<img src="../../images/recognition/drink_data_demo/test_images/nongfu_spring.jpeg"  width = "400" />
 </div>
 
 
 最终输出结果如下。
 
-```json
-[{'bbox': [287, 129, 497, 326], 'rec_docs': '稻香村金猪饼', 'rec_scores': 0.8309420943260193}, {'bbox': [99, 242, 313, 426], 'rec_docs': '稻香村金猪饼', 'rec_scores': 0.7245652079582214}]
+```
+[{'bbox': [244, 49, 509, 964], 'rec_docs': '农夫山泉-饮用天然水', 'rec_scores': 0.7585664}]
 ```
 
 其中bbox表示检测出的主体所在位置，rec_docs表示索引库中与检测框最为相似的类别，rec_scores表示对应的置信度。
@@ -185,7 +172,7 @@ python3.7 python/predict_system.py -c configs/inference_product.yaml -o Global.u
 检测的可视化结果也保存在`output`文件夹下，对于本张图像，识别结果可视化如下所示。
 
 <div align="center">
-<img src="../../images/recognition/product_demo/result/daoxiangcunjinzhubing_6.jpg"  width = "400" />
+<img src="../../images/recognition/drink_data_demo/output/nongfu_spring.jpeg"  width = "400" />
 </div>
 
 
@@ -196,20 +183,24 @@ python3.7 python/predict_system.py -c configs/inference_product.yaml -o Global.u
 
 ```shell
 # 使用下面的命令使用GPU进行预测，如果希望使用CPU预测，可以在命令后面添加 -o Global.use_gpu=False
-python3.7 python/predict_system.py -c configs/inference_product.yaml -o Global.infer_imgs="./recognition_demo_data_v1.1/test_product/"
+python3.7 python/predict_system.py -c configs/inference_general.yaml -o Global.infer_imgs="./drink_dataset_v1.0/test_images/"
 ```
 
 终端中会输出该文件夹内所有图像的识别结果，如下所示。
 
-```json
+```
 ...
-[{'bbox': [37, 29, 123, 89], 'rec_docs': '香奈儿包', 'rec_scores': 0.6163763999938965}, {'bbox': [153, 96, 235, 175], 'rec_docs': '香奈儿包', 'rec_scores': 0.5279821157455444}]
-[{'bbox': [735, 562, 1133, 851], 'rec_docs': '香奈儿包', 'rec_scores': 0.5588355660438538}]
-[{'bbox': [124, 50, 230, 129], 'rec_docs': '香奈儿包', 'rec_scores': 0.6980369687080383}]
-[{'bbox': [0, 0, 275, 183], 'rec_docs': '香奈儿包', 'rec_scores': 0.5818190574645996}]
-[{'bbox': [400, 1179, 905, 1537], 'rec_docs': '香奈儿包', 'rec_scores': 0.9814301133155823}]
-[{'bbox': [544, 4, 1482, 932], 'rec_docs': '香奈儿包', 'rec_scores': 0.5143815279006958}]
-[{'bbox': [29, 42, 194, 183], 'rec_docs': '香奈儿包', 'rec_scores': 0.9543638229370117}]
+[{'bbox': [345, 95, 524, 586], 'rec_docs': '红牛-强化型', 'rec_scores': 0.80164653}]
+Inference: 23.43583106994629 ms per batch image
+[{'bbox': [233, 0, 372, 436], 'rec_docs': '康师傅矿物质水', 'rec_scores': 0.72513914}]
+Inference: 117.95639991760254 ms per batch image
+[{'bbox': [138, 40, 573, 1198], 'rec_docs': '乐虎功能饮料', 'rec_scores': 0.7855944}]
+Inference: 22.172927856445312 ms per batch image
+[{'bbox': [328, 7, 467, 272], 'rec_docs': '脉动', 'rec_scores': 0.5829516}]
+Inference: 118.08514595031738 ms per batch image
+[{'bbox': [242, 82, 498, 726], 'rec_docs': '味全_每日C', 'rec_scores': 0.75581443}]
+Inference: 150.06470680236816 ms per batch image
+[{'bbox': [437, 71, 660, 728], 'rec_docs': '元气森林', 'rec_scores': 0.8478892}, {'bbox': [221, 72, 449, 701], 'rec_docs': '元气森林', 'rec_scores': 0.6790612}, {'bbox': [794, 104, 979, 652], 'rec_docs': '元气森林', 'rec_scores': 0.6292581}]
 ...
 ```
 
@@ -222,17 +213,17 @@ python3.7 python/predict_system.py -c configs/inference_product.yaml -o Global.i
 
 ## 3. 未知类别的图像识别体验
 
-对图像 `./recognition_demo_data_v1.1/test_product/anmuxi.jpg` 进行识别，命令如下
+对图像 `./drink_dataset_v1.0/test_images/mosilian.jpeg` 进行识别，命令如下
 
 ```shell
 # 使用下面的命令使用 GPU 进行预测，如果希望使用 CPU 预测，可以在命令后面添加 -o Global.use_gpu=False
-python3.7 python/predict_system.py -c configs/inference_product.yaml -o Global.infer_imgs="./recognition_demo_data_v1.1/test_product/anmuxi.jpg"
+python3.7 python/predict_system.py -c configs/inference_general.yaml -o Global.infer_imgs="./drink_dataset_v1.0/test_images/mosilian.jpeg"
 ```
 
 待检索图像如下所示。
 
 <div align="center">
-<img src="../../images/recognition/product_demo/query/anmuxi.jpg"  width = "400" />
+<img src="../../images/recognition/drink_data_demo/test_images/mosilian.jpeg"  width = "400" />
 </div>
 
 
@@ -245,31 +236,12 @@ python3.7 python/predict_system.py -c configs/inference_product.yaml -o Global.i
 <a name="准备新的数据与标签"></a>
 ### 3.1 准备新的数据与标签
 
-首先需要将与待检索图像相似的图像列表拷贝到索引库原始图像的文件夹( `./recognition_demo_data_v1.1/gallery_product/gallery` )中，运行下面的命令拷贝相似图像。
+首先需要将与待检索图像相似的图像列表拷贝到索引库原始图像的文件夹。这里 PaddleClas 已经将所有的图像数据都放在文件夹 `drink_dataset_v1.0/gallery/` 中。
 
-```shell
-cp -r  ../docs/images/recognition/product_demo/gallery/anmuxi ./recognition_demo_data_v1.1/gallery_product/gallery/
-```
+然后需要编辑记录了图像路径和标签信息的文本文件，这里 PaddleClas 将更正后的标签信息文件放在了 `drink_dataset_v1.0/gallery/drink_label_all.txt` 文件中。可以与默认的 `drink_dataset_v1.0/gallery/drink_label.txt` 标签文件进行对比，添加了光明和三元系列牛奶的索引图像。
 
-然后需要编辑记录了图像路径和标签信息的文本文件( `./recognition_demo_data_v1.1/gallery_product/data_file_update.txt` )，这里基于原始标签文件，新建一个文件。命令如下。
 
-```shell
-# 复制文件
-cp recognition_demo_data_v1.1/gallery_product/data_file.txt recognition_demo_data_v1.1/gallery_product/data_file_update.txt
-```
-
-然后在文件 `recognition_demo_data_v1.1/gallery_product/data_file_update.txt` 中添加以下的信息，
-
-```
-gallery/anmuxi/001.jpg	安慕希酸奶
-gallery/anmuxi/002.jpg	安慕希酸奶
-gallery/anmuxi/003.jpg	安慕希酸奶
-gallery/anmuxi/004.jpg	安慕希酸奶
-gallery/anmuxi/005.jpg	安慕希酸奶
-gallery/anmuxi/006.jpg	安慕希酸奶
-```
-
-每一行的文本中，第一个字段表示图像的相对路径，第二个字段表示图像对应的标签信息，中间用 `tab` 键分隔开（注意：有些编辑器会将 `tab` 自动转换为 `空格` ，这种情况下会导致文件解析报错）。
+每一行的文本中，第一个字段表示图像的相对路径，第二个字段表示图像对应的标签信息，中间用 `\t` 键分隔开（注意：有些编辑器会将 `tab` 自动转换为 `空格` ，这种情况下会导致文件解析报错）。
 
 
 <a name="建立新的索引库"></a>
@@ -278,10 +250,10 @@ gallery/anmuxi/006.jpg	安慕希酸奶
 使用下面的命令构建 `index` 索引，加速识别后的检索过程。
 
 ```shell
-python3.7 python/build_gallery.py -c configs/build_product.yaml -o IndexProcess.data_file="./recognition_demo_data_v1.1/gallery_product/data_file_update.txt" -o IndexProcess.index_dir="./recognition_demo_data_v1.1/gallery_product/index_update"
+python3.7 python/build_gallery.py -c configs/build_general.yaml -o IndexProcess.data_file="./drink_dataset_v1.0/gallery/drink_label_all.txt" -o IndexProcess.index_dir="./drink_dataset_v1.0/index_all"
 ```
 
-最终新的索引信息保存在文件夹 `./recognition_demo_data_v1.1/gallery_product/index_update` 中。
+最终新的索引信息保存在文件夹 `./drink_dataset_v1.0/index_all` 中。
 
 <a name="基于新的索引库的图像识别"></a>
 
@@ -291,17 +263,34 @@ python3.7 python/build_gallery.py -c configs/build_product.yaml -o IndexProcess.
 
 ```shell
 # 使用下面的命令使用 GPU 进行预测，如果希望使用 CPU 预测，可以在命令后面添加 -o Global.use_gpu=False
-python3.7 python/predict_system.py -c configs/inference_product.yaml -o Global.infer_imgs="./recognition_demo_data_v1.1/test_product/anmuxi.jpg" -o IndexProcess.index_dir="./recognition_demo_data_v1.1/gallery_product/index_update"
+python3.7 python/predict_system.py -c configs/inference_general.yaml -o Global.infer_imgs="././drink_dataset_v1.0/test_images/mosilian.jpeg" -o IndexProcess.index_dir="./drink_dataset_v1.0/index_all"
 ```
 
 输出结果如下。
 
-```json
-[{'bbox': [243, 80, 523, 522], 'rec_docs': '安慕希酸奶', 'rec_scores': 0.5570770502090454}]
+```
+[{'bbox': [396, 553, 508, 621], 'rec_docs': '光明_莫斯利安', 'rec_scores': 0.5921005}]
 ```
 
-最终识别结果为`安慕希酸奶`，识别正确，识别结果可视化如下所示。
+最终识别结果为`光明_莫斯利安`，识别正确，识别结果可视化如下所示。
 
 <div align="center">
-<img src="../../images/recognition/product_demo/result/anmuxi.jpg"  width = "400" />
+<img src="../../images/recognition/drink_data_demo/output/mosilian.jpeg"  width = "400" />
 </div>
+
+
+<a name="4"></a>
+## 4. 服务端识别模型列表
+
+如果希望体验服务端识别模型，服务器端通用主体检测模型与各方向识别模型、测试数据下载地址以及对应的配置文件地址如下。
+
+| 模型简介       | 推荐场景   | inference模型  | 预测配置文件  | 构建索引库的配置文件 |
+| ------------  | ------------- | -------- | ------- | -------- |
+| 通用主体检测模型 | 通用场景  |[模型下载链接](https://paddle-imagenet-models-name.bj.bcebos.com/dygraph/rec/models/inference/ppyolov2_r50vd_dcn_mainbody_v1.0_infer.tar) | - | - |
+| Logo 识别模型 | Logo场景  | [模型下载链接](https://paddle-imagenet-models-name.bj.bcebos.com/dygraph/rec/models/inference/logo_rec_ResNet50_Logo3K_v1.0_infer.tar) | [inference_logo.yaml](../../../deploy/configs/inference_logo.yaml) | [build_logo.yaml](../../../deploy/configs/build_logo.yaml) |
+| 动漫人物识别模型 | 动漫人物场景  | [模型下载链接](https://paddle-imagenet-models-name.bj.bcebos.com/dygraph/rec/models/inference/cartoon_rec_ResNet50_iCartoon_v1.0_infer.tar) | [inference_cartoon.yaml](../../../deploy/configs/inference_cartoon.yaml) | [build_cartoon.yaml](../../../deploy/configs/build_cartoon.yaml) |
+| 车辆细分类模型 | 车辆场景  |  [模型下载链接](https://paddle-imagenet-models-name.bj.bcebos.com/dygraph/rec/models/inference/vehicle_cls_ResNet50_CompCars_v1.0_infer.tar) | [inference_vehicle.yaml](../../../deploy/configs/inference_vehicle.yaml) | [build_vehicle.yaml](../../../deploy/configs/build_vehicle.yaml) |
+| 商品识别模型 | 商品场景  |  [模型下载链接](https://paddle-imagenet-models-name.bj.bcebos.com/dygraph/rec/models/inference/product_ResNet50_vd_aliproduct_v1.0_infer.tar) | [inference_product.yaml](../../../deploy/configs/inference_product.yaml) | [build_product.yaml](../../../deploy/configs/build_product.yaml) |
+| 车辆ReID模型 | 车辆ReID场景 | [模型下载链接](https://paddle-imagenet-models-name.bj.bcebos.com/dygraph/rec/models/inference/vehicle_reid_ResNet50_VERIWild_v1.0_infer.tar) | - | - |
+
+* 更多关于主体检测的介绍可以参考：[主体检测教程文档](../image_recognition_pipeline/mainbody_detection.md)；关于特征提取的介绍可以参考：[特征提取教程文档](../image_recognition_pipeline/feature_extraction.md)；关于向量检索的介绍可以参考：[向量检索教程文档](../image_recognition_pipeline/vector_search.md)。
diff --git a/ppcls/arch/backbone/base/theseus_layer.py b/ppcls/arch/backbone/base/theseus_layer.py
index 35eac5f083bae1a371119ccf35c390441f0d1f8e..64bfed0e939f614dca2fcac996da35c6dbe5989a 100644
--- a/ppcls/arch/backbone/base/theseus_layer.py
+++ b/ppcls/arch/backbone/base/theseus_layer.py
@@ -15,6 +15,7 @@ class TheseusLayer(nn.Layer):
     def __init__(self, *args, **kwargs):
         super(TheseusLayer, self).__init__()
         self.res_dict = {}
+        self.res_name = self.full_name()
 
     # stop doesn't work when stop layer has a parallel branch.
     def stop_after(self, stop_layer_name: str):
@@ -33,29 +34,45 @@ class TheseusLayer(nn.Layer):
         return after_stop
 
     def update_res(self, return_patterns):
-        if not return_patterns or isinstance(self, WrapLayer):
-            return
-        for layer_i in self._sub_layers:
-            layer_name = self._sub_layers[layer_i].full_name()
-            if isinstance(self._sub_layers[layer_i], (nn.Sequential, nn.LayerList)):
-                self._sub_layers[layer_i] = wrap_theseus(self._sub_layers[layer_i], self.res_dict)
-                self._sub_layers[layer_i].update_res(return_patterns)
+        for return_pattern in return_patterns:
+            pattern_list = return_pattern.split(".")
+            if not pattern_list:
+                continue
+            sub_layer_parent = self
+            while len(pattern_list) > 1:
+                if '[' in pattern_list[0]:
+                    sub_layer_name = pattern_list[0].split('[')[0]
+                    sub_layer_index = pattern_list[0].split('[')[1].split(']')[0]
+                    sub_layer_parent = getattr(sub_layer_parent, sub_layer_name)[sub_layer_index]
+                else:
+                    sub_layer_parent = getattr(sub_layer_parent, pattern_list[0],
+                                               None)
+                    if sub_layer_parent is None:
+                        break
+                if isinstance(sub_layer_parent, WrapLayer):
+                    sub_layer_parent = sub_layer_parent.sub_layer
+                pattern_list = pattern_list[1:]
+            if sub_layer_parent is None:
+                continue
+            if '[' in pattern_list[0]:
+                sub_layer_name = pattern_list[0].split('[')[0]
+                sub_layer_index = pattern_list[0].split('[')[1].split(']')[0]
+                sub_layer = getattr(sub_layer_parent, sub_layer_name)[sub_layer_index]
+                if not isinstance(sub_layer, TheseusLayer):
+                    sub_layer = wrap_theseus(sub_layer)
+                getattr(sub_layer_parent, sub_layer_name)[sub_layer_index] = sub_layer
             else:
-                for return_pattern in return_patterns:
-                    if re.match(return_pattern, layer_name):
-                        if not isinstance(self._sub_layers[layer_i], TheseusLayer):
-                            self._sub_layers[layer_i] = wrap_theseus(self._sub_layers[layer_i], self.res_dict)
-                        else:
-                            self._sub_layers[layer_i].res_dict = self.res_dict
-
-                        self._sub_layers[layer_i].register_forward_post_hook(
-                            self._sub_layers[layer_i]._save_sub_res_hook)
-            if isinstance(self._sub_layers[layer_i], TheseusLayer):
-                self._sub_layers[layer_i].res_dict = self.res_dict
-                self._sub_layers[layer_i].update_res(return_patterns)
+                sub_layer = getattr(sub_layer_parent, pattern_list[0])
+                if not isinstance(sub_layer, TheseusLayer):
+                    sub_layer = wrap_theseus(sub_layer)
+                setattr(sub_layer_parent, pattern_list[0], sub_layer)
+
+            sub_layer.res_dict = self.res_dict
+            sub_layer.res_name = return_pattern
+            sub_layer.register_forward_post_hook(sub_layer._save_sub_res_hook)
 
     def _save_sub_res_hook(self, layer, input, output):
-        self.res_dict[layer.full_name()] = output
+        self.res_dict[self.res_name] = output
 
     def _return_dict_hook(self, layer, input, output):
         res_dict = {"output": output}
@@ -63,19 +80,23 @@ class TheseusLayer(nn.Layer):
             res_dict[res_key] = self.res_dict.pop(res_key)
         return res_dict
 
-    def replace_sub(self, layer_name_pattern, replace_function, recursive=True):
+    def replace_sub(self, layer_name_pattern, replace_function,
+                    recursive=True):
         for layer_i in self._sub_layers:
             layer_name = self._sub_layers[layer_i].full_name()
             if re.match(layer_name_pattern, layer_name):
-                self._sub_layers[layer_i] = replace_function(self._sub_layers[layer_i])
+                self._sub_layers[layer_i] = replace_function(self._sub_layers[
+                    layer_i])
             if recursive:
                 if isinstance(self._sub_layers[layer_i], TheseusLayer):
                     self._sub_layers[layer_i].replace_sub(
                         layer_name_pattern, replace_function, recursive)
-                elif isinstance(self._sub_layers[layer_i], (nn.Sequential, nn.LayerList)):
+                elif isinstance(self._sub_layers[layer_i],
+                                (nn.Sequential, nn.LayerList)):
                     for layer_j in self._sub_layers[layer_i]._sub_layers:
-                        self._sub_layers[layer_i]._sub_layers[layer_j].replace_sub(
-                            layer_name_pattern, replace_function, recursive)
+                        self._sub_layers[layer_i]._sub_layers[
+                            layer_j].replace_sub(layer_name_pattern,
+                                                 replace_function, recursive)
 
     '''
     example of replace function:
@@ -92,39 +113,14 @@ class TheseusLayer(nn.Layer):
 
 
 class WrapLayer(TheseusLayer):
-    def __init__(self, sub_layer, res_dict=None):
+    def __init__(self, sub_layer):
         super(WrapLayer, self).__init__()
         self.sub_layer = sub_layer
-        self.name = sub_layer.full_name()
-        if res_dict is not None:
-            self.res_dict = res_dict
-
-    def full_name(self):
-        return self.name
 
     def forward(self, *inputs, **kwargs):
         return self.sub_layer(*inputs, **kwargs)
 
-    def update_res(self, return_patterns):
-        if not return_patterns or not isinstance(self.sub_layer, (nn.Sequential, nn.LayerList)):
-            return
-        for layer_i in self.sub_layer._sub_layers:
-            if isinstance(self.sub_layer._sub_layers[layer_i], (nn.Sequential, nn.LayerList)):
-                self.sub_layer._sub_layers[layer_i] = wrap_theseus(self.sub_layer._sub_layers[layer_i], self.res_dict)
-                self.sub_layer._sub_layers[layer_i].update_res(return_patterns)
-            elif isinstance(self.sub_layer._sub_layers[layer_i], TheseusLayer):
-                self.sub_layer._sub_layers[layer_i].res_dict = self.res_dict
-
-            layer_name = self.sub_layer._sub_layers[layer_i].full_name()
-            for return_pattern in return_patterns:
-                if re.match(return_pattern, layer_name):
-                    self.sub_layer._sub_layers[layer_i].register_forward_post_hook(
-                        self._sub_layers[layer_i]._save_sub_res_hook)
-
-            if isinstance(self.sub_layer._sub_layers[layer_i], TheseusLayer):
-                self.sub_layer._sub_layers[layer_i].update_res(return_patterns)
-
-
-def wrap_theseus(sub_layer, res_dict=None):
-    wrapped_layer = WrapLayer(sub_layer, res_dict)
+
+def wrap_theseus(sub_layer):
+    wrapped_layer = WrapLayer(sub_layer)
     return wrapped_layer
diff --git a/ppcls/configs/ImageNet/DeiT/DeiT_base_distilled_patch16_224.yaml b/ppcls/configs/ImageNet/DeiT/DeiT_base_distilled_patch16_224.yaml
index d3b9344f2be9db24504d2645a469c9ad74973a59..f03543f77f4c831127bcad9c2939fb89ace902ea 100644
--- a/ppcls/configs/ImageNet/DeiT/DeiT_base_distilled_patch16_224.yaml
+++ b/ppcls/configs/ImageNet/DeiT/DeiT_base_distilled_patch16_224.yaml
@@ -17,6 +17,8 @@ Global:
 # model architecture
 Arch:
   name: DeiT_base_distilled_patch16_224
+  drop_path_rate : 0.1
+  drop_rate : 0.0
   class_num: 1000
  
 # loss function config for traing/eval process
diff --git a/ppcls/configs/ImageNet/DeiT/DeiT_base_distilled_patch16_384.yaml b/ppcls/configs/ImageNet/DeiT/DeiT_base_distilled_patch16_384.yaml
index 77b5a5afaaddc226904a7025ab1eb385b4ab562d..fcf2981beb217a1fd3e55741a70df8294cc01176 100644
--- a/ppcls/configs/ImageNet/DeiT/DeiT_base_distilled_patch16_384.yaml
+++ b/ppcls/configs/ImageNet/DeiT/DeiT_base_distilled_patch16_384.yaml
@@ -17,6 +17,8 @@ Global:
 # model architecture
 Arch:
   name: DeiT_base_distilled_patch16_384
+  drop_path_rate : 0.1
+  drop_rate : 0.0
   class_num: 1000
  
 # loss function config for traing/eval process
diff --git a/ppcls/configs/ImageNet/DeiT/DeiT_base_patch16_224.yaml b/ppcls/configs/ImageNet/DeiT/DeiT_base_patch16_224.yaml
index 927c18b93f68e426863b2669f182e3d9142d46ee..7b328905e05d3d9db713710185eb2e1ee5ebca46 100644
--- a/ppcls/configs/ImageNet/DeiT/DeiT_base_patch16_224.yaml
+++ b/ppcls/configs/ImageNet/DeiT/DeiT_base_patch16_224.yaml
@@ -17,6 +17,8 @@ Global:
 # model architecture
 Arch:
   name: DeiT_base_patch16_224
+  drop_path_rate : 0.1
+  drop_rate : 0.0
   class_num: 1000
  
 # loss function config for traing/eval process
diff --git a/ppcls/configs/ImageNet/DeiT/DeiT_base_patch16_384.yaml b/ppcls/configs/ImageNet/DeiT/DeiT_base_patch16_384.yaml
index 26374b1d78a2d405308dac18e9346ad44d6370dd..a2990ecdbd6e9e2493fdf177d9d6042d9c682227 100644
--- a/ppcls/configs/ImageNet/DeiT/DeiT_base_patch16_384.yaml
+++ b/ppcls/configs/ImageNet/DeiT/DeiT_base_patch16_384.yaml
@@ -17,6 +17,8 @@ Global:
 # model architecture
 Arch:
   name: DeiT_base_patch16_384
+  drop_path_rate : 0.1
+  drop_rate : 0.0
   class_num: 1000
  
 # loss function config for traing/eval process
diff --git a/ppcls/configs/ImageNet/DeiT/DeiT_small_distilled_patch16_224.yaml b/ppcls/configs/ImageNet/DeiT/DeiT_small_distilled_patch16_224.yaml
index 0192205739ac4b59ff483ad736b40380d5eaef38..b565d03ab3d468a090d2ead743b00a73cffc239e 100644
--- a/ppcls/configs/ImageNet/DeiT/DeiT_small_distilled_patch16_224.yaml
+++ b/ppcls/configs/ImageNet/DeiT/DeiT_small_distilled_patch16_224.yaml
@@ -17,6 +17,8 @@ Global:
 # model architecture
 Arch:
   name: DeiT_small_distilled_patch16_224
+  drop_path_rate : 0.1
+  drop_rate : 0.0
   class_num: 1000
  
 # loss function config for traing/eval process
diff --git a/ppcls/configs/ImageNet/DeiT/DeiT_small_patch16_224.yaml b/ppcls/configs/ImageNet/DeiT/DeiT_small_patch16_224.yaml
index acc07fd1cec296fe44c34e2ebf97514510e3fdc0..9e9c5de120b9577fba179c7893f5b3640a909323 100644
--- a/ppcls/configs/ImageNet/DeiT/DeiT_small_patch16_224.yaml
+++ b/ppcls/configs/ImageNet/DeiT/DeiT_small_patch16_224.yaml
@@ -17,6 +17,8 @@ Global:
 # model architecture
 Arch:
   name: DeiT_small_patch16_224
+  drop_path_rate : 0.1
+  drop_rate : 0.0
   class_num: 1000
  
 # loss function config for traing/eval process
diff --git a/ppcls/configs/ImageNet/DeiT/DeiT_tiny_distilled_patch16_224.yaml b/ppcls/configs/ImageNet/DeiT/DeiT_tiny_distilled_patch16_224.yaml
index 154826e2ae6b5ff49a995fc1dfe77562dc856eed..53f54b1b2356b6d67270cd778e0a2b84e7906cb8 100644
--- a/ppcls/configs/ImageNet/DeiT/DeiT_tiny_distilled_patch16_224.yaml
+++ b/ppcls/configs/ImageNet/DeiT/DeiT_tiny_distilled_patch16_224.yaml
@@ -17,6 +17,8 @@ Global:
 # model architecture
 Arch:
   name: DeiT_tiny_distilled_patch16_224
+  drop_path_rate : 0.1
+  drop_rate : 0.0
   class_num: 1000
  
 # loss function config for traing/eval process
diff --git a/ppcls/configs/ImageNet/DeiT/DeiT_tiny_patch16_224.yaml b/ppcls/configs/ImageNet/DeiT/DeiT_tiny_patch16_224.yaml
index 93e4d239238f1d522b730c89c5447c5db630c63f..8fa66856367a6694501ddcfb2782751fc43e556a 100644
--- a/ppcls/configs/ImageNet/DeiT/DeiT_tiny_patch16_224.yaml
+++ b/ppcls/configs/ImageNet/DeiT/DeiT_tiny_patch16_224.yaml
@@ -17,6 +17,8 @@ Global:
 # model architecture
 Arch:
   name: DeiT_tiny_patch16_224
+  drop_path_rate : 0.1
+  drop_rate : 0.0
   class_num: 1000
  
 # loss function config for traing/eval process
diff --git a/ppcls/engine/engine.py b/ppcls/engine/engine.py
index b6f5253db4614f78d63ea3a18f4c56371db5ef80..7568c16c1270a1139e3a792271ed140bacf5c6f0 100644
--- a/ppcls/engine/engine.py
+++ b/ppcls/engine/engine.py
@@ -61,7 +61,7 @@ class Engine(object):
 
         # set seed
         seed = self.config["Global"].get("seed", False)
-        if seed:
+        if seed or seed == 0:
             assert isinstance(seed, int), "The 'seed' must be a integer!"
             paddle.seed(seed)
             np.random.seed(seed)
@@ -91,7 +91,7 @@ class Engine(object):
             self.vdl_writer = LogWriter(logdir=vdl_writer_path)
 
         # set device
-        assert self.config["Global"]["device"] in ["cpu", "gpu", "xpu"]
+        assert self.config["Global"]["device"] in ["cpu", "gpu", "xpu", "npu"]
         self.device = paddle.set_device(self.config["Global"]["device"])
         logger.info('train with paddle {} and device {}'.format(
             paddle.__version__, self.device))
diff --git a/ppcls/engine/train/train.py b/ppcls/engine/train/train.py
index f2c79da78c2515435764184016aa27b425fdc3a2..cbf868e4e6d1d118b417568625c493afea6cd23a 100644
--- a/ppcls/engine/train/train.py
+++ b/ppcls/engine/train/train.py
@@ -16,6 +16,7 @@ from __future__ import absolute_import, division, print_function
 import time
 import paddle
 from ppcls.engine.train.utils import update_loss, update_metric, log_info
+from ppcls.utils import profiler
 
 
 def train_epoch(engine, epoch_id, print_batch_step):
@@ -23,6 +24,7 @@ def train_epoch(engine, epoch_id, print_batch_step):
     for iter_id, batch in enumerate(engine.train_dataloader):
         if iter_id >= engine.max_iter:
             break
+        profiler.add_profiler_step(engine.config["profiler_options"])
         if iter_id == 5:
             for key in engine.time_info:
                 engine.time_info[key].reset()
diff --git a/ppcls/static/program.py b/ppcls/static/program.py
index 956af174e3c447ffd614e138c3d32a55fbc96f58..9075a359b8ad0d991865d19f413ca250c39368f1 100644
--- a/ppcls/static/program.py
+++ b/ppcls/static/program.py
@@ -433,9 +433,8 @@ def run(dataloader,
 
     end_str = ' '.join([str(m.mean) for m in metric_dict.values()] +
                        [metric_dict["batch_time"].total])
-    ips_info = "ips: {:.5f} images/sec.".format(
-        batch_size * metric_dict["batch_time"].count /
-        metric_dict["batch_time"].sum)
+    ips_info = "ips: {:.5f} images/sec.".format(batch_size /
+                                                metric_dict["batch_time"].avg)
     if mode == 'eval':
         logger.info("END {:s} {:s} {:s}".format(mode, end_str, ips_info))
     else:
diff --git a/ppcls/static/run_dali.sh b/ppcls/static/run_dali.sh
index 8b33b28d28d0b83a163244495a6076fb63fd4a02..748ac84c732ddfe2382747118f2abaf3e005a484 100644
--- a/ppcls/static/run_dali.sh
+++ b/ppcls/static/run_dali.sh
@@ -5,7 +5,7 @@ export FLAGS_fraction_of_gpu_memory_to_use=0.80
 
 python3.7 -m paddle.distributed.launch \
     --gpus="0,1,2,3,4,5,6,7" \
-    ppcls/static//train.py \
+    ppcls/static/train.py \
     -c ./ppcls/configs/ImageNet/ResNet/ResNet50_fp16.yaml \
     -o Global.use_dali=True
 
diff --git a/ppcls/static/train.py b/ppcls/static/train.py
index bc160e4a40debee2a50db16a9c47256976d27412..e262f27ffdcc827414df459822f45963f3bb0f92 100644
--- a/ppcls/static/train.py
+++ b/ppcls/static/train.py
@@ -91,14 +91,17 @@ def main(args):
             os.environ[k] = AMP_RELATED_FLAGS_SETTING[k]
 
     use_xpu = global_config.get("use_xpu", False)
+    use_npu = global_config.get("use_npu", False)
     assert (
-        use_gpu and use_xpu
-    ) is not True, "gpu and xpu can not be true in the same time in static mode!"
+        use_gpu and use_xpu and use_npu
+    ) is not True, "gpu, xpu and npu can not be true in the same time in static mode!"
 
     if use_gpu:
         device = paddle.set_device('gpu')
     elif use_xpu:
         device = paddle.set_device('xpu')
+    elif use_npu:
+        device = paddle.set_device('npu')
     else:
         device = paddle.set_device('cpu')
 
diff --git a/ppcls/utils/config.py b/ppcls/utils/config.py
index b92f0d9456c8e7ced5704c0bfe931a080e5eb5cf..e3277c480943cdfb7ce49f4f3ea7bbd160c34ebb 100644
--- a/ppcls/utils/config.py
+++ b/ppcls/utils/config.py
@@ -199,5 +199,12 @@ def parse_args():
         action='append',
         default=[],
         help='config options to be overridden')
+    parser.add_argument(
+        '-p',
+        '--profiler_options',
+        type=str,
+        default=None,
+        help='The option of profiler, which should be in format \"key1=value1;key2=value2;key3=value3\".'
+    )
     args = parser.parse_args()
     return args
diff --git a/tests/config/ResNet50_vd.txt b/tests/config/ResNet50_vd.txt
index 9b8f27cc6485ff61f3b3999a870af544afca485b..ed706085985b3f9d405b478d402551a023b26d22 100644
--- a/tests/config/ResNet50_vd.txt
+++ b/tests/config/ResNet50_vd.txt
@@ -33,7 +33,7 @@ fpgm_export:tools/export_model.py -c ppcls/configs/slim/ResNet50_vd_prune.yaml
 distill_export:null
 kl_quant:deploy/slim/quant_post_static.py -c ppcls/configs/ImageNet/ResNet/ResNet50_vd.yaml -o Global.save_inference_dir=./inference
 export2:null
-infer_model_url:https://paddle-imagenet-models-name.bj.bcebos.com/dygraph/whole_chain/ResNet50_vd_inference.tar
+inference_model_url:https://paddle-imagenet-models-name.bj.bcebos.com/dygraph/whole_chain/ResNet50_vd_inference.tar
 infer_model:../inference/
 infer_export:null
 infer_quant:Fasle
diff --git a/tests/prepare.sh b/tests/prepare.sh
index 4c2aa052a57923e3354847c07f2091bb3c0e5666..57e9a949d6bbb2beb8486f42e556e8fd94fe7365 100644
--- a/tests/prepare.sh
+++ b/tests/prepare.sh
@@ -42,6 +42,9 @@ if [ ${MODE} = "lite_train_infer" ] || [ ${MODE} = "whole_infer" ];then
     cd ILSVRC2012 
     mv train.txt train_list.txt
     mv val.txt val_list.txt
+    if [ ${MODE} = "lite_train_infer" ];then
+	cp -r train/* val/
+    fi
     cd ../../
 elif [ ${MODE} = "infer" ] || [ ${MODE} = "cpp_infer" ];then
     # download data
diff --git a/tools/train.py b/tools/train.py
index 1d835903638aacb459f982a7c5f8710241f01be4..e7c9d7bcc8f2e6b1ebd9ad0d7f12f94c2e58ea13 100644
--- a/tools/train.py
+++ b/tools/train.py
@@ -27,5 +27,6 @@ if __name__ == "__main__":
     args = config.parse_args()
     config = config.get_config(
         args.config, overrides=args.override, show=False)
+    config.profiler_options = args.profiler_options
     engine = Engine(config, mode="train")
     engine.train()