Merge pull request #1119 from PaddlePaddle/develop

merge Develop into release/2.2

Merge pull request #1119 from PaddlePaddle/develop
merge Develop into release/2.2
ae4d477c · Wei Shengyu · GitHub · 659e3d18 · 22cc9a75 · ae4d477c
100 changed file
--- a/README_ch.md
+++ b/README_ch.md
@@ -8,9 +8,10 @@

 **近期更新**

+- 2021.07.08、07.27 添加26个[FAQ](docs/zh_CN/faq_series/faq_2021_s2.md)
 - 2021.06.29 添加Swin-transformer系列模型，ImageNet1k数据集上Top1 acc最高精度可达87.2%；支持训练预测评估与whl包部署，预训练模型可以从[这里](docs/zh_CN/models/models_intro.md)下载。
 - 2021.06.22,23,24 PaddleClas官方研发团队带来技术深入解读三日直播课。课程回放：[https://aistudio.baidu.com/aistudio/course/introduce/24519](https://aistudio.baidu.com/aistudio/course/introduce/24519)
- 2021.06.16 PaddleClas v2.2版本升级，集成Metric learning，向量检索等组件。新增商品识别、动漫人物识别、车辆识别和logo识别等4个图像识别应用。新增LeViT、TNT、DLA、HarDNet、RedNet系列24个预训练模型。
+- 2021.06.16 PaddleClas v2.2版本升级，集成Metric learning，向量检索等组件。新增商品识别、动漫人物识别、车辆识别和logo识别等4个图像识别应用。新增LeViT、Twins、TNT、DLA、HarDNet、RedNet系列30个预训练模型。
 - [more](./docs/zh_CN/update_history.md)

 ## 特性
@@ -18,7 +19,7 @@
 - 实用的图像识别系统：集成了目标检测、特征学习、图像检索等模块，广泛适用于各类图像识别任务。
 提供商品识别、车辆识别、logo识别和动漫人物识别等4个场景应用示例。

- 丰富的预训练模型库：提供了33个系列共150个ImageNet预训练模型，其中6个精选系列模型支持结构快速修改。
+- 丰富的预训练模型库：提供了35个系列共164个ImageNet预训练模型，其中6个精选系列模型支持结构快速修改。

 - 全面易用的特征学习组件：集成arcmargin, triplet loss等12度量学习方法，通过配置文件即可随意组合切换。

@@ -78,7 +79,8 @@ Res2Net200_vd预训练模型Top-1精度高达85.1%。
    - [知识蒸馏](./docs/zh_CN/advanced_tutorials/distillation/distillation.md)
    - [模型量化](./docs/zh_CN/extension/paddle_quantization.md)
    - [数据增广](./docs/zh_CN/advanced_tutorials/image_augmentation/ImageAugment.md)
- FAQ(暂停更新)
+- FAQ
+    - [图像识别任务FAQ](docs/zh_CN/faq_series/faq_2021_s2.md)
    - [图像分类任务FAQ](docs/zh_CN/faq.md)
 - [许可证书](#许可证书)
 - [贡献代码](#贡献代码)

--- a/README_en.md
+++ b/README_en.md
@@ -9,7 +9,7 @@ PaddleClas is an image recognition toolset for industry and academia, helping us
 **Recent updates**

 - 2021.06.29 Add Swin-transformer series model，Highest top1 acc on ImageNet1k dataset reaches 87.2%, training, evaluation and inference are all supported. Pretrained models can be downloaded [here](docs/en/models/models_intro_en.md).
- 2021.06.16 PaddleClas release/2.2. Add metric learning and vector search modules. Add product recognition, animation character recognition, vehicle recognition and logo recognition. Added 24 pretrained models of LeViT, TNT, DLA, HarDNet, and RedNet, and the accuracy is roughly the same as that of the paper.
+- 2021.06.16 PaddleClas release/2.2. Add metric learning and vector search modules. Add product recognition, animation character recognition, vehicle recognition and logo recognition. Added 30 pretrained models of LeViT, Twins, TNT, DLA, HarDNet, and RedNet, and the accuracy is roughly the same as that of the paper.
 - [more](./docs/en/update_history_en.md)

 ## Features
@@ -17,7 +17,7 @@ PaddleClas is an image recognition toolset for industry and academia, helping us
 - A practical image recognition system consist of detection, feature learning and retrieval modules, widely applicable to all types of image recognition tasks.
 Four sample solutions are provided, including product recognition, vehicle recognition, logo recognition and animation character recognition.

- Rich library of pre-trained models: Provide a total of 150 ImageNet pre-trained models in 33 series, among which 6 selected series of models support fast structural modification.
+- Rich library of pre-trained models: Provide a total of 164 ImageNet pre-trained models in 35 series, among which 6 selected series of models support fast structural modification.

 - Comprehensive and easy-to-use feature learning components: 12 metric learning methods are integrated and can be combined and switched at will through configuration files.

@@ -51,7 +51,7 @@ Quick experience of image recognition：[Link](./docs/en/tutorials/quick_start_r
 - [Introduction to Image Recognition Systems](#Introduction_to_Image_Recognition_Systems)
 - [Demo images](#Demo_images)
 - Algorithms Introduction
-    - [Backbone Network and Pre-trained Model Library](./docs/en/ImageNet_models.md)
+    - [Backbone Network and Pre-trained Model Library](./docs/en/ImageNet_models_en.md)
    - [Mainbody Detection](./docs/en/application/mainbody_detection_en.md)
    - [Image Classification](./docs/en/tutorials/image_classification_en.md)
    - [Feature Learning](./docs/en/application/feature_learning_en.md)

--- a/deploy/auto_log.log
+++ b/deploy/auto_log.log
--- a/deploy/configs/build_cartoon.yaml
+++ b/deploy/configs/build_cartoon.yaml
 Global:
  rec_inference_model_dir: "./models/cartoon_rec_ResNet50_iCartoon_v1.0_infer/"
-  batch_size: 1
+  batch_size: 32
  use_gpu: True
-  enable_mkldnn: False
-  cpu_num_threads: 100
+  enable_mkldnn: True
+  cpu_num_threads: 10
  enable_benchmark: True
  use_fp16: False
  ir_optim: True

--- a/deploy/configs/build_logo.yaml
+++ b/deploy/configs/build_logo.yaml
 Global:
  rec_inference_model_dir: "./models/logo_rec_ResNet50_Logo3K_v1.0_infer/"
-  batch_size: 1
+  batch_size: 32
  use_gpu: True
-  enable_mkldnn: False
-  cpu_num_threads: 100
+  enable_mkldnn: True
+  cpu_num_threads: 10
  enable_benchmark: True
  use_fp16: False
  ir_optim: True

--- a/deploy/configs/build_product.yaml
+++ b/deploy/configs/build_product.yaml
 Global:
  rec_inference_model_dir: "./models/product_ResNet50_vd_aliproduct_v1.0_infer"
-  batch_size: 1
+  batch_size: 32
  use_gpu: True
-  enable_mkldnn: False
-  cpu_num_threads: 100
+  enable_mkldnn: True
+  cpu_num_threads: 10
  enable_benchmark: True
  use_fp16: False
  ir_optim: True

--- a/deploy/configs/build_vehicle.yaml
+++ b/deploy/configs/build_vehicle.yaml
 Global:
  rec_inference_model_dir: "./models/vehicle_cls_ResNet50_CompCars_v1.0_infer/"
-  batch_size: 1
+  batch_size: 32
  use_gpu: True
-  enable_mkldnn: False
-  cpu_num_threads: 100
+  enable_mkldnn: True
+  cpu_num_threads: 10
  enable_benchmark: True
  use_fp16: False
  ir_optim: True

--- a/deploy/configs/inference_cartoon.yaml
+++ b/deploy/configs/inference_cartoon.yaml
@@ -12,8 +12,8 @@ Global:
  - foreground

  use_gpu: True
-  enable_mkldnn: False
-  cpu_num_threads: 100
+  enable_mkldnn: True
+  cpu_num_threads: 10
  enable_benchmark: True
  use_fp16: False
  ir_optim: True

--- a/deploy/configs/inference_cls.yaml
+++ b/deploy/configs/inference_cls.yaml
@@ -3,8 +3,8 @@ Global:
  inference_model_dir: "./models"
  batch_size: 1
  use_gpu: True
-  enable_mkldnn: False
-  cpu_num_threads: 100
+  enable_mkldnn: True
+  cpu_num_threads: 10
  enable_benchmark: True
  use_fp16: False
  ir_optim: True
@@ -22,6 +22,7 @@ PreProcess:
        mean: [0.485, 0.456, 0.406]
        std: [0.229, 0.224, 0.225]
        order: ''
+        channel_num: 3
    - ToCHWImage:
 PostProcess:
  main_indicator: Topk

--- a/deploy/configs/inference_cls_ch4.yaml
+++ b/deploy/configs/inference_cls_ch4.yaml
+Global:
+  infer_imgs: "./images/ILSVRC2012_val_00000010.jpeg"
+  inference_model_dir: "./models"
+  batch_size: 1
+  use_gpu: True
+  enable_mkldnn: True
+  cpu_num_threads: 10
+  enable_benchmark: True
+  use_fp16: False
+  ir_optim: True
+  use_tensorrt: False
+  gpu_mem: 8000
+  enable_profile: False
+PreProcess:
+  transform_ops:
+    - ResizeImage:
+        resize_short: 256
+    - CropImage:
+        size: 224
+    - NormalizeImage:
+        scale: 0.00392157
+        mean: [0.485, 0.456, 0.406]
+        std: [0.229, 0.224, 0.225]
+        order: ''
+        channel_num: 4
+    - ToCHWImage:
+PostProcess:
+  main_indicator: Topk
+  Topk:
+    topk: 5
+    class_id_map_file: "../ppcls/utils/imagenet1k_label_list.txt"
+  SavePreLabel:
+    save_dir: ./pre_label/
--- a/deploy/configs/inference_det.yaml
+++ b/deploy/configs/inference_det.yaml
@@ -10,8 +10,8 @@ Global:

  # inference engine config
  use_gpu: True
-  enable_mkldnn: False
-  cpu_num_threads: 100
+  enable_mkldnn: True
+  cpu_num_threads: 10
  enable_benchmark: True
  use_fp16: False
  ir_optim: True

--- a/deploy/configs/inference_logo.yaml
+++ b/deploy/configs/inference_logo.yaml
@@ -13,8 +13,8 @@ Global:

  # inference engine config
  use_gpu: True
-  enable_mkldnn: False
-  cpu_num_threads: 100
+  enable_mkldnn: True
+  cpu_num_threads: 10
  enable_benchmark: True
  use_fp16: False
  ir_optim: True

--- a/deploy/configs/inference_product.yaml
+++ b/deploy/configs/inference_product.yaml
@@ -13,8 +13,8 @@ Global:

  # inference engine config
  use_gpu: True
-  enable_mkldnn: False
-  cpu_num_threads: 100
+  enable_mkldnn: True
+  cpu_num_threads: 10
  enable_benchmark: True
  use_fp16: False
  ir_optim: True

--- a/deploy/configs/inference_rec.yaml
+++ b/deploy/configs/inference_rec.yaml
@@ -10,8 +10,8 @@ Global:

  # inference engine config
  use_gpu: False
-  enable_mkldnn: False
-  cpu_num_threads: 100
+  enable_mkldnn: True
+  cpu_num_threads: 10
  enable_benchmark: True
  use_fp16: False
  ir_optim: True

--- a/deploy/configs/inference_vehicle.yaml
+++ b/deploy/configs/inference_vehicle.yaml
@@ -13,8 +13,8 @@ Global:

  # inference engine config
  use_gpu: True
-  enable_mkldnn: False
-  cpu_num_threads: 100
+  enable_mkldnn: True
+  cpu_num_threads: 10
  enable_benchmark: True
  use_fp16: False
  ir_optim: True

--- a/deploy/paddleserving/image_http_client.py
+++ b/deploy/paddleserving/image_http_client.py
@@ -22,10 +22,9 @@ py_version = sys.version_info[0]


 def predict(image_path, server):
-    if py_version == 2:
-        image = base64.b64encode(open(image_path).read())
-    else:
-        image = base64.b64encode(open(image_path, "rb").read()).decode("utf-8")
+
+    with open(image_path, "rb") as f:
+        image = base64.b64encode(f.read()).decode("utf-8")
    req = json.dumps({"feed": [{"image": image}], "fetch": ["prediction"]})
    r = requests.post(
        server, data=req, headers={"Content-Type": "application/json"})

--- a/deploy/python/build_gallery.py
+++ b/deploy/python/build_gallery.py
@@ -71,14 +71,26 @@ class GalleryBuilder(object):
        gallery_features = np.zeros(
            [len(gallery_images), config['embedding_size']], dtype=np.float32)

+        #construct batch imgs and do inference
+        batch_size = config.get("batch_size", 32)
+        batch_img = []
        for i, image_file in enumerate(tqdm(gallery_images)):
            img = cv2.imread(image_file)
            if img is None:
                logger.error("img empty, please check {}".format(image_file))
                exit()
            img = img[:, :, ::-1]
-            rec_feat = self.rec_predictor.predict(img)
-            gallery_features[i, :] = rec_feat
+            batch_img.append(img)
+
+            if (i + 1) % batch_size == 0:
+                rec_feat = self.rec_predictor.predict(batch_img)
+                gallery_features[i - batch_size + 1:i + 1, :] = rec_feat
+                batch_img = []
+
+        if len(batch_img) > 0:
+            rec_feat = self.rec_predictor.predict(batch_img)
+            gallery_features[-len(batch_img):, :] = rec_feat
+            batch_img = []

        # train index 
        self.Searcher = Graph_Index(dist_type=config['dist_type'])

--- a/deploy/python/predict_cls.py
+++ b/deploy/python/predict_cls.py
@@ -41,6 +41,29 @@ class ClsPredictor(Predictor):
        if "PostProcess" in config:
            self.postprocess = build_postprocess(config["PostProcess"])

+        # for whole_chain project to test each repo of paddle
+        self.benchmark = config["Global"].get("benchmark", False)
+        if self.benchmark:
+            import auto_log
+            import os
+            pid = os.getpid()
+            self.auto_logger = auto_log.AutoLogger(
+                model_name=config["Global"].get("model_name", "cls"),
+                model_precision='fp16'
+                if config["Global"]["use_fp16"] else 'fp32',
+                batch_size=config["Global"].get("batch_size", 1),
+                data_shape=[3, 224, 224],
+                save_path=config["Global"].get("save_log_path",
+                                               "./auto_log.log"),
+                inference_config=self.config,
+                pids=pid,
+                process_name=None,
+                gpu_ids=None,
+                time_keys=[
+                    'preprocess_time', 'inference_time', 'postprocess_time'
+                ],
+                warmup=2)
+
    def predict(self, images):
        input_names = self.paddle_predictor.get_input_names()
        input_tensor = self.paddle_predictor.get_input_handle(input_names[0])
@@ -49,16 +72,26 @@ class ClsPredictor(Predictor):
        output_tensor = self.paddle_predictor.get_output_handle(output_names[
            0])

+        if self.benchmark:
+            self.auto_logger.times.start()
        if not isinstance(images, (list, )):
            images = [images]
        for idx in range(len(images)):
            for ops in self.preprocess_ops:
                images[idx] = ops(images[idx])
        image = np.array(images)
+        if self.benchmark:
+            self.auto_logger.times.stamp()

        input_tensor.copy_from_cpu(image)
        self.paddle_predictor.run()
        batch_output = output_tensor.copy_to_cpu()
+        if self.benchmark:
+            self.auto_logger.times.stamp()
+        if self.postprocess is not None:
+            batch_output = self.postprocess(batch_output)
+        if self.benchmark:
+            self.auto_logger.times.end(stamp=True)
        return batch_output


@@ -66,12 +99,40 @@ def main(config):
    cls_predictor = ClsPredictor(config)
    image_list = get_image_list(config["Global"]["infer_imgs"])

-    assert config["Global"]["batch_size"] == 1
-    for idx, image_file in enumerate(image_list):
-        img = cv2.imread(image_file)[:, :, ::-1]
-        output = cls_predictor.predict(img)
-        output = cls_predictor.postprocess(output, [image_file])
-        print(output)
+    batch_imgs = []
+    batch_names = []
+    cnt = 0
+    for idx, img_path in enumerate(image_list):
+        img = cv2.imread(img_path)
+        if img is None:
+            logger.warning(
+                "Image file failed to read and has been skipped. The path: {}".
+                format(img_path))
+        else:
+            img = img[:, :, ::-1]
+            batch_imgs.append(img)
+            img_name = os.path.basename(img_path)
+            batch_names.append(img_name)
+            cnt += 1
+
+        if cnt % config["Global"]["batch_size"] == 0 or (idx + 1
+                                                         ) == len(image_list):
+            if len(batch_imgs) == 0:
+                continue
+
+            batch_results = cls_predictor.predict(batch_imgs)
+            for number, result_dict in enumerate(batch_results):
+                filename = batch_names[number]
+                clas_ids = result_dict["class_ids"]
+                scores_str = "[{}]".format(", ".join("{:.2f}".format(
+                    r) for r in result_dict["scores"]))
+                label_names = result_dict["label_names"]
+                print("{}:\tclass id(s): {}, score(s): {}, label_name(s): {}".
+                      format(filename, clas_ids, scores_str, label_names))
+            batch_imgs = []
+            batch_names = []
+    if cls_predictor.benchmark:
+        cls_predictor.auto_logger.report()
    return



--- a/deploy/python/predict_rec.py
+++ b/deploy/python/predict_rec.py
@@ -60,6 +60,8 @@ class RecPredictor(Predictor):
                np.sum(np.square(batch_output), axis=1, keepdims=True))
            batch_output = np.divide(batch_output, feas_norm)

+        if self.postprocess is not None:
+            batch_output = self.postprocess(batch_output)
        return batch_output


@@ -67,14 +69,33 @@ def main(config):
    rec_predictor = RecPredictor(config)
    image_list = get_image_list(config["Global"]["infer_imgs"])

-    assert config["Global"]["batch_size"] == 1
-    for idx, image_file in enumerate(image_list):
-        batch_input = []
-        img = cv2.imread(image_file)[:, :, ::-1]
-        output = rec_predictor.predict(img)
-        if rec_predictor.postprocess is not None:
-            output = rec_predictor.postprocess(output)
-        print(output)
+    batch_imgs = []
+    batch_names = []
+    cnt = 0
+    for idx, img_path in enumerate(image_list):
+        img = cv2.imread(img_path)
+        if img is None:
+            logger.warning(
+                "Image file failed to read and has been skipped. The path: {}".
+                format(img_path))
+        else:
+            img = img[:, :, ::-1]
+            batch_imgs.append(img)
+            img_name = os.path.basename(img_path)
+            batch_names.append(img_name)
+            cnt += 1
+
+        if cnt % config["Global"]["batch_size"] == 0 or (idx + 1) == len(image_list):
+            if len(batch_imgs) == 0: 
+                continue
+                
+            batch_results = rec_predictor.predict(batch_imgs)
+            for number, result_dict in enumerate(batch_results):
+                filename = batch_names[number]
+                print("{}:\t{}".format(filename, result_dict))
+            batch_imgs = []
+            batch_names = []
+
    return



--- a/deploy/utils/predictor.py
+++ b/deploy/utils/predictor.py
@@ -28,7 +28,7 @@ class Predictor(object):
        if args.use_fp16 is True:
            assert args.use_tensorrt is True
        self.args = args
-        self.paddle_predictor = self.create_paddle_predictor(
+        self.paddle_predictor, self.config = self.create_paddle_predictor(
            args, inference_model_dir)

    def predict(self, image):
@@ -59,11 +59,12 @@ class Predictor(object):
            config.enable_tensorrt_engine(
                precision_mode=Config.Precision.Half
                if args.use_fp16 else Config.Precision.Float32,
-                max_batch_size=args.batch_size)
+                max_batch_size=args.batch_size,
+                min_subgraph_size=30)

        config.enable_memory_optim()
        # use zero copy
        config.switch_use_feed_fetch_ops(False)
        predictor = create_predictor(config)

-        return predictor
+        return predictor, config
--- a/deploy/vector_search/README.md
+++ b/deploy/vector_search/README.md
@@ -35,7 +35,6 @@ sudo apt-get install build-essential gcc g++

 进入该文件夹，直接运行`make`即可，如果希望重新生成`index.so`文件，可以首先使用`make clean`清除已经生成的缓存，再使用`make`生成更新之后的库文件。

-
 ### 2.3 Windows上编译生成库文件

 Windows上首先需要安装gcc编译工具，推荐使用[TDM-GCC](https://jmeubank.github.io/tdm-gcc/articles/2020-03/9.2.0-release)，进入官网之后，可以选择合适的版本进行下载。推荐下载[tdm64-gcc-10.3.0-2.exe](https://github.com/jmeubank/tdm-gcc/releases/download/v10.3.0-tdm64-2/tdm64-gcc-10.3.0-2.exe)。
@@ -50,6 +49,25 @@ Windows上首先需要安装gcc编译工具，推荐使用[TDM-GCC](https://jmeu

 在该文件夹(deploy/vector_search)下，运行命令`mingw32-make`，即可生成`index.dll`库文件。如果希望重新生成`index.dll`文件，可以首先使用`mingw32-make clean`清除已经生成的缓存，再使用`mingw32-make`生成更新之后的库文件。

+### 2.4 MacOS上编译生成库文件
+
+运行下面的命令，安装gcc与g++:
+
+```shell
+brew install gcc
+```
+#### 注意：
+1. 若提示 `Error: Running Homebrew as root is extremely dangerous and no longer supported...`,  参考该[链接](https://jingyan.baidu.com/article/e52e3615057a2840c60c519c.html)处理
+2. 若提示 `Error: Failure while executing; `tar --extract --no-same-owner --file...`， 参考该[链接](https://blog.csdn.net/Dawn510/article/details/117787358)处理
+
+在安装之后编译后的可执行程序会被复制到/usr/local/bin下面，查看这个文件夹下的gcc：
+```
+ls /usr/local/bin/gcc*
+```
+可以看到本地gcc对应的版本号为gcc-11，编译命令如下: (如果本地gcc版本为gcc-9, 则相应命令修改为`CXX=g++-9 make`)
+```
+CXX=g++-11 make
+```

 ## 3. 快速使用


--- a/docs/en/ImageNet_models.md
+++ b/docs/en/ImageNet_models.md
--- a/docs/en/advanced_tutorials/image_augmentation/ImageAugment_en.md
+++ b/docs/en/advanced_tutorials/image_augmentation/ImageAugment_en.md
@@ -154,10 +154,12 @@ cutout_op = Cutout(n_holes=1, length=112)

 ops = [decode_op, resize_op, cutout_op]

-imgs_dir = image_path
-fnames = os.listdir(imgs_dir)
-for f in fnames:
-    data = open(os.path.join(imgs_dir, f)).read()
+imgs_dir = "imgs_dir"
+file_names = os.listdir(imgs_dir)
+for file_name in file_names:
+    file_path = os.join(imgs_dir, file_name)
+    with open(file_path) as f:
+        data = f.read()
    img = transform(data, ops)
 ```


--- a/docs/en/models/LeViT_en.md
+++ b/docs/en/models/LeViT_en.md
+# LeViT series
+
+## Overview
+LeViT is a fast inference hybrid neural network for image classification tasks. Its design considers the performance of the network model on different hardware platforms, so it can better reflect the real scenarios of common applications. Through a large number of experiments, the author found a better way to combine the convolutional neural network and the Transformer system, and proposed an attention-based method to integrate the position information encoding in the Transformer. [Paper](https://arxiv.org/abs/2104.01136)。
+
+## Accuracy, FLOPS and Parameters
+
+| Models           | Top1 | Top5 | Reference<br>top1 | Reference<br>top5 | FLOPS<br>(M) | Params<br>(M) |
+|:--:|:--:|:--:|:--:|:--:|:--:|:--:|
+| LeViT-128S | 0.7598 | 0.9269 | 0.766 | 0.929 | 305  | 7.8 |
+| LeViT-128  | 0.7810 | 0.9371 | 0.786 | 0.940 | 406  | 9.2 |
+| LeViT-192  | 0.7934 | 0.9446 | 0.800 | 0.947 | 658  | 11 |
+| LeViT-256  | 0.8085 | 0.9497 | 0.816 | 0.954 | 1120 | 19 |
+| LeViT-384  | 0.8191 | 0.9551 | 0.826 | 0.960 | 2353 | 39 |
+
+
+**Note**：The difference in accuracy from Reference is due to the difference in data preprocessing and the absence of distilled head as output.
--- a/docs/zh_CN/advanced_tutorials/image_augmentation/ImageAugment.md
+++ b/docs/zh_CN/advanced_tutorials/image_augmentation/ImageAugment.md
@@ -74,10 +74,12 @@ autoaugment_op = ImageNetPolicy()

 ops = [decode_op, resize_op, autoaugment_op]

-imgs_dir = 图像路径
-fnames = os.listdir(imgs_dir)
-for f in fnames:
-    data = open(os.path.join(imgs_dir, f)).read()
+imgs_dir = "imgs_dir"
+file_names = os.listdir(imgs_dir)
+for file_name in file_names:
+    file_path = os.join(imgs_dir, file_name)
+    with open(file_path) as f:
+        data = f.read()
    img = transform(data, ops)
 ```


--- a/docs/zh_CN/faq_series/faq_2021_s2.md
+++ b/docs/zh_CN/faq_series/faq_2021_s2.md
+# 图像识别常见问题汇总 - 2021 第2季
+
+
+## 目录
+* [第1期](#第1期)(2021.07.08)
+* [第2期](#第2期)(2021.07.27)
+
+<a name="第1期"></a>
+## 第1期
+
+### Q1.1: 目前使用的主体检测模型检测在某些场景中会有误检？
+
+**A**：目前的主体检测模型训练时使用了COCO、Object365、RPC、LogoDet等公开数据集，如果被检测数据是类似工业质检等于常见类别差异较大的数据，需要基于目前的检测模型重新微调训练。
+
+### Q1.2: 添加图片后建索引报`assert text_num >= 2`错？
+
+**A**：请确保data_file.txt中图片路径和图片名称中间的间隔为单个table，而不是空格。
+
+### Q1.3: 识别模块预测时报`Illegal instruction`错？
+
+**A**：可能是编译生成的库文件与您的环境不兼容，导致程序报错，如果报错，推荐参考[向量检索教程](../../../deploy/vector_search/README.md)重新编译库文件。
+
+### Q1.4 主体检测是每次只输出一个主体检测框吗？
+
+**A**：主体检测这块的输出数量是可以通过配置文件配置的。在配置文件中Global.threshold控制检测的阈值，小于该阈值的检测框被舍弃，Global.max_det_results控制最大返回的结果数，这两个参数共同决定了输出检测框的数量。
+
+### Q1.5 训练主体检测模型的数据是如何选择的？换成更小的模型会有损精度吗？
+
+**A**：训练数据是在COCO、Object365、RPC、LogoDet等公开数据集中随机抽取的子集，小模型精度可能会有一些损失，后续我们也会尝试下更小的检测模型。关于主体检测模型的更多信息请参考[主体检测](../application/mainbody_detection.md)。
+
+### Q1.6 识别模型怎么在预训练模型的基础上进行微调训练？
+
+**A**：识别模型的微调训练和分类模型的微调训练类似，识别模型可以加载商品的预训练模型],训练过程可以参考[识别模型训练](../tutorials/getting_started_retrieval.md)，后续我们也会持续细化这块的文档。
+
+### Q1.7 PaddleClas和PaddleDetection区别
+
+**A**：PaddleClas是一个兼主体检测、图像分类、图像检索于一体的图像识别repo，用于解决大部分图像识别问题，用户可以很方便的使用PaddleClas来解决小样本、多类别的图像识别问题。PaddleDetection提供了目标检测、关键点检测、多目标跟踪等能力，方便用户定位图像中的感兴趣的点和区域，被广泛应用于工业质检、遥感图像检测、无人巡检等项目。
+
+### Q1.8 PaddleClas 2.2和PaddleClas 2.1完全兼容吗？
+
+**A**：PaddleClas2.2相对PaddleClas2.1新增了metric learning模块，主体检测模块、向量检索模块。另外，也提供了商品识别、车辆识别、logo识别和动漫人物识别等4个场景应用示例。用户可以基于PaddleClas 2.2快速构建图像识别系统。在图像分类模块，二者的使用方法类似，可以参考[图像分类示例](../tutorials/getting_started.md)快速迭代和评估。新增的metric learning模块，可以参考[metric learning示例](../tutorials/getting_started_retrieval.md)。另外，新版本暂时还不支持fp16、dali训练，也暂时不支持多标签训练，这块内容将在不久后支持。
+
+### Q1.9 训练metric learning时，每个epoch中，无法跑完所有mini-batch，为什么？
+
+**A**：在训练metric learning时，使用的Sampler是DistributedRandomIdentitySampler，该Sampler不会采样全部的图片，导致会让每一个epoch采样的数据不是所有的数据，所以无法跑完显示的mini-batch是正常现象。后续我们会优化下打印的信息，尽可能减少给大家带来的困惑。
+
+### Q1.10 有些图片没有识别出结果，为什么？
+
+**A**：在配置文件（如inference_product.yaml）中，`IndexProcess.score_thres`中会控制被识别的图片与库中的图片的余弦相似度的最小值。当余弦相似度小于该值时，不会打印结果。您可以根据自己的实际数据调整该值。
+
+### Q1.11 为什么有一些图片检测出的结果就是原图？
+
+**A**：主体检测模型会返回检测框，但事实上为了让后续的识别模型更加准确，在返回检测框的同时也返回了原图。后续会根据原图或者检测框与库中的图片的相似度排序，相似度最高的库中图片的标签即为被识别图片的标签。
+
+### Q1.12 使用`circle loss`还需加`triplet loss`吗？
+
+**A**：`circle loss`是统一了样本对学习和分类学习的两种形式，如果是分类学习的形式的话，可以增加`triplet loss`。
+
+### Q1.13 hub serving方式启动某个模块，怎么添加该模块的参数呢？
+
+**A**：具体可以参考[hub serving参数](../../../deploy/hubserving/clas/params.py)。
+
+### Q1.14  模型训练出nan，为什么？
+
+**A**：
+
+1.确保正确加载预训练模型, 最简单的加载方式添加参数`-o Arch.pretrained=True`即可；
+
+2.模型微调时，学习率不要太大，如设置0.001就好。
+
+
+### Q1.15 SSLD中，大模型在500M数据上预训练后蒸馏小模型，然后在1M数据上蒸馏finetune小模型？
+
+**A**：步骤如下：
+
+1.基于facebook开源的`ResNeXt101-32x16d-wsl`模型 去蒸馏得到了`ResNet50-vd`模型；
+
+2.用这个`ResNet50-vd`，在500W数据集上去蒸馏`MobilNetV3`；
+
+3.考虑到500W的数据集的分布和100W的数据分布不完全一致，所以这块，在100W上的数据上又finetune了一下，精度有微弱的提升。
+
+
+### Q1.16 如果不是识别开源的四个方向的图片，该使用哪个识别模型？
+
+**A**：建议使用商品识别模型，一来是因为商品覆盖的范围比较广，被识别的图片是商品的概率更大，二来是因为商品识别模型的训练数据使用了5万类别的数据，泛化能力更好，特征会更鲁棒一些。
+
+### Q1.17 最后使用512维的向量，为什么不用1024或者其他维度的呢？
+
+**A**：使用维度小的向量，为了加快计算，在实际使用过程中，可能使用128甚至更小。一般来说，512的维度已经够大，能充分表示特征了。
+
+### Q1.18 训练SwinTransformer，loss出现nan
+
+**A**：训练SwinTransformer的话，需要使用paddle-dev去训练，安装方式参考[paddlepaddle安装方式](https://www.paddlepaddle.org.cn/install/quick?docurl=/documentation/docs/zh/develop/install/pip/linux-pip.html)，后续paddlepaddle-2.1也会同时支持。
+
+### Q1.19 新增底库数据需要重新构建索引吗？
+
+**A**：这一版需要重新构建索引，未来版本会支持只构建新增图片的索引。
+
+### Q1.20 PaddleClas 的`train_log`文件在哪里?
+
+**A**：在保存权重的路径中存放了`train.log`。
+
+
+<a name="第2期"></a>
+## 第2期
+
+### Q2.1 PaddleClas目前使用的Möbius向量检索算法支持类似于faiss的那种index.add()的功能吗? 另外，每次构建新的图都要进行train吗？这里的train是为了检索加速还是为了构建相似的图？
+
+**A**：Mobius提供的检索算法是一种基于图的近似最近邻搜索算法，目前支持两种距离计算方式：inner product和L2 distance. faiss中提供的index.add功能暂时不支持，如果需要增加检索库的内容，需要从头重新构建新的index. 在每次构建index时，检索算法内部执行的操作是一种类似于train的过程，不同于faiss提供的train接口，我们命名为build, 主要的目的是为了加速检索的速度。
+
+### Q2.2 可以对视频中每一帧画面进行逐帧预测吗？
+**A**：可以，但目前PaddleClas并不支持视频输入。可以尝试修改一下PaddleClas代码，或者预先将视频逐帧转为图像存储，再使用PaddleClas进行预测。
+
+### Q2.3：在直播场景中，需要提供一个直播即时识别画面，能够在延迟几秒内找到特征目标物并用框圈起，这个可以实现吗？
+**A**：要达到实时的检测效果，需要检测速度达到实时性的要求；PPyolo是Paddle团队提供的轻量级目标检测模型，检测速度和精度达到了很好的平衡，可以试试ppyolo来做检测. 关于ppyolo的使用，可以参照：   https://github.com/PaddlePaddle/PaddleDetection/blob/release/2.1/configs/ppyolo/README_cn.md
+
+### Q2.4: 对于未知的标签，加入gallery dataset可以用于后续的分类识别（无需训练），但是如果前面的检测模型对于未知的标签无法定位检测出来，是否还是要训练前面的检测模型？
+**A**：如果检测模型在自己的数据集上表现不佳，需要在自己的检测数据集上再finetune下
+
+### Q2.5: Mac重新编译index.so时报错如下：clang: error: unsupported option '-fopenmp', 该如何处理？
+**A**：该问题已经解决。Mac编译index.so，可以参照文档： https://github.com/PaddlePaddle/PaddleClas/blob/develop/deploy/vector_search/README.md
+
+### Q2.6: PaddleClas有提供调整图片亮度，对比度，饱和度，色调等方面的数据增强吗？
+**A**：PaddleClas提供了多种数据增广方式， 可分为3类：1. 图像变换类： AutoAugment, RandAugment;  2. 图像裁剪类： CutOut、RandErasing、HideAndSeek、GridMask；3. 图像混叠类：Mixup, Cutmix. 其中，Randangment提供了多种数据增强方式的随机组合，可以满足亮度、对比度、饱和度、色调等多方面的数据增广需求
--- a/docs/zh_CN/tutorials/config.md
+++ b/docs/zh_CN/tutorials/config.md
-# 配置说明
-
---
-
-## 简介
-
-本文档介绍了PaddleClas配置文件(`configs/*.yaml`)中各参数的含义，以便您更快地自定义或修改超参数配置。
-
-* 注意：部分参数并未在配置文件中体现，在训练或者评估时，可以直接使用`-o`进行参数的扩充或者更新，比如说`-o checkpoints=./ckp_path/ppcls`，表示在配置文件中添加（如果之前不存在）或者更新（如果之前已经包含该字段）`checkpoints`字段，其值设为`./ckp_path/ppcls`。
-
-
-## 配置详解
-
-### 基础配置
-
-| 参数名字 | 具体含义 | 默认值 | 可选值 |
-|:---:|:---:|:---:|:---:|
-| mode | 运行模式 | "train" | ["train"," valid"] |
-| checkpoints | 断点模型路径，用于恢复训练 | "" | Str |
-| last_epoch | 上一次训练结束时已经训练的epoch数量，与checkpoints一起使用 | -1 | int |
-| pretrained_model | 预训练模型路径 | "" | Str |
-| load_static_weights | 加载的模型是否为静态图的预训练模型 | False | bool |
-| model_save_dir | 保存模型路径 | "" | Str |
-| classes_num | 分类数 | 1000 | int |
-| total_images | 总图片数 | 1281167 | int |
-| save_interval | 每隔多少个epoch保存模型 | 1 | int |
-| validate | 是否在训练时进行评估 | TRUE | bool |
-| valid_interval | 每隔多少个epoch进行模型评估 | 1 | int |
-| epochs | 训练总epoch数 |  | int |
-| topk | 评估指标K值大小 | 5 | int |
-| image_shape | 图片大小 | [3，224，224] | list, shape: (3,) |
-| use_mix | 是否启用mixup | False | ['True', 'False'] |
-| ls_epsilon | label_smoothing epsilon值| 0 | float |
-| use_distillation | 是否进行模型蒸馏 | False | bool |
-
-## 结构(ARCHITECTURE)
-
-### 分类模型结构配置
-
-| 参数名字 | 具体含义 | 默认值 | 可选值 |
-|:---:|:---:|:---:|:---:|
-| name | 模型结构名字 | "ResNet50_vd" | PaddleClas提供的模型结构 |
-| params | 模型传参 | {} | 模型结构所需的额外字典，如EfficientNet等配置文件中需要传入`padding_type`等参数，可以通过这种方式传入 |
-
-### 识别模型结构配置
-
-|     参数名字      |         具体含义          |   默认值   |                            可选值                            |
-| :---------------: | :-----------------------: | :--------: | :----------------------------------------------------------: |
-|       name        |         模型结构          | "RecModel" |                         ["RecModel"]                         |
-| infer_output_key  |    inference时的输出值    | “feature”  |                    ["feature", "logits"]                     |
-| infer_add_softmax |  infercne是否添加softmax  |    True    |                        [True, False]                         |
-|     Backbone      |    使用Backbone的名字     |            | 需传入字典结构，包含`name`、`pretrained`等key值。其中`name`为分类模型名字， `pretrained`为布尔值 |
-| BackboneStopLayer | Backbone中的feature输出层 |            | 需传入字典结构，包含`name`key值，具体值为Backbone中的特征输出层的`full_name` |
-|       Neck        |    添加的网络Neck部分     |            |           需传入字典结构，Neck网络层的具体输入参数           |
-|       Head        |    添加的网络Head部分     |            |           需传入字典结构，Head网络层的具体输入参数           |
-
-### 学习率(LEARNING_RATE)
-
-| 参数名字 | 具体含义 | 默认值 | 可选值 |
-|:---:|:---:|:---:|:---:|
-| function | decay方法名 | "Linear" | ["Linear", "Cosine", <br> "Piecewise", "CosineWarmup"] |
-| params.lr | 初始学习率 | 0.1 | float |
-| params.decay_epochs | piecewisedecay中<br>衰减学习率的milestone |  | list |
-| params.gamma | piecewisedecay中gamma值 | 0.1 | float |
-| params.warmup_epoch | warmup轮数 | 5 | int |
-| parmas.steps | lineardecay衰减steps数 | 100 | int |
-| params.end_lr | lineardecayend_lr值 | 0 | float |
-
-### 优化器(OPTIMIZER)
-
-| 参数名字 | 具体含义 | 默认值 | 可选值 |
-|:---:|:---:|:---:|:---:|
-| function | 优化器方法名 | "Momentum" | ["Momentum", "RmsProp"] |
-| params.momentum | momentum值 | 0.9 | float |
-| regularizer.function | 正则化方法名 | "L2" | ["L1", "L2"] |
-| regularizer.factor | 正则化系数 | 0.0001 | float |
-
-### 数据读取器与数据处理
-
-| 参数名字 | 具体含义 |
-|:---:|:---:|
-| batch_size | 批大小 |
-| num_workers | 数据读取器worker数量 |
-| file_list | train文件列表 |
-| data_dir | train文件路径 |
-| shuffle_seed | 用来进行shuffle的seed值 |
-
-数据处理
-
-| 功能名字 | 参数名字 | 具体含义 |
-|:---:|:---:|:---:|
-| DecodeImage | to_rgb | 数据转RGB |
-|  | to_np | 数据转numpy |
-|  | channel_first | 按CHW排列的图片数据 |
-| RandCropImage | size | 随机裁剪 |
-| RandFlipImage | | 随机翻转 |
-| NormalizeImage | scale | 归一化scale值 |
-|  | mean | 归一化均值 |
-|  | std | 归一化方差 |
-|  | order | 归一化顺序 |
-| ToCHWImage |  | 调整为CHW |
-| CropImage | size | 裁剪大小 |
-| ResizeImage | resize_short | 按短边调整大小 |
-
-mix处理
-
-| 参数名字| 具体含义|
-|:---:|:---:|
-| MixupOperator.alpha | mixup处理中的alpha值|
--- a/docs/zh_CN/tutorials/config_description.md
+++ b/docs/zh_CN/tutorials/config_description.md
+# 配置说明
+
+---
+
+## 简介
+
+本文档介绍了PaddleClas配置文件(`ppcls/configs/*.yaml`)中各参数的含义，以便您更快地自定义或修改超参数配置。
+
+
+## 配置详解
+
+### 1.分类模型
+
+此处以`ResNet50_vd`在`ImageNet-1k`上的训练配置为例，详解各个参数的意义。[配置路径](../../../ppcls/configs/ImageNet/ResNet/ResNet50_vd.yaml)。
+
+#### 1.1 全局配置(Global)
+
+| 参数名字 | 具体含义 | 默认值 | 可选值 |
+|:---:|:---:|:---:|:---:|
+| checkpoints | 断点模型路径，用于恢复训练 | null | str |
+| pretrained_model | 预训练模型路径 | null | str |
+| output_dir | 保存模型路径 | "./output/" | str |
+| save_interval | 每隔多少个epoch保存模型 | 1 | int |
+| eval_during_train| 是否在训练时进行评估 | True | bool |
+| eval_interval | 每隔多少个epoch进行模型评估 | 1 | int |
+| epochs | 训练总epoch数 |  | int |
+| print_batch_step | 每隔多少个mini-batch打印输出 | 10 | int |
+| use_visualdl | 是否是用visualdl可视化训练过程 | False | bool |
+| image_shape | 图片大小 | [3，224，224] | list, shape: (3,) |
+| save_inference_dir | inference模型的保存路径 | "./inference" | str |
+| eval_mode | eval的模式 | "classification" | "retrieval" |
+
+**注**：`pretrained_model`也可以填写存放预训练模型的http地址。
+
+#### 1.2 结构(Arch)
+
+| 参数名字 | 具体含义 | 默认值 | 可选值 |
+|:---:|:---:|:---:|:---:|
+| name | 模型结构名字 | ResNet50 | PaddleClas提供的模型结构 |
+| class_num | 分类数 | 1000 | int |
+| pretrained | 预训练模型 | False | bool， str |
+
+**注**：此处的pretrained可以设置为`True`或者`False`，也可以设置权重的路径。另外当`Global.pretrained_model`也设置相应路径时，此处的`pretrained`失效。
+
+#### 1.3 损失函数（Loss）
+
+| 参数名字 | 具体含义 | 默认值 | 可选值 |
+|:---:|:---:|:---:|:---:|
+| CELoss | 交叉熵损失函数 | —— | —— |
+| CELoss.weight | CELoss的在整个Loss中的权重 | 1.0 | float |
+| CELoss.epsilon | CELoss中label_smooth的epsilon值 | 0.1 | float，0-1之间 |
+
+
+#### 1.4 优化器(Optimizer)
+
+| 参数名字 | 具体含义 | 默认值 | 可选值 |
+|:---:|:---:|:---:|:---:|
+| name | 优化器方法名 | "Momentum" | "RmsProp"等其他优化器 |
+| momentum | momentum值 | 0.9 | float |
+| lr.name | 学习率下降方式 | "Cosine" | "Linear"、"Piecewise"等其他下降方式 |
+| lr.learning_rate | 学习率初始值 | 0.1 | float |
+| lr.warmup_epoch | warmup轮数 | 0 | int，如5 |
+| regularizer.name | 正则化方法名 | "L2" | ["L1", "L2"] |
+| regularizer.coeff | 正则化系数 | 0.00007 | float |
+
+**注**：`lr.name`不同时，新增的参数可能也不同，如当`lr.name=Piecewise`时，需要添加如下参数：
+
+```
+  lr:
+    name: Piecewise
+    learning_rate: 0.1
+    decay_epochs: [30, 60, 90]
+    values: [0.1, 0.01, 0.001, 0.0001]
+```
+
+添加方法及参数请查看[learning_rate.py](../../../ppcls/optimizer/learning_rate.py)。
+
+
+#### 1.5数据读取模块（DataLoader）
+
+##### 1.5.1 dataset
+
+| 参数名字 | 具体含义 | 默认值 | 可选值 |
+|:---:|:---:|:---:|:---:|
+| name | 读取数据的类的名字 | ImageNetDataset | VeriWild等其他读取数据类的名字 |
+| image_root | 数据集存放的路径 | ./dataset/ILSVRC2012/ | str |
+| cls_label_path | 数据集标签list | ./dataset/ILSVRC2012/train_list.txt | str |
+| transform_ops | 单张图片的数据预处理 | —— | —— |
+| batch_transform_ops | batch图片的数据预处理 | —— | —— |
+
+
+transform_ops中参数的意义：
+
+| 功能名字 | 参数名字 | 具体含义 |
+|:---:|:---:|:---:|
+| DecodeImage | to_rgb | 数据转RGB |
+|  | channel_first | 按CHW排列的图片数据 |
+| RandCropImage | size | 随机裁剪 |
+| RandFlipImage | | 随机翻转 |
+| NormalizeImage | scale | 归一化scale值 |
+|  | mean | 归一化均值 |
+|  | std | 归一化方差 |
+|  | order | 归一化顺序 |
+| CropImage | size | 裁剪大小 |
+| ResizeImage | resize_short | 按短边调整大小 |
+
+batch_transform_ops中参数的含义：
+
+| 功能名字 | 参数名字 | 具体含义 |
+|:---:|:---:|:---:|
+| MixupOperator | alpha | Mixup参数值，该值越大增强越强 |
+
+##### 1.5.2 sampler
+
+| 参数名字 | 具体含义 | 默认值 | 可选值 |
+|:---:|:---:|:---:|:---:|
+| name |  sampler类型 | DistributedBatchSampler | DistributedRandomIdentitySampler等其他Sampler |
+| batch_size | 批大小 | 64 | int |
+| drop_last | 是否丢掉最后不够batch-size的数据 | False | bool |
+| shuffle | 数据是否做shuffle | True | bool |
+
+##### 1.5.3 loader
+
+| 参数名字 | 具体含义 | 默认值 | 可选值 |
+|:---:|:---:|:---:|:---:|
+| num_workers | 数据读取线程数 | 4 | int |
+| use_shared_memory | 是否使用共享内存 | True | bool |
+
+
+#### 1.6 评估指标（Metric）
+
+| 参数名字 | 具体含义 | 默认值 | 可选值 |
+|:---:|:---:|:---:|:---:|
+| TopkAcc | TopkAcc | [1, 5] | list, int |
+
+#### 1.7 预测（Infer）
+
+| 参数名字 | 具体含义 | 默认值 | 可选值 |
+|:---:|:---:|:---:|:---:|
+| infer_imgs | 被infer的图像的地址 | docs/images/whl/demo.jpg | str |
+| batch_size | 批大小 | 10 | int |
+| PostProcess.name | 后处理名字 | Topk | str |
+| PostProcess.topk | topk的值 | 5 | int |
+| PostProcess.class_id_map_file | class id和名字的映射文件 | ppcls/utils/imagenet1k_label_list.txt | str |
+
+**注**：Infer模块的`transforms`的解释参考数据读取模块中的dataset中`transform_ops`的解释。
+
+
+### 2.蒸馏模型
+
+**注**：此处以`MobileNetV3_large_x1_0`在`ImageNet-1k`上蒸馏`MobileNetV3_small_x1_0`的训练配置为例，详解各个参数的意义。[配置路径](../../../ppcls/configs/ImageNet/Distillation/mv3_large_x1_0_distill_mv3_small_x1_0.yaml)。这里只介绍与分类模型有区别的参数。
+
+#### 2.1 结构（Arch）
+
+| 参数名字 | 具体含义 | 默认值 | 可选值 |
+|:---:|:---:|:---:|:---:|
+| name | 模型结构名字 | DistillationModel | —— |
+| class_num | 分类数 | 1000 | int |
+| freeze_params_list | 冻结参数列表 | [True, False] | list |
+| models | 模型列表 | [Teacher, Student] | list |
+| Teacher.name | 教师模型的名字 | MobileNetV3_large_x1_0 | PaddleClas中的模型 |
+| Teacher.pretrained | 教师模型预训练权重 | True | 布尔值或者预训练权重路径 |
+| Teacher.use_ssld | 教师模型预训练权重是否是ssld权重 | True | 布尔值 |
+| infer_model_name | 被infer模型的类型 | Student | Teacher |
+
+**注**：
+
+1.list在yaml中体现如下：
+```
+  freeze_params_list:
+  - True
+  - False
+```
+2.Student的参数情况类似，不再赘述。
+
+#### 2.2 损失函数（Loss）
+
+| 参数名字 | 具体含义 | 默认值 | 可选值 |
+|:---:|:---:|:---:|:---:|
+| DistillationCELoss | 蒸馏的交叉熵损失函数 | —— | —— |
+| DistillationCELoss.weight | Loss权重 | 1.0 | float |
+| DistillationCELoss.model_name_pairs |  ["Student", "Teacher"] | —— | —— |
+| DistillationGTCELoss.weight | 蒸馏的模型与真实Label的交叉熵损失函数 | —— | —— |
+| DistillationGTCELos.weight | Loss权重 | 1.0 | float |
+| DistillationCELoss.model_names | 与真实label作交叉熵的模型名字 | ["Student"] | —— |
+
+
+#### 2.3 评估指标（Metric）
+
+| 参数名字 | 具体含义 | 默认值 | 可选值 |
+|:---:|:---:|:---:|:---:|
+| DistillationTopkAcc | DistillationTopkAcc | 包含model_key和topk两个参数 | —— |
+| DistillationTopkAcc.model_key | 被评估的模型 | "Student" | "Teacher" |
+| DistillationTopkAcc.topk | Topk的值 | [1, 5] | list, int |
+
+**注**：`DistillationTopkAcc`与普通`TopkAcc`含义相同，只是只用在蒸馏任务中。
+
+### 3. 识别模型
+
+**注**：此处以`ResNet50`在`LogoDet-3k`上的训练配置为例，详解各个参数的意义。[配置路径](../../../ppcls/configs/Logo/ResNet50_ReID.yaml)。这里只介绍与分类模型有区别的参数。
+
+#### 3.1 结构(Arch)
+
+|     参数名字      |         具体含义          |   默认值   |                            可选值                            |
+| :---------------: | :-----------------------: | :--------: | :----------------------------------------------------------: |
+|       name        |         模型结构          | "RecModel" |                         ["RecModel"]                         |
+| infer_output_key  |    inference时的输出值    | “feature”  |                    ["feature", "logits"]                     |
+| infer_add_softmax |  infercne是否添加softmax  |    False   |                        [True, False]                         |
+|     Backbone.name      |    Backbone的名字     |      ResNet50_last_stage_stride1     | PaddleClas提供的其他backbone |
+|     Backbone.pretrained      |   Backbone预训练模型    |      True      | 布尔值或者预训练模型路径 |
+| BackboneStopLayer.name | Backbone中的输出层名字 |     True       | Backbone中的特征输出层的`full_name` |
+|       Neck.name        |    网络Neck部分名字     |      VehicleNeck      |           需传入字典结构，Neck网络层的具体输入参数           |
+|       Neck.in_channels        |    输入Neck部分的维度大小     |      2048      |        与BackboneStopLayer.name层的大小相同           |
+|       Neck.out_channels        |    输出Neck部分的维度大小，即特征维度大小    |      512     |        int           |
+|       Head.name        |    网络Head部分名字     |      CircleMargin      |           Arcmargin等           |
+|       Head.embedding_size        |    特征维度大小      |      512      |           与Neck.out_channels保持一致           |
+|       Head.class_num        |    类别数     |      3000      |           int           |
+|       Head.margin        |    CircleMargin中的margin值     |      0.35      |           float          |
+|       Head.scale        |    CircleMargin中的scale值     |      64      |           int          |
+
+**注**：
+
+1.在PaddleClas中，`Neck`部分是Backbone与embedding层的连接部分，`Head`部分是embedding层与分类层的连接部分。
+
+2.`BackboneStopLayer.name`的获取方式可以通过将模型可视化后获取，可视化方式可以参考[Netron](https://github.com/lutzroeder/netron)或者[visualdl](https://github.com/PaddlePaddle/VisualDL)。
+
+3.调用`tools/export_model.py`会将模型的权重转为inference model，其中`infer_add_softmax`参数会控制是否在其后增加`Softmax`激活函数，代码中默认为`True`(分类任务中最后的输出层会接`Softmax`激活函数)，识别任务中特征层无须接激活函数，此处要设置为`False`。
+
+#### 3.2 评估指标（Metric）
+
+| 参数名字 | 具体含义 | 默认值 | 可选值 |
+|:---:|:---:|:---:|:---:|
+| Recallk| 召回率 | [1, 5] | list, int |
+| mAP| 平均检索精度 | None | None |
--- a/ppcls/arch/backbone/legendary_models/resnet.py
+++ b/ppcls/arch/backbone/legendary_models/resnet.py
@@ -104,7 +104,8 @@ class ConvBNLayer(TheseusLayer):
                 groups=1,
                 is_vd_mode=False,
                 act=None,
-                 lr_mult=1.0):
+                 lr_mult=1.0,
+                 data_format="NCHW"):
        super().__init__()
        self.is_vd_mode = is_vd_mode
        self.act = act
@@ -118,11 +119,13 @@ class ConvBNLayer(TheseusLayer):
            padding=(filter_size - 1) // 2,
            groups=groups,
            weight_attr=ParamAttr(learning_rate=lr_mult),
-            bias_attr=False)
+            bias_attr=False,
+            data_format=data_format)
        self.bn = BatchNorm(
            num_filters,
            param_attr=ParamAttr(learning_rate=lr_mult),
-            bias_attr=ParamAttr(learning_rate=lr_mult))
+            bias_attr=ParamAttr(learning_rate=lr_mult),
+            data_layout=data_format)
        self.relu = nn.ReLU()

    def forward(self, x):
@@ -136,14 +139,14 @@ class ConvBNLayer(TheseusLayer):


 class BottleneckBlock(TheseusLayer):
-    def __init__(
-            self,
+    def __init__(self,
                 num_channels,
                 num_filters,
                 stride,
                 shortcut=True,
                 if_first=False,
-            lr_mult=1.0, ):
+                 lr_mult=1.0,
+                 data_format="NCHW"):
        super().__init__()

        self.conv0 = ConvBNLayer(
@@ -151,20 +154,23 @@ class BottleneckBlock(TheseusLayer):
            num_filters=num_filters,
            filter_size=1,
            act="relu",
-            lr_mult=lr_mult)
+            lr_mult=lr_mult,
+            data_format=data_format)
        self.conv1 = ConvBNLayer(
            num_channels=num_filters,
            num_filters=num_filters,
            filter_size=3,
            stride=stride,
            act="relu",
-            lr_mult=lr_mult)
+            lr_mult=lr_mult,
+            data_format=data_format)
        self.conv2 = ConvBNLayer(
            num_channels=num_filters,
            num_filters=num_filters * 4,
            filter_size=1,
            act=None,
-            lr_mult=lr_mult)
+            lr_mult=lr_mult,
+            data_format=data_format)

        if not shortcut:
            self.short = ConvBNLayer(
@@ -173,7 +179,8 @@ class BottleneckBlock(TheseusLayer):
                filter_size=1,
                stride=stride if if_first else 1,
                is_vd_mode=False if if_first else True,
-                lr_mult=lr_mult)
+                lr_mult=lr_mult,
+                data_format=data_format)
        self.relu = nn.ReLU()
        self.shortcut = shortcut

@@ -199,7 +206,8 @@ class BasicBlock(TheseusLayer):
                 stride,
                 shortcut=True,
                 if_first=False,
-                 lr_mult=1.0):
+                 lr_mult=1.0,
+                 data_format="NCHW"):
        super().__init__()

        self.stride = stride
@@ -209,13 +217,15 @@ class BasicBlock(TheseusLayer):
            filter_size=3,
            stride=stride,
            act="relu",
-            lr_mult=lr_mult)
+            lr_mult=lr_mult,
+            data_format=data_format)
        self.conv1 = ConvBNLayer(
            num_channels=num_filters,
            num_filters=num_filters,
            filter_size=3,
            act=None,
-            lr_mult=lr_mult)
+            lr_mult=lr_mult,
+            data_format=data_format)
        if not shortcut:
            self.short = ConvBNLayer(
                num_channels=num_channels,
@@ -223,7 +233,8 @@ class BasicBlock(TheseusLayer):
                filter_size=1,
                stride=stride if if_first else 1,
                is_vd_mode=False if if_first else True,
-                lr_mult=lr_mult)
+                lr_mult=lr_mult,
+                data_format=data_format)
        self.shortcut = shortcut
        self.relu = nn.ReLU()

@@ -256,7 +267,9 @@ class ResNet(TheseusLayer):
                 config,
                 version="vb",
                 class_num=1000,
-                 lr_mult_list=[1.0, 1.0, 1.0, 1.0, 1.0]):
+                 lr_mult_list=[1.0, 1.0, 1.0, 1.0, 1.0],
+                 data_format="NCHW",
+                 input_image_channel=3):
        super().__init__()

        self.cfg = config
@@ -279,22 +292,25 @@ class ResNet(TheseusLayer):

        self.stem_cfg = {
            #num_channels, num_filters, filter_size, stride
-            "vb": [[3, 64, 7, 2]],
-            "vd": [[3, 32, 3, 2], [32, 32, 3, 1], [32, 64, 3, 1]]
+            "vb": [[input_image_channel, 64, 7, 2]],
+            "vd":
+            [[input_image_channel, 32, 3, 2], [32, 32, 3, 1], [32, 64, 3, 1]]
        }

-        self.stem = nn.Sequential(*[
+        self.stem = nn.Sequential(* [
            ConvBNLayer(
                num_channels=in_c,
                num_filters=out_c,
                filter_size=k,
                stride=s,
                act="relu",
-                lr_mult=self.lr_mult_list[0])
+                lr_mult=self.lr_mult_list[0],
+                data_format=data_format)
            for in_c, out_c, k, s in self.stem_cfg[version]
        ])

-        self.max_pool = MaxPool2D(kernel_size=3, stride=2, padding=1)
+        self.max_pool = MaxPool2D(
+            kernel_size=3, stride=2, padding=1, data_format=data_format)
        block_list = []
        for block_idx in range(len(self.block_depth)):
            shortcut = False
@@ -306,11 +322,12 @@ class ResNet(TheseusLayer):
                    stride=2 if i == 0 and block_idx != 0 else 1,
                    shortcut=shortcut,
                    if_first=block_idx == i == 0 if version == "vd" else True,
-                    lr_mult=self.lr_mult_list[block_idx + 1]))
+                    lr_mult=self.lr_mult_list[block_idx + 1],
+                    data_format=data_format))
                shortcut = True
        self.blocks = nn.Sequential(*block_list)

-        self.avg_pool = AdaptiveAvgPool2D(1)
+        self.avg_pool = AdaptiveAvgPool2D(1, data_format=data_format)
        self.flatten = nn.Flatten()
        self.avg_pool_channels = self.num_channels[-1] * 2
        stdv = 1.0 / math.sqrt(self.avg_pool_channels * 1.0)
@@ -319,7 +336,13 @@ class ResNet(TheseusLayer):
            self.class_num,
            weight_attr=ParamAttr(initializer=Uniform(-stdv, stdv)))

+        self.data_format = data_format
+
    def forward(self, x):
+        with paddle.static.amp.fp16_guard():
+            if self.data_format == "NHWC":
+                x = paddle.transpose(x, [0, 2, 3, 1])
+                x.stop_gradient = True
            x = self.stem(x)
            x = self.max_pool(x)
            x = self.blocks(x)

--- a/ppcls/arch/gears/cosmargin.py
+++ b/ppcls/arch/gears/cosmargin.py
@@ -38,7 +38,7 @@ class CosMargin(paddle.nn.Layer):

        input_norm = paddle.sqrt(
            paddle.sum(paddle.square(input), axis=1, keepdim=True))
-        input = paddle.divide(input, x_norm)
+        input = paddle.divide(input, input_norm)

        weight = self.fc.weight
        weight_norm = paddle.sqrt(

--- a/ppcls/configs/ImageNet/DPN/DPN107.yaml
+++ b/ppcls/configs/ImageNet/DPN/DPN107.yaml
@@ -22,7 +22,7 @@ Arch:
 # loss function config for traing/eval process
 Loss:
  Train:
-    - CELoss:
+    - MixCELoss:
        weight: 1.0
        epsilon: 0.1
  Eval:

--- a/ppcls/configs/ImageNet/DPN/DPN131.yaml
+++ b/ppcls/configs/ImageNet/DPN/DPN131.yaml
@@ -22,7 +22,7 @@ Arch:
 # loss function config for traing/eval process
 Loss:
  Train:
-    - CELoss:
+    - MixCELoss:
        weight: 1.0
        epsilon: 0.1
  Eval:

--- a/ppcls/configs/ImageNet/DPN/DPN68.yaml
+++ b/ppcls/configs/ImageNet/DPN/DPN68.yaml
@@ -22,7 +22,7 @@ Arch:
 # loss function config for traing/eval process
 Loss:
  Train:
-    - CELoss:
+    - MixCELoss:
        weight: 1.0
        epsilon: 0.1
  Eval:

--- a/ppcls/configs/ImageNet/DPN/DPN92.yaml
+++ b/ppcls/configs/ImageNet/DPN/DPN92.yaml
@@ -22,7 +22,7 @@ Arch:
 # loss function config for traing/eval process
 Loss:
  Train:
-    - CELoss:
+    - MixCELoss:
        weight: 1.0
        epsilon: 0.1
  Eval:

--- a/ppcls/configs/ImageNet/DPN/DPN98.yaml
+++ b/ppcls/configs/ImageNet/DPN/DPN98.yaml
@@ -22,7 +22,7 @@ Arch:
 # loss function config for traing/eval process
 Loss:
  Train:
-    - CELoss:
+    - MixCELoss:
        weight: 1.0
        epsilon: 0.1
  Eval:

--- a/ppcls/configs/ImageNet/DarkNet/DarkNet53.yaml
+++ b/ppcls/configs/ImageNet/DarkNet/DarkNet53.yaml
@@ -22,7 +22,7 @@ Arch:
 # loss function config for traing/eval process
 Loss:
  Train:
-    - CELoss:
+    - MixCELoss:
        weight: 1.0
        epsilon: 0.1
  Eval:

--- a/ppcls/configs/ImageNet/DataAugment/ResNet50_Cutmix.yaml
+++ b/ppcls/configs/ImageNet/DataAugment/ResNet50_Cutmix.yaml
@@ -22,7 +22,7 @@ Arch:
 # loss function config for traing/eval process
 Loss:
  Train:
-    - CELoss:
+    - MixCELoss:
        weight: 1.0
  Eval:
    - CELoss:

--- a/ppcls/configs/ImageNet/DataAugment/ResNet50_Mixup.yaml
+++ b/ppcls/configs/ImageNet/DataAugment/ResNet50_Mixup.yaml
@@ -22,7 +22,7 @@ Arch:
 # loss function config for traing/eval process
 Loss:
  Train:
-    - CELoss:
+    - MixCELoss:
        weight: 1.0
  Eval:
    - CELoss:

--- a/ppcls/configs/ImageNet/Inception/InceptionV3.yaml
+++ b/ppcls/configs/ImageNet/Inception/InceptionV3.yaml
@@ -22,7 +22,7 @@ Arch:
 # loss function config for traing/eval process
 Loss:
  Train:
-    - CELoss:
+    - MixCELoss:
        weight: 1.0
        epsilon: 0.1
  Eval:

--- a/ppcls/configs/ImageNet/Inception/InceptionV4.yaml
+++ b/ppcls/configs/ImageNet/Inception/InceptionV4.yaml
@@ -22,7 +22,7 @@ Arch:
 # loss function config for traing/eval process
 Loss:
  Train:
-    - CELoss:
+    - MixCELoss:
        weight: 1.0
        epsilon: 0.1
  Eval:

--- a/ppcls/configs/ImageNet/Res2Net/Res2Net101_vd_26w_4s.yaml
+++ b/ppcls/configs/ImageNet/Res2Net/Res2Net101_vd_26w_4s.yaml
@@ -22,7 +22,7 @@ Arch:
 # loss function config for traing/eval process
 Loss:
  Train:
-    - CELoss:
+    - MixCELoss:
        weight: 1.0
        epsilon: 0.1
  Eval:

--- a/ppcls/configs/ImageNet/Res2Net/Res2Net200_vd_26w_4s.yaml
+++ b/ppcls/configs/ImageNet/Res2Net/Res2Net200_vd_26w_4s.yaml
@@ -22,7 +22,7 @@ Arch:
 # loss function config for traing/eval process
 Loss:
  Train:
-    - CELoss:
+    - MixCELoss:
        weight: 1.0
        epsilon: 0.1
  Eval:

--- a/ppcls/configs/ImageNet/Res2Net/Res2Net50_14w_8s.yaml
+++ b/ppcls/configs/ImageNet/Res2Net/Res2Net50_14w_8s.yaml
@@ -22,7 +22,7 @@ Arch:
 # loss function config for traing/eval process
 Loss:
  Train:
-    - CELoss:
+    - MixCELoss:
        weight: 1.0
        epsilon: 0.1
  Eval:

--- a/ppcls/configs/ImageNet/Res2Net/Res2Net50_26w_4s.yaml
+++ b/ppcls/configs/ImageNet/Res2Net/Res2Net50_26w_4s.yaml
@@ -22,7 +22,7 @@ Arch:
 # loss function config for traing/eval process
 Loss:
  Train:
-    - CELoss:
+    - MixCELoss:
        weight: 1.0
        epsilon: 0.1
  Eval:

--- a/ppcls/configs/ImageNet/Res2Net/Res2Net50_vd_26w_4s.yaml
+++ b/ppcls/configs/ImageNet/Res2Net/Res2Net50_vd_26w_4s.yaml
@@ -22,7 +22,7 @@ Arch:
 # loss function config for traing/eval process
 Loss:
  Train:
-    - CELoss:
+    - MixCELoss:
        weight: 1.0
        epsilon: 0.1
  Eval:

--- a/ppcls/configs/ImageNet/ResNeSt/ResNeSt101.yaml
+++ b/ppcls/configs/ImageNet/ResNeSt/ResNeSt101.yaml
@@ -22,7 +22,7 @@ Arch:
 # loss function config for traing/eval process
 Loss:
  Train:
-    - CELoss:
+    - MixCELoss:
        weight: 1.0
        epsilon: 0.1
  Eval:

--- a/ppcls/configs/ImageNet/ResNeSt/ResNeSt50.yaml
+++ b/ppcls/configs/ImageNet/ResNeSt/ResNeSt50.yaml
@@ -22,7 +22,7 @@ Arch:
 # loss function config for traing/eval process
 Loss:
  Train:
-    - CELoss:
+    - MixCELoss:
        weight: 1.0
        epsilon: 0.1
  Eval:

--- a/ppcls/configs/ImageNet/ResNeSt/ResNeSt50_fast_1s1x64d.yaml
+++ b/ppcls/configs/ImageNet/ResNeSt/ResNeSt50_fast_1s1x64d.yaml
@@ -22,7 +22,7 @@ Arch:
 # loss function config for traing/eval process
 Loss:
  Train:
-    - CELoss:
+    - MixCELoss:
        weight: 1.0
        epsilon: 0.1
  Eval:

--- a/ppcls/configs/ImageNet/ResNeXt/ResNeXt101_vd_32x4d.yaml
+++ b/ppcls/configs/ImageNet/ResNeXt/ResNeXt101_vd_32x4d.yaml
@@ -22,7 +22,7 @@ Arch:
 # loss function config for traing/eval process
 Loss:
  Train:
-    - CELoss:
+    - MixCELoss:
        weight: 1.0
        epsilon: 0.1
  Eval:

--- a/ppcls/configs/ImageNet/ResNeXt/ResNeXt101_vd_64x4d.yaml
+++ b/ppcls/configs/ImageNet/ResNeXt/ResNeXt101_vd_64x4d.yaml
@@ -22,7 +22,7 @@ Arch:
 # loss function config for traing/eval process
 Loss:
  Train:
-    - CELoss:
+    - MixCELoss:
        weight: 1.0
        epsilon: 0.1
  Eval:

--- a/ppcls/configs/ImageNet/ResNeXt/ResNeXt152_vd_32x4d.yaml
+++ b/ppcls/configs/ImageNet/ResNeXt/ResNeXt152_vd_32x4d.yaml
@@ -22,7 +22,7 @@ Arch:
 # loss function config for traing/eval process
 Loss:
  Train:
-    - CELoss:
+    - MixCELoss:
        weight: 1.0
        epsilon: 0.1
  Eval:

--- a/ppcls/configs/ImageNet/ResNeXt/ResNeXt152_vd_64x4d.yaml
+++ b/ppcls/configs/ImageNet/ResNeXt/ResNeXt152_vd_64x4d.yaml
@@ -22,7 +22,7 @@ Arch:
 # loss function config for traing/eval process
 Loss:
  Train:
-    - CELoss:
+    - MixCELoss:
        weight: 1.0
        epsilon: 0.1
  Eval:

--- a/ppcls/configs/ImageNet/ResNeXt/ResNeXt50_vd_32x4d.yaml
+++ b/ppcls/configs/ImageNet/ResNeXt/ResNeXt50_vd_32x4d.yaml
@@ -22,7 +22,7 @@ Arch:
 # loss function config for traing/eval process
 Loss:
  Train:
-    - CELoss:
+    - MixCELoss:
        weight: 1.0
        epsilon: 0.1
  Eval:

--- a/ppcls/configs/ImageNet/ResNeXt/ResNeXt50_vd_64x4d.yaml
+++ b/ppcls/configs/ImageNet/ResNeXt/ResNeXt50_vd_64x4d.yaml
@@ -22,7 +22,7 @@ Arch:
 # loss function config for traing/eval process
 Loss:
  Train:
-    - CELoss:
+    - MixCELoss:
        weight: 1.0
        epsilon: 0.1
  Eval:

--- a/ppcls/configs/ImageNet/ResNet/ResNet101_vd.yaml
+++ b/ppcls/configs/ImageNet/ResNet/ResNet101_vd.yaml
@@ -22,7 +22,7 @@ Arch:
 # loss function config for traing/eval process
 Loss:
  Train:
-    - CELoss:
+    - MixCELoss:
        weight: 1.0
        epsilon: 0.1
  Eval:

--- a/ppcls/configs/ImageNet/ResNet/ResNet152_vd.yaml
+++ b/ppcls/configs/ImageNet/ResNet/ResNet152_vd.yaml
@@ -22,7 +22,7 @@ Arch:
 # loss function config for traing/eval process
 Loss:
  Train:
-    - CELoss:
+    - MixCELoss:
        weight: 1.0
        epsilon: 0.1
  Eval:

--- a/ppcls/configs/ImageNet/ResNet/ResNet18_vd.yaml
+++ b/ppcls/configs/ImageNet/ResNet/ResNet18_vd.yaml
@@ -22,7 +22,7 @@ Arch:
 # loss function config for traing/eval process
 Loss:
  Train:
-    - CELoss:
+    - MixCELoss:
        weight: 1.0
        epsilon: 0.1
  Eval:

--- a/ppcls/configs/ImageNet/ResNet/ResNet200_vd.yaml
+++ b/ppcls/configs/ImageNet/ResNet/ResNet200_vd.yaml
@@ -22,7 +22,7 @@ Arch:
 # loss function config for traing/eval process
 Loss:
  Train:
-    - CELoss:
+    - MixCELoss:
        weight: 1.0
        epsilon: 0.1
  Eval:

--- a/ppcls/configs/ImageNet/ResNet/ResNet34_vd.yaml
+++ b/ppcls/configs/ImageNet/ResNet/ResNet34_vd.yaml
@@ -22,7 +22,7 @@ Arch:
 # loss function config for traing/eval process
 Loss:
  Train:
-    - CELoss:
+    - MixCELoss:
        weight: 1.0
        epsilon: 0.1
  Eval:

--- a/ppcls/configs/ImageNet/ResNet/ResNet50_fp16.yaml
+++ b/ppcls/configs/ImageNet/ResNet/ResNet50_fp16.yaml
+# global configs
+Global:
+  checkpoints: null
+  pretrained_model: null
+  output_dir: ./output/
+  device: gpu
+  save_interval: 1
+  eval_during_train: True
+  eval_interval: 1
+  epochs: 120
+  print_batch_step: 10
+  use_visualdl: False
+  # used for static mode and model export
+  image_channel: &image_channel 4
+  image_shape: [*image_channel, 224, 224]
+  save_inference_dir: ./inference
+  # training model under @to_static
+  to_static: False
+
+# mixed precision training
+AMP:
+  scale_loss: 128.0
+  use_dynamic_loss_scaling: True
+  use_pure_fp16: &use_pure_fp16 True
+
+# model architecture
+Arch:
+  name: ResNet50
+  class_num: 1000
+  input_image_channel: *image_channel
+  data_format: "NHWC"
+ 
+# loss function config for traing/eval process
+Loss:
+  Train:
+    - CELoss:
+        weight: 1.0
+  Eval:
+    - CELoss:
+        weight: 1.0
+
+Optimizer:
+  name: Momentum
+  momentum: 0.9
+  multi_precision: *use_pure_fp16
+  lr:
+    name: Piecewise
+    learning_rate: 0.1
+    decay_epochs: [30, 60, 90]
+    values: [0.1, 0.01, 0.001, 0.0001]
+  regularizer:
+    name: 'L2'
+    coeff: 0.0001
+
+
+# data loader for train and eval
+DataLoader:
+  Train:
+    dataset:
+      name: ImageNetDataset
+      image_root: ./dataset/ILSVRC2012/
+      cls_label_path: ./dataset/ILSVRC2012/train_list.txt
+      transform_ops:
+        - DecodeImage:
+            to_rgb: True
+            channel_first: False
+        - RandCropImage:
+            size: 224
+        - RandFlipImage:
+            flip_code: 1
+        - NormalizeImage:
+            scale: 1.0/255.0
+            mean: [0.485, 0.456, 0.406]
+            std: [0.229, 0.224, 0.225]
+            order: ''
+            output_fp16: *use_pure_fp16
+            channel_num: *image_channel
+
+    sampler:
+      name: DistributedBatchSampler
+      batch_size: 32
+      drop_last: False
+      shuffle: True
+    loader:
+      num_workers: 4
+      use_shared_memory: True
+
+  Eval:
+    dataset: 
+      name: ImageNetDataset
+      image_root: ./dataset/ILSVRC2012/
+      cls_label_path: ./dataset/ILSVRC2012/val_list.txt
+      transform_ops:
+        - DecodeImage:
+            to_rgb: True
+            channel_first: False
+        - ResizeImage:
+            resize_short: 256
+        - CropImage:
+            size: 224
+        - NormalizeImage:
+            scale: 1.0/255.0
+            mean: [0.485, 0.456, 0.406]
+            std: [0.229, 0.224, 0.225]
+            order: ''
+            output_fp16: *use_pure_fp16
+            channel_num: *image_channel
+    sampler:
+      name: DistributedBatchSampler
+      batch_size: 64
+      drop_last: False
+      shuffle: False
+    loader:
+      num_workers: 4
+      use_shared_memory: True
+
+Infer:
+  infer_imgs: docs/images/whl/demo.jpg
+  batch_size: 10
+  transforms:
+    - DecodeImage:
+        to_rgb: True
+        channel_first: False
+    - ResizeImage:
+        resize_short: 256
+    - CropImage:
+        size: 224
+    - NormalizeImage:
+        scale: 1.0/255.0
+        mean: [0.485, 0.456, 0.406]
+        std: [0.229, 0.224, 0.225]
+        order: ''
+        output_fp16: *use_pure_fp16
+        channel_num: *image_channel
+    - ToCHWImage:
+  PostProcess:
+    name: Topk
+    topk: 5
+    class_id_map_file: ppcls/utils/imagenet1k_label_list.txt
+
+Metric:
+  Train:
+    - TopkAcc:
+        topk: [1, 5]
+  Eval:
+    - TopkAcc:
+        topk: [1, 5]
--- a/ppcls/configs/ImageNet/ResNet/ResNet50_fp16_dygraph.yaml
+++ b/ppcls/configs/ImageNet/ResNet/ResNet50_fp16_dygraph.yaml
+# global configs
+Global:
+  checkpoints: null
+  pretrained_model: null
+  output_dir: ./output/
+  device: gpu
+  save_interval: 1
+  eval_during_train: True
+  eval_interval: 1
+  epochs: 120
+  print_batch_step: 10
+  use_visualdl: False
+  image_channel: &image_channel 4
+  # used for static mode and model export
+  image_shape: [*image_channel, 224, 224]
+  save_inference_dir: ./inference
+  # training model under @to_static
+  to_static: False
+  use_dali: True
+
+# mixed precision training
+AMP:
+  scale_loss: 128.0
+  use_dynamic_loss_scaling: True
+  use_pure_fp16: &use_pure_fp16 False
+
+# model architecture
+Arch:
+  name: ResNet50
+  class_num: 1000
+  input_image_channel: *image_channel
+  data_format: "NHWC"
+ 
+# loss function config for traing/eval process
+Loss:
+  Train:
+    - CELoss:
+        weight: 1.0
+  Eval:
+    - CELoss:
+        weight: 1.0
+
+
+Optimizer:
+  name: Momentum
+  momentum: 0.9
+  lr:
+    name: Piecewise
+    learning_rate: 0.1
+    decay_epochs: [30, 60, 90]
+    values: [0.1, 0.01, 0.001, 0.0001]
+  regularizer:
+    name: 'L2'
+    coeff: 0.0001
+
+
+# data loader for train and eval
+DataLoader:
+  Train:
+    dataset:
+      name: ImageNetDataset
+      image_root: ./dataset/ILSVRC2012/
+      cls_label_path: ./dataset/ILSVRC2012/train_list.txt
+      transform_ops:
+        - DecodeImage:
+            to_rgb: True
+            channel_first: False
+        - RandCropImage:
+            size: 224
+        - RandFlipImage:
+            flip_code: 1
+        - NormalizeImage:
+            scale: 1.0/255.0
+            mean: [0.485, 0.456, 0.406]
+            std: [0.229, 0.224, 0.225]
+            order: ''
+            output_fp16: *use_pure_fp16
+            channel_num: *image_channel
+
+    sampler:
+      name: DistributedBatchSampler
+      batch_size: 256
+      drop_last: False
+      shuffle: True
+    loader:
+      num_workers: 4
+      use_shared_memory: True
+
+  Eval:
+    dataset: 
+      name: ImageNetDataset
+      image_root: ./dataset/ILSVRC2012/
+      cls_label_path: ./dataset/ILSVRC2012/val_list.txt
+      transform_ops:
+        - DecodeImage:
+            to_rgb: True
+            channel_first: False
+        - ResizeImage:
+            resize_short: 256
+        - CropImage:
+            size: 224
+        - NormalizeImage:
+            scale: 1.0/255.0
+            mean: [0.485, 0.456, 0.406]
+            std: [0.229, 0.224, 0.225]
+            order: ''
+            output_fp16: *use_pure_fp16
+            channel_num: *image_channel
+    sampler:
+      name: DistributedBatchSampler
+      batch_size: 64
+      drop_last: False
+      shuffle: False
+    loader:
+      num_workers: 4
+      use_shared_memory: True
+
+Infer:
+  infer_imgs: docs/images/whl/demo.jpg
+  batch_size: 10
+  transforms:
+    - DecodeImage:
+        to_rgb: True
+        channel_first: False
+    - ResizeImage:
+        resize_short: 256
+    - CropImage:
+        size: 224
+    - NormalizeImage:
+        scale: 1.0/255.0
+        mean: [0.485, 0.456, 0.406]
+        std: [0.229, 0.224, 0.225]
+        order: ''
+        output_fp16: *use_pure_fp16
+        channel_num: *image_channel
+    - ToCHWImage:
+  PostProcess:
+    name: Topk
+    topk: 5
+    class_id_map_file: ppcls/utils/imagenet1k_label_list.txt
+
+Metric:
+  Train:
+    - TopkAcc:
+        topk: [1, 5]
+  Eval:
+    - TopkAcc:
+        topk: [1, 5]
--- a/ppcls/configs/ImageNet/ResNet/ResNet50_vd.yaml
+++ b/ppcls/configs/ImageNet/ResNet/ResNet50_vd.yaml
@@ -22,7 +22,7 @@ Arch:
 # loss function config for traing/eval process
 Loss:
  Train:
-    - CELoss:
+    - MixCELoss:
        weight: 1.0
        epsilon: 0.1
  Eval:

--- a/ppcls/configs/ImageNet/SENet/SENet154_vd.yaml
+++ b/ppcls/configs/ImageNet/SENet/SENet154_vd.yaml
@@ -22,7 +22,7 @@ Arch:
 # loss function config for traing/eval process
 Loss:
  Train:
-    - CELoss:
+    - MixCELoss:
        weight: 1.0
        epsilon: 0.1
  Eval:

--- a/ppcls/configs/ImageNet/SENet/SE_ResNeXt101_32x4d.yaml
+++ b/ppcls/configs/ImageNet/SENet/SE_ResNeXt101_32x4d.yaml
@@ -22,7 +22,7 @@ Arch:
 # loss function config for traing/eval process
 Loss:
  Train:
-    - CELoss:
+    - MixCELoss:
        weight: 1.0
        epsilon: 0.1
  Eval:

--- a/ppcls/configs/ImageNet/SENet/SE_ResNeXt101_32x4d_fp16.yaml
+++ b/ppcls/configs/ImageNet/SENet/SE_ResNeXt101_32x4d_fp16.yaml
+# global configs
+Global:
+  checkpoints: null
+  pretrained_model: null
+  output_dir: ./output/
+  device: gpu
+  save_interval: 1
+  eval_during_train: True
+  eval_interval: 1
+  epochs: 200
+  print_batch_step: 10
+  use_visualdl: False
+  # used for static mode and model export
+  image_channel: &image_channel 4
+  image_shape: [*image_channel, 224, 224]
+  save_inference_dir: ./inference
+
+# model architecture
+Arch:
+  name: SE_ResNeXt101_32x4d
+  class_num: 1000
+  input_image_channel: *image_channel
+ 
+# loss function config for traing/eval process
+Loss:
+  Train:
+    - CELoss:
+        weight: 1.0
+        epsilon: 0.1
+  Eval:
+    - CELoss:
+        weight: 1.0
+
+# mixed precision training
+AMP:
+    scale_loss: 128.0
+    use_dynamic_loss_scaling: True
+    use_pure_fp16: &use_pure_fp16 True
+
+Optimizer:
+  name: Momentum
+  momentum: 0.9
+  lr:
+    name: Cosine
+    learning_rate: 0.1
+  regularizer:
+    name: 'L2'
+    coeff: 0.00007
+
+# data loader for train and eval
+DataLoader:
+  Train:
+    dataset:
+      name: ImageNetDataset
+      image_root: ./dataset/ILSVRC2012/
+      cls_label_path: ./dataset/ILSVRC2012/train_list.txt
+      transform_ops:
+        - DecodeImage:
+            to_rgb: True
+            channel_first: False
+        - RandCropImage:
+            size: 224
+        - RandFlipImage:
+            flip_code: 1
+        - NormalizeImage:
+            scale: 1.0/255.0
+            mean: [0.485, 0.456, 0.406]
+            std: [0.229, 0.224, 0.225]
+            order: ''
+            output_fp16: *use_pure_fp16
+            channel_num: *image_channel
+    sampler:
+      name: DistributedBatchSampler
+      batch_size: 64
+      drop_last: False
+      shuffle: True
+    loader:
+      num_workers: 4
+      use_shared_memory: True
+
+  Eval:
+    dataset: 
+      name: ImageNetDataset
+      image_root: ./dataset/ILSVRC2012/
+      cls_label_path: ./dataset/ILSVRC2012/val_list.txt
+      transform_ops:
+        - DecodeImage:
+            to_rgb: True
+            channel_first: False
+        - ResizeImage:
+            resize_short: 256
+        - CropImage:
+            size: 224
+        - NormalizeImage:
+            scale: 1.0/255.0
+            mean: [0.485, 0.456, 0.406]
+            std: [0.229, 0.224, 0.225]
+            order: ''
+            output_fp16: *use_pure_fp16
+            channel_num: *image_channel
+    sampler:
+      name: BatchSampler
+      batch_size: 64
+      drop_last: False
+      shuffle: False
+    loader:
+      num_workers: 4
+      use_shared_memory: True
+
+Infer:
+  infer_imgs: docs/images/whl/demo.jpg
+  batch_size: 10
+  transforms:
+    - DecodeImage:
+        to_rgb: True
+        channel_first: False
+    - ResizeImage:
+        resize_short: 256
+    - CropImage:
+        size: 224
+    - NormalizeImage:
+        scale: 1.0/255.0
+        mean: [0.485, 0.456, 0.406]
+        std: [0.229, 0.224, 0.225]
+        order: ''
+        output_fp16: *use_pure_fp16
+        channel_num: *image_channel
+    - ToCHWImage:
+  PostProcess:
+    name: Topk
+    topk: 5
+    class_id_map_file: ppcls/utils/imagenet1k_label_list.txt
+
+Metric:
+  Train:
+    - TopkAcc:
+        topk: [1, 5]
+  Eval:
+    - TopkAcc:
+        topk: [1, 5]
--- a/ppcls/configs/ImageNet/SENet/SE_ResNeXt50_32x4d.yaml
+++ b/ppcls/configs/ImageNet/SENet/SE_ResNeXt50_32x4d.yaml
@@ -22,7 +22,7 @@ Arch:
 # loss function config for traing/eval process
 Loss:
  Train:
-    - CELoss:
+    - MixCELoss:
        weight: 1.0
        epsilon: 0.1
  Eval:

--- a/ppcls/configs/ImageNet/SENet/SE_ResNeXt50_vd_32x4d.yaml
+++ b/ppcls/configs/ImageNet/SENet/SE_ResNeXt50_vd_32x4d.yaml
@@ -22,7 +22,7 @@ Arch:
 # loss function config for traing/eval process
 Loss:
  Train:
-    - CELoss:
+    - MixCELoss:
        weight: 1.0
        epsilon: 0.1
  Eval:

--- a/ppcls/configs/ImageNet/SENet/SE_ResNet18_vd.yaml
+++ b/ppcls/configs/ImageNet/SENet/SE_ResNet18_vd.yaml
@@ -22,7 +22,7 @@ Arch:
 # loss function config for traing/eval process
 Loss:
  Train:
-    - CELoss:
+    - MixCELoss:
        weight: 1.0
        epsilon: 0.1
  Eval:

--- a/ppcls/configs/ImageNet/SENet/SE_ResNet34_vd.yaml
+++ b/ppcls/configs/ImageNet/SENet/SE_ResNet34_vd.yaml
@@ -22,7 +22,7 @@ Arch:
 # loss function config for traing/eval process
 Loss:
  Train:
-    - CELoss:
+    - MixCELoss:
        weight: 1.0
        epsilon: 0.1
  Eval:

--- a/ppcls/configs/ImageNet/SENet/SE_ResNet50_vd.yaml
+++ b/ppcls/configs/ImageNet/SENet/SE_ResNet50_vd.yaml
@@ -22,7 +22,7 @@ Arch:
 # loss function config for traing/eval process
 Loss:
  Train:
-    - CELoss:
+    - MixCELoss:
        weight: 1.0
        epsilon: 0.1
  Eval:

--- a/ppcls/configs/ImageNet/Xception/Xception65.yaml
+++ b/ppcls/configs/ImageNet/Xception/Xception65.yaml
@@ -22,7 +22,7 @@ Arch:
 # loss function config for traing/eval process
 Loss:
  Train:
-    - CELoss:
+    - MixCELoss:
        weight: 1.0
        epsilon: 0.1
  Eval:

--- a/ppcls/configs/ImageNet/Xception/Xception71.yaml
+++ b/ppcls/configs/ImageNet/Xception/Xception71.yaml
@@ -22,7 +22,7 @@ Arch:
 # loss function config for traing/eval process
 Loss:
  Train:
-    - CELoss:
+    - MixCELoss:
        weight: 1.0
        epsilon: 0.1
  Eval:

--- a/ppcls/configs/Products/ResNet50_vd_Inshop.yaml
+++ b/ppcls/configs/Products/ResNet50_vd_Inshop.yaml
 # global configs
 Global:
  checkpoints: null
-# please download pretrained model via this link:
-# https://paddle-imagenet-models-name.bj.bcebos.com/dygraph/rec/models/pretrain/product_ResNet50_vd_Aliproduct_v1.0_pretrained.pdparams
-  pretrained_model: product_ResNet50_vd_Aliproduct_v1.0_pretrained
+  pretrained_model: "https://paddle-imagenet-models-name.bj.bcebos.com/dygraph/rec/models/pretrain/product_ResNet50_vd_Aliproduct_v1.0_pretrained.pdparams"
  output_dir: ./output/
  device: gpu
  save_interval: 10

--- a/ppcls/configs/Products/ResNet50_vd_SOP.yaml
+++ b/ppcls/configs/Products/ResNet50_vd_SOP.yaml
 # global configs
 Global:
  checkpoints: null
-# please download pretrained model via this link:
-# https://paddle-imagenet-models-name.bj.bcebos.com/dygraph/rec/models/pretrain/product_ResNet50_vd_Aliproduct_v1.0_pretrained.pdparams
-  pretrained_model: product_ResNet50_vd_Aliproduct_v1.0_pretrained
+  pretrained_model: "https://paddle-imagenet-models-name.bj.bcebos.com/dygraph/rec/models/pretrain/product_ResNet50_vd_Aliproduct_v1.0_pretrained.pdparams"
  output_dir: ./output/
  device: gpu
  save_interval: 10

--- a/ppcls/data/__init__.py
+++ b/ppcls/data/__init__.py
@@ -53,10 +53,14 @@ def create_operators(params):
    return ops


-def build_dataloader(config, mode, device, seed=None):
+def build_dataloader(config, mode, device, use_dali=False, seed=None):
    assert mode in ['Train', 'Eval', 'Test', 'Gallery', 'Query'
                    ], "Mode should be Train, Eval, Test, Gallery, Query"
    # build dataset
+    if use_dali:
+        from ppcls.data.dataloader.dali import dali_dataloader
+        return dali_dataloader(config, mode, paddle.device.get_device(), seed)
+
    config_dataset = config[mode]['dataset']
    config_dataset = copy.deepcopy(config_dataset)
    dataset_name = config_dataset.pop('name')

--- a/ppcls/utils/static/dali.py
+++ b/ppcls/utils/static/dali.py
-# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
@@ -14,16 +14,17 @@

 from __future__ import division

+import copy
 import os

 import numpy as np
-from nvidia.dali.pipeline import Pipeline
 import nvidia.dali.ops as ops
 import nvidia.dali.types as types
-from nvidia.dali.plugin.paddle import DALIGenericIterator
-
 import paddle
-from paddle import fluid
+from nvidia.dali import fn
+from nvidia.dali.pipeline import Pipeline
+from nvidia.dali.plugin.base_iterator import LastBatchPolicy
+from nvidia.dali.plugin.paddle import DALIGenericIterator


 class HybridTrainPipe(Pipeline):
@@ -46,10 +47,11 @@ class HybridTrainPipe(Pipeline):
                 num_threads=4,
                 seed=42,
                 pad_output=False,
-                 output_dtype=types.FLOAT):
+                 output_dtype=types.FLOAT,
+                 dataset='Train'):
        super(HybridTrainPipe, self).__init__(
            batch_size, num_threads, device_id, seed=seed)
-        self.input = ops.FileReader(
+        self.input = ops.readers.File(
            file_root=file_root,
            file_list=file_list,
            shard_id=shard_id,
@@ -59,9 +61,9 @@ class HybridTrainPipe(Pipeline):
        # without additional reallocations
        device_memory_padding = 211025920
        host_memory_padding = 140544512
-        self.decode = ops.ImageDecoderRandomCrop(
+        self.decode = ops.decoders.ImageRandomCrop(
            device='mixed',
-            output_type=types.RGB,
+            output_type=types.DALIImageType.RGB,
            device_memory_padding=device_memory_padding,
            host_memory_padding=host_memory_padding,
            random_aspect_ratio=[lower, upper],
@@ -71,15 +73,14 @@ class HybridTrainPipe(Pipeline):
            device='gpu', resize_x=crop, resize_y=crop, interp_type=interp)
        self.cmnp = ops.CropMirrorNormalize(
            device="gpu",
-            output_dtype=output_dtype,
-            output_layout=types.NCHW,
+            dtype=output_dtype,
+            output_layout='CHW',
            crop=(crop, crop),
-            image_type=types.RGB,
            mean=mean,
            std=std,
            pad_output=pad_output)
-        self.coin = ops.CoinFlip(probability=0.5)
-        self.to_int64 = ops.Cast(dtype=types.INT64, device="gpu")
+        self.coin = ops.random.CoinFlip(probability=0.5)
+        self.to_int64 = ops.Cast(dtype=types.DALIDataType.INT64, device="gpu")

    def define_graph(self):
        rng = self.coin()
@@ -113,25 +114,24 @@ class HybridValPipe(Pipeline):
                 output_dtype=types.FLOAT):
        super(HybridValPipe, self).__init__(
            batch_size, num_threads, device_id, seed=seed)
-        self.input = ops.FileReader(
+        self.input = ops.readers.File(
            file_root=file_root,
            file_list=file_list,
            shard_id=shard_id,
            num_shards=num_shards,
            random_shuffle=random_shuffle)
-        self.decode = ops.ImageDecoder(device="mixed", output_type=types.RGB)
+        self.decode = ops.decoders.Image(device="mixed")
        self.res = ops.Resize(
            device="gpu", resize_shorter=resize_shorter, interp_type=interp)
        self.cmnp = ops.CropMirrorNormalize(
            device="gpu",
-            output_dtype=output_dtype,
-            output_layout=types.NCHW,
+            dtype=output_dtype,
+            output_layout='CHW',
            crop=(crop, crop),
-            image_type=types.RGB,
            mean=mean,
            std=std,
            pad_output=pad_output)
-        self.to_int64 = ops.Cast(dtype=types.INT64, device="gpu")
+        self.to_int64 = ops.Cast(dtype=types.DALIDataType.INT64, device="gpu")

    def define_graph(self):
        jpegs, labels = self.input(name="Reader")
@@ -144,64 +144,84 @@ class HybridValPipe(Pipeline):
        return self.epoch_size("Reader")


-def build(config, mode='train'):
-    env = os.environ
-    assert config.get('use_gpu',
-                      True) == True, "gpu training is required for DALI"
-    assert not config.get(
-        'use_aa'), "auto augment is not supported by DALI reader"
-    assert float(env.get('FLAGS_fraction_of_gpu_memory_to_use', 0.92)) < 0.9, \
-        "Please leave enough GPU memory for DALI workspace, e.g., by setting" \
-        " `export FLAGS_fraction_of_gpu_memory_to_use=0.8`"
+def dali_dataloader(config, mode, device, seed=None):
+    assert "gpu" in device, "gpu training is required for DALI"
+    device_id = int(device.split(':')[1])
+    config_dataloader = config[mode]
+    seed = 42 if seed is None else seed
+    ops = [
+        list(x.keys())[0]
+        for x in config_dataloader["dataset"]["transform_ops"]
+    ]
+    support_ops_train = [
+        "DecodeImage", "NormalizeImage", "RandFlipImage", "RandCropImage"
+    ]
+    support_ops_eval = [
+        "DecodeImage", "ResizeImage", "CropImage", "NormalizeImage"
+    ]
+
+    if mode.lower() == 'train':
+        assert set(ops) == set(
+            support_ops_train
+        ), "The supported trasform_ops for train_dataset in dali is : {}".format(
+            ",".join(support_ops_train))
+    else:
+        assert set(ops) == set(
+            support_ops_eval
+        ), "The supported trasform_ops for eval_dataset in dali is : {}".format(
+            ",".join(support_ops_eval))
+
+    normalize_ops = [
+        op for op in config_dataloader["dataset"]["transform_ops"]
+        if "NormalizeImage" in op
+    ][0]["NormalizeImage"]
+    channel_num = normalize_ops.get("channel_num", 3)
+    output_dtype = types.FLOAT16 if normalize_ops.get("output_fp16",
+                                                      False) else types.FLOAT

-    dataset_config = config[mode.upper()]
+    env = os.environ
+    #  assert float(env.get('FLAGS_fraction_of_gpu_memory_to_use', 0.92)) < 0.9, \
+    #      "Please leave enough GPU memory for DALI workspace, e.g., by setting" \
+    #      " `export FLAGS_fraction_of_gpu_memory_to_use=0.8`"

-    gpu_num = paddle.fluid.core.get_cuda_device_count() if (
-        'PADDLE_TRAINERS_NUM') and (
-            'PADDLE_TRAINER_ID'
-        ) not in env else int(env.get('PADDLE_TRAINERS_NUM', 0))
+    gpu_num = paddle.distributed.get_world_size()

-    batch_size = dataset_config.batch_size
-    assert batch_size % gpu_num == 0, \
-        "batch size must be multiple of number of devices"
-    batch_size = batch_size // gpu_num
+    batch_size = config_dataloader["sampler"]["batch_size"]

-    file_root = dataset_config.data_dir
-    file_list = dataset_config.file_list
+    file_root = config_dataloader["dataset"]["image_root"]
+    file_list = config_dataloader["dataset"]["cls_label_path"]

    interp = 1  # settings.interpolation or 1  # default to linear
    interp_map = {
-        0: types.INTERP_NN,  # cv2.INTER_NEAREST
-        1: types.INTERP_LINEAR,  # cv2.INTER_LINEAR
-        2: types.INTERP_CUBIC,  # cv2.INTER_CUBIC
-        4: types.INTERP_LANCZOS3,  # XXX use LANCZOS3 for cv2.INTER_LANCZOS4
+        0: types.DALIInterpType.INTERP_NN,  # cv2.INTER_NEAREST
+        1: types.DALIInterpType.INTERP_LINEAR,  # cv2.INTER_LINEAR
+        2: types.DALIInterpType.INTERP_CUBIC,  # cv2.INTER_CUBIC
+        3: types.DALIInterpType.
+        INTERP_LANCZOS3,  # XXX use LANCZOS3 for cv2.INTER_LANCZOS4
    }

-    output_dtype = (types.FLOAT16 if 'AMP' in config and
-                    config.AMP.get("use_pure_fp16", False) 
-                    else types.FLOAT)
-    
    assert interp in interp_map, "interpolation method not supported by DALI"
    interp = interp_map[interp]
-    pad_output = False
-    image_shape = config.get("image_shape", None)
-    if image_shape and image_shape[0] == 4:
-        pad_output = True
+    pad_output = channel_num == 4

    transforms = {
        k: v
-        for d in dataset_config["transforms"] for k, v in d.items()
+        for d in config_dataloader["dataset"]["transform_ops"]
+        for k, v in d.items()
    }

    scale = transforms["NormalizeImage"].get("scale", 1.0 / 255)
-    if isinstance(scale, str):
-        scale = eval(scale)
+    scale = eval(scale) if isinstance(scale, str) else scale
    mean = transforms["NormalizeImage"].get("mean", [0.485, 0.456, 0.406])
    std = transforms["NormalizeImage"].get("std", [0.229, 0.224, 0.225])
    mean = [v / scale for v in mean]
    std = [v / scale for v in std]

-    if mode == "train":
+    sampler_name = config_dataloader["sampler"].get("name",
+                                                    "DistributedBatchSampler")
+    assert sampler_name in ["DistributedBatchSampler", "BatchSampler"]
+
+    if mode.lower() == "train":
        resize_shorter = 256
        crop = transforms["RandCropImage"]["size"]
        scale = transforms["RandCropImage"].get("scale", [0.08, 1.])
@@ -229,20 +249,13 @@ def build(config, mode='train'):
                device_id,
                shard_id,
                num_shards,
-                seed=42 + shard_id,
+                seed=seed + shard_id,
                pad_output=pad_output,
                output_dtype=output_dtype)
            pipe.build()
            pipelines = [pipe]
-            sample_per_shard = len(pipe) // num_shards
+            #  sample_per_shard = len(pipe) // num_shards
        else:
-            pipelines = []
-            places = fluid.framework.cuda_places()
-            num_shards = len(places)
-            for idx, p in enumerate(places):
-                place = fluid.core.Place()
-                place.set_place(p)
-                device_id = place.gpu_device_id()
            pipe = HybridTrainPipe(
                file_root,
                file_list,
@@ -255,25 +268,40 @@ def build(config, mode='train'):
                interp,
                mean,
                std,
-                    device_id,
-                    idx,
-                    num_shards,
-                    seed=42 + idx,
+                device_id=device_id,
+                shard_id=0,
+                num_shards=1,
+                seed=seed,
                pad_output=pad_output,
                output_dtype=output_dtype)
            pipe.build()
-                pipelines.append(pipe)
-            sample_per_shard = len(pipelines[0])
+            pipelines = [pipe]
+            #  sample_per_shard = len(pipelines[0])
        return DALIGenericIterator(
-            pipelines, ['feed_image', 'feed_label'], size=sample_per_shard)
+            pipelines, ['data', 'label'], reader_name='Reader')
    else:
        resize_shorter = transforms["ResizeImage"].get("resize_short", 256)
        crop = transforms["CropImage"]["size"]
+        if 'PADDLE_TRAINER_ID' in env and 'PADDLE_TRAINERS_NUM' in env and sampler_name == "DistributedBatchSampler":
+            shard_id = int(env['PADDLE_TRAINER_ID'])
+            num_shards = int(env['PADDLE_TRAINERS_NUM'])
+            device_id = int(env['FLAGS_selected_gpus'])

-        p = fluid.framework.cuda_places()[0]
-        place = fluid.core.Place()
-        place.set_place(p)
-        device_id = place.gpu_device_id()
+            pipe = HybridValPipe(
+                file_root,
+                file_list,
+                batch_size,
+                resize_shorter,
+                crop,
+                interp,
+                mean,
+                std,
+                device_id=device_id,
+                shard_id=shard_id,
+                num_shards=num_shards,
+                pad_output=pad_output,
+                output_dtype=output_dtype)
+        else:
            pipe = HybridValPipe(
                file_root,
                file_list,
@@ -288,74 +316,4 @@ def build(config, mode='train'):
                output_dtype=output_dtype)
        pipe.build()
        return DALIGenericIterator(
-            pipe, ['feed_image', 'feed_label'],
-            size=len(pipe),
-            dynamic_shape=True,
-            fill_last_batch=True,
-            last_batch_padded=True)
-
-
-def train(config):
-    return build(config, 'train')
-
-
-def val(config):
-    return build(config, 'valid')
-
-
-def _to_Tensor(lod_tensor, dtype):
-    data_tensor = fluid.layers.create_tensor(dtype=dtype)
-    data = np.array(lod_tensor).astype(dtype)
-    fluid.layers.assign(data, data_tensor)
-    return data_tensor
-
-
-def normalize(feeds, config):
-    image, label = feeds['image'], feeds['label']
-    img_mean = np.array([0.485, 0.456, 0.406]).reshape((3, 1, 1))
-    img_std = np.array([0.229, 0.224, 0.225]).reshape((3, 1, 1))
-    image = fluid.layers.cast(image, 'float32')
-    costant = fluid.layers.fill_constant(
-        shape=[1], value=255.0, dtype='float32')
-    image = fluid.layers.elementwise_div(image, costant)
-
-    mean = fluid.layers.create_tensor(dtype="float32")
-    fluid.layers.assign(input=img_mean.astype("float32"), output=mean)
-    std = fluid.layers.create_tensor(dtype="float32")
-    fluid.layers.assign(input=img_std.astype("float32"), output=std)
-
-    image = fluid.layers.elementwise_sub(image, mean)
-    image = fluid.layers.elementwise_div(image, std)
-
-    image.stop_gradient = True
-    feeds['image'] = image
-
-    return feeds
-
-
-def mix(feeds, config, is_train=True):
-    env = os.environ
-    gpu_num = paddle.fluid.core.get_cuda_device_count() if (
-        'PADDLE_TRAINERS_NUM') and (
-            'PADDLE_TRAINER_ID'
-        ) not in env else int(env.get('PADDLE_TRAINERS_NUM', 0))
-
-    batch_size = config.TRAIN.batch_size // gpu_num
-
-    images = feeds['image']
-    label = feeds['label']
-    # TODO: hard code here, should be fixed!
-    alpha = 0.2
-    idx = _to_Tensor(np.random.permutation(batch_size), 'int32')
-    lam = np.random.beta(alpha, alpha)
-
-    images = lam * images + (1 - lam) * paddle.fluid.layers.gather(images, idx)
-
-    feed = {
-        'image': images,
-        'feed_y_a': label,
-        'feed_y_b': paddle.fluid.layers.gather(label, idx),
-        'feed_lam': _to_Tensor([lam] * batch_size, 'float32')
-    }
-
-    return feed if is_train else feeds
+            [pipe], ['data', 'label'], reader_name="Reader")
--- a/ppcls/data/dataloader/vehicle_dataset.py
+++ b/ppcls/data/dataloader/vehicle_dataset.py
@@ -61,7 +61,8 @@ class CompCars(Dataset):
                    label_path = os.path.join(self._label_root,
                                              l[0].split('.')[0] + '.txt')
                    assert os.path.exists(label_path)
-                    bbox = open(label_path).readlines()[-1].strip().split()
+                    with open(label_path) as f:
+                        bbox = f.readlines()[-1].strip().split()
                    bbox = [int(x) for x in bbox]
                    self.images.append(os.path.join(self._img_root, l[0]))
                    self.labels.append(int(l[1]))

--- a/ppcls/data/preprocess/ops/operators.py
+++ b/ppcls/data/preprocess/ops/operators.py
@@ -197,14 +197,26 @@ class NormalizeImage(object):
    """ normalize image such as substract mean, divide std
    """

-    def __init__(self, scale=None, mean=None, std=None, order='chw'):
+    def __init__(self,
+                 scale=None,
+                 mean=None,
+                 std=None,
+                 order='chw',
+                 output_fp16=False,
+                 channel_num=3):
        if isinstance(scale, str):
            scale = eval(scale)
+        assert channel_num in [
+            3, 4
+        ], "channel number of input image should be set to 3 or 4."
+        self.channel_num = channel_num
+        self.output_dtype = 'float16' if output_fp16 else 'float32'
        self.scale = np.float32(scale if scale is not None else 1.0 / 255.0)
+        self.order = order
        mean = mean if mean is not None else [0.485, 0.456, 0.406]
        std = std if std is not None else [0.229, 0.224, 0.225]

-        shape = (3, 1, 1) if order == 'chw' else (1, 1, 3)
+        shape = (3, 1, 1) if self.order == 'chw' else (1, 1, 3)
        self.mean = np.array(mean).reshape(shape).astype('float32')
        self.std = np.array(std).reshape(shape).astype('float32')

@@ -215,7 +227,20 @@ class NormalizeImage(object):

        assert isinstance(img,
                          np.ndarray), "invalid input 'img' in NormalizeImage"
-        return (img.astype('float32') * self.scale - self.mean) / self.std
+
+        img = (img.astype('float32') * self.scale - self.mean) / self.std
+
+        if self.channel_num == 4:
+            img_h = img.shape[1] if self.order == 'chw' else img.shape[0]
+            img_w = img.shape[2] if self.order == 'chw' else img.shape[1]
+            pad_zeros = np.zeros(
+                (1, img_h, img_w)) if self.order == 'chw' else np.zeros(
+                    (img_h, img_w, 1))
+            img = (np.concatenate(
+                (img, pad_zeros), axis=0)
+                   if self.order == 'chw' else np.concatenate(
+                       (img, pad_zeros), axis=2))
+        return img.astype(self.output_dtype)


 class ToCHWImage(object):

--- a/ppcls/engine/trainer.py
+++ b/ppcls/engine/trainer.py
@@ -17,10 +17,12 @@ from __future__ import print_function
 import os
 import sys
 import numpy as np
+
 __dir__ = os.path.dirname(os.path.abspath(__file__))
 sys.path.append(os.path.abspath(os.path.join(__dir__, '../../')))

 import time
+import platform
 import datetime
 import argparse
 import paddle
@@ -39,7 +41,7 @@ from ppcls.arch import apply_to_static
 from ppcls.loss import build_loss
 from ppcls.metric import build_metrics
 from ppcls.optimizer import build_optimizer
-from ppcls.utils.save_load import load_dygraph_pretrain
+from ppcls.utils.save_load import load_dygraph_pretrain, load_dygraph_pretrain_from_url
 from ppcls.utils.save_load import init_model
 from ppcls.utils import save_load

@@ -77,8 +79,12 @@ class Trainer(object):
        apply_to_static(self.config, self.model)

        if self.config["Global"]["pretrained_model"] is not None:
-            load_dygraph_pretrain(self.model,
-                                  self.config["Global"]["pretrained_model"])
+            if self.config["Global"]["pretrained_model"].startswith("http"):
+                load_dygraph_pretrain_from_url(
+                    self.model, self.config["Global"]["pretrained_model"])
+            else:
+                load_dygraph_pretrain(
+                    self.model, self.config["Global"]["pretrained_model"])

        if self.config["Global"]["distributed"]:
            self.model = paddle.DataParallel(self.model)
@@ -98,10 +104,25 @@ class Trainer(object):
        self.query_dataloader = None
        self.eval_mode = self.config["Global"].get("eval_mode",
                                                   "classification")
+        self.amp = True if "AMP" in self.config else False
+        if self.amp and self.config["AMP"] is not None:
+            self.scale_loss = self.config["AMP"].get("scale_loss", 1.0)
+            self.use_dynamic_loss_scaling = self.config["AMP"].get(
+                "use_dynamic_loss_scaling", False)
+        else:
+            self.scale_loss = 1.0
+            self.use_dynamic_loss_scaling = False
+        if self.amp:
+            AMP_RELATED_FLAGS_SETTING = {
+                'FLAGS_cudnn_batchnorm_spatial_persistent': 1,
+                'FLAGS_max_inplace_grad_add': 8,
+            }
+            paddle.fluid.set_flags(AMP_RELATED_FLAGS_SETTING)
        self.train_loss_func = None
        self.eval_loss_func = None
        self.train_metric_func = None
        self.eval_metric_func = None
+        self.use_dali = self.config['Global'].get("use_dali", False)

    def train(self):
        # build train loss and metric info
@@ -116,8 +137,8 @@ class Trainer(object):
                    self.train_metric_func = build_metrics(metric_config)

        if self.train_dataloader is None:
-            self.train_dataloader = build_dataloader(self.config["DataLoader"],
-                                                     "Train", self.device)
+            self.train_dataloader = build_dataloader(
+                self.config["DataLoader"], "Train", self.device, self.use_dali)

        step_each_epoch = len(self.train_dataloader)

@@ -151,26 +172,51 @@ class Trainer(object):
            if metric_info is not None:
                best_metric.update(metric_info)

+        # for amp training
+        if self.amp:
+            scaler = paddle.amp.GradScaler(
+                init_loss_scaling=self.scale_loss,
+                use_dynamic_loss_scaling=self.use_dynamic_loss_scaling)
+
        tic = time.time()
+        max_iter = len(self.train_dataloader) - 1 if platform.system(
+        ) == "Windows" else len(self.train_dataloader)
        for epoch_id in range(best_metric["epoch"] + 1,
                              self.config["Global"]["epochs"] + 1):
            acc = 0.0
-            for iter_id, batch in enumerate(self.train_dataloader()):
+            train_dataloader = self.train_dataloader if self.use_dali else self.train_dataloader(
+            )
+            for iter_id, batch in enumerate(train_dataloader):
+                if iter_id >= max_iter:
+                    break
                if iter_id == 5:
                    for key in time_info:
                        time_info[key].reset()
                time_info["reader_cost"].update(time.time() - tic)
+                if self.use_dali:
+                    batch = [
+                        paddle.to_tensor(batch[0]['data']),
+                        paddle.to_tensor(batch[0]['label'])
+                    ]
                batch_size = batch[0].shape[0]
                batch[1] = batch[1].reshape([-1, 1]).astype("int64")

                global_step += 1
                # image input
-                if not self.is_rec:
-                    out = self.model(batch[0])
+                if self.amp:
+                    with paddle.amp.auto_cast(custom_black_list={
+                            "flatten_contiguous_range", "greater_than"
+                    }):
+                        out = self.forward(batch)
+                        loss_dict = self.train_loss_func(out, batch[1])
                else:
-                    out = self.model(batch[0], batch[1])
+                    out = self.forward(batch)

                # calc loss
+                if self.config["DataLoader"]["Train"]["dataset"].get(
+                        "batch_transform_ops", None):
+                    loss_dict = self.train_loss_func(out, batch[1:])
+                else:
                    loss_dict = self.train_loss_func(out, batch[1])

                for key in loss_dict:
@@ -188,6 +234,11 @@ class Trainer(object):
                                                batch_size)

                # step opt and lr
+                if self.amp:
+                    scaled = scaler.scale(loss_dict["loss"])
+                    scaled.backward()
+                    scaler.minimize(optimizer, scaled)
+                else:
                    loss_dict["loss"].backward()
                    optimizer.step()
                optimizer.clear_grad()
@@ -232,7 +283,8 @@ class Trainer(object):
                            step=global_step,
                            writer=self.vdl_writer)
                tic = time.time()
-
+            if self.use_dali:
+                self.train_dataloader.reset()
            metric_msg = ", ".join([
                "{}: {:.5f}".format(key, output_info[key].avg)
                for key in output_info
@@ -302,7 +354,8 @@ class Trainer(object):
        if self.eval_mode == "classification":
            if self.eval_dataloader is None:
                self.eval_dataloader = build_dataloader(
-                    self.config["DataLoader"], "Eval", self.device)
+                    self.config["DataLoader"], "Eval", self.device,
+                    self.use_dali)

            if self.eval_metric_func is None:
                metric_config = self.config.get("Metric")
@@ -316,11 +369,13 @@ class Trainer(object):
        elif self.eval_mode == "retrieval":
            if self.gallery_dataloader is None:
                self.gallery_dataloader = build_dataloader(
-                    self.config["DataLoader"]["Eval"], "Gallery", self.device)
+                    self.config["DataLoader"]["Eval"], "Gallery", self.device,
+                    self.use_dali)

            if self.query_dataloader is None:
                self.query_dataloader = build_dataloader(
-                    self.config["DataLoader"]["Eval"], "Query", self.device)
+                    self.config["DataLoader"]["Eval"], "Query", self.device,
+                    self.use_dali)
            # build metric info
            if self.eval_metric_func is None:
                metric_config = self.config.get("Metric", None)
@@ -336,6 +391,13 @@ class Trainer(object):
        self.model.train()
        return eval_result

+    def forward(self, batch):
+        if not self.is_rec:
+            out = self.model(batch[0])
+        else:
+            out = self.model(batch[0], batch[1])
+        return out
+
    @paddle.no_grad()
    def eval_cls(self, epoch_id=0):
        output_info = dict()
@@ -349,20 +411,27 @@ class Trainer(object):

        metric_key = None
        tic = time.time()
-        for iter_id, batch in enumerate(self.eval_dataloader()):
+        eval_dataloader = self.eval_dataloader if self.use_dali else self.eval_dataloader(
+        )
+        max_iter = len(self.eval_dataloader) - 1 if platform.system(
+        ) == "Windows" else len(self.eval_dataloader)
+        for iter_id, batch in enumerate(eval_dataloader):
+            if iter_id >= max_iter:
+                break
            if iter_id == 5:
                for key in time_info:
                    time_info[key].reset()
-
+            if self.use_dali:
+                batch = [
+                    paddle.to_tensor(batch[0]['data']),
+                    paddle.to_tensor(batch[0]['label'])
+                ]
            time_info["reader_cost"].update(time.time() - tic)
            batch_size = batch[0].shape[0]
            batch[0] = paddle.to_tensor(batch[0]).astype("float32")
            batch[1] = batch[1].reshape([-1, 1]).astype("int64")
            # image input
-            if self.is_rec:
-                out = self.model(batch[0], batch[1])
-            else:
-                out = self.model(batch[0])
+            out = self.forward(batch)
            # calc loss
            if self.eval_loss_func is not None:
                loss_dict = self.eval_loss_func(out, batch[-1])
@@ -410,7 +479,8 @@ class Trainer(object):
                    len(self.eval_dataloader), metric_msg, time_msg, ips_msg))

            tic = time.time()
-
+        if self.use_dali:
+            self.eval_dataloader.reset()
        metric_msg = ", ".join([
            "{}: {:.5f}".format(key, output_info[key].avg)
            for key in output_info
@@ -425,7 +495,6 @@ class Trainer(object):

    def eval_retrieval(self, epoch_id=0):
        self.model.eval()
-        cum_similarity_matrix = None
        # step1. build gallery
        gallery_feas, gallery_img_id, gallery_unique_id = self._cal_feature(
            name='gallery')
@@ -498,12 +567,22 @@ class Trainer(object):
            raise RuntimeError("Only support gallery or query dataset")

        has_unique_id = False
-        for idx, batch in enumerate(dataloader(
-        )):  # load is very time-consuming
+        max_iter = len(dataloader) - 1 if platform.system(
+        ) == "Windows" else len(dataloader)
+        dataloader_tmp = dataloader if self.use_dali else dataloader()
+        for idx, batch in enumerate(
+                dataloader_tmp):  # load is very time-consuming
+            if idx >= max_iter:
+                break
            if idx % self.config["Global"]["print_batch_step"] == 0:
                logger.info(
                    f"{name} feature calculation process: [{idx}/{len(dataloader)}]"
                )
+            if self.use_dali:
+                batch = [
+                    paddle.to_tensor(batch[0]['data']),
+                    paddle.to_tensor(batch[0]['label'])
+                ]
            batch = [paddle.to_tensor(x) for x in batch]
            batch[1] = batch[1].reshape([-1, 1]).astype("int64")
            if len(batch) == 3:
@@ -529,7 +608,8 @@ class Trainer(object):
                all_image_id = paddle.concat([all_image_id, batch[1]])
                if has_unique_id:
                    all_unique_id = paddle.concat([all_unique_id, batch[2]])
-
+        if self.use_dali:
+            dataloader_tmp.reset()
        if paddle.distributed.get_world_size() > 1:
            feat_list = []
            img_id_list = []

--- a/ppcls/loss/__init__.py
+++ b/ppcls/loss/__init__.py
@@ -4,7 +4,7 @@ import paddle
 import paddle.nn as nn
 from ppcls.utils import logger

-from .celoss import CELoss
+from .celoss import CELoss, MixCELoss
 from .googlenetloss import GoogLeNetLoss
 from .centerloss import CenterLoss
 from .emlloss import EmlLoss
@@ -30,7 +30,6 @@ class CombinedLoss(nn.Layer):
        assert isinstance(config_list, list), (
            'operator config should be a list')
        for config in config_list:
-            print(config)
            assert isinstance(config,
                              dict) and len(config) == 1, "yaml format error"
            name = list(config)[0]

--- a/ppcls/loss/celoss.py
+++ b/ppcls/loss/celoss.py
@@ -18,6 +18,10 @@ import paddle.nn.functional as F


 class CELoss(nn.Layer):
+    """
+    Cross entropy loss
+    """
+
    def __init__(self, epsilon=None):
        super().__init__()
        if epsilon is not None and (epsilon <= 0 or epsilon >= 1):
@@ -50,3 +54,21 @@ class CELoss(nn.Layer):
            loss = F.cross_entropy(x, label=label, soft_label=soft_label)
        loss = loss.mean()
        return {"CELoss": loss}
+
+
+class MixCELoss(CELoss):
+    """
+    Cross entropy loss with mix(mixup, cutmix, fixmix)
+    """
+
+    def __init__(self, epsilon=None):
+        super().__init__()
+        self.epsilon = epsilon
+
+    def __call__(self, input, batch):
+        target0, target1, lam = batch
+        loss0 = super().forward(input, target0)["CELoss"]
+        loss1 = super().forward(input, target1)["CELoss"]
+        loss = lam * loss0 + (1.0 - lam) * loss1
+        loss = paddle.mean(loss)
+        return {"MixCELoss": loss}
--- a/ppcls/optimizer/__init__.py
+++ b/ppcls/optimizer/__init__.py
@@ -41,7 +41,7 @@ def build_lr_scheduler(lr_config, epochs, step_each_epoch):
    return lr


-def build_optimizer(config, epochs, step_each_epoch, parameters):
+def build_optimizer(config, epochs, step_each_epoch, parameters=None):
    config = copy.deepcopy(config)
    # step1 build lr
    lr = build_lr_scheduler(config.pop('lr'), epochs, step_each_epoch)

--- a/ppcls/optimizer/optimizer.py
+++ b/ppcls/optimizer/optimizer.py
@@ -33,12 +33,14 @@ class Momentum(object):
                 learning_rate,
                 momentum,
                 weight_decay=None,
-                 grad_clip=None):
+                 grad_clip=None,
+                 multi_precision=False):
        super(Momentum, self).__init__()
        self.learning_rate = learning_rate
        self.momentum = momentum
        self.weight_decay = weight_decay
        self.grad_clip = grad_clip
+        self.multi_precision = multi_precision

    def __call__(self, parameters):
        opt = optim.Momentum(
@@ -46,6 +48,7 @@ class Momentum(object):
            momentum=self.momentum,
            weight_decay=self.weight_decay,
            grad_clip=self.grad_clip,
+            multi_precision=self.multi_precision,
            parameters=parameters)
        return opt

@@ -60,7 +63,8 @@ class Adam(object):
                 weight_decay=None,
                 grad_clip=None,
                 name=None,
-                 lazy_mode=False):
+                 lazy_mode=False,
+                 multi_precision=False):
        self.learning_rate = learning_rate
        self.beta1 = beta1
        self.beta2 = beta2
@@ -71,6 +75,7 @@ class Adam(object):
        self.grad_clip = grad_clip
        self.name = name
        self.lazy_mode = lazy_mode
+        self.multi_precision = multi_precision

    def __call__(self, parameters):
        opt = optim.Adam(
@@ -82,6 +87,7 @@ class Adam(object):
            grad_clip=self.grad_clip,
            name=self.name,
            lazy_mode=self.lazy_mode,
+            multi_precision=self.multi_precision,
            parameters=parameters)
        return opt

@@ -104,7 +110,8 @@ class RMSProp(object):
                 rho=0.95,
                 epsilon=1e-6,
                 weight_decay=None,
-                 grad_clip=None):
+                 grad_clip=None,
+                 multi_precision=False):
        super(RMSProp, self).__init__()
        self.learning_rate = learning_rate
        self.momentum = momentum

--- a/ppcls/utils/static/program.py
+++ b/ppcls/utils/static/program.py
--- a/ppcls/utils/static/run_dali.sh
+++ b/ppcls/utils/static/run_dali.sh
 #!/usr/bin/env bash

-export CUDA_VISIBLE_DEVICES="0,1,2,3"
+export CUDA_VISIBLE_DEVICES="0,1,2,3,4,5,6,7"
 export FLAGS_fraction_of_gpu_memory_to_use=0.80

 python3.7 -m paddle.distributed.launch \
-    --gpus="0,1,2,3" \
-    tools/static/train.py \
-        -c ./configs/ResNet/ResNet50.yaml \
-        -o print_interval=10 \
-        -o use_dali=True
+    --gpus="0,1,2,3,4,5,6,7" \
+    ppcls/static//train.py \
+    -c ./ppcls/configs/ImageNet/ResNet/ResNet50_fp16.yaml \
+    -o Global.use_dali=True
+
--- a/ppcls/utils/static/save_load.py
+++ b/ppcls/utils/static/save_load.py
@@ -74,9 +74,7 @@ def load_params(exe, prog, path, ignore_params=None):
        raise ValueError("Model pretrain path {} does not "
                         "exists.".format(path))

-    logger.info(
-        logger.coloring('Loading parameters from {}...'.format(path),
-                        'HEADER'))
+    logger.info("Loading parameters from {}...".format(path))

    ignore_set = set()
    state = _load_state(path)
@@ -116,9 +114,7 @@ def init_model(config, program, exe):
    checkpoints = config.get('checkpoints')
    if checkpoints:
        paddle.static.load(program, checkpoints, exe)
-        logger.info(
-            logger.coloring("Finish initing model from {}".format(checkpoints),
-                            "HEADER"))
+        logger.info("Finish initing model from {}".format(checkpoints))
        return

    pretrained_model = config.get('pretrained_model')
@@ -127,19 +123,17 @@ def init_model(config, program, exe):
            pretrained_model = [pretrained_model]
        for pretrain in pretrained_model:
            load_params(exe, program, pretrain)
-        logger.info(
-            logger.coloring("Finish initing model from {}".format(
-                pretrained_model), "HEADER"))
+        logger.info("Finish initing model from {}".format(pretrained_model))


 def save_model(program, model_path, epoch_id, prefix='ppcls'):
    """
    save model to the target path
    """
+    if paddle.distributed.get_rank() != 0:
+        return
    model_path = os.path.join(model_path, str(epoch_id))
    _mkdir_if_not_exist(model_path)
    model_prefix = os.path.join(model_path, prefix)
    paddle.static.save(program, model_prefix)
-    logger.info(
-        logger.coloring("Already save model in {}".format(model_path),
-                        "HEADER"))
+    logger.info("Already save model in {}".format(model_path))
--- a/ppcls/utils/static/train.py
+++ b/ppcls/utils/static/train.py
@@ -23,16 +23,16 @@ __dir__ = os.path.dirname(os.path.abspath(__file__))
 sys.path.append(__dir__)
 sys.path.append(os.path.abspath(os.path.join(__dir__, '../../')))

-from sys import version_info
-
 import paddle
 from paddle.distributed import fleet
+from visualdl import LogWriter

-from ppcls.data import Reader
-from ppcls.utils.config import get_config
+from ppcls.data import build_dataloader
+from ppcls.utils.config import get_config, print_config
 from ppcls.utils import logger
-from tools.static import program
-from save_load import init_model, save_model
+from ppcls.utils.logger import init_logger
+from ppcls.static.save_load import init_model, save_model
+from ppcls.static import program


 def parse_args():
@@ -43,11 +43,6 @@ def parse_args():
        type=str,
        default='configs/ResNet/ResNet50.yaml',
        help='config file path')
-    parser.add_argument(
-        '--vdl_dir',
-        type=str,
-        default=None,
-        help='VisualDL logging directory for image.')
    parser.add_argument(
        '-p',
        '--profiler_options',
@@ -66,32 +61,64 @@ def parse_args():


 def main(args):
-    config = get_config(args.config, overrides=args.override, show=True)
-    if config.get("is_distributed", True):
+    """
+    all the config of training paradigm should be in config["Global"]
+    """
+    config = get_config(args.config, overrides=args.override, show=False)
+    global_config = config["Global"]
+
+    mode = "train"
+
+    log_file = os.path.join(global_config['output_dir'],
+                            config["Arch"]["name"], f"{mode}.log")
+    init_logger(name='root', log_file=log_file)
+    print_config(config)
+
+    if global_config.get("is_distributed", True):
        fleet.init(is_collective=True)
-    # assign the place
-    use_gpu = config.get("use_gpu", True)
+    # assign the device
+    use_gpu = global_config.get("use_gpu", True)
    # amp related config
    if 'AMP' in config:
        AMP_RELATED_FLAGS_SETTING = {
-            'FLAGS_cudnn_exhaustive_search': 1,
-            'FLAGS_conv_workspace_size_limit': 1500,
-            'FLAGS_cudnn_batchnorm_spatial_persistent': 1,
-            'FLAGS_max_inplace_grad_add': 8,
+            'FLAGS_cudnn_exhaustive_search': "1",
+            'FLAGS_conv_workspace_size_limit': "1500",
+            'FLAGS_cudnn_batchnorm_spatial_persistent': "1",
+            'FLAGS_max_indevice_grad_add': "8",
+            "FLAGS_cudnn_batchnorm_spatial_persistent": "1",
        }
-        os.environ['FLAGS_cudnn_batchnorm_spatial_persistent'] = '1'
-        paddle.fluid.set_flags(AMP_RELATED_FLAGS_SETTING)
-    use_xpu = config.get("use_xpu", False)
+        for k in AMP_RELATED_FLAGS_SETTING:
+            os.environ[k] = AMP_RELATED_FLAGS_SETTING[k]
+
+    use_xpu = global_config.get("use_xpu", False)
    assert (
        use_gpu and use_xpu
    ) is not True, "gpu and xpu can not be true in the same time in static mode!"

    if use_gpu:
-        place = paddle.set_device('gpu')
+        device = paddle.set_device('gpu')
    elif use_xpu:
-        place = paddle.set_device('xpu')
+        device = paddle.set_device('xpu')
    else:
-        place = paddle.set_device('cpu')
+        device = paddle.set_device('cpu')
+
+    # visualDL
+    vdl_writer = None
+    if global_config["use_visualdl"]:
+        vdl_dir = os.path.join(global_config["output_dir"], "vdl")
+        vdl_writer = LogWriter(vdl_dir)
+
+    # build dataloader
+    eval_dataloader = None
+    use_dali = global_config.get('use_dali', False)
+
+    train_dataloader = build_dataloader(
+        config["DataLoader"], "Train", device=device, use_dali=use_dali)
+    if global_config["eval_during_train"]:
+        eval_dataloader = build_dataloader(
+            config["DataLoader"], "Eval", device=device, use_dali=use_dali)
+
+    step_each_epoch = len(train_dataloader)

    # startup_prog is used to do some parameter init work,
    # and train prog is used to hold the network
@@ -104,88 +131,70 @@ def main(args):
        config,
        train_prog,
        startup_prog,
+        step_each_epoch=step_each_epoch,
        is_train=True,
-        is_distributed=config.get("is_distributed", True))
+        is_distributed=global_config.get("is_distributed", True))

-    if config.validate:
-        valid_prog = paddle.static.Program()
-        valid_fetchs, _, valid_feeds, _ = program.build(
+    if global_config["eval_during_train"]:
+        eval_prog = paddle.static.Program()
+        eval_fetchs, _, eval_feeds, _ = program.build(
            config,
-            valid_prog,
+            eval_prog,
            startup_prog,
            is_train=False,
-            is_distributed=config.get("is_distributed", True))
-        # clone to prune some content which is irrelevant in valid_prog
-        valid_prog = valid_prog.clone(for_test=True)
+            is_distributed=global_config.get("is_distributed", True))
+        # clone to prune some content which is irrelevant in eval_prog
+        eval_prog = eval_prog.clone(for_test=True)

-    # create the "Executor" with the statement of which place
-    exe = paddle.static.Executor(place)
+    # create the "Executor" with the statement of which device
+    exe = paddle.static.Executor(device)
    # Parameter initialization
    exe.run(startup_prog)
    # load pretrained models or checkpoints
-    init_model(config, train_prog, exe)
+    init_model(global_config, train_prog, exe)

    if 'AMP' in config and config.AMP.get("use_pure_fp16", False):
        optimizer.amp_init(
-            place,
+            device,
            scope=paddle.static.global_scope(),
-            test_program=valid_prog if config.validate else None)
+            test_program=eval_prog
+            if global_config["eval_during_train"] else None)

-    if not config.get("is_distributed", True):
+    if not global_config.get("is_distributed", True):
        compiled_train_prog = program.compile(
            config, train_prog, loss_name=train_fetchs["loss"][0].name)
    else:
        compiled_train_prog = train_prog

-    if not config.get('use_dali', False):
-        train_dataloader = Reader(config, 'train', places=place)()
-        if config.validate and paddle.distributed.get_rank() == 0:
-            valid_dataloader = Reader(config, 'valid', places=place)()
-            compiled_valid_prog = program.compile(config, valid_prog)
-    else:
-        assert use_gpu is True, "DALI only support gpu, please set use_gpu to True!"
-        import dali
-        train_dataloader = dali.train(config)
-        if config.validate and paddle.distributed.get_rank() == 0:
-            valid_dataloader = dali.val(config)
-            compiled_valid_prog = program.compile(config, valid_prog)
-
-    vdl_writer = None
-    if args.vdl_dir:
-        if version_info.major == 2:
-            logger.info(
-                "visualdl is just supported for python3, so it is disabled in python2..."
-            )
-        else:
-            from visualdl import LogWriter
-            vdl_writer = LogWriter(args.vdl_dir)
+    if eval_dataloader is not None:
+        compiled_eval_prog = program.compile(config, eval_prog)

-    for epoch_id in range(config.epochs):
+    for epoch_id in range(global_config["epochs"]):
        # 1. train with train dataset
        program.run(train_dataloader, exe, compiled_train_prog, train_feeds,
                    train_fetchs, epoch_id, 'train', config, vdl_writer,
                    lr_scheduler, args.profiler_options)
-        if paddle.distributed.get_rank() == 0:
-            # 2. validate with validate dataset
-            if config.validate and epoch_id % config.valid_interval == 0:
-                top1_acc = program.run(valid_dataloader, exe,
-                                       compiled_valid_prog, valid_feeds,
-                                       valid_fetchs, epoch_id, 'valid', config)
+        # 2. evaate with eval dataset
+        if global_config["eval_during_train"] and epoch_id % global_config[
+                "eval_interval"] == 0:
+            top1_acc = program.run(eval_dataloader, exe, compiled_eval_prog,
+                                   eval_feeds, eval_fetchs, epoch_id, "eval",
+                                   config)
            if top1_acc > best_top1_acc:
                best_top1_acc = top1_acc
                message = "The best top1 acc {:.5f}, in epoch: {:d}".format(
                    best_top1_acc, epoch_id)
-                    logger.info("{:s}".format(logger.coloring(message, "RED")))
-                    if epoch_id % config.save_interval == 0:
+                logger.info(message)
+                if epoch_id % global_config["save_interval"] == 0:

-                        model_path = os.path.join(config.model_save_dir,
-                                                  config.ARCHITECTURE["name"])
+                    model_path = os.path.join(global_config["output_dir"],
+                                              config["Arch"]["name"])
                    save_model(train_prog, model_path, "best_model")

        # 3. save the persistable model
-            if epoch_id % config.save_interval == 0:
-                model_path = os.path.join(config.model_save_dir,
-                                          config.ARCHITECTURE["name"])
+        if epoch_id % global_config["save_interval"] == 0:
+            model_path = os.path.join(global_config["output_dir"],
+                                      config["Arch"]["name"])
            save_model(train_prog, model_path, epoch_id)



--- a/ppcls/utils/save_load.py
+++ b/ppcls/utils/save_load.py
@@ -54,7 +54,7 @@ def load_dygraph_pretrain(model, path=None):
    return


-def load_dygraph_pretrain_from_url(model, pretrained_url, use_ssld):
+def load_dygraph_pretrain_from_url(model, pretrained_url, use_ssld=False):
    if use_ssld:
        pretrained_url = pretrained_url.replace("_pretrained",
                                                "_ssld_pretrained")

--- a/tests/DarkNet53.txt
+++ b/tests/DarkNet53.txt
+===========================train_params===========================
+model_name:DarkNet53
+python:python3.7
+gpu_list:0|0,1
+-o Global.device:gpu
+-o Global.auto_cast:null
+-o Global.epochs:lite_train_infer=2|whole_train_infer=120
+-o Global.output_dir:./output/
+-o DataLoader.Train.sampler.batch_size:8
+-o Global.pretrained_model:null
+train_model_name:latest
+train_infer_img_dir:./dataset/ILSVRC2012/val
+null:null
+##
+trainer:norm_train
+norm_train:tools/train.py -c ppcls/configs/ImageNet/DarkNet/DarkNet53.yaml
+pact_train:null
+fpgm_train:null
+distill_train:null
+null:null
+null:null
+##
+===========================eval_params=========================== 
+eval:tools/eval.py -c ppcls/configs/ImageNet/DarkNet/DarkNet53.yaml
+null:null
+##
+===========================infer_params==========================
+-o Global.save_inference_dir:./inference
+-o Global.pretrained_model:
+norm_export:tools/export_model.py -c ppcls/configs/ImageNet/DarkNet/DarkNet53.yaml
+quant_export:null
+fpgm_export:null
+distill_export:null
+export1:null
+export2:null
+infer_model_url:https://paddle-imagenet-models-name.bj.bcebos.com/dygraph/whole_chain/DarkNet53_inference.tar
+infer_model:../inference/
+infer_export:null
+infer_quant:Fasle
+inference:python/predict_cls.py -c configs/inference_cls.yaml
+-o Global.use_gpu:True|False
+-o Global.enable_mkldnn:True|False
+-o Global.cpu_num_threads:1|6
+-o Global.batch_size:1
+-o Global.use_tensorrt:True|False
+-o Global.use_fp16:True|False
+-o Global.inference_model_dir:../inference
+-o Global.infer_imgs:../dataset/ILSVRC2012/val
+-o Global.save_log_path:null
+-o Global.benchmark:True
+null:null
--- a/tests/HRNet_W18_C.txt
+++ b/tests/HRNet_W18_C.txt
+===========================train_params===========================
+model_name:HRNet_W18_C
+python:python3.7
+gpu_list:0|0,1
+-o Global.device:gpu
+-o Global.auto_cast:null
+-o Global.epochs:lite_train_infer=2|whole_train_infer=120
+-o Global.output_dir:./output/
+-o DataLoader.Train.sampler.batch_size:8
+-o Global.pretrained_model:null
+train_model_name:latest
+train_infer_img_dir:./dataset/ILSVRC2012/val
+null:null
+##
+trainer:norm_train
+norm_train:tools/train.py -c ppcls/configs/ImageNet/HRNet/HRNet_W18_C.yaml
+pact_train:null
+fpgm_train:null
+distill_train:null
+null:null
+null:null
+##
+===========================eval_params=========================== 
+eval:tools/eval.py -c ppcls/configs/ImageNet/HRNet/HRNet_W18_C.yaml
+null:null
+##
+===========================infer_params==========================
+-o Global.save_inference_dir:./inference
+-o Global.pretrained_model:
+norm_export:tools/export_model.py -c ppcls/configs/ImageNet/HRNet/HRNet_W18_C.yaml
+quant_export:null
+fpgm_export:null
+distill_export:null
+export1:null
+export2:null
+inference_model_url:https://paddle-imagenet-models-name.bj.bcebos.com/dygraph/whole_chain/HRNet_W18_C_inference.tar
+infer_model:../inference/
+infer_export:null
+infer_quant:Fasle
+inference:python/predict_cls.py -c configs/inference_cls.yaml
+-o Global.use_gpu:True|False
+-o Global.enable_mkldnn:True|False
+-o Global.cpu_num_threads:1|6
+-o Global.batch_size:1
+-o Global.use_tensorrt:True|False
+-o Global.use_fp16:True|False
+-o Global.inference_model_dir:../inference
+-o Global.infer_imgs:../dataset/ILSVRC2012/val
+-o Global.save_log_path:null
+-o Global.benchmark:True
+null:null
--- a/tests/LeViT_128S.txt
+++ b/tests/LeViT_128S.txt
+===========================train_params===========================
+model_name:LeViT_128S
+python:python3.7
+gpu_list:0|0,1
+-o Global.device:gpu
+-o Global.auto_cast:null
+-o Global.epochs:lite_train_infer=2|whole_train_infer=120
+-o Global.output_dir:./output/
+-o DataLoader.Train.sampler.batch_size:8
+-o Global.pretrained_model:null
+train_model_name:latest
+train_infer_img_dir:./dataset/ILSVRC2012/val
+null:null
+##
+trainer:norm_train
+norm_train:tools/train.py -c ppcls/configs/ImageNet/LeViT/LeViT_128S.yaml
+pact_train:null
+fpgm_train:null
+distill_train:null
+null:null
+null:null
+##
+===========================eval_params=========================== 
+eval:tools/eval.py -c ppcls/configs/ImageNet/LeViT/LeViT_128S.yaml
+null:null
+##
+===========================infer_params==========================
+-o Global.save_inference_dir:./inference
+-o Global.pretrained_model:
+norm_export:tools/export_model.py -c ppcls/configs/ImageNet/LeViT/LeViT_128S.yaml
+quant_export:null
+fpgm_export:null
+distill_export:null
+export1:null
+export2:null
+infer_model_url:https://paddle-imagenet-models-name.bj.bcebos.com/dygraph/whole_chain/LeViT_128S_inference.tar
+infer_model:../inference/
+infer_export:null
+infer_quant:Fasle
+inference:python/predict_cls.py -c configs/inference_cls.yaml
+-o Global.use_gpu:True|Fasle
+-o Global.enable_mkldnn:True|False
+-o Global.cpu_num_threads:1|6
+-o Global.batch_size:1
+-o Global.use_tensorrt:True|False
+-o Global.use_fp16:True|False
+-o Global.inference_model_dir:../inference
+-o Global.infer_imgs:../dataset/ILSVRC2012/val
+-o Global.save_log_path:null
+-o Global.benchmark:True
+null:null
--- a/tests/MobileNetV1.txt
+++ b/tests/MobileNetV1.txt
+===========================train_params===========================
+model_name:MobileNetV1
+python:python3.7
+gpu_list:0|0,1
+-o Global.device:gpu
+-o Global.auto_cast:null
+-o Global.epochs:lite_train_infer=2|whole_train_infer=120
+-o Global.output_dir:./output/
+-o DataLoader.Train.sampler.batch_size:8
+-o Global.pretrained_model:null
+train_model_name:latest
+train_infer_img_dir:./dataset/ILSVRC2012/val
+null:null
+##
+trainer:norm_train
+norm_train:tools/train.py -c ppcls/configs/ImageNet/MobileNetV1/MobileNetV1.yaml
+pact_train:null
+fpgm_train:null
+distill_train:null
+null:null
+null:null
+##
+===========================eval_params=========================== 
+eval:tools/eval.py -c ppcls/configs/ImageNet/MobileNetV1/MobileNetV1.yaml
+null:null
+##
+===========================infer_params==========================
+-o Global.save_inference_dir:./inference
+-o Global.pretrained_model:
+norm_export:tools/export_model.py -c ppcls/configs/ImageNet/MobileNetV1/MobileNetV1.yaml
+quant_export:null
+fpgm_export:null
+distill_export:null
+export1:null
+export2:null
+inference_model_url:https://paddle-imagenet-models-name.bj.bcebos.com/dygraph/whole_chain/MobileNetV1_inference.tar
+infer_model:../inference/
+infer_export:null
+infer_quant:Fasle
+inference:python/predict_cls.py -c configs/inference_cls.yaml
+-o Global.use_gpu:True|False
+-o Global.enable_mkldnn:True|False
+-o Global.cpu_num_threads:1|6
+-o Global.batch_size:1
+-o Global.use_tensorrt:True|False
+-o Global.use_fp16:True|False
+-o Global.inference_model_dir:../inference
+-o Global.infer_imgs:../dataset/ILSVRC2012/val
+-o Global.save_log_path:null
+-o Global.benchmark:True
+null:null
--- a/tests/MobileNetV2.txt
+++ b/tests/MobileNetV2.txt
+===========================train_params===========================
+model_name:MobileNetV2
+python:python3.7
+gpu_list:0|0,1
+-o Global.device:gpu
+-o Global.auto_cast:null
+-o Global.epochs:lite_train_infer=2|whole_train_infer=120
+-o Global.output_dir:./output/
+-o DataLoader.Train.sampler.batch_size:8
+-o Global.pretrained_model:null
+train_model_name:latest
+train_infer_img_dir:./dataset/ILSVRC2012/val
+null:null
+##
+trainer:norm_train
+norm_train:tools/train.py -c ppcls/configs/ImageNet/MobileNetV2/MobileNetV2.yaml
+pact_train:null
+fpgm_train:null
+distill_train:null
+null:null
+null:null
+##
+===========================eval_params=========================== 
+eval:tools/eval.py -c ppcls/configs/ImageNet/MobileNetV2/MobileNetV2.yaml
+null:null
+##
+===========================infer_params==========================
+-o Global.save_inference_dir:./inference
+-o Global.pretrained_model:
+norm_export:tools/export_model.py -c ppcls/configs/ImageNet/MobileNetV2/MobileNetV2.yaml
+quant_export:null
+fpgm_export:null
+distill_export:null
+export1:null
+export2:null
+infer_model_url:https://paddle-imagenet-models-name.bj.bcebos.com/dygraph/whole_chain/MobileNetV2_inference.tar
+infer_model:../inference/
+infer_export:null
+infer_quant:Fasle
+inference:python/predict_cls.py -c configs/inference_cls.yaml
+-o Global.use_gpu:True|False
+-o Global.enable_mkldnn:True|False
+-o Global.cpu_num_threads:1|6
+-o Global.batch_size:1
+-o Global.use_tensorrt:True|False
+-o Global.use_fp16:True|False
+-o Global.inference_model_dir:../inference
+-o Global.infer_imgs:../dataset/ILSVRC2012/val
+-o Global.save_log_path:null
+-o Global.benchmark:True
+null:null
--- a/tests/MobileNetV3_large_x1_0.txt
+++ b/tests/MobileNetV3_large_x1_0.txt
+===========================train_params===========================
+model_name:MobileNetV3_large_x1_0
+python:python3.7
+gpu_list:0|0,1
+-o Global.device:gpu
+-o Global.auto_cast:null
+-o Global.epochs:lite_train_infer=2|whole_train_infer=120
+-o Global.output_dir:./output/
+-o DataLoader.Train.sampler.batch_size:8
+-o Global.pretrained_model:null
+train_model_name:latest
+train_infer_img_dir:./dataset/ILSVRC2012/val
+null:null
+##
+trainer:norm_train
+norm_train:tools/train.py -c ppcls/configs/ImageNet/MobileNetV3/MobileNetV3_large_x1_0.yaml
+pact_train:deploy/slim/slim.py -c ppcls/configs/slim/MobileNetV3_large_x1_0_quantization.yaml
+fpgm_train:deploy/slim/slim.py -c ppcls/configs/slim/MobileNetV3_large_x1_0_prune.yaml
+distill_train:null
+null:null
+null:null
+##
+===========================eval_params=========================== 
+eval:tools/eval.py -c ppcls/configs/ImageNet/MobileNetV3/MobileNetV3_large_x1_0.yaml
+null:null
+##
+===========================infer_params==========================
+-o Global.save_inference_dir:./inference
+-o Global.pretrained_model:
+norm_export:tools/export_model.py -c ppcls/configs/ImageNet/MobileNetV3/MobileNetV3_large_x1_0.yaml
+quant_export:deploy/slim/slim.py -m export -c ppcls/configs/slim/MobileNetV3_large_x1_0_quantalization.yaml
+fpgm_export:deploy/slim/slim.py -m export -c ppcls/configs/slim/MobileNetV3_large_x1_0_prune.yaml
+distill_export:null
+export1:null
+export2:null
+inference_model_url:https://paddle-imagenet-models-name.bj.bcebos.com/dygraph/whole_chain/MobileNetV3_large_x1_0_inference.tar
+infer_model:../inference/
+infer_export:null
+infer_quant:Fasle
+inference:python/predict_cls.py -c configs/inference_cls.yaml
+-o Global.use_gpu:True|False
+-o Global.enable_mkldnn:True|False
+-o Global.cpu_num_threads:1|6
+-o Global.batch_size:1
+-o Global.use_tensorrt:True|False
+-o Global.use_fp16:True|False
+-o Global.inference_model_dir:../inference
+-o Global.infer_imgs:../dataset/ILSVRC2012/val
+-o Global.save_log_path:null
+-o Global.benchmark:True
+null:null
--- a/tests/ResNeXt101_vd_64x4d.txt
+++ b/tests/ResNeXt101_vd_64x4d.txt
+===========================train_params===========================
+model_name:ResNeXt101_vd_64x4d
+python:python3.7
+gpu_list:0|0,1
+-o Global.device:gpu
+-o Global.auto_cast:null
+-o Global.epochs:lite_train_infer=2|whole_train_infer=120
+-o Global.output_dir:./output/
+-o DataLoader.Train.sampler.batch_size:8
+-o Global.pretrained_model:null
+train_model_name:latest
+train_infer_img_dir:./dataset/ILSVRC2012/val
+null:null
+##
+trainer:norm_train
+norm_train:tools/train.py -c ppcls/configs/ImageNet/ResNeXt/ResNeXt101_vd_64x4d.yaml
+pact_train:null
+fpgm_train:null
+distill_train:null
+null:null
+null:null
+##
+===========================eval_params=========================== 
+eval:tools/eval.py -c ppcls/configs/ImageNet/ResNeXt/ResNeXt101_vd_64x4d.yaml
+null:null
+##
+===========================infer_params==========================
+-o Global.save_inference_dir:./inference
+-o Global.pretrained_model:
+norm_export:tools/export_model.py -c ppcls/configs/ImageNet/ResNeXt/ResNeXt101_vd_64x4d.yaml
+quant_export:null
+fpgm_export:null
+distill_export:null
+export1:null
+export2:null
+inference_model_url:https://paddle-imagenet-models-name.bj.bcebos.com/dygraph/whole_chain/ResNeXt101_64x4d_inference.tar
+infer_model:../inference/
+infer_export:null
+infer_quant:Fasle
+inference:python/predict_cls.py -c configs/inference_cls.yaml
+-o Global.use_gpu:True|False
+-o Global.enable_mkldnn:True|False
+-o Global.cpu_num_threads:1|6
+-o Global.batch_size:1
+-o Global.use_tensorrt:True|False
+-o Global.use_fp16:True|False
+-o Global.inference_model_dir:../inference
+-o Global.infer_imgs:../dataset/ILSVRC2012/val
+-o Global.save_log_path:null
+-o Global.benchmark:True
+null:null
--- a/tests/ResNet50_vd.txt
+++ b/tests/ResNet50_vd.txt
+===========================train_params===========================
+model_name:ResNet50_vd
+python:python3.7
+gpu_list:0|0,1
+-o Global.device:gpu
+-o Global.auto_cast:null
+-o Global.epochs:lite_train_infer=2|whole_train_infer=120
+-o Global.output_dir:./output/
+-o DataLoader.Train.sampler.batch_size:8
+-o Global.pretrained_model:null
+train_model_name:latest
+train_infer_img_dir:./dataset/ILSVRC2012/val
+null:null
+##
+trainer:norm_train
+norm_train:tools/train.py -c ppcls/configs/ImageNet/ResNet/ResNet50_vd.yaml
+pact_train:deploy/slim/slim.py -c ppcls/configs/slim/ResNet50_vd_quantization.yaml
+fpgm_train:deploy/slim/slim.py -c ppcls/configs/slim/ResNet50_vd_prune.yaml
+distill_train:null
+null:null
+null:null
+##
+===========================eval_params=========================== 
+eval:tools/eval.py -c ppcls/configs/ImageNet/ResNet/ResNet50_vd.yaml
+null:null
+##
+===========================infer_params==========================
+-o Global.save_inference_dir:./inference
+-o Global.pretrained_model:
+norm_export:tools/export_model.py -c ppcls/configs/ImageNet/ResNet/ResNet50_vd.yaml
+quant_export:deploy/slim/slim.py -m export -c ppcls/configs/slim/ResNet50_vd_quantalization.yaml
+fpgm_export:deploy/slim/slim.py -m export -c ppcls/configs/slim/ResNet50_vd_prune.yaml
+distill_export:null
+export1:null
+export2:null
+infer_model_url:https://paddle-imagenet-models-name.bj.bcebos.com/dygraph/whole_chain/ResNet50_vd_inference.tar
+infer_model:../inference/
+infer_export:null
+infer_quant:Fasle
+inference:python/predict_cls.py -c configs/inference_cls.yaml
+-o Global.use_gpu:True|False
+-o Global.enable_mkldnn:True|False
+-o Global.cpu_num_threads:1|6
+-o Global.batch_size:1
+-o Global.use_tensorrt:True|False
+-o Global.use_fp16:True|False
+-o Global.inference_model_dir:../inference
+-o Global.infer_imgs:../dataset/ILSVRC2012/val
+-o Global.save_log_path:null
+-o Global.benchmark:True
+null:null
--- a/tests/ShuffleNetV2_x1_0.txt
+++ b/tests/ShuffleNetV2_x1_0.txt
+===========================train_params===========================
+model_name:ShuffleNetV2_x1_0
+python:python3.7
+gpu_list:0|0,1
+-o Global.device:gpu
+-o Global.auto_cast:null
+-o Global.epochs:lite_train_infer=2|whole_train_infer=120
+-o Global.output_dir:./output/
+-o DataLoader.Train.sampler.batch_size:8
+-o Global.pretrained_model:null
+train_model_name:latest
+train_infer_img_dir:./dataset/ILSVRC2012/val
+null:null
+##
+trainer:norm_train
+norm_train:tools/train.py -c ppcls/configs/ImageNet/ShuffleNet/ShuffleNetV2_x1_0.yaml
+pact_train:null
+fpgm_train:null
+distill_train:null
+null:null
+null:null
+##
+===========================eval_params=========================== 
+eval:tools/eval.py -c ppcls/configs/ImageNet/ShuffleNet/ShuffleNetV2_x1_0.yaml
+null:null
+##
+===========================infer_params==========================
+-o Global.save_inference_dir:./inference
+-o Global.pretrained_model:
+norm_export:tools/export_model.py -c ppcls/configs/ImageNet/ShuffleNet/ShuffleNetV2_x1_0.yaml
+quant_export:null
+fpgm_export:null
+distill_export:null
+export1:null
+export2:null
+inference_model_url:https://paddle-imagenet-models-name.bj.bcebos.com/dygraph/whole_chain/ShuffleNetV2_x1_0_inference.tar
+infer_model:../inference/
+infer_export:null
+infer_quant:Fasle
+inference:python/predict_cls.py -c configs/inference_cls.yaml
+-o Global.use_gpu:True|False
+-o Global.enable_mkldnn:True|False
+-o Global.cpu_num_threads:1|6
+-o Global.batch_size:1
+-o Global.use_tensorrt:True|False
+-o Global.use_fp16:True|False
+-o Global.inference_model_dir:../inference
+-o Global.infer_imgs:../dataset/ILSVRC2012/val
+-o Global.save_log_path:null
+-o Global.benchmark:True
+null:null
--- a/tests/SwinTransformer_tiny_patch4_window7_224.txt
+++ b/tests/SwinTransformer_tiny_patch4_window7_224.txt
+===========================train_params===========================
+model_name:SwinTransformer_tiny_patch4_window7_224
+python:python3.7
+gpu_list:0|0,1
+-o Global.device:gpu
+-o Global.auto_cast:null
+-o Global.epochs:lite_train_infer=2|whole_train_infer=120
+-o Global.output_dir:./output/
+-o DataLoader.Train.sampler.batch_size:8
+-o Global.pretrained_model:null
+train_model_name:latest
+train_infer_img_dir:./dataset/ILSVRC2012/val
+null:null
+##
+trainer:norm_train
+norm_train:tools/train.py -c ppcls/configs/ImageNet/SwinTransformer/SwinTransformer_tiny_patch4_window7_224.yaml
+pact_train:null
+fpgm_train:null
+distill_train:null
+null:null
+null:null
+##
+===========================eval_params=========================== 
+eval:tools/eval.py -c ppcls/configs/ImageNet/SwinTransformer/SwinTransformer_tiny_patch4_window7_224.yaml
+null:null
+##
+===========================infer_params==========================
+-o Global.save_inference_dir:./inference
+-o Global.pretrained_model:
+norm_export:tools/export_model.py -c ppcls/configs/ImageNet/SwinTransformer/SwinTransformer_tiny_patch4_window7_224.yaml
+quant_export:null
+fpgm_export:null
+distill_export:null
+export1:null
+export2:null
+inference_model_url:https://paddle-imagenet-models-name.bj.bcebos.com/dygraph/whole_chain/SwinTransformer_tiny_patch4_window7_224_inference.tar
+infer_model:../inference/
+infer_export:null
+infer_quant:Fasle
+inference:python/predict_cls.py -c configs/inference_cls.yaml
+-o Global.use_gpu:True|False
+-o Global.enable_mkldnn:True|False
+-o Global.cpu_num_threads:1|6
+-o Global.batch_size:1
+-o Global.use_tensorrt:True|False
+-o Global.use_fp16:True|False
+-o Global.inference_model_dir:../inference
+-o Global.infer_imgs:../dataset/ILSVRC2012/val
+-o Global.save_log_path:null
+-o Global.benchmark:True
+null:null
--- a/tests/prepare.sh
+++ b/tests/prepare.sh
+#!/bin/bash
+FILENAME=$1
+# MODE be one of ['lite_train_infer' 'whole_infer' 'whole_train_infer', 'infer']
+MODE=$2
+
+dataline=$(cat ${FILENAME})
+# parser params
+IFS=$'\n'
+lines=(${dataline})
+function func_parser_value(){
+    strs=$1
+    IFS=":"
+    array=(${strs})
+    if [ ${#array[*]} = 2 ]; then
+        echo ${array[1]}
+    else
+    	IFS="|"
+    	tmp="${array[1]}:${array[2]}"
+        echo ${tmp}
+    fi
+}
+model_name=$(func_parser_value "${lines[1]}")
+inference_model_url=$(func_parser_value "${lines[35]}")
+
+if [ ${MODE} = "lite_train_infer" ] || [ ${MODE} = "whole_infer" ];then
+    # pretrain lite train data
+    cd dataset
+    rm -rf ILSVRC2012
+    wget -nc https://paddle-imagenet-models-name.bj.bcebos.com/data/whole_chain/whole_chain_little_train.tar
+    tar xf whole_chain_little_train.tar
+    ln -s whole_chain_little_train ILSVRC2012
+    cd ILSVRC2012 
+    mv train.txt train_list.txt
+    mv val.txt val_list.txt
+    cd ../../
+elif [ ${MODE} = "infer" ];then
+    # download data
+    cd dataset
+    rm -rf ILSVRC2012
+    wget -nc https://paddle-imagenet-models-name.bj.bcebos.com/data/whole_chain/whole_chain_infer.tar
+    tar xf whole_chain_infer.tar
+    ln -s whole_chain_infer ILSVRC2012
+    cd ILSVRC2012 
+    mv val.txt val_list.txt
+    cd ../../
+    # download inference model
+    eval "wget -nc $inference_model_url"
+    tar xf "${model_name}_inference.tar"
+
+elif [ ${MODE} = "whole_train_infer" ];then
+    cd dataset
+    rm -rf ILSVRC2012
+    wget -nc https://paddle-imagenet-models-name.bj.bcebos.com/data/whole_chain/whole_chain_CIFAR100.tar
+    tar xf whole_chain_CIFAR100.tar
+    ln -s whole_chain_CIFAR100 ILSVRC2012
+    cd ILSVRC2012 
+    mv train.txt train_list.txt
+    mv val.txt val_list.txt
+    cd ../../
+fi
--- a/tests/test.sh
+++ b/tests/test.sh
+#!/bin/bash
+FILENAME=$1
+# MODE be one of ['lite_train_infer' 'whole_infer' 'whole_train_infer', 'infer'] 
+MODE=$2
+
+dataline=$(cat ${FILENAME})
+
+# parser params
+IFS=$'\n'
+lines=(${dataline})
+
+function func_parser_key(){
+    strs=$1
+    IFS=":"
+    array=(${strs})
+    tmp=${array[0]}
+    echo ${tmp}
+}
+function func_parser_value(){
+    strs=$1
+    IFS=":"
+    array=(${strs})
+    tmp=${array[1]}
+    echo ${tmp}
+}
+function func_set_params(){
+    key=$1
+    value=$2
+    if [ ${key} = "null" ];then
+        echo " "
+    elif [[ ${value} = "null" ]] || [[ ${value} = " " ]] || [ ${#value} -le 0 ];then
+        echo " "
+    else 
+        echo "${key}=${value}"
+    fi
+}
+function func_parser_params(){
+    strs=$1
+    IFS=":"
+    array=(${strs})
+    key=${array[0]}
+    tmp=${array[1]}
+    IFS="|"
+    res=""
+    for _params in ${tmp[*]}; do
+        IFS="="
+        array=(${_params})
+        mode=${array[0]}
+        value=${array[1]}
+        if [[ ${mode} = ${MODE} ]]; then
+            IFS="|"
+            #echo $(func_set_params "${mode}" "${value}")
+            echo $value
+            break
+        fi
+        IFS="|"
+    done
+    echo ${res}
+}
+function status_check(){
+    last_status=$1   # the exit code
+    run_command=$2
+    run_log=$3
+    if [ $last_status -eq 0 ]; then
+        echo -e "\033[33m Run successfully with command - ${run_command}!  \033[0m" | tee -a ${run_log}
+    else
+        echo -e "\033[33m Run failed with command - ${run_command}!  \033[0m" | tee -a ${run_log}
+    fi
+}
+
+IFS=$'\n'
+# The training params
+model_name=$(func_parser_value "${lines[1]}")
+python=$(func_parser_value "${lines[2]}")
+gpu_list=$(func_parser_value "${lines[3]}")
+train_use_gpu_key=$(func_parser_key "${lines[4]}")
+train_use_gpu_value=$(func_parser_value "${lines[4]}")
+autocast_list=$(func_parser_value "${lines[5]}")
+autocast_key=$(func_parser_key "${lines[5]}")
+epoch_key=$(func_parser_key "${lines[6]}")
+epoch_num=$(func_parser_params "${lines[6]}")
+save_model_key=$(func_parser_key "${lines[7]}")
+train_batch_key=$(func_parser_key "${lines[8]}")
+train_batch_value=$(func_parser_params "${lines[8]}")
+pretrain_model_key=$(func_parser_key "${lines[9]}")
+pretrain_model_value=$(func_parser_value "${lines[9]}")
+train_model_name=$(func_parser_value "${lines[10]}")
+train_infer_img_dir=$(func_parser_value "${lines[11]}")
+train_param_key1=$(func_parser_key "${lines[12]}")
+train_param_value1=$(func_parser_value "${lines[12]}")
+
+trainer_list=$(func_parser_value "${lines[14]}")
+trainer_norm=$(func_parser_key "${lines[15]}")
+norm_trainer=$(func_parser_value "${lines[15]}")
+pact_key=$(func_parser_key "${lines[16]}")
+pact_trainer=$(func_parser_value "${lines[16]}")
+fpgm_key=$(func_parser_key "${lines[17]}")
+fpgm_trainer=$(func_parser_value "${lines[17]}")
+distill_key=$(func_parser_key "${lines[18]}")
+distill_trainer=$(func_parser_value "${lines[18]}")
+trainer_key1=$(func_parser_key "${lines[19]}")
+trainer_value1=$(func_parser_value "${lines[19]}")
+trainer_key2=$(func_parser_key "${lines[20]}")
+trainer_value2=$(func_parser_value "${lines[20]}")
+
+eval_py=$(func_parser_value "${lines[23]}")
+eval_key1=$(func_parser_key "${lines[24]}")
+eval_value1=$(func_parser_value "${lines[24]}")
+
+save_infer_key=$(func_parser_key "${lines[27]}")
+export_weight=$(func_parser_key "${lines[28]}")
+norm_export=$(func_parser_value "${lines[29]}")
+pact_export=$(func_parser_value "${lines[30]}")
+fpgm_export=$(func_parser_value "${lines[31]}")
+distill_export=$(func_parser_value "${lines[32]}")
+export_key1=$(func_parser_key "${lines[33]}")
+export_value1=$(func_parser_value "${lines[33]}")
+export_key2=$(func_parser_key "${lines[34]}")
+export_value2=$(func_parser_value "${lines[34]}")
+
+# parser inference model 
+infer_model_dir_list=$(func_parser_value "${lines[36]}")
+infer_export_list=$(func_parser_value "${lines[37]}")
+infer_is_quant=$(func_parser_value "${lines[38]}")
+# parser inference 
+inference_py=$(func_parser_value "${lines[39]}")
+use_gpu_key=$(func_parser_key "${lines[40]}")
+use_gpu_list=$(func_parser_value "${lines[40]}")
+use_mkldnn_key=$(func_parser_key "${lines[41]}")
+use_mkldnn_list=$(func_parser_value "${lines[41]}")
+cpu_threads_key=$(func_parser_key "${lines[42]}")
+cpu_threads_list=$(func_parser_value "${lines[42]}")
+batch_size_key=$(func_parser_key "${lines[43]}")
+batch_size_list=$(func_parser_value "${lines[43]}")
+use_trt_key=$(func_parser_key "${lines[44]}")
+use_trt_list=$(func_parser_value "${lines[44]}")
+precision_key=$(func_parser_key "${lines[45]}")
+precision_list=$(func_parser_value "${lines[45]}")
+infer_model_key=$(func_parser_key "${lines[46]}")
+image_dir_key=$(func_parser_key "${lines[47]}")
+infer_img_dir=$(func_parser_value "${lines[47]}")
+save_log_key=$(func_parser_key "${lines[48]}")
+benchmark_key=$(func_parser_key "${lines[49]}")
+benchmark_value=$(func_parser_value "${lines[49]}")
+infer_key1=$(func_parser_key "${lines[50]}")
+infer_value1=$(func_parser_value "${lines[50]}")
+
+LOG_PATH="./tests/output"
+mkdir -p ${LOG_PATH}
+status_log="${LOG_PATH}/results.log"
+
+
+function func_inference(){
+    IFS='|'
+    _python=$1
+    _script=$2
+    _model_dir=$3
+    _log_path=$4
+    _img_dir=$5
+    _flag_quant=$6
+    # inference 
+    for use_gpu in ${use_gpu_list[*]}; do
+        if [ ${use_gpu} = "False" ] || [ ${use_gpu} = "cpu" ]; then
+            for use_mkldnn in ${use_mkldnn_list[*]}; do
+                if [ ${use_mkldnn} = "False" ] && [ ${_flag_quant} = "True" ]; then
+                    continue
+                fi
+                for threads in ${cpu_threads_list[*]}; do
+                    for batch_size in ${batch_size_list[*]}; do
+                        _save_log_path="${_log_path}/infer_cpu_usemkldnn_${use_mkldnn}_threads_${threads}_batchsize_${batch_size}.log"
+                        set_infer_data=$(func_set_params "${image_dir_key}" "${_img_dir}")
+                        set_benchmark=$(func_set_params "${benchmark_key}" "${benchmark_value}")
+                        set_batchsize=$(func_set_params "${batch_size_key}" "${batch_size}")
+                        set_cpu_threads=$(func_set_params "${cpu_threads_key}" "${threads}")
+                        set_model_dir=$(func_set_params "${infer_model_key}" "${_model_dir}")
+                        set_infer_params1=$(func_set_params "${infer_key1}" "${infer_value1}")
+                        command="${_python} ${_script} ${use_gpu_key}=${use_gpu} ${use_mkldnn_key}=${use_mkldnn} ${set_cpu_threads} ${set_model_dir} ${set_batchsize} ${set_infer_data} ${set_benchmark} ${set_infer_params1} > ${_save_log_path} 2>&1 "
+                        eval $command
+                        last_status=${PIPESTATUS[0]}
+                        eval "cat ${_save_log_path}"
+                        status_check $last_status "${command}" "../${status_log}"
+                    done
+                done
+            done
+        elif [ ${use_gpu} = "True" ] || [ ${use_gpu} = "gpu" ]; then
+            for use_trt in ${use_trt_list[*]}; do
+                for precision in ${precision_list[*]}; do
+                    if [ ${precision} = "False" ] && [ ${use_trt} = "False" ]; then
+                        continue
+                    fi
+                    if [[ ${use_trt} = "False" || ${precision} =~ "int8" ]] && [ ${_flag_quant} = "True" ]; then
+                        continue
+                    fi
+                    for batch_size in ${batch_size_list[*]}; do
+                        _save_log_path="${_log_path}/infer_gpu_usetrt_${use_trt}_precision_${precision}_batchsize_${batch_size}.log"
+                        set_infer_data=$(func_set_params "${image_dir_key}" "${_img_dir}")
+                        set_benchmark=$(func_set_params "${benchmark_key}" "${benchmark_value}")
+                        set_batchsize=$(func_set_params "${batch_size_key}" "${batch_size}")
+                        set_tensorrt=$(func_set_params "${use_trt_key}" "${use_trt}")
+                        set_precision=$(func_set_params "${precision_key}" "${precision}")
+                        set_model_dir=$(func_set_params "${infer_model_key}" "${_model_dir}")
+                        command="${_python} ${_script} ${use_gpu_key}=${use_gpu} ${set_tensorrt} ${set_precision} ${set_model_dir} ${set_batchsize} ${set_infer_data} ${set_benchmark} > ${_save_log_path} 2>&1 "
+                        eval $command
+                        last_status=${PIPESTATUS[0]}
+                        eval "cat ${_save_log_path}"
+                        status_check $last_status "${command}" "../${status_log}"
+                        
+                    done
+                done
+            done
+        else
+            echo "Does not support hardware other than CPU and GPU Currently!"
+        fi
+    done
+}
+
+if [ ${MODE} = "infer" ]; then
+    GPUID=$3
+    if [ ${#GPUID} -le 0 ];then
+        env=" "
+    else
+        env="export CUDA_VISIBLE_DEVICES=${GPUID}"
+    fi
+    # set CUDA_VISIBLE_DEVICES
+    eval $env
+    export Count=0
+    IFS="|"
+    infer_run_exports=(${infer_export_list})
+    infer_quant_flag=(${infer_is_quant})
+    cd deploy
+    for infer_model in ${infer_model_dir_list[*]}; do
+        # run export
+        if [ ${infer_run_exports[Count]} != "null" ];then
+            set_export_weight=$(func_set_params "${export_weight}" "${infer_model}")
+            set_save_infer_key=$(func_set_params "${save_infer_key}" "${infer_model}")
+            export_cmd="${python} ${norm_export} ${set_export_weight} ${set_save_infer_key}"
+            eval $export_cmd
+            status_export=$?
+            if [ ${status_export} = 0 ];then
+                status_check $status_export "${export_cmd}" "../${status_log}"
+            fi
+        fi
+        #run inference
+        is_quant=${infer_quant_flag[Count]}
+        echo "is_quant: ${is_quant}"
+        func_inference "${python}" "${inference_py}" "${infer_model}" "../${LOG_PATH}" "${infer_img_dir}" ${is_quant}
+        Count=$(($Count + 1))
+    done
+    cd ..
+
+else
+    IFS="|"
+    export Count=0
+    USE_GPU_KEY=(${train_use_gpu_value})
+    for gpu in ${gpu_list[*]}; do
+        use_gpu=${USE_GPU_KEY[Count]}
+        Count=$(($Count + 1))
+        if [ ${gpu} = "-1" ];then
+            env=""
+        elif [ ${#gpu} -le 1 ];then
+            env="export CUDA_VISIBLE_DEVICES=${gpu}"
+            eval ${env}
+        elif [ ${#gpu} -le 15 ];then
+            IFS=","
+            array=(${gpu})
+            env="export CUDA_VISIBLE_DEVICES=${array[0]}"
+            IFS="|"
+        else
+            IFS=";"
+            array=(${gpu})
+            ips=${array[0]}
+            gpu=${array[1]}
+            IFS="|"
+            env=" "
+        fi
+        for autocast in ${autocast_list[*]}; do 
+            for trainer in ${trainer_list[*]}; do 
+                flag_quant=False
+                if [ ${trainer} = ${pact_key} ]; then
+                    run_train=${pact_trainer}
+                    run_export=${pact_export}
+                    flag_quant=True
+                elif [ ${trainer} = "${fpgm_key}" ]; then
+                    run_train=${fpgm_trainer}
+                    run_export=${fpgm_export}
+                elif [ ${trainer} = "${distill_key}" ]; then
+                    run_train=${distill_trainer}
+                    run_export=${distill_export}
+                elif [ ${trainer} = ${trainer_key1} ]; then
+                    run_train=${trainer_value1}
+                    run_export=${export_value1}
+                elif [[ ${trainer} = ${trainer_key2} ]]; then
+                    run_train=${trainer_value2}
+                    run_export=${export_value2}
+                else
+                    run_train=${norm_trainer}
+                    run_export=${norm_export}
+                fi
+
+                if [ ${run_train} = "null" ]; then
+                    continue
+                fi
+                
+                set_autocast=$(func_set_params "${autocast_key}" "${autocast}")
+                set_epoch=$(func_set_params "${epoch_key}" "${epoch_num}")
+                set_pretrain=$(func_set_params "${pretrain_model_key}" "${pretrain_model_value}")
+                set_batchsize=$(func_set_params "${train_batch_key}" "${train_batch_value}")
+                set_train_params1=$(func_set_params "${train_param_key1}" "${train_param_value1}")
+                set_use_gpu=$(func_set_params "${train_use_gpu_key}" "${use_gpu}")
+                save_log="${LOG_PATH}/${trainer}_gpus_${gpu}_autocast_${autocast}"
+                
+                # load pretrain from norm training if current trainer is pact or fpgm trainer
+                if [ ${trainer} = ${pact_key} ] || [ ${trainer} = ${fpgm_key} ]; then
+                    set_pretrain="${load_norm_train_model}"
+                fi
+
+                set_save_model=$(func_set_params "${save_model_key}" "${save_log}")
+                if [ ${#gpu} -le 2 ];then  # train with cpu or single gpu
+                    cmd="${python} ${run_train} ${set_use_gpu}  ${set_save_model} ${set_epoch} ${set_pretrain} ${set_autocast} ${set_batchsize} ${set_train_params1} "
+                elif [ ${#gpu} -le 15 ];then  # train with multi-gpu
+                    cmd="${python} -m paddle.distributed.launch --gpus=${gpu} ${run_train} ${set_save_model} ${set_epoch} ${set_pretrain} ${set_autocast} ${set_batchsize} ${set_train_params1}"
+                else     # train with multi-machine
+                    cmd="${python} -m paddle.distributed.launch --ips=${ips} --gpus=${gpu} ${run_train} ${set_save_model} ${set_pretrain} ${set_epoch} ${set_autocast} ${set_batchsize} ${set_train_params1}"
+                fi
+                # run train
+		eval "unset CUDA_VISIBLE_DEVICES"
+                eval $cmd
+                status_check $? "${cmd}" "${status_log}"
+
+                set_eval_pretrain=$(func_set_params "${pretrain_model_key}" "${save_log}/${$model_name}/${train_model_name}")
+                # save norm trained models to set pretrain for pact training and fpgm training 
+                if [ ${trainer} = ${trainer_norm} ]; then
+                    load_norm_train_model=${set_eval_pretrain}
+                fi
+                # run eval 
+                if [ ${eval_py} != "null" ]; then
+                    set_eval_params1=$(func_set_params "${eval_key1}" "${eval_value1}")
+                    eval_cmd="${python} ${eval_py} ${set_eval_pretrain} ${set_use_gpu} ${set_eval_params1}" 
+                    eval $eval_cmd
+                    status_check $? "${eval_cmd}" "${status_log}"
+                fi
+                # run export model
+                if [ ${run_export} != "null" ]; then 
+                    # run export model
+                    save_infer_path="${save_log}"
+                    set_export_weight=$(func_set_params "${export_weight}" "${save_log}/${model_name}/${train_model_name}")
+                    set_save_infer_key=$(func_set_params "${save_infer_key}" "${save_infer_path}")
+                    export_cmd="${python} ${run_export} ${set_export_weight} ${set_save_infer_key}"
+                    eval $export_cmd
+                    status_check $? "${export_cmd}" "${status_log}"
+
+                    #run inference
+                    eval $env
+                    save_infer_path="${save_log}"
+		    cd deploy
+                    func_inference "${python}" "${inference_py}" "../${save_infer_path}" "../${LOG_PATH}" "${infer_img_dir}" "${flag_quant}"
+		    cd ..
+                fi
+                eval "unset CUDA_VISIBLE_DEVICES"
+            done  # done with:    for trainer in ${trainer_list[*]}; do 
+        done      # done with:    for autocast in ${autocast_list[*]}; do 
+    done          # done with:    for gpu in ${gpu_list[*]}; do
+fi  # end if [ ${MODE} = "infer" ]; then