Merge pull request #1551 from cuicheng01/add_en_docs

add some en docs

Merge pull request #1551 from cuicheng01/add_en_docs
add some en docs
140425a7 · cuicheng01 · GitHub · 524190a7 · 60f18b97 · 140425a7
12 changed file
--- a/docs/en/advanced_tutorials/DataAugmentation_en.md
+++ b/docs/en/advanced_tutorials/DataAugmentation_en.md
+# Image Augmentation
+---
+
+Experiments about data augmentation will be introduced in detail in this section. If you want to quickly experience these methods, please refer to [**Quick start PaddleClas in 30 miniutes**](../../tutorials/quick_start_en.md), which based on CIFAR100 dataset. If you want to know the content of related algorithms, please refer to [Data Augmentation Algorithm Introduction](../algorithm_introduction/DataAugmentation_en.md).
+
+
+## Catalogue
+
+- [1. Configurations](#1)
+    - [1.1 AutoAugment](#1.1)
+    - [1.2 RandAugment](#1.2)
+    - [1.3 TimmAutoAugment](#1.3)
+    - [1.4 Cutout](#1.4)
+    - [1.5 RandomErasing](#1.5)
+    - [1.6 HideAndSeek](#1.6)
+    - [1.7 GridMask](#1.7)
+    - [1.8 Mixup](#1.8)
+    - [1.9 Cutmix](#1.9)
+    - [1.10 Use Mixup and Cutmix at the same time](#1.10)
+- [2. Start training](#2)
+- [3. Matters needing attention](#3)
+- [4. Experiments](#4)
+
+
+<a name="1"></a>
+## Configurations
+
+Since hyperparameters differ from different augmentation methods. For better understanding, we list 8 augmentation configuration files in `configs/DataAugment` based on ResNet50. Users can train the model with `tools/run.sh`. The following are 3 of them.
+
+<a name="1.1"></a>
+### 1.1 AutoAugment
+
+The configuration of the data augmentation method of `AotoAugment` is as follows. `AutoAugment` is converted on the uint8 data format, so its processing should be placed before the normalization operation (`NormalizeImage`).
+
+```yaml        
+      transform_ops:
+        - DecodeImage:
+            to_rgb: True
+            channel_first: False
+        - RandCropImage:
+            size: 224
+        - RandFlipImage:
+            flip_code: 1
+        - AutoAugment:
+        - NormalizeImage:
+            scale: 1.0/255.0
+            mean: [0.485, 0.456, 0.406]
+            std: [0.229, 0.224, 0.225]
+            order: ''
+```
+
+<a name="1.2"></a>
+### 1.2 RandAugment
+
+The configuration of the data augmentation method of `RandAugment` is as follows, where the user needs to specify the parameters `num_layers` and `magnitude`, and the default values are `2` and `5` respectively. `RandAugment` is converted on the uint8 data format, so its processing should be placed before the normalization operation (`NormalizeImage`).
+
+```yaml        
+      transform_ops:
+        - DecodeImage:
+            to_rgb: True
+            channel_first: False
+        - RandCropImage:
+            size: 224
+        - RandFlipImage:
+            flip_code: 1
+        - RandAugment:
+            num_layers: 2
+            magnitude: 5
+        - NormalizeImage:
+            scale: 1.0/255.0
+            mean: [0.485, 0.456, 0.406]
+            std: [0.229, 0.224, 0.225]
+            order: ''
+```
+
+<a name="1.3"></a>
+### 1.3 TimmAutoAugment
+
+The configuration of the data augmentation method of `TimmAutoAugment` is as follows, in which the user needs to specify the parameters `config_str`, `interpolation`, and `img_size`. The default values are `rand-m9-mstd0.5-inc1` and `bicubic. `, `224`. `TimmAutoAugment` is converted on the uint8 data format, so its processing should be placed before the normalization operation (`NormalizeImage`).
+
+```yaml        
+      transform_ops:
+        - DecodeImage:
+            to_rgb: True
+            channel_first: False
+        - RandCropImage:
+            size: 224
+        - RandFlipImage:
+            flip_code: 1
+        - TimmAutoAugment:
+            config_str: rand-m9-mstd0.5-inc1
+            interpolation: bicubic
+            img_size: 224
+        - NormalizeImage:
+            scale: 1.0/255.0
+            mean: [0.485, 0.456, 0.406]
+            std: [0.229, 0.224, 0.225]
+            order: ''
+```
+
+<a name="1.4"></a>
+### 1.4 Cutout
+
+The configuration of the data augmentation method of `Cutout` is as follows, where the user needs to specify the parameters `n_holes` and `length`, and the default values are `1` and `112` respectively. Similar to other image cropping data augmentation methods, `Cutout` can operate on data in uint8 format, or on data after normalization (`NormalizeImage`).The demo here is operated after normalization.
+
+```yaml
+      transform_ops:
+        - DecodeImage:
+            to_rgb: True
+            channel_first: False
+        - RandCropImage:
+            size: 224
+        - RandFlipImage:
+            flip_code: 1
+        - NormalizeImage:
+            scale: 1.0/255.0
+            mean: [0.485, 0.456, 0.406]
+            std: [0.229, 0.224, 0.225]
+            order: ''
+        - Cutout:
+            n_holes: 1
+            length: 112
+```
+
+<a name="1.5"></a>
+### 1.5 RandomErasing
+
+The configuration of the image augmentation method of `RandomErasing` is as follows, where the user needs to specify the parameters `EPSILON`, `sl`, `sh`, `r1`, `attempt`, `use_log_aspect`, `mode`, and the default values They are `0.25`, `0.02`, `1.0/3.0`, `0.3`, `10`, `True`, and `pixel`.  Similar to other image cropping data augmentation methods, `RandomErasing` can operate on data in uint8 format, or on data after normalization (`NormalizeImage`).The demo here is operated after normalization.
+
+```yaml
+      transform_ops:
+        - DecodeImage:
+            to_rgb: True
+            channel_first: False
+        - RandCropImage:
+            size: 224
+        - RandFlipImage:
+            flip_code: 1
+        - NormalizeImage:
+            scale: 1.0/255.0
+            mean: [0.485, 0.456, 0.406]
+            std: [0.229, 0.224, 0.225]
+            order: ''
+        - RandomErasing:
+            EPSILON: 0.25
+            sl: 0.02
+            sh: 1.0/3.0
+            r1: 0.3
+            attempt: 10
+            use_log_aspect: True
+            mode: pixel
+```
+
+<a name="1.6"></a>
+### 1.6 HideAndSeek
+
+The configuration of the image augmentation method of `HideAndSeek` is as follows. Similar to other image cropping data augmentation methods, `HideAndSeek` can operate on data in uint8 format, or on data after normalization (`NormalizeImage`).The demo here is operated after normalization.
+
+```yaml
+      transform_ops:
+        - DecodeImage:
+            to_rgb: True
+            channel_first: False
+        - RandCropImage:
+            size: 224
+        - RandFlipImage:
+            flip_code: 1
+        - NormalizeImage:
+            scale: 1.0/255.0
+            mean: [0.485, 0.456, 0.406]
+            std: [0.229, 0.224, 0.225]
+            order: ''
+        - HideAndSeek:
+```
+
+<a name="1.7"></a>
+### 1.7 GridMask
+
+The configuration of the image augmentation method of `GridMask` is as follows, where the user needs to specify the parameters `d1`, `d2`, `rotate`, `ratio`, `mode`, and the default values are `96`, `224 respectively `, `1`, `0.5`, `0`. Similar to other image cropping data augmentation methods, `HideAndSeek` can operate on data in uint8 format, or on data after normalization (`GridMask`).The demo here is operated after normalization.
+
+```yaml
+      transform_ops:
+        - DecodeImage:
+            to_rgb: True
+            channel_first: False
+        - RandCropImage:
+            size: 224
+        - RandFlipImage:
+            flip_code: 1
+        - NormalizeImage:
+            scale: 1.0/255.0
+            mean: [0.485, 0.456, 0.406]
+            std: [0.229, 0.224, 0.225]
+            order: ''
+        - GridMask:
+            d1: 96
+            d2: 224
+            rotate: 1
+            ratio: 0.5
+            mode: 0
+```
+
+<a name="1.8"></a>
+### 1.8 Mixup
+
+The configuration of the data augmentation method of `Mixup` is as follows, where the user needs to specify the parameter `alpha`, and the default value is `0.2`. Similar to other image mixing data augmentation methods, `Mixup` is to perform image mix on the data in each batch after the image is processed, and the mixed images and labels are put into the network for training, 
+so it operates after image data processing (image transformation, image cropping).
+
+```yaml
+      transform_ops:
+        - DecodeImage:
+            to_rgb: True
+            channel_first: False
+        - RandCropImage:
+            size: 224
+        - RandFlipImage:
+            flip_code: 1
+        - NormalizeImage:
+            scale: 1.0/255.0
+            mean: [0.485, 0.456, 0.406]
+            std: [0.229, 0.224, 0.225]
+            order: ''
+      batch_transform_ops:
+        - MixupOperator:
+            alpha: 0.2
+```
+
+<a name="1.9"></a>
+### 1.9 Cutmix
+
+The configuration of the image augmentation method of `Cutmix` is as follows, where the user needs to specify the parameter `alpha`, and the default value is `0.2`. Similar to other image mixing data augmentation methods, `Mixup` is to perform image mix on the data in each batch after the image is processed, and the mixed images and labels are put into the network for training, 
+so it operates after image data processing (image transformation, image cropping).
+
+```yaml
+      transform_ops:
+        - DecodeImage:
+            to_rgb: True
+            channel_first: False
+        - RandCropImage:
+            size: 224
+        - RandFlipImage:
+            flip_code: 1
+        - NormalizeImage:
+            scale: 1.0/255.0
+            mean: [0.485, 0.456, 0.406]
+            std: [0.229, 0.224, 0.225]
+            order: ''
+      batch_transform_ops:
+        - CutmixOperator:
+            alpha: 0.2
+```
+
+<a name="1.10"></a>
+### 1.10 Use Mixup and Cutmix at the same time
+
+The configuration for both `Mixup` and `Cutmix` is as follows, in which the user needs to specify an additional parameter `prob`, which controls the probability of different data enhancements, and the default is `0.5`.
+
+```yaml
+      transform_ops:
+        - DecodeImage:
+            to_rgb: True
+            channel_first: False
+        - RandCropImage:
+            size: 224
+        - RandFlipImage:
+            flip_code: 1
+        - NormalizeImage:
+            scale: 1.0/255.0
+            mean: [0.485, 0.456, 0.406]
+            std: [0.229, 0.224, 0.225]
+            order: ''
+        - OpSampler:
+            MixupOperator:
+              alpha: 0.8
+              prob: 0.5
+            CutmixOperator:
+              alpha: 1.0
+              prob: 0.5
+```
+
+<a name="2"></a>
+## 2. Start training
+
+After you configure the training environment, similar to training other classification tasks, you only need to replace the configuration file in `tools/train.sh` with the configuration file of the corresponding data augmentation method.
+
+
+The contents of `train.sh` are as follows:
+
+```bash
+python3 -m paddle.distributed.launch \
+    --selected_gpus="0,1,2,3" \
+    --log_dir=ResNet50_Cutout \
+    tools/train.py \
+        -c ./ppcls/configs/ImageNet/DataAugment/ResNet50_Cutout.yaml
+```
+
+Run `train.sh`:
+
+```bash
+sh tools/train.sh
+```
+
+<a name="3"></a>
+## 3. Matters needing attention
+
+* In addition, because the label needs to be aliased when the image is aliased, the accuracy of the training data cannot be calculated. The training accuracy rate was not printed during the training process.
+
+* The training data is more difficult with data augmentation, so the training loss may be larger, the training set accuracy is relatively low, but it has better generalization ability, so the validation set accuracy is relatively higher.
+
+* After the use of data augmentation, the model may tend to be underfitting. It is recommended to reduce `l2_decay` for better performance on validation set.
+
+* hyperparameters exist in almost all agmenatation methods. Here we provide hyperparameters for ImageNet1k dataset. User may need to finetune the hyperparameters on specified dataset. More training tricks can be referred to [**Tricks**](../../../zh_CN/models/Tricks.md).
+
+
+> If this document is helpful to you, welcome to star our project: [https://github.com/PaddlePaddle/PaddleClas](https://github.com/PaddlePaddle/PaddleClas)
+
+<a name="4"></a>
+## 4. Experiments
+
+Based on PaddleClas, Metrics of different augmentation methods on ImageNet1k dataset are as follows.
+
+
+| Model          | Learning strategy  | l2 decay | batch size | epoch | Augmentation method   | Top1 Acc    | Reference |
+|-------------|------------------|--------------|------------|-------|----------------|------------|----|
+| ResNet50 | 0.1/cosine_decay | 0.0001       | 256        | 300   | Standard transform           | 0.7731 | - |
+| ResNet50 | 0.1/cosine_decay | 0.0001       | 256        | 300   | AutoAugment    | 0.7795 |  0.7763 |
+| ResNet50 | 0.1/cosine_decay | 0.0001       | 256        | 300   | mixup          | 0.7828 |  0.7790 |
+| ResNet50 | 0.1/cosine_decay | 0.0001       | 256        | 300   | cutmix         | 0.7839 |  0.7860 |
+| ResNet50 | 0.1/cosine_decay | 0.0001       | 256        | 300   | cutout         | 0.7801 |  - |
+| ResNet50 | 0.1/cosine_decay | 0.0001       | 256        | 300   | gridmask       | 0.7785 |  0.7790 |
+| ResNet50 | 0.1/cosine_decay | 0.0001       | 256        | 300   | random-augment | 0.7770 |  0.7760 |
+| ResNet50 | 0.1/cosine_decay | 0.0001       | 256        | 300   | random erasing | 0.7791 |  - |
+| ResNet50 | 0.1/cosine_decay | 0.0001       | 256        | 300   | hide and seek  | 0.7743 |  0.7720 |
+
+
+**note**:
+* In the experiment here, for better comparison, we fixed the l2 decay to 1e-4. To achieve higher accuracy, we recommend trying to use a smaller l2 decay. Combined with data augmentaton, we found that reducing l2 decay from 1e-4 to 7e-5 can bring at least 0.3~0.5% accuracy improvement.
+* We have not yet combined different strategies or verified, whch is our future work.
--- a/docs/en/advanced_tutorials/image_augmentation/index.rst
+++ b/docs/en/advanced_tutorials/image_augmentation/index.rst
-image_augmentation
-================================
-
-.. toctree::
-   :maxdepth: 3
-
-   ImageAugment_en.md
--- a/docs/en/advanced_tutorials/index.rst
+++ b/docs/en/advanced_tutorials/index.rst
-advanced_tutorials
-================================
-
-.. toctree::
-   :maxdepth: 1
-
-   image_augmentation/index
-   distillation/index
-   multilabel/index
--- a/docs/en/advanced_tutorials/image_augmentation/ImageAugment_en.md
+++ b/docs/en/advanced_tutorials/image_augmentation/ImageAugment_en.md
-# Image Augmentation
-
-
-Image augmentation is a commonly used regularization method in image classification task, which is often used in scenarios with insufficient data or large model. In this chapter, we mainly introduce 8 image augmentation methods besides standard augmentation methods. Users can apply these methods in their own tasks for better model performance. Under the same conditions, These augmentation methods' performance on ImageNet1k dataset is shown as follows.
+# Data Augmentation
+------
+
+## Catalogue
+
+  - [1. Introduction to data augmentation](#1)
+  - [2. Common data augmentation methods](#2)
+    - [2.1 Image Transformation](#2.1)
+      - [2.1.1 AutoAugment](#2.1.1)
+      - [2.1.2 RandAugment](#2.1.2)
+      - [2.1.3 TimmAutoAugment](#2.1.3)
+    - [2.2 Image Cropping](#2.2)
+      - [2.2.1 Cutout](#2.2.1)
+      - [2.2.2 RandomErasing](#2.2.2)
+      - [2.2.3 HideAndSeek](#2.2.3)
+      - [2.2.4 GridMask](#2.2.4)
+    - [2.3 Image mix](#2.3)
+      - [2.3.1 Mixup](#2.3.1)
+      - [2.3.2 Cutmix](#3.2.2)
+    
+<a name="1"></a>
+## 1. Introduction to data augmentation
+
+Data augmentation is a commonly used regularization method in image classification task, which is often used in scenarios with insufficient data or large model. In this chapter, we mainly introduce 8 image augmentation methods besides standard augmentation methods. Users can apply these methods in their own tasks for better model performance. Under the same conditions, these augmentation methods' performance on ImageNet1k dataset is shown as follows.

 ![](../../../images/image_aug/main_image_aug.png)


-# Common image augmentation methods
+<a name="2"></a>
+## 2. Common data augmentation methods

 If without special explanation, all the examples and experiments in this chapter are based on ImageNet1k dataset with the network input image size set as 224.

@@ -53,11 +74,13 @@ PaddleClas integrates all the above data augmentation strategies. More details i

 ![](../../../images/image_aug/test_baseline.jpeg)

-# Image Transformation
+<a name="2.1"></a>
+### 2.1 Image Transformation

 Transformation means performing some transformations on the image after `RandCrop`. It mainly contains AutoAugment and RandAugment.

-## AutoAugment
+<a name="2.1.1"></a>
+#### 2.1.1 AutoAugment

 Address：[https://arxiv.org/abs/1805.09501v1](https://arxiv.org/abs/1805.09501v1)

@@ -66,30 +89,12 @@ Github repo：[https://github.com/DeepVoltaire/AutoAugment](https://github.com/D

 Unlike conventional artificially designed image augmentation methods, AutoAugment is an image augmentation solution suitable for a specific data set found by certain search algorithm in the search space of a series of image augmentation sub-strategies. For the ImageNet dataset, the final data augmentation solution contains 25 sub-strategy combinations. Each sub-strategy contains two transformations. For each image, a sub-strategy combination is randomly selected and then determined with a certain probability Perform each transformation in the sub-strategy.

-In PaddleClas, `AutoAugment` is used as follows.
-
-```python
-
-size = 224
-
-decode_op = DecodeImage()
-resize_op = ResizeImage(size=(size, size))
-autoaugment_op = ImageNetPolicy()
-
-ops = [decode_op, resize_op, autoaugment_op]
-
-imgs_dir = image_path
-fnames = os.listdir(imgs_dir)
-for f in fnames:
-    data = open(os.path.join(imgs_dir, f)).read()
-    img = transform(data, ops)
-```
-
 The images after `AutoAugment` are as follows.

 ![][test_autoaugment]

-## RandAugment
+<a name="2.1.2"></a>
+#### 2.1.2 RandAugment

 Address: [https://arxiv.org/pdf/1909.13719.pdf](https://arxiv.org/pdf/1909.13719.pdf)

@@ -100,31 +105,19 @@ The search method of `AutoAugment` is relatively violent. Searching for the opti

 In `RandAugment`, the author proposes a random augmentation method. Instead of using a specific probability to determine whether to use a certain sub-strategy, all sub-strategies are selected with the same probability. The experiments in the paper also show that this method performs well even for large models.

-In PaddleClas, `RandAugment` is used as follows.
-
-```python
-
-size = 224
-
-decode_op = DecodeImage()
-resize_op = ResizeImage(size=(size, size))
-randaugment_op = RandAugment()
-
-ops = [decode_op, resize_op, randaugment_op]
-
-imgs_dir = image_path
-fnames = os.listdir(imgs_dir)
-for f in fnames:
-    data = open(os.path.join(imgs_dir, f)).read()
-    img = transform(data, ops)
-```
-
 The images after `RandAugment` are as follows.

 ![][test_randaugment]

+<a name="2.1.3"></a>
+#### 2.1.3 TimmAutoAugment
+
+Github open source code address: [https://github.com/rwightman/pytorch-image-models/blob/master/timm/data/auto_augment.py](https://github.com/rwightman/pytorch-image-models/blob/master/timm/data/auto_augment.py)

-# Image Cropping
+`TimmAutoAugment` is an improvement of AutoAugment and RandAugment by open source authors. Facts have proved that it has better performance on many visual tasks. At present, most VisionTransformer models are implemented based on TimmAutoAugment.
+
+<a name="2.2"></a>
+### 2.2 Image Cropping

 Cropping means performing some transformations on the image after `Transpose`, setting pixels of the cropped area as certain constant. It mainly contains CutOut, RandErasing, HideAndSeek and GridMask.

@@ -132,8 +125,8 @@ Image cropping methods can be operated before or after normalization. The differ

 The above-mentioned cropping transformation ideas are the similar, all to solve the problem of poor generalization ability of the trained model on occlusion images, the difference lies in that their cropping details.

-
-## Cutout
+<a name="2.2.1"></a>
+#### 2.2.1 Cutout

 Address: [https://arxiv.org/abs/1708.04552](https://arxiv.org/abs/1708.04552)

@@ -142,32 +135,12 @@ Github repo: [https://github.com/uoguelph-mlrg/Cutout](https://github.com/uoguel

 Cutout is a kind of dropout, but occludes input image rather than feature map. It is more robust to noise than noise. Cutout has two advantages: (1) Using Cutout, we can simulate the situation when the subject is partially occluded. (2) It can promote the model to make full use of more content in the image for classification, and prevent the network from focusing only on the saliency area, thereby causing overfitting.

-In PaddleClas, `Cutout` is used as follows.
-
-```python
-
-size = 224
-
-decode_op = DecodeImage()
-resize_op = ResizeImage(size=(size, size))
-cutout_op = Cutout(n_holes=1, length=112)
-
-ops = [decode_op, resize_op, cutout_op]
-
-imgs_dir = "imgs_dir"
-file_names = os.listdir(imgs_dir)
-for file_name in file_names:
-    file_path = os.join(imgs_dir, file_name)
-    with open(file_path) as f:
-        data = f.read()
-    img = transform(data, ops)
-```
-
 The images after `Cutout` are as follows.

 ![][test_cutout]

-## RandomErasing
+<a name="2.2.2"></a>
+#### 2.2.2 RandomErasing

 Address: [https://arxiv.org/pdf/1708.04896.pdf](https://arxiv.org/pdf/1708.04896.pdf)

@@ -175,32 +148,12 @@ Github repo: [https://github.com/zhunzhong07/Random-Erasing](https://github.com/

 RandomErasing is similar to the Cutout. It is also to solve the problem of poor generalization ability of the trained model on images with occlusion. The author also pointed out in the paper that the way of random cropping is complementary to random horizontal flipping. The author also verified the effectiveness of the method on pedestrian re-identification (REID). Unlike `Cutout`, in` `, `RandomErasing` is operateed on the image with a certain probability, size and aspect ratio of the generated mask are also randomly generated according to pre-defined hyperparameters.

-In PaddleClas, `RandomErasing` is used as follows.
-
-```python
-
-size = 224
-
-decode_op = DecodeImage()
-resize_op = ResizeImage(size=(size, size))
-randomerasing_op = RandomErasing()
-
-ops = [decode_op, resize_op, tochw_op, randomerasing_op]
-
-imgs_dir = image_path
-fnames = os.listdir(imgs_dir)
-for f in fnames:
-    data = open(os.path.join(imgs_dir, f)).read()
-    img = transform(data, ops)
-    img = img.transpose((1, 2, 0))
-```
-
 The images after `RandomErasing` are as follows.

 ![][test_randomerassing]

-
-## HideAndSeek
+<a name="2.2.3"></a>
+#### 2.2.3 HideAndSeek

 Address: [https://arxiv.org/pdf/1811.02545.pdf](https://arxiv.org/pdf/1811.02545.pdf)

@@ -211,32 +164,12 @@ Images are divided into some patches for `HideAndSeek` and masks are generated w

 ![][hide_and_seek_mask_expanation]

-In PaddleClas, `HideAndSeek` is used as follows.
-
-```python
-
-size = 224
-
-decode_op = DecodeImage()
-resize_op = ResizeImage(size=(size, size))
-hide_and_seek_op = HideAndSeek()
-
-ops = [decode_op, resize_op, tochw_op, hide_and_seek_op]
-
-imgs_dir = image_path
-fnames = os.listdir(imgs_dir)
-for f in fnames:
-    data = open(os.path.join(imgs_dir, f)).read()
-    img = transform(data, ops)
-    img = img.transpose((1, 2, 0))
-```
-
 The images after `HideAndSeek` are as follows.

 ![][test_hideandseek]

-
-## GridMask
+<a name="2.2.4"></a>
+#### 2.2.4 GridMask
 Address：[https://arxiv.org/abs/2001.04086](https://arxiv.org/abs/2001.04086)

 Github repo：[https://github.com/akuxcw/GridMask](https://github.com/akuxcw/GridMask)
@@ -260,39 +193,19 @@ In the training process, there are two methods to use:

 It shows that the second method is better.

-The usage of `GridMask` in PaddleClas is shown below.
-
-```python
-
-size = 224
-
-decode_op = DecodeImage()
-resize_op = ResizeImage(size=(size, size))
-tochw_op = ToCHWImage()
-gridmask_op = GridMask(d1=96, d2=224, rotate=1, ratio=0.6, mode=1, prob=0.8)
-
-ops = [decode_op, resize_op, tochw_op, gridmask_op]
-
-imgs_dir = image_path
-fnames = os.listdir(imgs_dir)
-for f in fnames:
-    data = open(os.path.join(imgs_dir, f)).read()
-    img = transform(data, ops)
-    img = img.transpose((1, 2, 0))
-```
-
 The images after `GridMask` are as follows.

 ![][test_gridmask]

+<a name="2.3"></a>
+### 2.3 Image mix

-# Image aliasing
-
-Aliasing means performing some transformations on the image after `Batch`, which contains Mixup and Cutmix.
+mix means performing some transformations on the image after `Batch`, which contains Mixup and Cutmix.

-Data augmentation methods introduced before are based on single image while aliasing is carried on a certain batch to generate a new batch.
+Data augmentation methods introduced before are based on single image while mixing is carried on a certain batch to generate a new batch.

-## Mixup
+<a name="2.3.1"></a>
+#### 2.3.1 Mixup

 Address: [https://arxiv.org/pdf/1710.09412.pdf](https://arxiv.org/pdf/1710.09412.pdf)

@@ -300,211 +213,30 @@ Github repo: [https://github.com/facebookresearch/mixup-cifar10](https://github.

 Mixup is the first solution for image aliasing, it is easy to realize and performs well not only on image classification but also on object detection. Mixup is usually carried out in a batch for simplification, so as `Cutmix`.

-
-The usage of `Mixup` in PaddleClas is shown below.
-
-```python
-
-size = 224
-
-decode_op = DecodeImage()
-resize_op = ResizeImage(size=(size, size))
-tochw_op = ToCHWImage()
-hide_and_seek_op = HideAndSeek()
-mixup_op = MixupOperator()
-cutmix_op = CutmixOperator()
-
-ops = [decode_op, resize_op, tochw_op]
-
-imgs_dir = image_path
-
-batch = []
-fnames = os.listdir(imgs_dir)
-for idx, f in enumerate(fnames):
-    data = open(os.path.join(imgs_dir, f)).read()
-    img = transform(data, ops)
-    batch.append( (img, idx) ) # fake label
-
-new_batch = mixup_op(batch)
-```
-
 The images after `Mixup` are as follows.

 ![][test_mixup]

-## Cutmix
+<a name="2.3.2"></a>
+#### 2.3.2 Cutmix

 Address: [https://arxiv.org/pdf/1905.04899v2.pdf](https://arxiv.org/pdf/1905.04899v2.pdf)

 Github repo: [https://github.com/clovaai/CutMix-PyTorch](https://github.com/clovaai/CutMix-PyTorch)

 Unlike `Mixup` which directly adds two images, for Cutmix, an `ROI` is cut out from one image and
-Cutmix randomly cuts out an `ROI` from one image, and then covered onto the corresponding area in the another image. The usage of `Cutmix` in PaddleClas is shown below.
-
-
-```python
-
-size = 224
-
-decode_op = DecodeImage()
-resize_op = ResizeImage(size=(size, size))
-tochw_op = ToCHWImage()
-hide_and_seek_op = HideAndSeek()
-cutmix_op = CutmixOperator()
-
-ops = [decode_op, resize_op, tochw_op]
-
-imgs_dir = image_path
-
-batch = []
-fnames = os.listdir(imgs_dir)
-for idx, f in enumerate(fnames):
-    data = open(os.path.join(imgs_dir, f)).read()
-    img = transform(data, ops)
-    batch.append( (img, idx) ) # fake label
-
-new_batch = cutmix_op(batch)
-```
+Cutmix randomly cuts out an `ROI` from one image, and then covered onto the corresponding area in the another image. 

 The images after `Cutmix` are as follows.

 ![][test_cutmix]

+For the practical part of data augmentation, please refer to [Data Augmentation Practice](../advanced_tutorials/DataAugmentation_en.md).

-# Experiments
-
-Based on PaddleClas, Metrics of different augmentation methods on ImageNet1k dataset are as follows.
-
-
-| Model          | Learning strategy  | l2 decay | batch size | epoch | Augmentation method   | Top1 Acc    | Reference |
-|-------------|------------------|--------------|------------|-------|----------------|------------|----|
-| ResNet50 | 0.1/cosine_decay | 0.0001       | 256        | 300   | Standard transform           | 0.7731 | - |
-| ResNet50 | 0.1/cosine_decay | 0.0001       | 256        | 300   | AutoAugment    | 0.7795 |  0.7763 |
-| ResNet50 | 0.1/cosine_decay | 0.0001       | 256        | 300   | mixup          | 0.7828 |  0.7790 |
-| ResNet50 | 0.1/cosine_decay | 0.0001       | 256        | 300   | cutmix         | 0.7839 |  0.7860 |
-| ResNet50 | 0.1/cosine_decay | 0.0001       | 256        | 300   | cutout         | 0.7801 |  - |
-| ResNet50 | 0.1/cosine_decay | 0.0001       | 256        | 300   | gridmask       | 0.7785 |  0.7790 |
-| ResNet50 | 0.1/cosine_decay | 0.0001       | 256        | 300   | random-augment | 0.7770 |  0.7760 |
-| ResNet50 | 0.1/cosine_decay | 0.0001       | 256        | 300   | random erasing | 0.7791 |  - |
-| ResNet50 | 0.1/cosine_decay | 0.0001       | 256        | 300   | hide and seek  | 0.7743 |  0.7720 |
-
-
-**note**:
-* In the experiment here, for better comparison, we fixed the l2 decay to 1e-4. To achieve higher accuracy, we recommend trying to use a smaller l2 decay. Combined with data augmentaton, we found that reducing l2 decay from 1e-4 to 7e-5 can bring at least 0.3~0.5% accuracy improvement.
-* We have not yet combined different strategies or verified, whch is our future work.
-
-
-
-## Data augmentation practice
-
-Experiments about data augmentation will be introduced in detail in this section. If you want to quickly experience these methods, please refer to [**Quick start PaddleClas in 30 miniutes**](../../tutorials/quick_start_en.md).
-
-## Configurations
-
-Since hyperparameters differ from different augmentation methods. For better understanding, we list 8 augmentation configuration files in `configs/DataAugment` based on ResNet50. Users can train the model with `tools/run.sh`. The following are 3 of them.
-
-### RandAugment
-
-Configuration of `RandAugment` is shown as follows. `Num_layers`(default as 2) and `magnitude`(default as 5) are two hyperparameters.
-
-
-```yaml
-      transform_ops:
-        - DecodeImage:
-            to_rgb: True
-            channel_first: False
-        - RandCropImage:
-            size: 224
-        - RandFlipImage:
-            flip_code: 1
-        - RandAugment:
-            num_layers: 2
-            magnitude: 5
-        - NormalizeImage:
-            scale: 1.0/255.0
-            mean: [0.485, 0.456, 0.406]
-            std: [0.229, 0.224, 0.225]
-            order: ''
-```
-
-### Cutout
-
-Configuration of `Cutout` is shown as follows. `n_holes`(default as 1) and `n_holes`(default as 112) are two hyperparameters.
-
-```yaml
-      transform_ops:
-        - DecodeImage:
-            to_rgb: True
-            channel_first: False
-        - RandCropImage:
-            size: 224
-        - RandFlipImage:
-            flip_code: 1
-        - NormalizeImage:
-            scale: 1.0/255.0
-            mean: [0.485, 0.456, 0.406]
-            std: [0.229, 0.224, 0.225]
-            order: ''
-        - Cutout:
-            n_holes: 1
-            length: 112
-```
-
-### Mixup
-
-
-Configuration of `Mixup` is shown as follows. `alpha`(default as 0.2) is hyperparameter which users need to care about. What's more, `use_mix` need to be set as `True` in the root of the configuration.
-
-```yaml
-      transform_ops:
-        - DecodeImage:
-            to_rgb: True
-            channel_first: False
-        - RandCropImage:
-            size: 224
-        - RandFlipImage:
-            flip_code: 1
-        - NormalizeImage:
-            scale: 1.0/255.0
-            mean: [0.485, 0.456, 0.406]
-            std: [0.229, 0.224, 0.225]
-            order: ''
-      batch_transform_ops:
-        - MixupOperator:
-            alpha: 0.2
-```
-
-## Start training
-
-Users can use the following command to start the training process, which can also be referred to `tools/train.sh`.
-
-```bash
-python3 -m paddle.distributed.launch \
-    --selected_gpus="0,1,2,3" \
-    --log_dir=ResNet50_Cutout \
-    tools/train.py \
-        -c ./ppcls/configs/ImageNet/DataAugment/ResNet50_Cutout.yaml
-```
-
-## Note
-
-* In addition, because the label needs to be aliased when the image is aliased, the accuracy of the training data cannot be calculated. The training accuracy rate was not printed during the training process.
-
-* The training data is more difficult with data augmentation, so the training loss may be larger, the training set accuracy is relatively low, but it has better generalization ability, so the validation set accuracy is relatively higher.
-
-* After the use of data augmentation, the model may tend to be underfitting. It is recommended to reduce `l2_decay` for better performance on validation set.
-
-* hyperparameters exist in almost all agmenatation methods. Here we provide hyperparameters for ImageNet1k dataset. User may need to finetune the hyperparameters on specified dataset. More training tricks can be referred to [**Tricks**](../../../zh_CN/models/Tricks.md).
-
-
-> If this document is helpful to you, welcome to star our project: [https://github.com/PaddlePaddle/PaddleClas](https://github.com/PaddlePaddle/PaddleClas)
-
-
-# Reference
+## Reference

 [1] Cubuk E D, Zoph B, Mane D, et al. Autoaugment: Learning augmentation strategies from data[C]//Proceedings of the IEEE conference on computer vision and pattern recognition. 2019: 113-123.

-
 [2] Cubuk E D, Zoph B, Shlens J, et al. Randaugment: Practical automated data augmentation with a reduced search space[J]. arXiv preprint arXiv:1909.13719, 2019.

 [3] DeVries T, Taylor G W. Improved regularization of convolutional neural networks with cutout[J]. arXiv preprint arXiv:1708.04552, 2017.
@@ -519,8 +251,6 @@ python3 -m paddle.distributed.launch \

 [8] Yun S, Han D, Oh S J, et al. Cutmix: Regularization strategy to train strong classifiers with localizable features[C]//Proceedings of the IEEE International Conference on Computer Vision. 2019: 6023-6032.

-
-
 [test_baseline]: ../../../images/image_aug/test_baseline.jpeg
 [test_autoaugment]: ../../../images/image_aug/test_autoaugment.jpeg
 [test_cutout]: ../../../images/image_aug/test_cutout.jpeg

--- a/docs/en/algorithm_introduction/image_classification_en.md
+++ b/docs/en/algorithm_introduction/image_classification_en.md
+## Contents
+
+- [1. Dataset Introduction](#1)
+  - [1.1 ImageNet-1k](#1.1)
+  - [1.2 CIFAR-10/CIFAR-100](#1.2)
+- [2. Image Classification Process](2)
+  - [2.1 Data and its Preprocessing](#2.1)
+  - [2.2 Prepare the model](#2.2)
+  - [2.3 Train the model](#2.3)
+  - [2.4 Evaluate the model](#2.4)
+- [3. Main Algorithms Introduction](#3)
+
+Image Classification is a fundamental task that classifies the image by semantic information and assigns it to a specific label. Image Classification is the foundation of Computer Vision tasks, such as object detection, image segmentation, object tracking, and behavior analysis. Image Classification enjoys comprehensive applications, including face recognition and smart video analysis in the security and protection field, traffic scenario recognition in the traffic field, image retrieval and electronic photo album classification in the internet industry, and image recognition in the medical industry.
+
+Generally speaking, Image Classification attempts to fully describe the whole image by feature engineering and assigns labels by a classifier. Hence, how to extract the features of images is the essential part. Before we have deep learning, the most adopted classification method is the Bag of Words model. However, Image Classification based on deep learning can learn the hierarchical feature description by supervised and unsupervised learning, replacing the manual image feature selection. Recently, Convolution Neural Network (CNN) in deep learning gives an awesome performance in the image field. It uses pixel information as the input to get all the information to the maximum extent. Additionally, since the model uses convolution to extract features, the classification result is the output. Thus, this end-to-end method performs well and is widespread.
+
+Image Classification is a basic but important field in computer vision, whose research results have a lasting impact on the development of computer vision and even deep learning. Image classification has many sub-fields, such as multi-label image classification and fine-grained image classification. Here we only brief on the single-label image classification.
+
+
+<a name="1"></a>
+## 1. Dataset Introduction
+
+
+<a name="1.1"></a>
+### 1.1 ImageNet-1k
+
+The ImageNet project is a large-scale visual database for the research of visual object recognition software. More than 14 million images have been annotated manually to point out objects in the picture, and at least 1 million images are provided with borders. ImageNet-1k is a subset of the ImageNet dataset, which contains 1000 categories. The training set contains 1281167 image data, and the validation set contains 50,000 image data. Since 2010, ImageNet began to hold an annual image classification competition, namely, the ImageNet Large Scale Visual Recognition Challenge (ILSVRC) with ImageNet-1k as its specified dataset. To date, ImageNet-1k has become one of the most significant contributors to the development of computer vision, based on which numerous initial models of downstream computer vision tasks are trained.
+
+
+<a name="1.2"></a>
+### 1.2 CIFAR-10/CIFAR-100
+
+The CIFAR-10 data set consists of 60,000 color images of 10 categories with an image resolution of 32x32, and each category has 6000 images, including 5000 in the training set and 1000 in the validation set. The 10 different classes represent airplanes, cars, birds, cats, deer, dogs, frogs, horses, ships, and trucks. The CIFAR-100 dataset is an extension of CIFAR-10 and consists of 60,000 color images of 100 classes with an image resolution of 32x32, and each class has 600 images, including 500 in the training set and 100 in the validation set. Researchers can try different algorithms quickly due to their small scale. These two data sets are also commonly used for testing the quality of models in image classification.
+
+
+<a name="2"></a>
+## 2.  Image Classification Process
+
+The prepared training data is correspondingly preprocessed and then passed through the image classification model. The output of the model and the real label are used in a cross-entropy loss function which describes the convergence direction of the model. An image classification model can be obtained by repeatedly traversing all the image data input models, conducting the corresponding gradient descent for the final loss function through some optimizers, returning the gradient information to the model, and updating the weight of the model.
+
+
+<a name="2.1"></a>
+### 2.1 Data and its Preprocessing
+
+The quality and quantity of data often determine the performance of a model. In the field of image classification, data includes images and labels. In most cases, labeled data is scarce to an extent that hard to saturate the model. In order to enable the model to learn more image features, a lot of image transformation or data augmentation is required before the image enters the model, so as to ensure the diversity of input data, hence better generalization capabilities of the model. PaddleClas provides standard image transformation for training ImageNet-1k and 8 data augmentation methods. For related codes, please refer to [Data Preprocess](https://github.com/PaddlePaddle/PaddleClas/blob/develop/ppcls/data/preprocess)，and the configuration file to [Data Augmentation Configuration File](https://github.com/PaddlePaddle/PaddleClas/blob/develop/ppcls/configs/ImageNet/DataAugment).
+
+
+<a name="2.2"></a>
+### 2.2 Prepare the Model
+
+After the data is settled, the model often determines the upper limit of the final accuracy. In the field of image classification, classic models emerge in endlessly. PaddleClas provides 36 series, or a total of 164 ImageNet pre-trained models. For specific accuracy, speed and other indicators, please refer to [Backbone Network Introduction](https://github.com/PaddlePaddle/PaddleClas/blob/develop/docs/en/models).
+
+
+<a name="2.3"></a>
+### 2.3 Train the Model
+
+After preparing the data and model, you can start training the model and updating the parameters of the model. After many iterations, a trained model can finally be obtained for image classification tasks. The training process of image classification requires a lot of experience and involves the setting of many hyperparameters. PaddleClas provides a series of [training tuning methods](https://github.com/PaddlePaddle/PaddleClas/blob/develop/docs/en/models/Tricks_en.md), which can help you quickly obtain a high-precision model.
+
+
+<a name="2.4"></a>
+### 2.4 Evaluate the Model
+
+After a model is trained, the evaluation results of the model on the validation set can determine the performance of the model. The evaluation index is generally Top1-Acc or Top5-Acc, and the higher the index, the better the model performance.
+
+
+<a name="3"></a>
+## 3. Main Algorithms Introduction
+
+- LeNet: Yan LeCun et al. first applied convolutional neural networks to image classification tasks in the 1990s, and creatively proposed LeNet, which achieved great success in handwritten digit recognition tasks.
+- AlexNet: Alex Krizhevsky et al. proposed AlexNet in 2012 and applied it to ImageNet, and won the 2012 ImageNet classification competition. Since then, a deep learning boom is created.
+- VGG: Simonyan and Zisserman put forward the VGG network structure in 2014. This network structure uses a smaller convolution kernel to stack the entire network, achieving better performance in ImageNet classification and providing new ideas for the subsequent network structure design.
+- GoogLeNet: Christian Szegedy et al. presented GoogLeNet in 2014. This network uses a multi-branch structure and a global average pooling layer (GAP). While maintaining the accuracy of the model, the amount of model storage and calculation witnesses a drastic decrease. The network won the 2014 ImageNet classification competition.
+- ResNet: Kaiming He et al. delivered ResNet in 2015, which deepened the depth of the network by introducing a residual module, reducing the recognition error rate of ImageNet classification to 3.6%, which exceeded the recognition accuracy of normal human eyes for the first time.
+- DenseNet: Huang Gao et al. proposed DenseNet in 2017. The network designed a denser connected block and achieved higher performance with a smaller amount of parameters.
+- EfficientNet: Mingxing Tan et al. introduced EfficientNet in 2019. This network balances the width of the network, the depth of the network, and the resolution of the input image. With the same FLOPS and parameters, it reaches the state-of-the-art results.
+
+For more algorithm introduction, please refer to [Algorithm Introduction](https://github.com/PaddlePaddle/PaddleClas/blob/develop/docs/en/models).
\ No newline at end of file
--- a/docs/en/models/Tricks_en.md
+++ b/docs/en/models/Tricks_en.md
 # Tricks for Training

-## Choice of Optimizers:
+## Catalogue
+
+- [1. Choice of Optimizers](#1)
+- [2. Choice of Learning Rate and Learning Rate Declining Strategy](#2)
+  - [2.1 Concept of Learning Rate](#2.1)
+  - [2.2 Learning Rate Decline Strategy](#2.2)
+  - [2.3 Warmup Strategy](#2.3)
+- [3. Choice of Batch_size](#3)
+- [4. Choice of Weight_decay](#4)
+- [5. Choice of Label_smoothing](#5)
+- [6. Change the Crop Area and Stretch Transformation Degree of the Images for Small Models](#6)
+- [7. Use Data Augmentation to Improve Accuracy](#7)
+- [8. Determine the Tuning Strategy by Train_acc and Test_acc](#8)
+- [9. Improve the Accuracy of Your Own Data Set with Existing Pre-trained Models](#9)
+
+<a name="1"></a>
+## 1. Choice of Optimizers
 Since the development of deep learning, there have been many researchers working on the optimizer. The purpose of the optimizer is to make the loss function as small as possible, so as to find suitable parameters to complete a certain task. At present, the main optimizers used in model training are SGD, RMSProp, Adam, AdaDelt and so on. The SGD optimizers with momentum is widely used in academia and industry, so most of models we release are trained by SGD optimizer with momentum. But the SGD optimizer with momentum has two disadvantages, one is that the convergence speed is slow, the other is that the initial learning rate is difficult to set, however, if the initial learning rate is set properly and the models are trained in sufficient iterations, the models trained by SGD with momentum can reach higher accuracy compared with the models trained by other optimizers. Some other optimizers with adaptive learning rate such as Adam, RMSProp and so on tent to converge faster, but the final convergence accuracy will be slightly worse. If you want to train a model in faster convergence speed, we recommend you use the optimizers with adaptive learning rate, but if you want to train a model with higher accuracy, we recommend you to use SGD optimizer with momentum.

-## Choice of Learning Rate and Learning Rate Declining Strategy:
+<a name="2"></a>
+## 2. Choice of Learning Rate and Learning Rate Declining Strategy
 The choice of learning rate is related to the optimizer, data set and tasks. Here we mainly introduce the learning rate of training ImageNet-1K with momentum + SGD as the optimizer and the choice of learning rate decline.

-### Concept of Learning Rate：
+<a name="2.1"></a>
+### 2.1 Concept of Learning Rate
 the learning rate is the hyperparameter to control the learning speed, the lower the learning rate, the slower the change of the loss value, though using a low learning rate can ensure that you will not miss any local minimum, but it also means that the convergence speed is slow, especially when the gradient is trapped in a gradient plateau area.

-### Learning Rate Decline Strategy：
+<a name="2.2"></a>
+### 2.2 Learning Rate Decline Strategy
 During training, if we always use the same learning rate, we cannot get the model with highest accuracy, so the learning rate should be adjust during training. In the early stage of training, the weights are in a random initialization state and the gradients are tended to descent, so we can set a relatively large learning rate for faster convergence. In the late stage of training, the weights are close to the optimal values, the optimal value cannot be reached by a relatively large learning rate, so a relatively smaller learning rate should be used. During training, many researchers use the piecewise_decay learning rate reduction strategy, which is a stepwise decline learning rate. For example, in the training of ResNet50, the initial learning rate we set is 0.1, and the learning rate drops to 1/10 every 30 epoches, the total epoches for training is 120. Besides the piecewise_decay, many researchers also proposed other ways to decrease the learning rate, such as polynomial_decay, exponential_decay and cosine_decay and so on, among them, cosine_decay has become the preferred learning rate reduction method for improving model accuracy beacause there is no need to adjust hyperparameters and the robustness is relatively high. The learning rate curves of cosine_decay and piecewise_decay are shown in the following figures, it is easy to observe that during the entire training process, cosine_decay keeps a relatively large learning rate, so its convergence is slower, but the final convergence accuracy is better than the one using piecewise_decay.

 ![](../../images/models/lr_decay.jpeg)

 In addition, we can also see from the figures that the number of epoches with a small learning rate in cosine_decay is fewer, which will affect the final accuracy, so in order to make cosine_decay play a better effect, it is recommended to use cosine_decay in large epoched, such as 200 epoches.

-### Warmup Strategy
+<a name="2.3"></a>
+### 2.3 Warmup Strategy
 If a large batch_size is adopted to train nerual network, we recommend you to adopt warmup strategy. as the name suggests, the warmup strategy is to let model learning first warm up, we do not directly use the initial learning rate at the begining of training, instead, we use a gradually increasing learning rate to train the model, when the increasing learning rate reaches the initial learning rate, the learning rate reduction method mentioned in the learning rate reduction strategy is then used to decay the learning rate. Experiments show that when the batch size is large, warmup strategy can improve the accuracy. Some model training with large batch_size such as MobileNetV3 training, we set the epoch in warmup to 5 by default, that is, first in 5 epoches, the learning rate increases from 0 to initial learning rate, then learning rate decay begins.

-## Choice of Batch_size
+<a name="3"></a>
+## 3. Choice of Batch_size
 Batch_size is an important hyperparameter in training neural networks, batch_size determines how much data is sent to the neural network to for training at a time. In the paper [1], the author found in experiments that when batch_size is linearly related to the learning rate, the convergence accuracy is hardly affected. When training ImageNet data, an initial learning rate of 0.1 are commonly chosen for training, and batch_size is 256, so according to the actual model size and memory, you can set the learning rate to 0.1\*k, batch_size to 256\*k.

-## Choice of Weight_decay
+<a name="4"></a>
+## 4. Choice of Weight_decay
 Overfitting is a common term in machine learning. A simple understanding is that the model performs well on the training data, but it performs poorly on the test data. In the convolutional neural network, there also exists the problem of overfitting. To avoid overfitting, many regular ways have been proposed. Among them, weight_decay is one of the widely used ways to avoid overfitting. After the final loss function, L2 regularization(weight_decay) is added to the loss function, with the help of L2 regularization, the weight of the network tend to choose a smaller value, and finally the parameters in the entire network tends to 0, and the generalization performance of the model is improved accordingly. In different kinds of Deep learning frame, the meaning of L2_decay is the coefficient of L2 regularization, on paddle, the name of this value is L2_decay, so in the following the value is called L2_decay. the larger the coefficient, the more the model tends to be underfitting. In the task of training ImageNet, this parameter is set to 1e-4 in most network. In some small networks such as MobileNet networks, in order to avoid network underfitting, the value is set to 1e-5 ~ 4e-5. Of course, the setting of this value is also related to the specific data set, When the data set is large, the network itself tends to be under-fitted, and the value can be appropriately reduced. When the data set is small, the network tends to overfit itself, so the value can be increased appropriately. The following table shows the accuracy of MobileNetV1_x0_25 using different l2_decay on ImageNet-1k. Since MobileNetV1_x0_25 is a relatively small network, the large l2_decay will make the network tend to be underfitting, so in this network, 3e-5 are better choices compared with 1e-4.

 | Model                | L2_decay | Train acc1/acc5 | Test acc1/acc5 |
@@ -39,7 +61,8 @@ In addition, the setting of L2_decay is also related to whether other regulariza

 In summary, l2_decay can be adjusted according to specific tasks and models. Usually simple tasks or larger models are recommended to use Larger l2_decay, complex tasks or smaller models are recommended to use smaller l2_decay.

-## Choice of Label_smoothing
+<a name="5"></a>
+## 5. Choice of Label_smoothing
 Label_smoothing is a regularization method in deep learning. Its full name is Label Smoothing Regularization (LSR), that is, label smoothing regularization. In the traditional classification task, when calculating the loss function, the real one hot label and the output of the neural network are calculated in cross-entropy formula, the label smoothing aims to make the real one hot label become smooth label, which makes the neural network no longer learn from the hard labels, but the soft labels with a probability value, where the probability of the position corresponding to the category is the largest and the probability of other positions are very small value, specific calculation method can be seen in the paper[2]. In label-smoothing, there is an epsilon parameter describing the degree of softening the label. The larger epsilon, the smaller the probability and smoother the label, on the contrary, the label tends to be hard label. during training on ImageNet-1K, the parameter is usually set to 0.1. In the experiments of training ResNet50, when using label_smoothing, the accuracy is higher than the one without label_smoothing, the following table shows the performance of ResNet50_vd with label smoothing and without label smoothing.

 | Model          | Use_label_smoothing | Test acc1 |
@@ -57,7 +80,8 @@ But, because label smoothing can be regarded as a regular way, on relatively sma

 In summary, the use of label_smoohing for larger models can effectively improve the accuracy of the model, and the use of label_smoohing for smaller models may reduce the accuracy of the model, so before deciding whether to use label_smoohing, you need to evaluate the size of the model and the difficulty of the task.

-## Change the Crop Area and Stretch Transformation Degree of the Images for Small Models
+<a name="6"></a>
+## 6. Change the Crop Area and Stretch Transformation Degree of the Images for Small Models
 In the standard preprocessing of ImageNet-1k data, two values of scale and ratio are defined in the random_crop function. These two values respectively determine the size of the image crop and the degree of stretching of the image. The default value of scale is 0.08-1(lower_scale-upper_scale), the default value range of ratio is 3/4-4/3(lower_ratio-upper_ratio). In small network training, such data argument will make the network underfitting, resulting in a decrease in accuracy. In order to improve the accuracy of the network, you can make the data argument weaker, that is, increase the crop area of the images or weaken the degree of stretching and transformation of the images, we can achieve weaker image transformation by increasing the value of lower_scale or narrowing the gap between lower_ratio and upper_scale. The following table lists the accuracy of training MobileNetV2_x0_25 with different lower_scale. It can be seen that the training accuracy and validation accuracy are improved after increasing the crop area of the images

 | Model                | Scale Range | Train_acc1/acc5 | Test_acc1/acc5 |
@@ -65,7 +89,8 @@ In the standard preprocessing of ImageNet-1k data, two values of scale and ratio
 | MobileNetV2_x0_25 | [0.08,1]  | 50.36%/72.98%   | 52.35%/75.65%  |
 | MobileNetV2_x0_25 | [0.2,1]   | 54.39%/77.08%   | 53.18%/76.14%  |

-## Use Data Augmentation to Improve Accuracy
+<a name="7"></a>
+## 7. Use Data Augmentation to Improve Accuracy
 In general, the size of the data set is critical to the performances, but the annotation of images are often more expensive, so the number of annotated images are often scarce. In this case, the data argument is particularly important. In the standard data augmentation for training on ImageNet-1k, two data augmentation methods which are random_crop and random_flip are mainly used. However, in recent years, more and more data augmentation methods have been proposed, such as cutout, mixup, cutmix, AutoAugment, etc. Experiments show that these data augmentation methods can effectively improve the accuracy of the model. The following table lists the performance of ResNet50 in 8 different data augmentation methods. It can be seen that compared to the baseline, all data augmentation methods can be useful for the accuracy of ResNet50, among them cutmix is currently the most effective data argument. More data argument can be seen here[**Data Argument**](https://paddleclas.readthedocs.io/zh_CN/latest/advanced_tutorials/image_augmentation/ImageAugment.html).

 | Model       | Data Argument         | Test top-1 |
@@ -80,10 +105,12 @@ In general, the size of the data set is critical to the performances, but the an
 | ResNet50 | Random-Erasing | 77.91%     |
 | ResNet50 | Hide-and-Seek  | 77.43%     |

-## Determine the Tuning Strategy by Train_acc and Test_acc
+<a name="8"></a>
+## 8. Determine the Tuning Strategy by Train_acc and Test_acc
 In the process of training the network, the training set accuracy rate and validation set accuracy rate of each epoch are usually printed. Generally speaking, the accuracy of the training set is slightly higher than the accuracy of the validation set or the same are good state in training, but if you find that the accuracy of training set is much higher than the one of validation set, it means that overfitting happens in your task, which need more regularization, such as increase the value of L2_decay, using more data argument or label smoothing and so on. If you find that the accuracy of training set is lower than the one of validation set, it means that underfitting happens in your task, which recommend you to decrease the value of L2_decay, using fewer data argument, increase the area of the crop area of the images, weaken the stretching transformation of the images, remove label_smoothing, etc.

-## Improve the Accuracy of Your Own Data Set with Existing Pre-trained Models
+<a name="9"></a>
+## 9. Improve the Accuracy of Your Own Data Set with Existing Pre-trained Models
 In the field of computer vision, it has become common to load pre-trained models to train one's own tasks. Compared with starting training from random initialization, loading pre-trained models can often improve the accuracy of specific tasks. In general, the pre-trained model widely used in the industry is obtained from the ImageNet-1k dataset. The fc layer weight of the pre-trained model is a matrix of k\*1000, where k is The number of neurons before,  and the weights of the fc layer is not need to load because of the different tasks. In terms of learning rate, if your training data set is particularly small (such as less than 1,000), we recommend that you use a smaller initial learning rate, such as 0.001 (batch_size: 256, the same below), to avoid a large learning rate undermine pre-training weights, if your training data set is relatively large (greater than 100,000), we recommend that you try a larger initial learning rate, such as 0.01 or greater.



--- a/docs/en/quick_start/quick_start_classification_new_user_en.md
+++ b/docs/en/quick_start/quick_start_classification_new_user_en.md
+# Trial in 30mins(new users)
+
+This tutorial is mainly for new users, that is, users who are in the introductory stage of deep learning-related theoretical knowledge, know some python grammar, and can read simple codes. This content mainly includes the use of PaddleClas for image classification network training and model prediction.
+
+---
+
+## Catalogue
+
+- [1. Basic knowledge](#1)
+- [2. Environmental installation and configuration](#2)
+- [3. Data preparation and processing](#3)
+- [4. Model training](#4)
+  - [4.1 Use CPU for model training](#4.1)
+    - [4.1.1 Training without using pre-trained models](#4.1.1)
+    - [4.1.2 Use pre-trained models for training](#4.1.2)
+  - [4.2 Use GPU for model training](#4.2)
+    - [4.2.1 Training without using pre-trained models](#4.2.1)
+    - [4.2.2 Use pre-trained models for training](#4.2.2)
+- [5. Model prediction](#5)
+
+<a name="1"></a>
+## 1. Basic knowledge
+
+Image classification is a pattern classification problem, which is the most basic task in computer vision. Its goal is to classify different images into different categories. We will briefly explain some concepts that need to be understood during model training. We hope to be helpful to you who are experiencing PaddleClas for the first time:
+
+- train/val/test dataset represents training set, validation set and test set respectively:
+  - Training dataset: used to train the model so that the model can recognize different types of features;
+  - Validation set (val dataset): the test set during the training process, which is convenient for checking the status of the model during the training process;
+  - Test dataset: After training the model, the test dataset is used to evaluate the results of the model.
+
+- Pre-trained model
+
+  Using a pre-trained model trained on a larger dataset, that is, the weights of the parameters are preset, can help the model converge faster on the new dataset. Especially for some tasks with scarce training data, when the neural network parameters are very large, we may not be able to fully train the model with a small amount of training data. The method of loading the pre-trained model can be thought of as allowing the model to learn based on a better initial weight, so as to achieve better performance.
+
+- epoch
+
+  The total number of training epochs of the model. The model passes through all the samples in the training set once, which is an epoch. When the difference between the error rate of the validation set and the error rate of the training set is small, the current number of epochs can be considered appropriate; when the error rate of the validation set first decreases and then becomes larger, it means that the number of epochs is too large and the number of epochs needs to be reduced. Otherwise, the model may overfit the training set.
+
+- Loss Function
+
+  During the training process, measure the difference between the model output (predicted value) and the ground truth.
+
+- Accuracy (Acc): indicates the proportion of the number of samples with correct predictions to the total data
+
+  - Top1 Acc: If the classification with the highest probability in the prediction result is correct, it is judged to be correct;
+  - Top5 Acc: If there is a correct classification among the top 5 probability rankings in the prediction result, it is judged as correct;
+
+<a name="2"></a>
+## 2. Environmental installation and configuration
+
+For specific installation steps, please refer to [Paddle Installation Document](../installation/install_paddle_en.md), [PaddleClas Installation Document](../installation/install_paddleclas_en.md).
+
+<a name="3"></a>
+## 3. Data preparation and processing
+
+Enter the PaddleClas directory:
+
+```shell
+# linux or mac， $path_to_PaddleClas represents the root directory of PaddleClas, and users need to modify it according to their real directory.
+cd $path_to_PaddleClas
+```
+
+Enter the `dataset/flowers102` directory, download and unzip the flowers102 dataset:
+
+```shell
+# linux or mac
+cd dataset/
+# If you want to download directly from the browser, you can copy the link and visit, then download and unzip
+wget https://paddle-imagenet-models-name.bj.bcebos.com/data/flowers102.zip
+# unzip
+unzip flowers102.zip
+```
+
+If there is no `wget` command or if you are downloading in the Windows operating system, you need to copy the address to the browser to download, and unzip it to the directory `PaddleClas/dataset/`.
+
+After the unzip operation is completed, there are three `.txt` files for training and testing under the directory `PaddleClas/dataset/flowers102`: `train_list.txt` (training set, 1020 images), `val_list.txt` (validation Set, 1020 images), `train_extra_list.txt` (larger training set, 7169 images). The format of each line in the file: **image relative path** **image label_id** (note: there is a space between the two columns), and there is also a mapping file for label id and category name: `flowers102_label_list.txt` .
+
+The image files of the flowers102 dataset are stored in the `dataset/flowers102/jpg` directory. The image examples are as follows:
+
+<div align="center">
+<img src="../../images/quick_start/Examples-Flower-102.png" width = "800" />
+</div>
+
+Return to the root directory of `PaddleClas`:
+
+```shell
+# linux or mac
+cd ../../
+# windoes users can open the PaddleClas root directory
+```
+
+<a name="4"></a>
+## 4. Model training
+
+<a name="4.1"></a>
+### 4.1 Use CPU for model training
+
+Since the CPU is used for model training, the calculation speed is slow, so here is ShuffleNetV2_x0_25 as an example. This model has a small amount of calculation and a faster calculation speed on the CPU. But also because the model is small, the accuracy of the trained model will also be limited.
+
+<a name="4.1.1"></a>
+#### 4.1.1 Training without using pre-trained models
+
+```shell
+# If you are using the windows operating system, please enter the root directory of PaddleClas in cmd and execute this command:
+python tools/train.py -c ./ppcls/configs/quick_start/new_user/ShuffleNetV2_x0_25.yaml
+```
+
+- The `-c` parameter is to specify the path of the configuration file for training, and the specific hyperparameters for training can be viewed in the `yaml` file
+- The `Global.device` parameter in the `yaml` file is set to `cpu`, that is, the CPU is used for training (if not set, this parameter defaults to `gpu`)
+- The `epochs` parameter in the `yaml` file is set to 20, indicating that 20 epoch iterations are performed on the entire data set. It is estimated that the training can be completed in about 20 minutes (different CPUs have slightly different training times). At this time, the training model is not sufficient. To improve the accuracy of the training model, please set this parameter to a larger value, such as **40**, the training time will be extended accordingly
+
+<a name="4.1.2"></a>
+#### 4.1.2 Use pre-trained models for training
+
+```shell
+python tools/train.py -c ./ppcls/configs/quick_start/new_user/ShuffleNetV2_x0_25.yaml  -o Arch.pretrained=True
+```
+
+- The `-o` parameter can be set to `True` or `False`, or it can be the storage path of the pre-training model. When `True` is selected, the pre-training weights will be automatically downloaded to the local. Note: If it is a pre-training model path, do not add: `.pdparams`
+
+You can compare whether to use the pre-trained model and observe the drop in loss.
+
+<a name="4.2"></a>
+### 4.2 Use GPU for model training
+
+Since GPU training is faster and more complex models can be used, take ResNet50_vd as an example. Compared with ShuffleNetV2_x0_25, this model is more computationally intensive, and the accuracy of the trained model will be higher.
+
+First, you must set the environment variables and use the 0th GPU for training:
+
+- For Linux users:
+
+  ```shell
+  export CUDA_VISIBLE_DEVICES=0
+  ```
+
+- For Windows users
+
+  ```shell
+  set CUDA_VISIBLE_DEVICES=0
+  ```
+
+<a name="4.2.1"></a>
+#### 4.2.1 Training without using pre-trained models
+
+```shell
+python tools/train.py -c ./ppcls/configs/quick_start/ResNet50_vd.yaml
+```
+
+After the training is completed, the `Top1 Acc` curve of the validation set is shown below, and the highest accuracy rate is 0.2735.
+
+<div align="center">
+<img src="../../images/quick_start/r50_vd_acc.png"  width = "800" />
+</div>
+
+<a name="4.2.2"></a>
+#### 4.2.2 Use pre-trained models for training
+
+Based on ImageNet1k classification pre-trained model for fine-tuning, the training script is as follows:
+
+```shell
+python tools/train.py -c ./ppcls/configs/quick_start/ResNet50_vd.yaml -o Arch.pretrained=True
+```
+
+**Note**: This training script uses GPU. If you use CPU, you can modify it as shown in [4.1 Use CPU for model training] (#4.1) above.
+
+The `Top1 Acc` curve of the validation set is shown below. The highest accuracy rate is `0.9402`. After loading the pre-trained model, the accuracy of the flowers102 data set has been greatly improved, and the absolute accuracy has increased by more than 65%.
+
+<div align="center">
+<img src="../../images/quick_start/r50_vd_pretrained_acc.png"  width = "800" />
+</div>
+
+<a name="5"></a>
+## 5. Model prediction
+
+After the training is completed, the trained model can be used to predict the image category. Take the trained ResNet50_vd model as an example, the prediction code is as follows:
+
+```shell
+cd $path_to_PaddleClas
+python tools/infer.py -c ./ppcls/configs/quick_start/ResNet50_vd.yaml -o Infer.infer_imgs=dataset/flowers102/jpg/image_00001.jpg -o Global.pretrained_model=output/ResNet50_vd/best_model
+```
+
+`-i` indicates the path of a single image. After running successfully, the sample results are as follows:
+
+`[{'class_ids': [76, 51, 37, 33, 9], 'scores': [0.99998, 0.0, 0.0, 0.0, 0.0], 'file_name': 'dataset/flowers102/jpg/image_00001.jpg', 'label_names': ['passion flower', 'wild pansy', 'great masterwort', 'mexican aster', 'globe thistle']}]`
+
+
+Of course, you can also use the trained ShuffleNetV2_x0_25 model for prediction, the code is as follows:
+
+```shell
+cd $path_to_PaddleClas
+python tools/infer.py -c ./ppcls/configs/quick_start/new_user/ShuffleNetV2_x0_25.yaml -o Infer.infer_imgs=dataset/flowers102/jpg/image_00001.jpg -o Global.pretrained_model=output/ShuffleNetV2_x0_25/best_model
+```
+
+The `-i` parameter can also be the directory of the image file to be tested (`dataset/flowers102/jpg/`). After running successfully, some sample results are as follows:
+
+`[{'class_ids': [76, 51, 37, 33, 9], 'scores': [0.99998, 0.0, 0.0, 0.0, 0.0], 'file_name': 'dataset/flowers102/jpg/image_00001.jpg', 'label_names': ['passion flower', 'wild pansy', 'great masterwort', 'mexican aster', 'globe thistle']}, {'class_ids': [76, 51, 37, 33, 32], 'scores': [0.99999, 0.0, 0.0, 0.0, 0.0], 'file_name': 'dataset/flowers102/jpg/image_00002.jpg', 'label_names': ['passion flower', 'wild pansy', 'great masterwort', 'mexican aster', 'love in the mist']}, {'class_ids': [76, 12, 39, 73, 78], 'scores': [0.99998, 0.0, 0.0, 0.0, 0.0], 'file_name': 'dataset/flowers102/jpg/image_00003.jpg', 'label_names': ['passion flower', 'king protea', 'lenten rose', 'rose', 'toad lily']}, {'class_ids': [76, 37, 34, 12, 9], 'scores': [0.86282, 0.11177, 0.00717, 0.00599, 0.00397], 'file_name': 'dataset/flowers102/jpg/image_00004.jpg', 'label_names': ['passion flower', 'great masterwort', 'alpine sea holly', 'king protea', 'globe thistle']}, {'class_ids': [76, 37, 33, 51, 69], 'scores': [0.9999, 1e-05, 1e-05, 0.0, 0.0], 'file_name': 'dataset/flowers102/jpg/image_00005.jpg', 'label_names': ['passion flower', 'great masterwort', 'mexican aster', 'wild pansy', 'tree poppy']}, {'class_ids': [76, 37, 51, 33, 73], 'scores': [0.99999, 0.0, 0.0, 0.0, 0.0], 'file_name': 'dataset/flowers102/jpg/image_00006.jpg', 'label_names': ['passion flower', 'great masterwort', 'wild pansy', 'mexican aster', 'rose']}, {'class_ids': [76, 37, 12, 91, 30], 'scores': [0.98746, 0.00211, 0.00201, 0.00136, 0.0007], 'file_name': 'dataset/flowers102/jpg/image_00007.jpg', 'label_names': ['passion flower', 'great masterwort', 'king protea', 'bee balm', 'carnation']}, {'class_ids': [76, 37, 81, 77, 72], 'scores': [0.99976, 3e-05, 2e-05, 2e-05, 1e-05], 'file_name': 'dataset/flowers102/jpg/image_00008.jpg', 'label_names': ['passion flower', 'great masterwort', 'clematis', 'lotus', 'water lily']}, {'class_ids': [76, 37, 13, 12, 34], 'scores': [0.99646, 0.00144, 0.00092, 0.00035, 0.00027], 'file_name': 'dataset/flowers102/jpg/image_00009.jpg', 'label_names': ['passion flower', 'great masterwort', 'spear thistle', 'king protea', 'alpine sea holly']}, {'class_ids': [76, 37, 34, 33, 51], 'scores': [0.99999, 0.0, 0.0, 0.0, 0.0], 'file_name': 'dataset/flowers102/jpg/image_00010.jpg', 'label_names': ['passion flower', 'great masterwort', 'alpine sea holly', 'mexican aster', 'wild pansy']}]`
+
+Among them, the length of the list is the size of batch_size.
\ No newline at end of file
--- a/docs/en/quick_start/quick_start_classification_professional_en.md
+++ b/docs/en/quick_start/quick_start_classification_professional_en.md
+# Trial in 30mins(professional)
+
+Here is a quick start tutorial for professional users to use PaddleClas on the Linux operating system. The main content is based on the CIFAR-100 data set. You can quickly experience the training of different models, experience loading different pre-trained models, experience the SSLD knowledge distillation solution, and experience data augmentation. Please refer to [Installation Guide](../installation/install_paddleclas_en.md) to configure the operating environment and clone PaddleClas code.
+
+------
+
+## Catalogue
+
+- [1. Data and model preparation](#1)
+  - [1.1 Data preparation](#1.1)
+    - [1.1.1 Prepare CIFAR100](#1.1.1)
+- [2. Model training](#2)
+  - [2.1 Single label training](#2.1)
+    - [2.1.1 Training without loading the pre-trained model](#2.1.1)
+    - [2.1.2 Transfer learning](#2.1.2)
+- [3. Data Augmentation](#3)
+  - [3.1 Data augmentation-Mixup](#3.1)
+- [4. Knowledge distillation](#4)
+- [5. Model evaluation and inference](#5)
+  - [5.1 Single-label classification model evaluation and inference](#5.1)
+    - [5.1.1 Single-label classification model evaluation](#5.1.1)
+    - [5.1.2 Single-label classification model prediction](#5.1.2)
+    - [5.1.3 Single-label classification uses inference model for model inference](#5.1.3)
+
+<a name="1"></a>
+
+## 1. Data and model preparation
+
+<a name="1.1"></a>
+
+### 1.1 Data preparation
+
+
+* Enter the PaddleClas directory.
+
+```
+cd path_to_PaddleClas
+```
+
+<a name="1.1.1"></a> 
+
+#### 1.1.1 Prepare CIFAR100
+
+* Enter the `dataset/` directory, download and unzip the CIFAR100 dataset.
+
+```shell
+cd dataset
+wget https://paddle-imagenet-models-name.bj.bcebos.com/data/CIFAR100.tar
+tar -xf CIFAR100.tar
+cd ../
+```
+
+<a name="2"></a>
+
+## 2. Model training
+
+<a name="2.1"></a> 
+
+### 2.1 Single label training
+
+<a name="2.1.1"></a> 
+
+#### 2.1.1 Training without loading the pre-trained model
+
+* Based on the ResNet50_vd model, the training script is shown below.
+
+```shell
+export CUDA_VISIBLE_DEVICES=0,1,2,3
+python3 -m paddle.distributed.launch \
+    --gpus="0,1,2,3" \
+    tools/train.py \
+        -c ./ppcls/configs/quick_start/professional/ResNet50_vd_CIFAR100.yaml \
+        -o Global.output_dir="output_CIFAR"
+```
+
+The highest accuracy of the validation set is around 0.415.
+
+<a name="2.1.2"></a> 
+
+
+#### 2.1.2 Transfer learning
+
+* Based on ImageNet1k classification pre-training model ResNet50_vd_pretrained (accuracy rate 79.12%) for fine-tuning, the training script is shown below.
+
+```shell
+export CUDA_VISIBLE_DEVICES=0,1,2,3
+python3 -m paddle.distributed.launch \
+    --gpus="0,1,2,3" \
+    tools/train.py \
+        -c ./ppcls/configs/quick_start/professional/ResNet50_vd_CIFAR100.yaml \
+        -o Global.output_dir="output_CIFAR" \
+        -o Arch.pretrained=True
+```
+
+The highest accuracy of the validation set is about 0.718. After loading the pre-trained model, the accuracy of the CIFAR100 data set has been greatly improved, with an absolute accuracy increase of 30%.
+
+* Based on ImageNet1k classification pre-training model ResNet50_vd_ssld_pretrained (accuracy rate of 82.39%) for fine-tuning, the training script is shown below.
+
+```shell
+export CUDA_VISIBLE_DEVICES=0,1,2,3
+python3 -m paddle.distributed.launch \
+    --gpus="0,1,2,3" \
+    tools/train.py \
+        -c ./ppcls/configs/quick_start/professional/ResNet50_vd_CIFAR100.yaml \
+        -o Global.output_dir="output_CIFAR" \
+        -o Arch.pretrained=True \
+        -o Arch.use_ssld=True
+```
+
+In the final CIFAR100 verification set, the top-1 accuracy is 0.73. Compared with the fine-tuning of the pre-trained model with a top-1 accuracy of 79.12%, the top-1 accuracy of the new data set can be increased by 1.2% again.
+
+* Replace the backbone with MobileNetV3_large_x1_0 for fine-tuning, the training script is shown below.
+
+```shell
+export CUDA_VISIBLE_DEVICES=0,1,2,3
+python3 -m paddle.distributed.launch \
+    --gpus="0,1,2,3" \
+    tools/train.py \
+        -c ./ppcls/configs/quick_start/professional/MobileNetV3_large_x1_0_CIFAR100_finetune.yaml \
+        -o Global.output_dir="output_CIFAR" \
+        -o Arch.pretrained=True
+```
+
+The highest accuracy of the validation set is about 0.601, which is nearly 12% lower than ResNet50_vd.
+
+<a name="3"></a>
+
+
+## 3. Data Augmentation
+
+PaddleClas contains many data augmentation methods, such as Mixup, Cutout, RandomErasing, etc. For specific methods, please refer to [Data augmentation chapter](../algorithm_introduction/DataAugmentation_en.md)。
+
+<a name="3.1"></a> 
+
+### 3.1 Data augmentation-Mixup
+
+Based on the training method in [Data Augmentation Chapter](../algorithm_introduction/DataAugmentation_en.md) in Section 3.3, combined with Mixup's data augmentation method for training, the specific training script is shown below.
+
+```shell
+export CUDA_VISIBLE_DEVICES=0,1,2,3
+python3 -m paddle.distributed.launch \
+    --gpus="0,1,2,3" \
+    tools/train.py \
+        -c ./ppcls/configs/quick_start/professional/ResNet50_vd_mixup_CIFAR100_finetune.yaml \
+        -o Global.output_dir="output_CIFAR"
+
+```
+
+
+The final accuracy on the CIFAR100 verification set is 0.73, and the use of data augmentation can increase the model accuracy by about 1.2% again.
+
+
+* **Note**
+
+* For other data augmentation configuration files, please refer to the configuration files in `ppcls/configs/ImageNet/DataAugment/`.
+* The number of epochs for training CIFAR100 is small, so the accuracy of the validation set may fluctuate by about 1%.
+
+<a name="4"></a>
+
+
+## 4. Knowledge distillation
+
+
+PaddleClas includes a self-developed SSLD knowledge distillation scheme. For specific content, please refer to [Knowledge Distillation Chapter](../algorithm_introduction/knowledge_distillation_en.md). This section will try to use knowledge distillation technology to train the MobileNetV3_large_x1_0 model. Here we use the ResNet50_vd model trained in section 2.1.2 as the teacher model for distillation. First, save the ResNet50_vd model trained in section 2.1.2 to the specified directory. The script is as follows.
+
+```shell
+mkdir pretrained
+cp -r output_CIFAR/ResNet50_vd/best_model.pdparams  ./pretrained/
+```
+
+The model name, teacher model and student model configuration, pre-training address configuration, and freeze_params configuration in the configuration file are as follows, where the two values in `freeze_params_list` represent whether the teacher model and the student model freeze parameter training respectively.
+
+```yaml
+Arch:
+  name: "DistillationModel"
+  # if not null, its lengths should be same as models
+  pretrained_list:
+  # if not null, its lengths should be same as models
+  freeze_params_list:
+  - True
+  - False
+  models:
+    - Teacher:
+        name: ResNet50_vd
+        pretrained: "./pretrained/best_model"
+    - Student:
+        name: MobileNetV3_large_x1_0
+        pretrained: True
+```
+
+The loss configuration is as follows, where the training loss is the cross entropy of the output of the student model and the teacher model, and the validation loss is the cross entropy of the output of the student model and the true label.
+
+```yaml
+Loss:
+  Train:
+    - DistillationCELoss:
+        weight: 1.0
+        model_name_pairs:
+        - ["Student", "Teacher"]
+  Eval:
+    - DistillationGTCELoss:
+        weight: 1.0
+        model_names: ["Student"]
+```
+
+The final training script is shown below.
+
+```shell
+export CUDA_VISIBLE_DEVICES=0,1,2,3
+python3 -m paddle.distributed.launch \
+    --gpus="0,1,2,3" \
+    tools/train.py \
+        -c ./ppcls/configs/quick_start/professional/R50_vd_distill_MV3_large_x1_0_CIFAR100.yaml \
+        -o Global.output_dir="output_CIFAR"
+
+```
+
+
+In the end, the accuracy on the CIFAR100 validation set was 64.4%. Using the teacher model for knowledge distillation, the accuracy of MobileNetV3 increased by 4.3%.
+
+* **Note**
+
+* In the distillation process, the pre-trained model used by the teacher model is the training result on the CIFAR100 dataset, and the student model uses the MobileNetV3_large_x1_0 pre-trained model with an accuracy of 75.32% on the ImageNet1k dataset.
+  * The distillation process does not need to use real labels, so more unlabeled data can be used. In the process of use, you can generate fake `train_list.txt` from unlabeled data, and then merge it with the real `train_list.txt`, You can experience it yourself based on your own data.
+
+<a name="5"></a>
+
+## 5. Model evaluation and inference
+
+<a name="5.1"></a> 
+
+### 5.1 Single-label classification model evaluation and inference
+
+<a name="5.1.1"></a> 
+
+#### 5.1.1 Single-label classification model evaluation
+
+After training the model, you can use the following commands to evaluate the accuracy of the model.
+
+```bash
+python3 tools/eval.py \
+    -c ./ppcls/configs/quick_start/professional/ResNet50_vd_CIFAR100.yaml \
+    -o Global.pretrained_model="output_CIFAR/ResNet50_vd/best_model"
+```
+
+<a name="5.1.2"></a> 
+
+#### 5.1.2 Single-label classification model prediction
+
+After the model training is completed, the pre-trained model obtained by the training can be loaded for model prediction. A complete example is provided in `tools/infer.py`, the model prediction can be completed by executing the following command:
+
+```python
+python3 tools/infer.py \
+    -c ./ppcls/configs/quick_start/professional/ResNet50_vd_CIFAR100.yaml \
+    -o Infer.infer_imgs=./dataset/CIFAR100/test/0/0001.png \
+    -o Global.pretrained_model=output_CIFAR/ResNet50_vd/best_model
+```
+
+<a name="5.1.3"></a> 
+
+#### 5.1.3 Single-label classification uses inference model for model inference
+
+We need to export the inference model, PaddlePaddle supports the use of prediction engines for inference. Here, we will introduce how to use the prediction engine for inference:
+First, export the trained model to inference model:
+
+```bash
+python3 tools/export_model.py \
+    -c ./ppcls/configs/quick_start/professional/ResNet50_vd_CIFAR100.yaml \
+    -o Global.pretrained_model=output_CIFAR/ResNet50_vd/best_model
+```
+
+* By default, `inference.pdiparams`, `inference.pdmodel` and `inference.pdiparams.info` files will be generated in the `inference` folder.
+
+Use prediction engines for inference:
+
+Enter the deploy directory:
+
+```bash
+cd deploy
+```
+
+Change the `inference_cls.yaml` file. Since the resolution used for training CIFAR100 is 32x32, the relevant resolution needs to be changed. The image preprocessing in the final configuration file is as follows:
+
+```yaml
+PreProcess:
+  transform_ops:
+    - ResizeImage:
+        resize_short: 36
+    - CropImage:
+        size: 32
+    - NormalizeImage:
+        scale: 0.00392157
+        mean: [0.485, 0.456, 0.406]
+        std: [0.229, 0.224, 0.225]
+        order: ''
+    - ToCHWImage:
+```
+
+Execute the command to make predictions. Since the default `class_id_map_file` is the mapping file of the ImageNet dataset, you need to set None here.
+
+```bash
+python3 python/predict_cls.py \
+    -c configs/inference_cls.yaml \
+    -o Global.infer_imgs=../dataset/CIFAR100/test/0/0001.png \
+    -o PostProcess.Topk.class_id_map_file=None
+```
--- a/docs/en/quick_start/quick_start_multilabel_classification_en.md
+++ b/docs/en/quick_start/quick_start_multilabel_classification_en.md
+# Quick Start of Multi-label Classification
+
+Experience the training, evaluation, and prediction of multi-label classification based on the [NUS-WIDE-SCENE](https://lms.comp.nus.edu.sg/wp-content/uploads/2019/research/nuswide/NUS-WIDE.html) dataset, which is a subset of the NUS-WIDE dataset. Please first install PaddlePaddle and PaddleClas, see [Paddle Installation](https://github.com/PaddlePaddle/PaddleClas/blob/develop/docs/zh_CN/installation) and [PaddleClas installation](https://github.com/PaddlePaddle/PaddleClas/blob/develop/docs/zh_CN/installation/install_ paddleclas.md) for more details.
+
+## Contents
+
+- [1. Data and Model Preparation](#1)
+- [2. Model Training](#2)
+- [3. Model Evaluation](#3)
+- [4. Model Prediction](#4)
+- [5. Predictive engine-based Prediction](#5)
+  - [5.1 Export inference model](#5.1)
+  - [5.2 Predictive engine-based Prediction](#5.2)
+
+
+<a name="1"></a>
+## 1. Data and Model Preparation
+
+- Go to `PaddleClas`.
+
+```
+cd path_to_PaddleClas
+```
+
+- Create and go to `dataset/NUS-WIDE-SCENE`, download and unzip the NUS-WIDE-SCENE dataset.
+
+```
+mkdir dataset/NUS-WIDE-SCENE
+cd dataset/NUS-WIDE-SCENE
+wget https://paddle-imagenet-models-name.bj.bcebos.com/data/NUS-SCENE-dataset.tar
+tar -xf NUS-SCENE-dataset.tar
+```
+
+- Return to `PaddleClas` root directory
+
+```
+cd ../../
+```
+
+
+<a name="2"></a>
+## 2. Model Training
+
+```
+export CUDA_VISIBLE_DEVICES=0,1,2,3
+python3 -m paddle.distributed.launch \
+    --gpus="0,1,2,3" \
+    tools/train.py \
+        -c ./ppcls/configs/quick_start/professional/MobileNetV1_multilabel.yaml
+```
+
+After training 10 epochs, the best correctness of the validation set should be around 0.95.
+
+
+<a name="3"></a>
+## 3. Model Evaluation
+
+```
+python3 tools/eval.py \
+    -c ./ppcls/configs/quick_start/professional/MobileNetV1_multilabel.yaml \
+    -o Arch.pretrained="./output/MobileNetV1/best_model"
+```
+
+
+<a name="4"></a>
+## 4. Model Prediction
+
+```
+python3 tools/infer.py \
+    -c ./ppcls/configs/quick_start/professional/MobileNetV1_multilabel.yaml \
+    -o Arch.pretrained="./output/MobileNetV1/best_model"
+```
+
+Obtain an output silimar to the following:
+
+```
+[{'class_ids': [6, 13, 17, 23, 26, 30], 'scores': [0.95683, 0.5567, 0.55211, 0.99088, 0.5943, 0.78767], 'file_name': './deploy/images/0517_2715693311.jpg', 'label_names': []}]
+```
+
+
+<a name="5"></a>
+## 5. Predictive engine-based Prediction
+
+
+<a name="5.1"></a>
+### 5.1 Export inference model
+
+```
+python3 tools/export_model.py \
+    -c ./ppcls/configs/quick_start/professional/MobileNetV1_multilabel.yaml \
+    -o Arch.pretrained="./output/MobileNetV1/best_model"
+```
+
+The path of the inference model is by default under the current path `. /inference`.
+
+
+<a name="5.2"></a>
+### 5.2 Predictive engine-based Prediction
+
+Go to the `deploy` first：
+
+```
+cd ./deploy
+```
+
+Inference and prediction through predictive engines:
+
+```
+python3 python/predict_cls.py \
+     -c configs/inference_multilabel_cls.yaml
+```
+
+Obtain an output silimar to the following:
+
+```
+0517_2715693311.jpg:    class id(s): [6, 13, 17, 23, 26, 30], score(s): [0.96, 0.56, 0.55, 0.99, 0.59, 0.79], label_name(s): []
+```
\ No newline at end of file
--- a/docs/en/tutorials/quick_start_en.md
+++ b/docs/en/tutorials/quick_start_en.md
-# Trial in 30mins
-
-Based on the flowers102 dataset, it takes only 30 mins to experience PaddleClas, include training varieties of backbone and pretrained model, SSLD distillation, and multiple data augmentation, Please refer to [Installation](install_en.md) to install at first.
-
-
-## Preparation
-
-* Enter insatallation dir.
-
-```
-cd path_to_PaddleClas
-```
-
-* Enter `dataset/flowers102`, download and decompress flowers102 dataset.
-
-```shell
-cd dataset/flowers102
-# If you want to download from the brower, you can copy the link, visit it
-# in the browser, download and then decommpress.
-wget https://paddle-imagenet-models-name.bj.bcebos.com/data/flowers102.zip
-unzip flowers102.zip
-```
-
-* Return `PaddleClas` dir
-
-```
-cd ../../
-```
-
-## Environment
-
-### Download pretrained model
-
-You can use the following commands to downdload the pretrained models.
-
-```bash
-mkdir pretrained
-cd pretrained
-wget https://paddle-imagenet-models-name.bj.bcebos.com/dygraph/ResNet50_vd_pretrained.pdparams
-wget https://paddle-imagenet-models-name.bj.bcebos.com/dygraph/ResNet50_vd_ssld_pretrained.pdparams
-wget https://paddle-imagenet-models-name.bj.bcebos.com/dygraph/MobileNetV3_large_x1_0_pretrained.pdparams
-
-cd ../
-```
-
-**Note**: If you want to download the pretrained models on Windows environment, you can copy the links to the browser and download.
-
-
-## Training
-
-* All experiments are running on the NVIDIA® Tesla® V100 single card.
-* First of all, use the following command to set visible device.
-
-If you use mac or linux, you can use the following command:
-
-```shell
-export CUDA_VISIBLE_DEVICES=0
-```
-
-* If you use windows, you can use the following command.
-
-```shell
-set CUDA_VISIBLE_DEVICES=0
-```
-
-* If you want to train on cpu device, you can modify the field `use_gpu: True` in the config file to `use_gpu: False`, or you can append `-o use_gpu=False` in the training command, which means override the value of `use_gpu` as False.
-
-
-### Train from scratch
-
-* Train ResNet50_vd
-
-```shell
-python3 tools/train.py -c ./configs/quick_start/ResNet50_vd.yaml
-```
-
-If you want to train on cpu device, the command is as follows.
-
-```shell
-python3 tools/train.py -c ./configs/quick_start/ResNet50_vd.yaml -o use_gpu=False
-```
-
-Similarly, for the following commands, if you want to train on cpu device, you can append `-o use_gpu=False` in the command.
-
-The validation `Top1 Acc` curve is shown below.
-
-![](../../images/quick_start/r50_vd_acc.png)
-
-
-### Finetune - ResNet50_vd pretrained model (Acc 79.12\%)
-
-* Finetune ResNet50_vd model pretrained on the 1000-class Imagenet dataset
-
-```shell
-python3 tools/train.py -c ./configs/quick_start/ResNet50_vd_finetune.yaml
-```
-
-The validation `Top1 Acc` curve is shown below
-
-![](../../images/quick_start/r50_vd_pretrained_acc.png)
-
-Compare with training from scratch, it improve by 65\% to 94.02\%
-
-
-You can use the trained model to infer the result of image `docs/images/quick_start/flowers102/image_06739.jpg`. The command is as follows.
-
-
-```shell
-python3 tools/infer/infer.py \
-    -i docs/images/quick_start/flowers102/image_06739.jpg \
-    --model=ResNet50_vd \
-    --pretrained_model="output/ResNet50_vd/best_model/ppcls" \
-    --class_num=102
-```
-
-The output is as follows. Top-5 class ids and their scores are printed.
-
-```
-File:image_06739.jpg, Top-5 result: class id(s): [0, 96, 18, 50, 51], score(s): [0.79, 0.02, 0.01, 0.01, 0.01]
-```
-
-* Note: Results are different for different models, so you might get different results for the command.
-
-
-### SSLD finetune - ResNet50_vd_ssld pretrained model (Acc 82.39\%)
-
-Note: when finetuning model, which has been trained by SSLD, please use smaller learning rate in the middle of net.
-
-```yaml
-ARCHITECTURE:
-    name: 'ResNet50_vd'
-    params:
-        lr_mult_list: [0.5, 0.5, 0.6, 0.6, 0.8]
-pretrained_model: "./pretrained/ResNet50_vd_ssld_pretrained"
-```
-
-Tringing script
-
-```shell
-python3 tools/train.py -c ./configs/quick_start/ResNet50_vd_ssld_finetune.yaml
-```
-
-Compare with finetune on the 79.12% pretrained model, it improve by 0.98\% to 95\%.
-
-
-### More architecture - MobileNetV3
-
-Training script
-
-```shell
-python3 tools/train.py -c ./configs/quick_start/MobileNetV3_large_x1_0_finetune.yaml
-```
-
-Compare with ResNet50_vd pretrained model, it decrease by 5% to 90%. Different architecture generates different performance, actually it is a task-oriented decision to apply the best performance model, should consider the inference time, storage, heterogeneous device, etc.
-
-
-### RandomErasing
-
-Data augmentation works when training data is small.
-
-Training script
-
-```shell
-python3 tools/train.py -c ./configs/quick_start/ResNet50_vd_ssld_random_erasing_finetune.yaml
-```
-
-It improves by 1.27\% to 96.27\%
-
-
-### Distillation
-
-* The ResNet50_vd model pretrained on previous chapter will be used as the teacher model to train student model. Save the model to specified directory, command as follows:
-
-```shell
-cp -r output/ResNet50_vd/19/  ./pretrained/flowers102_R50_vd_final/
-```
-
-* Use `extra_list.txt` as unlabeled data, Note:
-    * Samples in the `extra_list.txt` and `val_list.txt` don't have intersection
-    * Because of in the source code, label information is unused, This is still unlabeled distillation
-    * Teacher model use the pretrained_model trained on the flowers102 dataset, and student model use the MobileNetV3_large_x1_0 pretrained model(Acc 75.32\%) trained on the ImageNet1K dataset
-
-
-```yaml
-total_images: 7169
-ARCHITECTURE:
-    name: 'ResNet50_vd_distill_MobileNetV3_large_x1_0'
-pretrained_model:
-    - "./pretrained/flowers102_R50_vd_final/ppcls"
-    - "./pretrained/MobileNetV3_large_x1_0_pretrained/”
-TRAIN:
-    file_list: "./dataset/flowers102/train_extra_list.txt"
-```
-
-Final training script
-
-```shell
-python3 tools/train.py -c ./configs/quick_start/R50_vd_distill_MV3_large_x1_0.yaml
-```
-
-It significantly imporve by 6.47% to 96.47% with more unlabeled data and teacher model.
-
-### All accuracy
-
-
-|Configuration | Top1 Acc |
-|- |:-: |
-| ResNet50_vd.yaml | 0.2735 |
-| MobileNetV3_large_x1_0_finetune.yaml | 0.9000 |
-| ResNet50_vd_finetune.yaml | 0.9402 |
-| ResNet50_vd_ssld_finetune.yaml | 0.9500 |
-| ResNet50_vd_ssld_random_erasing_finetune.yaml | 0.9627 |
-| R50_vd_distill_MV3_large_x1_0.yaml | 0.9647 |
-
-
-The whole accuracy curves are shown below
-
-
-![](../../images/quick_start/all_acc.png)
-
-
-
-* **NOTE**: As flowers102 is a small dataset, validatation accuracy maybe float 1%.
-
-* Please refer to [Getting_started](./getting_started_en.md) for more details
--- a/docs/zh_CN/algorithm_introduction/DataAugmentation.md
+++ b/docs/zh_CN/algorithm_introduction/DataAugmentation.md
+# 数据增强
+------
+
 ## 目录

-  - [1. 数据增强](#1)
+  - [1. 数据增强简介](#1)
  - [2. 常用数据增强方法](#2)
-  - [3. 图像变换类](#3)
-    - [3.1 AutoAugment](#3.1)
-    - [3.2 RandAugment](#3.2)
-    - [3.3 TimmAutoAugment](#3.3)
-  - [4. 图像裁剪类](#4)
-    - [4.1 Cutout](#4.1)
-    - [4.2 RandomErasing](#4.2)
-    - [4.3 HideAndSeek](#4.3)
-    - [4.4 GridMask](#4.4)
-  - [5. 图像混叠](#5)
-    - [5.1 Mixup](#5.1)
-    - [5.2 Cutmix](#5.2)
+    - [2.1 图像变换类](#2.1)
+      - [2.1.1 AutoAugment](#2.1.1)
+      - [2.1.2 RandAugment](#2.1.2)
+      - [2.1.3 TimmAutoAugment](#2.1.3)
+    - [2.2 图像裁剪类](#2.2)
+      - [2.2.1 Cutout](#2.2.1)
+      - [2.2.2 RandomErasing](#2.2.2)
+      - [2.2.3 HideAndSeek](#2.2.3)
+      - [2.2.4 GridMask](#2.2.4)
+    - [2.3 图像混叠类](#2.3)
+      - [2.3.1 Mixup](#2.3.1)
+      - [2.3.2 Cutmix](#3.2.2)

 <a name="1"></a>
-## 1. 数据增强
+## 1. 数据增强简介

 在图像分类任务中，图像数据的增广是一种常用的正则化方法，常用于数据量不足或者模型参数较多的场景。在本章节中，我们将对除 ImageNet 分类任务标准数据增强外的 8 种数据增强方式进行简单的介绍和对比，用户也可以将这些增广方法应用到自己的任务中，以获得模型精度的提升。这 8 种数据增强方式在 ImageNet 上的精度指标如下所示。

 ![](../../images/image_aug/main_image_aug.png)
+
 <a name="2"></a>
 ## 2. 常用数据增强方法

@@ -64,16 +68,16 @@
 PaddleClas 中集成了上述所有的数据增强策略，每种数据增强策略的参考论文与参考开源代码均在下面的介绍中列出。下文将介绍这些策略的原理与使用方法，并以下图为例，对变换后的效果进行可视化。为了说明问题，本章节中将 `RandCrop` 替换为 `Resize`。

 ![][test_baseline]
-<a name="3"></a>
-## 3. 图像变换类
+<a name="2.1"></a>
+### 2.1 图像变换类

 图像变换类指的是对 `RandCrop` 后的 224 的图像进行一些变换，主要包括

 + AutoAugment
 + RandAugment
 + TimmAutoAugment
-<a name="3.1"></a>
-### 3.1 AutoAugment
+<a name="2.1.1"></a>
+#### 2.1.1 AutoAugment

 论文地址：[https://arxiv.org/abs/1805.09501v1](https://arxiv.org/abs/1805.09501v1)

@@ -84,8 +88,8 @@ PaddleClas 中集成了上述所有的数据增强策略，每种数据增强策
 经过 AutoAugment 数据增强后结果如下图所示。

 ![][test_autoaugment]
-<a name="3.2"></a>
-### 3.2 RandAugment
+<a name="2.1.2"></a>
+#### 2.1.2 RandAugment

 论文地址：[https://arxiv.org/pdf/1909.13719.pdf](https://arxiv.org/pdf/1909.13719.pdf)

@@ -100,15 +104,15 @@ PaddleClas 中集成了上述所有的数据增强策略，每种数据增强策
 经过 RandAugment 数据增强后结果如下图所示。

 ![][test_randaugment]
-<a name="3.3"></a>
-### 3.3 TimmAutoAugment
+<a name="2.1.3"></a>
+#### 2.1.3 TimmAutoAugment

 开源代码 github 地址：[https://github.com/rwightman/pytorch-image-models/blob/master/timm/data/auto_augment.py](https://github.com/rwightman/pytorch-image-models/blob/master/timm/data/auto_augment.py)

 `TimmAutoAugment` 是开源作者对 AutoAugment 和 RandAugment 的改进，事实证明，其在很多视觉任务上有更好的表现，目前绝大多数 VisionTransformer 模型都是基于 TimmAutoAugment 去实现的。

-<a name="4"></a>
-## 4. 图像裁剪类
+<a name="2.2"></a>
+### 2.2 图像裁剪类

 图像裁剪类主要是对 `Transpose` 后的 224 的图像进行一些裁剪，并将裁剪区域的像素值置为特定的常数（默认为 0），主要包括：

@@ -120,8 +124,8 @@ PaddleClas 中集成了上述所有的数据增强策略，每种数据增强策
 图像裁剪的这些增广并非一定要放在归一化之后，也有不少实现是放在归一化之前的，也就是直接对 uint8 的图像进行操作，两种方式的差别是：如果直接对 uint8 的图像进行操作，那么再经过归一化之后被裁剪的区域将不再是纯黑或纯白（减均值除方差之后像素值不为 0）。而对归一后之后的数据进行操作，裁剪的区域会是纯黑或纯白。

 上述的裁剪变换思路是相同的，都是为了解决训练出的模型在有遮挡数据上泛化能力较差的问题，不同的是他们的裁剪方式、区域不太一样。
-<a name="4.1"></a>
-### 4.1 Cutout
+<a name="2.2.1"></a>
+#### 2.2.1 Cutout

 论文地址：[https://arxiv.org/abs/1708.04552](https://arxiv.org/abs/1708.04552)

@@ -133,8 +137,8 @@ Cutout 可以理解为 Dropout 的一种扩展操作，不同的是 Dropout 是
 经过 RandAugment 数据增强后结果如下图所示。

 ![][test_cutout]
-<a name="4.2"></a>
-### 4.2 RandomErasing
+<a name="2.2.2"></a>
+#### 2.2.2 RandomErasing

 论文地址：[https://arxiv.org/pdf/1708.04896.pdf](https://arxiv.org/pdf/1708.04896.pdf)

@@ -149,8 +153,8 @@ PaddleClas 中 `RandomErasing` 的使用方法如下所示。

 ![][test_randomerassing]

-<a name="4.3"></a>
-### 4.3 HideAndSeek
+<a name="2.2.3"></a>
+#### 2.2.3 HideAndSeek

 论文地址：[https://arxiv.org/pdf/1811.02545.pdf](https://arxiv.org/pdf/1811.02545.pdf)

@@ -170,8 +174,8 @@ PaddleClas 中 `HideAndSeek` 的使用方法如下所示。

 ![][test_hideandseek]

-<a name="4.4"></a>
-### 4.4 GridMask
+<a name="2.2.4"></a>
+#### 2.2.4 GridMask
 论文地址：[https://arxiv.org/abs/2001.04086](https://arxiv.org/abs/2001.04086)

 开源代码 github 地址：[https://github.com/akuxcw/GridMask](https://github.com/akuxcw/GridMask)
@@ -200,8 +204,8 @@ PaddleClas 中 `HideAndSeek` 的使用方法如下所示。

 ![][test_gridmask]

-<a name="5"></a>
-## 5. 图像混叠
+<a name="2.3"></a>
+### 2.3 图像混叠类

 图像混叠主要对 `Batch` 后的数据进行混合，包括：

@@ -209,8 +213,8 @@ PaddleClas 中 `HideAndSeek` 的使用方法如下所示。
 + Cutmix

 前文所述的图像变换与图像裁剪都是针对单幅图像进行的操作，而图像混叠是对两幅图像进行融合，生成一幅图像，两种方法的主要区别为混叠的方式不太一样。
-<a name="5.1"></a>
-### 5.1 Mixup
+<a name="2.3.1"></a>
+#### 2.3.1 Mixup

 论文地址：[https://arxiv.org/pdf/1710.09412.pdf](https://arxiv.org/pdf/1710.09412.pdf)

@@ -224,8 +228,8 @@ Mixup 是最先提出的图像混叠增广方案，其原理简单、方便实
 经过 Mixup 数据增强结果如下图所示。

 ![][test_mixup]
-<a name="5.2"></a>
-### 5.2 Cutmix
+<a name="2.3.2"></a>
+#### 2.3.2 Cutmix

 论文地址：[https://arxiv.org/pdf/1905.04899v2.pdf](https://arxiv.org/pdf/1905.04899v2.pdf)


--- a/docs/zh_CN/quick_start/quick_start_classification_new_user.md
+++ b/docs/zh_CN/quick_start/quick_start_classification_new_user.md
-# 30 分钟玩转 PaddleClas（尝鲜版）
+# 30分钟玩转PaddleClas（尝鲜版）

 此教程主要针对初级用户，即深度学习相关理论知识处于入门阶段，具有一定的 Python 基础，能够阅读简单代码的用户。此内容主要包括使用 PaddleClas 进行图像分类网络训练及模型预测。

 ---
-## 目录
-
- [1. 基础知识](#1)
- [2. 环境安装与配置](#2)
- [3. 数据的准备与处理](#3)
- [4. 模型训练](#4)
-  - [4.1 使用 CPU 进行模型训练](#4.1)
-    - [4.1.1 不使用预训练模型进行训练](#4.1.1)
-    - [4.1.2 使用预训练模型进行训练](#4.1.2)
-  - [4.2 使用 GPU 进行模型训练](#4.2)
-    - [4.2.1 不使用预训练模型进行训练](#4.2.1)
-    - [4.2.2 使用预训练模型进行训练](#4.2.2)
- [5. 模型预测](#5)
-
-<a name="1"></a>
+
 ## 1. 基础知识

 图像分类顾名思义就是一个模式分类问题，是计算机视觉中最基础的任务，它的目标是将不同的图像，划分到不同的类别。以下会对整个模型训练过程中需要了解到的一些概念做简单的解释，希望能够对初次体验 PaddleClas 的你有所帮助：
@@ -44,18 +30,16 @@
  - Top1 Acc：预测结果中概率最大的所在分类正确，则判定为正确；
  - Top5 Acc：预测结果中概率排名前 5 中有分类正确，则判定为正确；

-<a name="2"></a>
 ## 2. 环境安装与配置

 具体安装步骤可详看[Paddle 安装文档](../installation/install_paddle.md)，[PaddleClas 安装文档](../installation/install_paddleclas.md)。

-<a name="3"></a>
 ## 3. 数据的准备与处理

-进入 PaddleClas 目录：
+进入PaddleClas目录：

 ```shell
-# linux or mac， $path_to_PaddleClas 表示 PaddleClas 的根目录，用户需要根据自己的真实目录修改
+# linux or mac， $path_to_PaddleClas表示PaddleClas的根目录，用户需要根据自己的真实目录修改
 cd $path_to_PaddleClas
 ```

@@ -72,9 +56,9 @@ unzip flowers102.zip

 没有安装 `wget` 命令或者在 Windows 中下载的话，需要将地址拷贝到浏览器中下载，并进行解压到目录 `PaddleClas/dataset/` 下面即可。

-解压完成后，在目录 `PaddleClas/dataset/flowers102` 下有用于训练和测试的三个 `.txt` 文件：`train_list.txt`（训练集，1020 张图）、`val_list.txt`（验证集，1020 张图）、`train_extra_list.txt`（更大的训练集，7169 张图）。文件中每行格式：**图像相对路径**  **图像的 label_id**（注意：中间有空格），此外还有 flowers102 数据集 label id 与类别名称的映射文件：`flowers102_label_list.txt`。
+解压完成后，在目录 `PaddleClas/dataset/flowers102` 下有用于训练和测试的三个 `.txt` 文件：`train_list.txt`（训练集，1020张图）、`val_list.txt`（验证集，1020张图）、`train_extra_list.txt`（更大的训练集，7169张图）。文件中每行格式：**图像相对路径**  **图像的label_id**（注意：中间有空格），此外还有flowers102数据集 label id 与类别名称的映射文件：`flowers102_label_list.txt`。

-flowers102 数据集的图像文件存放在 `dataset/flowers102/jpg` 目录中，图像示例如下：
+flowers102数据集的图像文件存放在 `dataset/flowers102/jpg` 目录中，图像示例如下：

 <div align="center">
 <img src="../../images/quick_start/Examples-Flower-102.png" width = "800" />
@@ -85,31 +69,29 @@ flowers102 数据集的图像文件存放在 `dataset/flowers102/jpg` 目录中
 ```shell
 # linux or mac
 cd ../../
-# windoes 直接打开 PaddleClas 根目录即可
+# windoes直接打开PaddleClas根目录即可
 ```

-<a name="4"></a>
 ## 4. 模型训练

 <a name="4.1"></a>
-### 4.1 使用 CPU 进行模型训练

-由于使用 CPU 来进行模型训练，计算速度较慢，因此，此处以 ShuffleNetV2_x0_25 为例。此模型计算量较小，在 CPU 上计算速度较快。但是也因为模型较小，训练好的模型精度也不会太高。
+### 4.1 使用CPU进行模型训练
+
+由于使用CPU来进行模型训练，计算速度较慢，因此，此处以 ShuffleNetV2_x0_25 为例。此模型计算量较小，在 CPU 上计算速度较快。但是也因为模型较小，训练好的模型精度也不会太高。

-<a name="4.1.1"></a>
-#### 4.1.1 不使用预训练模型进行训练
+#### 4.1.1 不使用预训练模型

 ```shell
-# windows 在 cmd 中进入 PaddleClas 根目录，执行此命令
+# windows在cmd中进入PaddleClas根目录，执行此命令
 python tools/train.py -c ./ppcls/configs/quick_start/new_user/ShuffleNetV2_x0_25.yaml
 ```

- `-c` 参数是指定训练的配置文件路径，训练的具体超参数可查看 `yaml` 文件
- `yaml` 文件中 `Global.device` 参数设置为 `cpu`，即使用 CPU 进行训练（若不设置，此参数默认为 `True`）
- `yaml` 文件中 `epochs` 参数设置为 20，说明对整个数据集进行 20 个 epoch 迭代，预计训练 20 分钟左右（不同 CPU，训练时间略有不同），此时训练模型不充分。若提高训练模型精度，请将此参数设大，如**40**，训练时间也会相应延长
+- `-c` 参数是指定训练的配置文件路径，训练的具体超参数可查看`yaml`文件
+- `yaml`文`Global.device` 参数设置为`cpu`，即使用CPU进行训练（若不设置，此参数默认为`True`）
+- `yaml`文件中`epochs`参数设置为20，说明对整个数据集进行20个epoch迭代，预计训练20分钟左右（不同CPU，训练时间略有不同），此时训练模型不充分。若提高训练模型精度，请将此参数设大，如**40**，训练时间也会相应延长

-<a name="4.1.2"></a>
-#### 4.1.2 使用预训练模型进行训练
+#### 4.1.2 使用预训练模型

 ```shell
 python tools/train.py -c ./ppcls/configs/quick_start/new_user/ShuffleNetV2_x0_25.yaml  -o Arch.pretrained=True
@@ -119,8 +101,7 @@ python tools/train.py -c ./ppcls/configs/quick_start/new_user/ShuffleNetV2_x0_25

 可以使用将使用与不使用预训练模型训练进行对比，观察 loss 的下降情况。

-<a name="4.2"></a>
-### 4.2 使用 GPU 进行模型训练
+### 4.2 使用GPU进行模型训练

 由于 GPU 训练速度更快，可以使用更复杂模型，因此以 ResNet50_vd 为例。与 ShuffleNetV2_x0_25 相比，此模型计算量较大，训练好的模型精度也会更高。

@@ -138,21 +119,19 @@ python tools/train.py -c ./ppcls/configs/quick_start/new_user/ShuffleNetV2_x0_25
  set CUDA_VISIBLE_DEVICES=0
  ```

-<a name="4.2.1"></a>
-#### 4.2.1 不使用预训练模型进行训练
+#### 不使用预训练模型

 ```shell
 python tools/train.py -c ./ppcls/configs/quick_start/ResNet50_vd.yaml
 ```

-训练完成后，验证集的 `Top1 Acc` 曲线如下所示，最高准确率为 0.2735。训练精度曲线下图所示
+训练完成后，验证集的`Top1 Acc`曲线如下所示，最高准确率为0.2735。训练精度曲线下图所示

 <div align="center">
 <img src="../../images/quick_start/r50_vd_acc.png"  width = "800" />
 </div>

-<a name="4.2.2"></a>
-#### 4.2.2 使用预训练模型进行训练
+#### 4.2.1 使用预训练模型进行训练

 基于 ImageNet1k 分类预训练模型进行微调，训练脚本如下所示

@@ -160,7 +139,7 @@ python tools/train.py -c ./ppcls/configs/quick_start/ResNet50_vd.yaml
 python tools/train.py -c ./ppcls/configs/quick_start/ResNet50_vd.yaml -o Arch.pretrained=True
 ```

-**注**：此训练脚本使用 GPU，如使用 CPU 可按照上文中[4.1 使用 CPU 进行模型训练](#4.1)所示，进行修改。
+**注**：此训练脚本使用 GPU，如使用 CPU 可按照上文中[4.1 使用CPU进行模型训练](#4.1)所示，进行修改。

 验证集的 `Top1 Acc` 曲线如下所示，最高准确率为 `0.9402`，加载预训练模型之后，flowers102 数据集精度大幅提升，绝对精度涨幅超过 65%。

@@ -168,7 +147,6 @@ python tools/train.py -c ./ppcls/configs/quick_start/ResNet50_vd.yaml -o Arch.pr
 <img src="../../images/quick_start/r50_vd_pretrained_acc.png"  width = "800" />
 </div>

-<a name="5"></a>
 ## 5. 模型预测

 训练完成后，可以使用训练好的模型进行预测，以训练的 ResNet50_vd 模型为例，预测代码如下：