From 30080e8071a928e8e19b78669150b4876bdec50f Mon Sep 17 00:00:00 2001
From: littletomatodonkey <dazhiningsibuqu@163.com>
Date: Tue, 9 Jun 2020 10:10:03 +0000
Subject: [PATCH] add distillation and image-aug doc

---
 .../distillation/distillation_en.md           | 227 +++++++
 .../image_augmentation/ImageAugment_en.md     | 572 ++++++++++++++++++
 2 files changed, 799 insertions(+)
 create mode 100644 docs/en/advanced_tutorials/distillation/distillation_en.md
 create mode 100644 docs/en/advanced_tutorials/image_augmentation/ImageAugment_en.md

diff --git a/docs/en/advanced_tutorials/distillation/distillation_en.md b/docs/en/advanced_tutorials/distillation/distillation_en.md
new file mode 100644
index 00000000..bbd042fb
--- /dev/null
+++ b/docs/en/advanced_tutorials/distillation/distillation_en.md
@@ -0,0 +1,227 @@
+
+
+# 1. Introduction of model compression methods
+
+In recent years, deep neural networks have been proven to be an extremely effective method to solve problems in the fields of computer vision and natural language processing. The deep learning methods performs better than traditional methods with suitable network structure and training process.
+
+With enough training data, increasing parameters of the neural network by building a reasonabe network can significantly the model performance. But this increases the model complexity, which takes too much computation cost in real scenarios.
+
+
+Parameter redundancy exists in deep neural networks. There are several methods to compress the model suck as pruning ,quantization, knowledge distillation, etc. Knowledge distillation refers to using the teacher model to guide the student model to learn specific tasks, ensuring that the small model has a relatively large effect improvement with the computation cost unchanged, and even obtains similar accuracy with the large model [1]. Combining some of the existing distillation methods [2,3], PaddleClas provides a simple semi-supervised label knowledge distillation solution (SSLD). Top-1 Accuarcy on ImageNet1k dataset has an improvement of more than 3% based on ResNet_vd and MobileNet series, which can be shown as below.
+
+
+![](../../../images/distillation/distillation_perform.png)
+
+
+# 2. SSLD
+
+## 2.1 Itroduction
+
+The following figure shows the framework of SSLD.
+
+![](../../../images/distillation/ppcls_distillation.png)
+
+First, we select nearly 4 million images from ImageNet22k dataset, and integrate it with the ImageNet-1k training set to get a new dataset containing 5 million images. Then, we combine the student model and the teacher model into a new network, which outputs the predictions of the student model and the teacher model, respectively. The gradient of the entire network of the teacher model is fixed. Finally, we use JS divergence loss as the loss function for the training process. Here we take MobileNetV3 distillation task as an example, and introduce key points of SSLD.
+
+* Choice of the teacher model, During knowledge distillation, it may not be an optimal solution if the structure of the teacher model and the student model are too different. Under the same structure, the teacher model with higher accuracy leads to better performance for the student model during distillation. Compared with the 79.12% ResNet50_vd teacher model, using the 82.4% teacher model can bring a 0.4% accuracy improvement on Top-1 accuracy (`75.6%-> 76.0%`).
+
+* Improvement of loss function. The most commonly used loss function for classification is cross entropy loss. We fint that when using soft label for training, KL divergence loss is almost useless to improve model performance compared to cross entropy loss, but The accuracy has a 0.2% improvement using JS divergence loss (`76.0%-> 76.2%`). Loss function in SSLD is JS divergence loss.
+
+* More iteration number. It is only 120 for the baseline experiment. We can achieve a 0.9% improvement by setting it as 360 (`76.2%-> 77.1%`).
+
+* There is not need for laleled data in SSLD, which leads to convenient training data expansion. label is not utilized when computing the loss function, therefore the unlabeled data can also be used to train the network. The label-free distillation strategy of this distillation solution has also greatly improved the upper performance limit of student models (`77.1%-> 78.5%`).
+
+* ImageNet1k finetune. ImageNet1k training set is used for finetuning, which brings a 0.4% accuracy improvement (`75.8%-> 78.9%`).
+
+
+## 2.2 Data selection
+
+* An important feature of the SSLD distillation scheme is no need for labeled images, so the dataset size can be arbitrarily expanded. Considering the limitation of computing resources, we here only expand the training set of the distillation task based on the ImageNet22k dataset. For SSLD, we used the `Top-k per class` data sampling scheme [3]. Specific steps are as follows.
+     * Deduplication of training set. We first deduplicate the ImageNet22k dataset and the ImageNet1k validation set based on the SIFT feature similarity matching method to prevent the added ImageNet22k training set from containing the ImageNet1k validation set images. Finally we removed 4511 similar images. Similar pictures with partial filtering are shown below.
+
+    ![](../../../images/distillation/22k_1k_val_compare_w_sift.png)
+
+    * Obtain the soft label of the ImageNet22k dataset. For the ImageNet22k dataset after deduplication, we use the `ResNeXt101_32x16d_wsl` model to make predictions to obtain the soft label of each image.
+     * Top-k data selection. There contains 1000 categories in ImageNet1k dataset. For each category, we find out images in the category with Top-k highest score, and finally generate a dataset whose image number does not exceed `1000 * k` (For some categories, there may contain less than k images).
+     * The selected images are merged with the ImageNet1k training set to form the new dataset used for the final distillation model training, which contains 5 million images in all.
+
+# 3. Experiments
+
+The distillation solution that PaddleClas provides is combining common training with finetuning. Given a suitable teacher model, the large dataset(5 million) is used for common training and the ImageNet1k dataset is used for finetuning.
+
+## 3.1 Choice of teacher model
+
+In order to verify the influence of the model size difference between the teacher model and the student model on the distillation results as well as the teacher model accuracy, we conducted several experiments. The training strategy is unified as follows: `cosine_decay_warmup, lr = 1.3, epoch = 120, bs = 2048`, and the student models are all trained from scratch.
+
+
+|Teacher Model | Teacher Top1 | Student Model | Student Top1|
+|- |:-: |:-: | :-: |
+| ResNeXt101_32x16d_wsl | 84.2% | MobileNetV3_large_x1_0 | 75.78% |
+| ResNet50_vd | 79.12% | MobileNetV3_large_x1_0 | 75.60% |
+| ResNet50_vd | 82.35% | MobileNetV3_large_x1_0 | 76.00% |
+
+
+It can be shown from the table that:
+
+> When the teacher model structure is the same, the higher the teacher model accuracy, the better the final student model will be.
+>
+> The size difference between the teacher model and the student model should not be too large, otherwise it will decrease the accuracy of the distillation results.
+
+Therefore, during distillation, for the ResNet series student model, we use `ResNeXt101_32x16d_wsl` as the teacher model; for the MobileNet series student model, we use` ResNet50_vd_SSLD` as the teacher model.
+
+
+## 3.2 Distillation using large-scale dataset
+
+Training process is carried out on the large-scale dataset with 5 million images. Specifically, the following table shows more details of different models.
+
+|Student Model | num_epoch  | l2_ecay | batch size/gpu cards |  base lr | learning rate decay | top1 acc |
+| - |:-: |:-: | :-: |:-: |:-: |:-: |
+| MobileNetV1 | 360 | 3e-5 | 4096/8  | 1.6 | cosine_decay_warmup | 77.65% |
+| MobileNetV2 | 360 | 1e-5 | 3072/8  | 0.54 | cosine_decay_warmup | 76.34% |
+| MobileNetV3_large_x1_0 | 360 | 1e-5 |  5760/24 | 3.65625 | cosine_decay_warmup | 78.54% |
+| MobileNetV3_small_x1_0 | 360 | 1e-5 |  5760/24 | 3.65625 | cosine_decay_warmup | 70.11% |
+| ResNet50_vd | 360 | 7e-5 | 1024/32 | 0.4 | cosine_decay_warmup | 82.07% |
+| ResNet101_vd | 360 | 7e-5 | 1024/32 | 0.4 | cosine_decay_warmup | 83.41% |
+
+## 3.3 finetuning using ImageNet1k
+
+Finetuning is carried out on ImageNet1k dataset to restore distribution between training set and test set. the following table shows more details of finetuning.
+
+
+|Student Model | num_epoch  | l2_ecay | batch size/gpu cards |  base lr | learning rate decay |  top1 acc |
+| - |:-: |:-: | :-: |:-: |:-: |:-: |
+| MobileNetV1 | 30 | 3e-5 | 4096/8 | 0.016 | cosine_decay_warmup | 77.89%  |
+| MobileNetV2 | 30 | 1e-5 | 3072/8  | 0.0054 | cosine_decay_warmup | 76.73% |
+| MobileNetV3_large_x1_0 | 30 | 1e-5 |  2048/8 | 0.008 | cosine_decay_warmup | 78.96% |
+| MobileNetV3_small_x1_0 | 30 | 1e-5 |  6400/32 | 0.025 | cosine_decay_warmup | 71.28% |
+| ResNet50_vd | 60 | 7e-5 | 1024/32 | 0.004 | cosine_decay_warmup | 82.39% |
+| ResNet101_vd | 30 | 7e-5 | 1024/32 | 0.004 | cosine_decay_warmup | 83.73% |
+
+
+# 4. Application of the distillation model
+
+## 4.1 Instructions
+
+* Adjust the learning rate of the middle layer. The middle layer feature map of the model obtained by distillation is more refined. Therefore, when the distillation model is used as the pretrained model in other tasks, if the same learning rate as before is adopted, it is easy to destroy the features. If the learning rate of the overall model training is reduced, it will bring about the problem of slow convergence. Therefore, we use the strategy of adjusting the learning rate of the middle layer. specifically:
+    * For ResNet50_vd, we set up a learning rate list. The three conv2d convolution parameters before the resiual block have a uniform learning rate multiple, and the four resiual block conv2d have theirs own learning rate parameters, respectively. 5 values need to be set in the list. By the experiment, we find that when used for transfer learning finetune classification model, the learning rate list with `[0.1,0.1,0.2,0.2,0.3]` performs better in most tasks; while in the object detection tasks, `[0.05, 0.05, 0.05, 0.1, 0.15]` can bring greater accuracy gains.
+    * For MoblileNetV3_large_1x0, because it contains 15 blocks, we set each 3 blocks to share a learning rate, so 5 learning rate values are required. We find that in classification and detection tasks, the learning rate list with `[0.25, 0.25, 0.5, 0.5, 0.75]` performs better in most tasks.
+* Appropriate l2 decay. Different l2 decay values are set for different models during training. In order to prevent overfitting, l2 decay is ofen set as large for large models. L2 decay is set as `1e-4` for ResNet50, and `1e-5 ~ 4e-5` for MobileNet series models. L2 decay needs also to be adjusted when applied in other tasks. Taking Faster_RCNN_MobiletNetV3_FPN as an example, we found that only modifying l2 decay can bring up to 0.5% accuracy (mAP) improvement on the COCO2017 dataset.
+
+
+## 4.2 Transfer learning
+
+* To verify the effect of the SSLD pretrained model in transfer learning, we carried out experiments on 10 small datasets. Here, in order to ensure the comparability of the experiment, we use the standard preprocessing process trained by the ImageNet1k dataset. For the distillation model, we also add a simple search method for the learning rate of the middle layers of the distillation pretrained model.
+* For ResNet50_vd, the baseline pretrained model Top-1 Acc is 79.12%, the other parameters are got by grid search. For distillation pretrained model, we add learning rate of the middle layers into the search space. The following table shows the results.
+
+| Dataset | Model | Baseline Top1 Acc | Distillation Model Finetune |
+|- |:-: |:-: | :-: |
+| Oxford102 flowers | ResNete50_vd | 97.18% | 97.41% |
+| caltech-101 | ResNete50_vd | 92.57% | 93.21% |
+| Oxford-IIIT-Pets | ResNete50_vd | 94.30% | 94.76% |
+| DTD | ResNete50_vd | 76.48% | 77.71% |
+| fgvc-aircraft-2013b | ResNete50_vd | 88.98% | 90.00% |
+| Stanford-Cars | ResNete50_vd | 92.65% | 92.76% |
+| SUN397 | ResNete50_vd | 64.02% | 68.36% |
+| cifar100 | ResNete50_vd | 86.50% | 87.58% |
+| cifar10 | ResNete50_vd | 97.72% | 97.94% |
+| Food-101 | ResNete50_vd | 89.58% | 89.99% |
+
+* It can be seen that on the above 10 datasets, combined with the appropriate middle layer learning rate, the distillation pretrained model can bring an average accuracy improvement of more than 1%.
+
+## 4.3 Object detection
+
+## 4.3 目标检测
+
+Based on the two-stage Faster/Cascade RCNN model, we verify the effect of the pretrained model obtained by distillation.
+
+* ResNet50_vd
+
+Training scale and test scale are set as 640x640, and some of the ablationstudies are as follows.
+
+
+| Model | train/test scale | pretrain top1 acc | feature map lr | coco mAP |
+|- |:-: |:-: | :-: | :-: |
+| Faster RCNN R50_vd FPN | 640/640 | 79.12% | [1.0,1.0,1.0,1.0,1.0] | 34.8% |
+| Faster RCNN R50_vd FPN | 640/640 | 79.12% | [0.05,0.05,0.1,0.1,0.15] | 34.3% |
+| Faster RCNN R50_vd FPN | 640/640 | 82.18% | [0.05,0.05,0.1,0.1,0.15] | 36.3% |
+
+
+It can be seen here that for the baseline pretrained model, excessive adjustment of the middle-layer learning rate actually reduces the performance of the detection model. Based on this distillation model, we also provide a practical server-side detection solution. The detailed configuration and training code are open source, more details can be refer to [PaddleDetection] (https://github.com/PaddlePaddle/PaddleDetection/tree/master/configs/rcnn_server_side_det).
+
+
+# 五、Practice
+
+This section will introduce the SSLD distillation experiments in detail based on the ImageNet-1K dataset. If you want to experience this method quickly, you can refer to [** Quick start PaddleClas in 30 minutes**] (../../tutorials/quick_start.md), whose dataset is set as Flowers102.
+
+
+
+## 5.1 Configuration
+
+
+
+### Distill ResNet50_vd using ResNeXt101_32x16d_wsl
+
+Configuration of distilling `ResNet50_vd` using `ResNeXt101_32x16d_wsl` is as follows.
+
+```yaml
+ARCHITECTURE:
+    name: 'ResNeXt101_32x16d_wsl_distill_ResNet50_vd'
+pretrained_model: "./pretrained/ResNeXt101_32x16d_wsl_pretrained/"
+# pretrained_model:
+#     - "./pretrained/ResNeXt101_32x16d_wsl_pretrained/"
+#     - "./pretrained/ResNet50_vd_pretrained/"
+use_distillation: True
+```
+
+### Distill MobileNetV3_large_x1_0 using ResNet50_vd_ssld
+
+The detailed configuration is as follows.
+
+```yaml
+ARCHITECTURE:
+    name: 'ResNet50_vd_distill_MobileNetV3_large_x1_0'
+pretrained_model: "./pretrained/ResNet50_vd_ssld_pretrained/"
+# pretrained_model:
+#     - "./pretrained/ResNet50_vd_ssld_pretrained/"
+#     - "./pretrained/ResNet50_vd_pretrained/"
+use_distillation: True
+```
+
+## 5.2 Begin to train the network
+
+If everything is ready, users can begin to train the network using the following command.
+
+```bash
+export PYTHONPATH=path_to_PaddleClas:$PYTHONPATH
+
+python -m paddle.distributed.launch \
+    --selected_gpus="0,1,2,3" \
+    --log_dir=R50_vd_distill_MV3_large_x1_0 \
+    tools/train.py \
+        -c ./configs/Distillation/R50_vd_distill_MV3_large_x1_0.yaml
+```
+
+## 5.3 Note
+
+* Before using SSLD, users need to train a teacher model on the target dataset firstly. The teacher model is used to guide the training of the student model.
+
+* When using SSLD, users need to set `use_distillation` in the configuration file to` True`. In addition, because the student model learns soft-label with knowledge information, you need to turn off the `label_smoothing` option.
+
+* If the student model is not loaded with a pretrained model, the other hyperparameters of the training can refer to the hyperparameters trained by the student model on ImageNet-1k. If the student model is loaded with the pre-trained model, the learning rate can be adjusted to `1/100~1/10` of the standard learning rate.
+
+* In the process of SSLD distillation, the student model only learns the soft label, which makes the training process more difficult. It is recommended that the value of `l2_decay` can be decreased appropriately to obtain higher accuracy of the validation set.
+
+* If users are going to add unlabeled training data, just the training list textfile needs to be adjusted for more data.
+
+
+
+> If this document is helpful to you, welcome to star our project: [https://github.com/PaddlePaddle/PaddleClas](https://github.com/PaddlePaddle/PaddleClas)
+
+
+# Reference
+
+[1] Hinton G, Vinyals O, Dean J. Distilling the knowledge in a neural network[J]. arXiv preprint arXiv:1503.02531, 2015.
+
+[2] Bagherinezhad H, Horton M, Rastegari M, et al. Label refinery: Improving imagenet classification through label progression[J]. arXiv preprint arXiv:1805.02641, 2018.
+
+[3] Yalniz I Z, Jégou H, Chen K, et al. Billion-scale semi-supervised learning for image classification[J]. arXiv preprint arXiv:1905.00546, 2019.
diff --git a/docs/en/advanced_tutorials/image_augmentation/ImageAugment_en.md b/docs/en/advanced_tutorials/image_augmentation/ImageAugment_en.md
new file mode 100644
index 00000000..167f6ccc
--- /dev/null
+++ b/docs/en/advanced_tutorials/image_augmentation/ImageAugment_en.md
@@ -0,0 +1,572 @@
+# 1. Image Augmentation
+
+
+Image augmentation is a commonly used regularization method in image classification task, which is often used in scenarios with insufficient data or large model. In this chapter, we mainly introduce 8 image augmentation methods besides standard augmentation methods. Users can apply these methods in their own tasks for better model performance. Under the same conditions, These augmentation methods' performance on ImageNet1k dataset is shown as follows.
+
+![](../../../images/image_aug/main_image_aug.png)
+
+
+# 2. Common image augmentation methods
+
+If without special explanation, all the examples and experiments in this chapter are based on ImageNet1k dataset with the network input image size set as 224.
+
+The standard data augmentation pipeline in ImageNet classification tasks contains the following steps.
+
+1. Decode image, abbreviated as `ImageDecode`.
+2. Randomly crop the image to size with 224x224, abbreviated as `RandCrop`.
+3. Randomly flip the image horizontally, abbreviated as `RandFlip`.
+4. Normalize the image pixel values, abbreviated as `Normalize`.
+5. Transpose the image from `[224, 224, 3]`(HWC) to `[3, 224, 224]`(CHW), abbreviated as `Transpose`.
+6. Group the image data(`[3, 224, 224]`) into a batch(`[N, 3, 224, 224]`), where `N` is the batch size. It is abbreviated as `Batch`.
+
+
+Compared with the above standard image augmentation methods, the researchers have also proposed many improved image augmentation strategies. These strategies are to insert certain operations at different stages of the standard augmentation method, based on the different stages of operation. We divide it into the following three categories.
+
+1. Transformation. Perform some transformations on the image after `RandCrop`, such as AutoAugment and RandAugment.
+2. Cropping. Perform some transformations on the image after  `Transpose`, such as CutOut, RandErasing, HideAndSeek and GridMask.
+3. Aliasing. Perform some transformations on the image after `Batch`, such as Mixup and Cutmix.
+
+
+
+The following table shows more detailed information of the transformations.
+
+
+| Method        | Input                        | Output                        | Auto-<br>Augment\[1\] | Rand-<br>Augment\[2\] | CutOut\[3\] | Rand<br>Erasing\[4\] | HideAnd-<br>Seek\[5\] | GridMask\[6\] | Mixup\[7\] | Cutmix\[8\] |
+|-------------|---------------------------|---------------------------|------------------|------------------|-------------|------------------|------------------|---------------|------------|------------|
+| Image<br>Decode | Binary                    | (224, 224, 3)<br>uint8      | Y                | Y                | Y           | Y                | Y                | Y             | Y          | Y |
+| RandCrop    | (:, :, 3)<br>uint8          | (224, 224, 3)<br>uint8      | Y                | Y                | Y           | Y                | Y                | Y             | Y          | Y |
+| **Process**     | (224, 224, 3)<br>uint8      | (224, 224, 3)<br>uint8      | Y                | Y                | \-          | \-               | \-               | \-            | \-         | \- |
+| RandFlip    | (224, 224, 3)<br>uint8      | (224, 224, 3)<br>float32    | Y                | Y                | Y           | Y                | Y                | Y             | Y          | Y |
+| Normalize   | (224, 224, 3)<br>uint8      | (3, 224, 224)<br>float32    | Y                | Y                | Y           | Y                | Y                | Y             | Y          | Y |
+| Transpose   | (224, 224, 3)<br>float32    | (3, 224, 224)<br>float32    | Y                | Y                | Y           | Y                | Y                | Y             | Y          | Y |
+| **Process**     | (3, 224, 224)<br>float32    | (3, 224, 224)<br>float32    | \-               | \-               | Y           | Y                | Y                | Y             | \-         | \- |
+| Batch       | (3, 224, 224)<br>float32    | (N, 3, 224, 224)<br>float32 | Y                | Y                | Y           | Y                | Y                | Y             | Y          | Y |
+| **Process**     | (N, 3, 224, 224)<br>float32 | (N, 3, 224, 224)<br>float32 | \-               | \-               | \-          | \-               | \-               | \-            | Y          | Y |
+
+
+
+PaddleClas integrates all the above data augmentation strategies. More details including principles and usage of the strategies are introduced in the following chapters. For better visualization, we use the following figure to show the changes after the transformations. And `RandCrop` is replaced with` Resize` for simplification.
+
+![](../../../images/image_aug/test_baseline.jpeg)
+
+# 3. Image Transformation
+
+Transformation means performing some transformations on the image after `RandCrop`. It mainly contains AutoAugment and RandAugment.
+
+## 3.1 AutoAugment
+
+Address：[https://arxiv.org/abs/1805.09501v1](https://arxiv.org/abs/1805.09501v1)
+
+Github repo：[https://github.com/DeepVoltaire/AutoAugment](https://github.com/DeepVoltaire/AutoAugment)
+
+
+Unlike conventional artificially designed image augmentation methods, AutoAugment is an image augmentation solution suitable for a specific data set found by certain search algorithm in the search space of a series of image augmentation sub-strategies. For the ImageNet dataset, the final data augmentation solution contains 25 sub-strategy combinations. Each sub-strategy contains two transformations. For each image, a sub-strategy combination is randomly selected and then determined with a certain probability Perform each transformation in the sub-strategy.
+
+In PaddleClas, `AutoAugment` is used as follows.
+
+```python
+from ppcls.data.imaug import DecodeImage
+from ppcls.data.imaug import ResizeImage
+from ppcls.data.imaug import ImageNetPolicy
+from ppcls.data.imaug import transform
+
+size = 224
+
+decode_op = DecodeImage()
+resize_op = ResizeImage(size=(size, size))
+autoaugment_op = ImageNetPolicy()
+
+ops = [decode_op, resize_op, autoaugment_op]
+
+imgs_dir = image_path
+fnames = os.listdir(imgs_dir)
+for f in fnames:
+    data = open(os.path.join(imgs_dir, f)).read()
+    img = transform(data, ops)
+```
+
+The images after `AutoAugment` are as follows.
+
+![][test_autoaugment]
+
+## 3.2 RandAugment
+
+Address: [https://arxiv.org/pdf/1909.13719.pdf](https://arxiv.org/pdf/1909.13719.pdf)
+
+Github repo: [https://github.com/heartInsert/randaugment](https://github.com/heartInsert/randaugment)
+
+
+The search method of `AutoAugment` is relatively violent. Searching for the optimal strategy for this data set directly on the data set requires a lot of computation. In `RandAugment`, the author found that on the one hand, for larger models and larger datasets, the gains generated by the augmentation method searched using `AutoAugment` are smaller. On the other hand, the searched strategy is limited to certain dataset, which has poor generalization performance and not sutable for other datasets.
+
+In `RandAugment`, the author proposes a random augmentation method. Instead of using a specific probability to determine whether to use a certain sub-strategy, all sub-strategies are selected with the same probability. The experiments in the paper also show that this method performs well even for large models.
+
+In PaddleClas, `RandAugment` is used as follows.
+
+```python
+from ppcls.data.imaug import DecodeImage
+from ppcls.data.imaug import ResizeImage
+from ppcls.data.imaug import RandAugment
+from ppcls.data.imaug import transform
+
+size = 224
+
+decode_op = DecodeImage()
+resize_op = ResizeImage(size=(size, size))
+randaugment_op = RandAugment()
+
+ops = [decode_op, resize_op, randaugment_op]
+
+imgs_dir = image_path
+fnames = os.listdir(imgs_dir)
+for f in fnames:
+    data = open(os.path.join(imgs_dir, f)).read()
+    img = transform(data, ops)
+```
+
+The images after `RandAugment` are as follows.
+
+![][test_randaugment]
+
+
+# 4. Image Cropping
+
+Cropping means performing some transformations on the image after `Transpose`, setting pixels of the cropped area as certain constant. It mainly contains CutOut, RandErasing, HideAndSeek and GridMask.
+
+Image cropping methods can be operated before or after normalization. The difference is that if we crop the image before normalization and fill the areas with 0, the cropped areas' pixel values will not be 0 after normalization, which will cause grayscale distribution change of the data.
+
+The above-mentioned cropping transformation ideas are the similar, all to solve the problem of poor generalization ability of the trained model on occlusion images, the difference lies in that their cropping details.
+
+
+## 4.1 Cutout
+
+Address: [https://arxiv.org/abs/1708.04552](https://arxiv.org/abs/1708.04552)
+
+Github repo: [https://github.com/uoguelph-mlrg/Cutout](https://github.com/uoguelph-mlrg/Cutout)
+
+
+Cutout is a kind of dropout, but occludes input image rather than feature map. It is more robust to noise than noise. Cutout has two advantages: (1) Using Cutout, we can simulate the situation when the subject is partially occluded. (2) It can promote the model to make full use of more content in the image for classification, and prevent the network from focusing only on the saliency area, thereby causing overfitting.
+
+In PaddleClas, `Cutout` is used as follows.
+
+```python
+from ppcls.data.imaug import DecodeImage
+from ppcls.data.imaug import ResizeImage
+from ppcls.data.imaug import Cutout
+from ppcls.data.imaug import transform
+
+size = 224
+
+decode_op = DecodeImage()
+resize_op = ResizeImage(size=(size, size))
+cutout_op = Cutout(n_holes=1, length=112)
+
+ops = [decode_op, resize_op, cutout_op]
+
+imgs_dir = image_path
+fnames = os.listdir(imgs_dir)
+for f in fnames:
+    data = open(os.path.join(imgs_dir, f)).read()
+    img = transform(data, ops)
+```
+
+The images after `Cutout` are as follows.
+
+![][test_cutout]
+
+## 4.2 RandomErasing
+
+Address: [https://arxiv.org/pdf/1708.04896.pdf](https://arxiv.org/pdf/1708.04896.pdf)
+
+Github repo: [https://github.com/zhunzhong07/Random-Erasing](https://github.com/zhunzhong07/Random-Erasing)
+
+RandomErasing is similar to the Cutout. It is also to solve the problem of poor generalization ability of the trained model on images with occlusion. The author also pointed out in the paper that the way of random cropping is complementary to random horizontal flipping. The author also verified the effectiveness of the method on pedestrian re-identification (REID). Unlike `Cutout`, in` `, `RandomErasing` is operateed on the image with a certain probability, size and aspect ratio of the generated mask are also randomly generated according to pre-defined hyperparameters.
+
+In PaddleClas, `RandomErasing` is used as follows.
+
+```python
+from ppcls.data.imaug import DecodeImage
+from ppcls.data.imaug import ResizeImage
+from ppcls.data.imaug import ToCHWImage
+from ppcls.data.imaug import RandomErasing
+from ppcls.data.imaug import transform
+
+size = 224
+
+decode_op = DecodeImage()
+resize_op = ResizeImage(size=(size, size))
+randomerasing_op = RandomErasing()
+
+ops = [decode_op, resize_op, tochw_op, randomerasing_op]
+
+imgs_dir = image_path
+fnames = os.listdir(imgs_dir)
+for f in fnames:
+    data = open(os.path.join(imgs_dir, f)).read()
+    img = transform(data, ops)
+    img = img.transpose((1, 2, 0))
+```
+
+The images after `RandomErasing` are as follows.
+
+![][test_randomerassing]
+
+
+## 4.3 HideAndSeek
+
+Address: [https://arxiv.org/pdf/1811.02545.pdf](https://arxiv.org/pdf/1811.02545.pdf)
+
+Github repo: [https://github.com/kkanshul/Hide-and-Seek](https://github.com/kkanshul/Hide-and-Seek)
+
+
+Images are divided into some patches for `HideAndSeek` and masks are generated with certain probability for each patch. The meaning of the masks in different areas is shown in the figure below.
+
+![][hide_and_seek_mask_expanation]
+
+In PaddleClas, `HideAndSeek` is used as follows.
+
+```python
+from ppcls.data.imaug import DecodeImage
+from ppcls.data.imaug import ResizeImage
+from ppcls.data.imaug import ToCHWImage
+from ppcls.data.imaug import HideAndSeek
+from ppcls.data.imaug import transform
+
+size = 224
+
+decode_op = DecodeImage()
+resize_op = ResizeImage(size=(size, size))
+hide_and_seek_op = HideAndSeek()
+
+ops = [decode_op, resize_op, tochw_op, hide_and_seek_op]
+
+imgs_dir = image_path
+fnames = os.listdir(imgs_dir)
+for f in fnames:
+    data = open(os.path.join(imgs_dir, f)).read()
+    img = transform(data, ops)
+    img = img.transpose((1, 2, 0))
+```
+
+The images after `HideAndSeek` are as follows.
+
+![][test_hideandseek]
+
+
+## 4.4 GridMask
+Address：[https://arxiv.org/abs/2001.04086](https://arxiv.org/abs/2001.04086)
+
+Github repo：[https://github.com/akuxcw/GridMask](https://github.com/akuxcw/GridMask)
+
+
+The author points out that the previous method based on image cropping has two problems, as shown in the following figure:
+
+1. Excessive deletion of the area may cause most or all of the target subject to be deleted, or cause the context information loss, resulting in the images after enhancement becoming noisy data.
+2. Reserving too much area has little effect on the object and context.
+
+![][gridmask-0]
+
+Therefore, it is the core problem to be solved how to
+if you avoid over-deletion or over-retention becomes the core problem to be solved.
+
+`GridMask` is to generate a mask with the same resolution as the original image and multiply it with the original image. The mask grid and size are adjusted by the hyperparameters.
+
+In the training process, there are two methods to use:
+1. Set a probability p and use the GridMask to augment the image with probability p from the beginning of training.
+2. Initially set the augmentation probability to 0, and the probability is increased with number of iterations from 0 to p.
+
+It shows that the second method is better.
+
+The usage of `GridMask` in PaddleClas is shown below.
+
+```python
+from data.imaug import DecodeImage
+from data.imaug import ResizeImage
+from data.imaug import ToCHWImage
+from data.imaug import GridMask
+from data.imaug import transform
+
+size = 224
+
+decode_op = DecodeImage()
+resize_op = ResizeImage(size=(size, size))
+tochw_op = ToCHWImage()
+gridmask_op = GridMask(d1=96, d2=224, rotate=1, ratio=0.6, mode=1, prob=0.8)
+
+ops = [decode_op, resize_op, tochw_op, gridmask_op]
+
+imgs_dir = image_path
+fnames = os.listdir(imgs_dir)
+for f in fnames:
+    data = open(os.path.join(imgs_dir, f)).read()
+    img = transform(data, ops)
+    img = img.transpose((1, 2, 0))
+```
+
+The images after `GridMask` are as follows.
+
+![][test_gridmask]
+
+
+# 5. Image aliasing
+
+Aliasing means performing some transformations on the image after `Batch`, which contains Mixup and Cutmix.
+
+Data augmentation methods introduced before are based on single image while aliasing is carried on a certain batch to generate a new batch.
+
+## 5.1 Mixup
+
+Address: [https://arxiv.org/pdf/1710.09412.pdf](https://arxiv.org/pdf/1710.09412.pdf)
+
+Github repo: [https://github.com/facebookresearch/mixup-cifar10](https://github.com/facebookresearch/mixup-cifar10)
+
+Mixup is the first solution for image aliasing, it is easy to realize and performs well not only on image classification but also on object detection. Mixup is usually carried out in a batch for simplification, so as `Cutmix`.
+
+
+The usage of `Mixup` in PaddleClas is shown below.
+
+```python
+from ppcls.data.imaug import DecodeImage
+from ppcls.data.imaug import ResizeImage
+from ppcls.data.imaug import ToCHWImage
+from ppcls.data.imaug import transform
+from ppcls.data.imaug import MixupOperator
+
+size = 224
+
+decode_op = DecodeImage()
+resize_op = ResizeImage(size=(size, size))
+tochw_op = ToCHWImage()
+hide_and_seek_op = HideAndSeek()
+mixup_op = MixupOperator()
+cutmix_op = CutmixOperator()
+
+ops = [decode_op, resize_op, tochw_op]
+
+imgs_dir = image_path
+
+batch = []
+fnames = os.listdir(imgs_dir)
+for idx, f in enumerate(fnames):
+    data = open(os.path.join(imgs_dir, f)).read()
+    img = transform(data, ops)
+    batch.append( (img, idx) ) # fake label
+
+new_batch = mixup_op(batch)
+```
+
+The images after `Mixup` are as follows.
+
+![][test_mixup]
+
+## 5.2 Cutmix
+
+Address: [https://arxiv.org/pdf/1905.04899v2.pdf](https://arxiv.org/pdf/1905.04899v2.pdf)
+
+Github repo: [https://github.com/clovaai/CutMix-PyTorch](https://github.com/clovaai/CutMix-PyTorch)
+
+Unlike `Mixup` which directly adds two images, for Cutmix, an `ROI` is cut out from one image and
+Cutmix randomly cuts out an `ROI` from one image, and then covered onto the corresponding area in the another image. The usage of `Cutmix` in PaddleClas is shown below.
+
+
+```python
+rom ppcls.data.imaug import DecodeImage
+from ppcls.data.imaug import ResizeImage
+from ppcls.data.imaug import ToCHWImage
+from ppcls.data.imaug import transform
+from ppcls.data.imaug import CutmixOperator
+
+size = 224
+
+decode_op = DecodeImage()
+resize_op = ResizeImage(size=(size, size))
+tochw_op = ToCHWImage()
+hide_and_seek_op = HideAndSeek()
+cutmix_op = CutmixOperator()
+
+ops = [decode_op, resize_op, tochw_op]
+
+imgs_dir = image_path
+
+batch = []
+fnames = os.listdir(imgs_dir)
+for idx, f in enumerate(fnames):
+    data = open(os.path.join(imgs_dir, f)).read()
+    img = transform(data, ops)
+    batch.append( (img, idx) ) # fake label
+
+new_batch = cutmix_op(batch)
+```
+
+The images after `Cutmix` are as follows.
+
+![][test_cutmix]
+
+
+# 6. Experiments
+
+Based on PaddleClas, Metrics of different augmentation methods on ImageNet1k dataset are as follows.
+
+
+| Model          | Learning strategy  | l2 decay | batch size | epoch | Augmentation method   | Top1 Acc    | Reference |
+|-------------|------------------|--------------|------------|-------|----------------|------------|----|
+| ResNet50 | 0.1/cosine_decay | 0.0001       | 256        | 300   | Standard transform           | 0.7731 | - |
+| ResNet50 | 0.1/cosine_decay | 0.0001       | 256        | 300   | AutoAugment    | 0.7795 |  0.7763 |
+| ResNet50 | 0.1/cosine_decay | 0.0001       | 256        | 300   | mixup          | 0.7828 |  0.7790 |
+| ResNet50 | 0.1/cosine_decay | 0.0001       | 256        | 300   | cutmix         | 0.7839 |  0.7860 |
+| ResNet50 | 0.1/cosine_decay | 0.0001       | 256        | 300   | cutout         | 0.7801 |  - |
+| ResNet50 | 0.1/cosine_decay | 0.0001       | 256        | 300   | gridmask       | 0.7785 |  0.7790 |
+| ResNet50 | 0.1/cosine_decay | 0.0001       | 256        | 300   | random-augment | 0.7770 |  0.7760 |
+| ResNet50 | 0.1/cosine_decay | 0.0001       | 256        | 300   | random erasing | 0.7791 |  - |
+| ResNet50 | 0.1/cosine_decay | 0.0001       | 256        | 300   | hide and seek  | 0.7743 |  0.7720 |
+
+
+**note**:
+* In the experiment here, for better comparison, we fixed the l2 decay to 1e-4. To achieve higher accuracy, we recommend trying to use a smaller l2 decay. Combined with data augmentaton, we found that reducing l2 decay from 1e-4 to 7e-5 can bring at least 0.3~0.5% accuracy improvement.
+* We have not yet combined different strategies or verified, whch is our future work.
+
+
+
+## 7. Data augmentation practice
+
+Experiments about data augmentation will be introduced in detail in this section. If you want to quickly experience these methods, please refer to [**Quick start PaddleClas in 30 miniutes**](../../tutorials/quick_start_en.md).
+
+## 7.1 Configurations
+
+Since hyperparameters differ from different augmentation methods. For better understanding, we list 8 augmentation configuration files in `configs/DataAugment` based on ResNet50. Users can train the model with `tools/run.sh`. The following are 3 of them.
+
+### 7.1.1 RandAugment
+
+Configuration of `RandAugment` is shown as follows. `Num_layers`(default as 2) and `magnitude`(default as 5) are two hyperparameters.
+
+
+```yaml
+    transforms:
+        - DecodeImage:
+            to_rgb: True
+            to_np: False
+            channel_first: False
+        - RandCropImage:
+            size: 224
+        - RandFlipImage:
+            flip_code: 1
+        - RandAugment:
+            num_layers: 2
+            magnitude: 5
+        - NormalizeImage:
+            scale: 1./255.
+            mean: [0.485, 0.456, 0.406]
+            std: [0.229, 0.224, 0.225]
+            order: ''
+        - ToCHWImage:
+```
+
+### 7.1.2 Cutout
+
+Configuration of `Cutout` is shown as follows. `n_holes`(default as 1) and `n_holes`(default as 112) are two hyperparameters.
+
+```yaml
+    transforms:
+        - DecodeImage:
+            to_rgb: True
+            to_np: False
+            channel_first: False
+        - RandCropImage:
+            size: 224
+        - RandFlipImage:
+            flip_code: 1
+        - NormalizeImage:
+            scale: 1./255.
+            mean: [0.485, 0.456, 0.406]
+            std: [0.229, 0.224, 0.225]
+            order: ''
+        - Cutout:
+            n_holes: 1
+            length: 112
+        - ToCHWImage:
+```
+
+### 7.1.3 Mixup
+
+
+Configuration of `Mixup` is shown as follows. `alpha`(default as 0.2) is hyperparameter which users need to care about. What's more, `use_mix` need to be set as `True` in the root of the configuration.
+
+```yaml
+    transforms:
+        - DecodeImage:
+            to_rgb: True
+            to_np: False
+            channel_first: False
+        - RandCropImage:
+            size: 224
+        - RandFlipImage:
+            flip_code: 1
+        - NormalizeImage:
+            scale: 1./255.
+            mean: [0.485, 0.456, 0.406]
+            std: [0.229, 0.224, 0.225]
+            order: ''
+        - ToCHWImage:
+    mix:
+        - MixupOperator:
+            alpha: 0.2
+```
+
+## 7.2 启动命令
+
+Users can use the following command to start the training process, which can also be referred to `tools/run.sh`.
+
+```bash
+export PYTHONPATH=path_to_PaddleClas:$PYTHONPATH
+
+python -m paddle.distributed.launch \
+    --selected_gpus="0,1,2,3" \
+    tools/train.py \
+        -c ./configs/DataAugment/ResNet50_Cutout.yaml
+```
+
+## 7.3 Note
+
+* When using augmentation methods based on image aliasing, users need to set `use_mix` in the configuration file as `True`. In addition, because the label needs to be aliased when the image is aliased, the accuracy of the training data cannot be calculated. The training accuracy rate was not printed during the training process.
+
+* The training data is more difficult with data augmentation, so the training loss may be larger, the training set accuracy is relatively low, but it has better generalization ability, so the validation set accuracy is relatively higher.
+
+* After the use of data augmentation, the model may tend to be underfitting. It is recommended to reduce `l2_decay` for better performance on validation set.
+
+* hyperparameters exist in almost all agmenatation methods. Here we provide hyperparameters for ImageNet1k dataset. User may need to finetune the hyperparameters on specified dataset. More training tricks can be referred to [**Tricks**](../../../zh_CN/models/Tricks.md).
+
+
+> If this document is helpful to you, welcome to star our project: [https://github.com/PaddlePaddle/PaddleClas](https://github.com/PaddlePaddle/PaddleClas)
+
+
+# Reference
+
+[1] Cubuk E D, Zoph B, Mane D, et al. Autoaugment: Learning augmentation strategies from data[C]//Proceedings of the IEEE conference on computer vision and pattern recognition. 2019: 113-123.
+
+
+[2] Cubuk E D, Zoph B, Shlens J, et al. Randaugment: Practical automated data augmentation with a reduced search space[J]. arXiv preprint arXiv:1909.13719, 2019.
+
+[3] DeVries T, Taylor G W. Improved regularization of convolutional neural networks with cutout[J]. arXiv preprint arXiv:1708.04552, 2017.
+
+[4] Zhong Z, Zheng L, Kang G, et al. Random erasing data augmentation[J]. arXiv preprint arXiv:1708.04896, 2017.
+
+[5] Singh K K, Lee Y J. Hide-and-seek: Forcing a network to be meticulous for weakly-supervised object and action localization[C]//2017 IEEE international conference on computer vision (ICCV). IEEE, 2017: 3544-3553.
+
+[6] Chen P. GridMask Data Augmentation[J]. arXiv preprint arXiv:2001.04086, 2020.
+
+[7] Zhang H, Cisse M, Dauphin Y N, et al. mixup: Beyond empirical risk minimization[J]. arXiv preprint arXiv:1710.09412, 2017.
+
+[8] Yun S, Han D, Oh S J, et al. Cutmix: Regularization strategy to train strong classifiers with localizable features[C]//Proceedings of the IEEE International Conference on Computer Vision. 2019: 6023-6032.
+
+
+
+[test_baseline]: ../../../images/image_aug/test_baseline.jpeg
+[test_autoaugment]: ../../../images/image_aug/test_autoaugment.jpeg
+[test_cutout]: ../../../images/image_aug/test_cutout.jpeg
+[test_gridmask]: ../../../images/image_aug/test_gridmask.jpeg
+[gridmask-0]: ../../../images/image_aug/gridmask-0.png
+[test_hideandseek]: ../../../images/image_aug/test_hideandseek.jpeg
+[test_randaugment]: ../../../images/image_aug/test_randaugment.jpeg
+[test_randomerassing]: ../../../images/image_aug/test_randomerassing.jpeg
+[hide_and_seek_mask_expanation]: ../../../images/image_aug/hide-and-seek-visual.png
+[test_mixup]: ../../../images/image_aug/test_mixup.png
+[test_cutmix]: ../../../images/image_aug/test_cutmix.png
-- 
GitLab