Experiments about data augmentation will be introduced in detail in this section. If you want to quickly experience these methods, please refer to [**Quick start PaddleClas in 30 miniutes**](../../tutorials/quick_start_en.md), which based on CIFAR100 dataset. If you want to know the content of related algorithms, please refer to [Data Augmentation Algorithm Introduction](../algorithm_introduction/DataAugmentation_en.md).
## Catalogue
-[1. Configurations](#1)
-[1.1 AutoAugment](#1.1)
-[1.2 RandAugment](#1.2)
-[1.3 TimmAutoAugment](#1.3)
-[1.4 Cutout](#1.4)
-[1.5 RandomErasing](#1.5)
-[1.6 HideAndSeek](#1.6)
-[1.7 GridMask](#1.7)
-[1.8 Mixup](#1.8)
-[1.9 Cutmix](#1.9)
-[1.10 Use Mixup and Cutmix at the same time](#1.10)
-[2. Start training](#2)
-[3. Matters needing attention](#3)
-[4. Experiments](#4)
<aname="1"></a>
## Configurations
Since hyperparameters differ from different augmentation methods. For better understanding, we list 8 augmentation configuration files in `configs/DataAugment` based on ResNet50. Users can train the model with `tools/run.sh`. The following are 3 of them.
<aname="1.1"></a>
### 1.1 AutoAugment
The configuration of the data augmentation method of `AotoAugment` is as follows. `AutoAugment` is converted on the uint8 data format, so its processing should be placed before the normalization operation (`NormalizeImage`).
```yaml
transform_ops:
-DecodeImage:
to_rgb:True
channel_first:False
-RandCropImage:
size:224
-RandFlipImage:
flip_code:1
-AutoAugment:
-NormalizeImage:
scale:1.0/255.0
mean:[0.485,0.456,0.406]
std:[0.229,0.224,0.225]
order:''
```
<aname="1.2"></a>
### 1.2 RandAugment
The configuration of the data augmentation method of `RandAugment` is as follows, where the user needs to specify the parameters `num_layers` and `magnitude`, and the default values are `2` and `5` respectively. `RandAugment` is converted on the uint8 data format, so its processing should be placed before the normalization operation (`NormalizeImage`).
```yaml
transform_ops:
-DecodeImage:
to_rgb:True
channel_first:False
-RandCropImage:
size:224
-RandFlipImage:
flip_code:1
-RandAugment:
num_layers:2
magnitude:5
-NormalizeImage:
scale:1.0/255.0
mean:[0.485,0.456,0.406]
std:[0.229,0.224,0.225]
order:''
```
<aname="1.3"></a>
### 1.3 TimmAutoAugment
The configuration of the data augmentation method of `TimmAutoAugment` is as follows, in which the user needs to specify the parameters `config_str`, `interpolation`, and `img_size`. The default values are `rand-m9-mstd0.5-inc1` and `bicubic. `, `224`. `TimmAutoAugment` is converted on the uint8 data format, so its processing should be placed before the normalization operation (`NormalizeImage`).
```yaml
transform_ops:
-DecodeImage:
to_rgb:True
channel_first:False
-RandCropImage:
size:224
-RandFlipImage:
flip_code:1
-TimmAutoAugment:
config_str:rand-m9-mstd0.5-inc1
interpolation:bicubic
img_size:224
-NormalizeImage:
scale:1.0/255.0
mean:[0.485,0.456,0.406]
std:[0.229,0.224,0.225]
order:''
```
<aname="1.4"></a>
### 1.4 Cutout
The configuration of the data augmentation method of `Cutout` is as follows, where the user needs to specify the parameters `n_holes` and `length`, and the default values are `1` and `112` respectively. Similar to other image cropping data augmentation methods, `Cutout` can operate on data in uint8 format, or on data after normalization (`NormalizeImage`).The demo here is operated after normalization.
```yaml
transform_ops:
-DecodeImage:
to_rgb:True
channel_first:False
-RandCropImage:
size:224
-RandFlipImage:
flip_code:1
-NormalizeImage:
scale:1.0/255.0
mean:[0.485,0.456,0.406]
std:[0.229,0.224,0.225]
order:''
-Cutout:
n_holes:1
length:112
```
<aname="1.5"></a>
### 1.5 RandomErasing
The configuration of the image augmentation method of `RandomErasing` is as follows, where the user needs to specify the parameters `EPSILON`, `sl`, `sh`, `r1`, `attempt`, `use_log_aspect`, `mode`, and the default values They are `0.25`, `0.02`, `1.0/3.0`, `0.3`, `10`, `True`, and `pixel`. Similar to other image cropping data augmentation methods, `RandomErasing` can operate on data in uint8 format, or on data after normalization (`NormalizeImage`).The demo here is operated after normalization.
```yaml
transform_ops:
-DecodeImage:
to_rgb:True
channel_first:False
-RandCropImage:
size:224
-RandFlipImage:
flip_code:1
-NormalizeImage:
scale:1.0/255.0
mean:[0.485,0.456,0.406]
std:[0.229,0.224,0.225]
order:''
-RandomErasing:
EPSILON:0.25
sl:0.02
sh:1.0/3.0
r1:0.3
attempt:10
use_log_aspect:True
mode:pixel
```
<aname="1.6"></a>
### 1.6 HideAndSeek
The configuration of the image augmentation method of `HideAndSeek` is as follows. Similar to other image cropping data augmentation methods, `HideAndSeek` can operate on data in uint8 format, or on data after normalization (`NormalizeImage`).The demo here is operated after normalization.
```yaml
transform_ops:
-DecodeImage:
to_rgb:True
channel_first:False
-RandCropImage:
size:224
-RandFlipImage:
flip_code:1
-NormalizeImage:
scale:1.0/255.0
mean:[0.485,0.456,0.406]
std:[0.229,0.224,0.225]
order:''
-HideAndSeek:
```
<aname="1.7"></a>
### 1.7 GridMask
The configuration of the image augmentation method of `GridMask` is as follows, where the user needs to specify the parameters `d1`, `d2`, `rotate`, `ratio`, `mode`, and the default values are `96`, `224 respectively `, `1`, `0.5`, `0`. Similar to other image cropping data augmentation methods, `HideAndSeek` can operate on data in uint8 format, or on data after normalization (`GridMask`).The demo here is operated after normalization.
```yaml
transform_ops:
-DecodeImage:
to_rgb:True
channel_first:False
-RandCropImage:
size:224
-RandFlipImage:
flip_code:1
-NormalizeImage:
scale:1.0/255.0
mean:[0.485,0.456,0.406]
std:[0.229,0.224,0.225]
order:''
-GridMask:
d1:96
d2:224
rotate:1
ratio:0.5
mode:0
```
<aname="1.8"></a>
### 1.8 Mixup
The configuration of the data augmentation method of `Mixup` is as follows, where the user needs to specify the parameter `alpha`, and the default value is `0.2`. Similar to other image mixing data augmentation methods, `Mixup` is to perform image mix on the data in each batch after the image is processed, and the mixed images and labels are put into the network for training,
so it operates after image data processing (image transformation, image cropping).
```yaml
transform_ops:
-DecodeImage:
to_rgb:True
channel_first:False
-RandCropImage:
size:224
-RandFlipImage:
flip_code:1
-NormalizeImage:
scale:1.0/255.0
mean:[0.485,0.456,0.406]
std:[0.229,0.224,0.225]
order:''
batch_transform_ops:
-MixupOperator:
alpha:0.2
```
<aname="1.9"></a>
### 1.9 Cutmix
The configuration of the image augmentation method of `Cutmix` is as follows, where the user needs to specify the parameter `alpha`, and the default value is `0.2`. Similar to other image mixing data augmentation methods, `Mixup` is to perform image mix on the data in each batch after the image is processed, and the mixed images and labels are put into the network for training,
so it operates after image data processing (image transformation, image cropping).
```yaml
transform_ops:
-DecodeImage:
to_rgb:True
channel_first:False
-RandCropImage:
size:224
-RandFlipImage:
flip_code:1
-NormalizeImage:
scale:1.0/255.0
mean:[0.485,0.456,0.406]
std:[0.229,0.224,0.225]
order:''
batch_transform_ops:
-CutmixOperator:
alpha:0.2
```
<aname="1.10"></a>
### 1.10 Use Mixup and Cutmix at the same time
The configuration for both `Mixup` and `Cutmix` is as follows, in which the user needs to specify an additional parameter `prob`, which controls the probability of different data enhancements, and the default is `0.5`.
```yaml
transform_ops:
-DecodeImage:
to_rgb:True
channel_first:False
-RandCropImage:
size:224
-RandFlipImage:
flip_code:1
-NormalizeImage:
scale:1.0/255.0
mean:[0.485,0.456,0.406]
std:[0.229,0.224,0.225]
order:''
-OpSampler:
MixupOperator:
alpha:0.8
prob:0.5
CutmixOperator:
alpha:1.0
prob:0.5
```
<aname="2"></a>
## 2. Start training
After you configure the training environment, similar to training other classification tasks, you only need to replace the configuration file in `tools/train.sh` with the configuration file of the corresponding data augmentation method.
* In addition, because the label needs to be aliased when the image is aliased, the accuracy of the training data cannot be calculated. The training accuracy rate was not printed during the training process.
* The training data is more difficult with data augmentation, so the training loss may be larger, the training set accuracy is relatively low, but it has better generalization ability, so the validation set accuracy is relatively higher.
* After the use of data augmentation, the model may tend to be underfitting. It is recommended to reduce `l2_decay` for better performance on validation set.
* hyperparameters exist in almost all agmenatation methods. Here we provide hyperparameters for ImageNet1k dataset. User may need to finetune the hyperparameters on specified dataset. More training tricks can be referred to [**Tricks**](../../../zh_CN/models/Tricks.md).
> If this document is helpful to you, welcome to star our project: [https://github.com/PaddlePaddle/PaddleClas](https://github.com/PaddlePaddle/PaddleClas)
<aname="4"></a>
## 4. Experiments
Based on PaddleClas, Metrics of different augmentation methods on ImageNet1k dataset are as follows.
* In the experiment here, for better comparison, we fixed the l2 decay to 1e-4. To achieve higher accuracy, we recommend trying to use a smaller l2 decay. Combined with data augmentaton, we found that reducing l2 decay from 1e-4 to 7e-5 can bring at least 0.3~0.5% accuracy improvement.
* We have not yet combined different strategies or verified, whch is our future work.
Image augmentation is a commonly used regularization method in image classification task, which is often used in scenarios with insufficient data or large model. In this chapter, we mainly introduce 8 image augmentation methods besides standard augmentation methods. Users can apply these methods in their own tasks for better model performance. Under the same conditions, These augmentation methods' performance on ImageNet1k dataset is shown as follows.
# Data Augmentation
------
## Catalogue
-[1. Introduction to data augmentation](#1)
-[2. Common data augmentation methods](#2)
-[2.1 Image Transformation](#2.1)
-[2.1.1 AutoAugment](#2.1.1)
-[2.1.2 RandAugment](#2.1.2)
-[2.1.3 TimmAutoAugment](#2.1.3)
-[2.2 Image Cropping](#2.2)
-[2.2.1 Cutout](#2.2.1)
-[2.2.2 RandomErasing](#2.2.2)
-[2.2.3 HideAndSeek](#2.2.3)
-[2.2.4 GridMask](#2.2.4)
-[2.3 Image mix](#2.3)
-[2.3.1 Mixup](#2.3.1)
-[2.3.2 Cutmix](#3.2.2)
<aname="1"></a>
## 1. Introduction to data augmentation
Data augmentation is a commonly used regularization method in image classification task, which is often used in scenarios with insufficient data or large model. In this chapter, we mainly introduce 8 image augmentation methods besides standard augmentation methods. Users can apply these methods in their own tasks for better model performance. Under the same conditions, these augmentation methods' performance on ImageNet1k dataset is shown as follows.
![](../../../images/image_aug/main_image_aug.png)
# Common image augmentation methods
<aname="2"></a>
## 2. Common data augmentation methods
If without special explanation, all the examples and experiments in this chapter are based on ImageNet1k dataset with the network input image size set as 224.
...
...
@@ -53,11 +74,13 @@ PaddleClas integrates all the above data augmentation strategies. More details i
![](../../../images/image_aug/test_baseline.jpeg)
# Image Transformation
<aname="2.1"></a>
### 2.1 Image Transformation
Transformation means performing some transformations on the image after `RandCrop`. It mainly contains AutoAugment and RandAugment.
Unlike conventional artificially designed image augmentation methods, AutoAugment is an image augmentation solution suitable for a specific data set found by certain search algorithm in the search space of a series of image augmentation sub-strategies. For the ImageNet dataset, the final data augmentation solution contains 25 sub-strategy combinations. Each sub-strategy contains two transformations. For each image, a sub-strategy combination is randomly selected and then determined with a certain probability Perform each transformation in the sub-strategy.
@@ -100,31 +105,19 @@ The search method of `AutoAugment` is relatively violent. Searching for the opti
In `RandAugment`, the author proposes a random augmentation method. Instead of using a specific probability to determine whether to use a certain sub-strategy, all sub-strategies are selected with the same probability. The experiments in the paper also show that this method performs well even for large models.
In PaddleClas, `RandAugment` is used as follows.
```python
size=224
decode_op=DecodeImage()
resize_op=ResizeImage(size=(size,size))
randaugment_op=RandAugment()
ops=[decode_op,resize_op,randaugment_op]
imgs_dir=image_path
fnames=os.listdir(imgs_dir)
forfinfnames:
data=open(os.path.join(imgs_dir,f)).read()
img=transform(data,ops)
```
The images after `RandAugment` are as follows.
![][test_randaugment]
<aname="2.1.3"></a>
#### 2.1.3 TimmAutoAugment
Github open source code address: [https://github.com/rwightman/pytorch-image-models/blob/master/timm/data/auto_augment.py](https://github.com/rwightman/pytorch-image-models/blob/master/timm/data/auto_augment.py)
# Image Cropping
`TimmAutoAugment` is an improvement of AutoAugment and RandAugment by open source authors. Facts have proved that it has better performance on many visual tasks. At present, most VisionTransformer models are implemented based on TimmAutoAugment.
<aname="2.2"></a>
### 2.2 Image Cropping
Cropping means performing some transformations on the image after `Transpose`, setting pixels of the cropped area as certain constant. It mainly contains CutOut, RandErasing, HideAndSeek and GridMask.
...
...
@@ -132,8 +125,8 @@ Image cropping methods can be operated before or after normalization. The differ
The above-mentioned cropping transformation ideas are the similar, all to solve the problem of poor generalization ability of the trained model on occlusion images, the difference lies in that their cropping details.
Cutout is a kind of dropout, but occludes input image rather than feature map. It is more robust to noise than noise. Cutout has two advantages: (1) Using Cutout, we can simulate the situation when the subject is partially occluded. (2) It can promote the model to make full use of more content in the image for classification, and prevent the network from focusing only on the saliency area, thereby causing overfitting.
RandomErasing is similar to the Cutout. It is also to solve the problem of poor generalization ability of the trained model on images with occlusion. The author also pointed out in the paper that the way of random cropping is complementary to random horizontal flipping. The author also verified the effectiveness of the method on pedestrian re-identification (REID). Unlike `Cutout`, in` `, `RandomErasing` is operateed on the image with a certain probability, size and aspect ratio of the generated mask are also randomly generated according to pre-defined hyperparameters.
In PaddleClas, `RandomErasing` is used as follows.
Mixup is the first solution for image aliasing, it is easy to realize and performs well not only on image classification but also on object detection. Mixup is usually carried out in a batch for simplification, so as `Cutmix`.
The usage of `Mixup` in PaddleClas is shown below.
Unlike `Mixup` which directly adds two images, for Cutmix, an `ROI` is cut out from one image and
Cutmix randomly cuts out an `ROI` from one image, and then covered onto the corresponding area in the another image. The usage of `Cutmix` in PaddleClas is shown below.
```python
size=224
decode_op=DecodeImage()
resize_op=ResizeImage(size=(size,size))
tochw_op=ToCHWImage()
hide_and_seek_op=HideAndSeek()
cutmix_op=CutmixOperator()
ops=[decode_op,resize_op,tochw_op]
imgs_dir=image_path
batch=[]
fnames=os.listdir(imgs_dir)
foridx,finenumerate(fnames):
data=open(os.path.join(imgs_dir,f)).read()
img=transform(data,ops)
batch.append((img,idx))# fake label
new_batch=cutmix_op(batch)
```
Cutmix randomly cuts out an `ROI` from one image, and then covered onto the corresponding area in the another image.
The images after `Cutmix` are as follows.
![][test_cutmix]
For the practical part of data augmentation, please refer to [Data Augmentation Practice](../advanced_tutorials/DataAugmentation_en.md).
# Experiments
Based on PaddleClas, Metrics of different augmentation methods on ImageNet1k dataset are as follows.
* In the experiment here, for better comparison, we fixed the l2 decay to 1e-4. To achieve higher accuracy, we recommend trying to use a smaller l2 decay. Combined with data augmentaton, we found that reducing l2 decay from 1e-4 to 7e-5 can bring at least 0.3~0.5% accuracy improvement.
* We have not yet combined different strategies or verified, whch is our future work.
## Data augmentation practice
Experiments about data augmentation will be introduced in detail in this section. If you want to quickly experience these methods, please refer to [**Quick start PaddleClas in 30 miniutes**](../../tutorials/quick_start_en.md).
## Configurations
Since hyperparameters differ from different augmentation methods. For better understanding, we list 8 augmentation configuration files in `configs/DataAugment` based on ResNet50. Users can train the model with `tools/run.sh`. The following are 3 of them.
### RandAugment
Configuration of `RandAugment` is shown as follows. `Num_layers`(default as 2) and `magnitude`(default as 5) are two hyperparameters.
```yaml
transform_ops:
-DecodeImage:
to_rgb:True
channel_first:False
-RandCropImage:
size:224
-RandFlipImage:
flip_code:1
-RandAugment:
num_layers:2
magnitude:5
-NormalizeImage:
scale:1.0/255.0
mean:[0.485,0.456,0.406]
std:[0.229,0.224,0.225]
order:''
```
### Cutout
Configuration of `Cutout` is shown as follows. `n_holes`(default as 1) and `n_holes`(default as 112) are two hyperparameters.
```yaml
transform_ops:
-DecodeImage:
to_rgb:True
channel_first:False
-RandCropImage:
size:224
-RandFlipImage:
flip_code:1
-NormalizeImage:
scale:1.0/255.0
mean:[0.485,0.456,0.406]
std:[0.229,0.224,0.225]
order:''
-Cutout:
n_holes:1
length:112
```
### Mixup
Configuration of `Mixup` is shown as follows. `alpha`(default as 0.2) is hyperparameter which users need to care about. What's more, `use_mix` need to be set as `True` in the root of the configuration.
```yaml
transform_ops:
-DecodeImage:
to_rgb:True
channel_first:False
-RandCropImage:
size:224
-RandFlipImage:
flip_code:1
-NormalizeImage:
scale:1.0/255.0
mean:[0.485,0.456,0.406]
std:[0.229,0.224,0.225]
order:''
batch_transform_ops:
-MixupOperator:
alpha:0.2
```
## Start training
Users can use the following command to start the training process, which can also be referred to `tools/train.sh`.
* In addition, because the label needs to be aliased when the image is aliased, the accuracy of the training data cannot be calculated. The training accuracy rate was not printed during the training process.
* The training data is more difficult with data augmentation, so the training loss may be larger, the training set accuracy is relatively low, but it has better generalization ability, so the validation set accuracy is relatively higher.
* After the use of data augmentation, the model may tend to be underfitting. It is recommended to reduce `l2_decay` for better performance on validation set.
* hyperparameters exist in almost all agmenatation methods. Here we provide hyperparameters for ImageNet1k dataset. User may need to finetune the hyperparameters on specified dataset. More training tricks can be referred to [**Tricks**](../../../zh_CN/models/Tricks.md).
> If this document is helpful to you, welcome to star our project: [https://github.com/PaddlePaddle/PaddleClas](https://github.com/PaddlePaddle/PaddleClas)
# Reference
## Reference
[1] Cubuk E D, Zoph B, Mane D, et al. Autoaugment: Learning augmentation strategies from data[C]//Proceedings of the IEEE conference on computer vision and pattern recognition. 2019: 113-123.
[2] Cubuk E D, Zoph B, Shlens J, et al. Randaugment: Practical automated data augmentation with a reduced search space[J]. arXiv preprint arXiv:1909.13719, 2019.
[3] DeVries T, Taylor G W. Improved regularization of convolutional neural networks with cutout[J]. arXiv preprint arXiv:1708.04552, 2017.
[8] Yun S, Han D, Oh S J, et al. Cutmix: Regularization strategy to train strong classifiers with localizable features[C]//Proceedings of the IEEE International Conference on Computer Vision. 2019: 6023-6032.
Image Classification is a fundamental task that classifies the image by semantic information and assigns it to a specific label. Image Classification is the foundation of Computer Vision tasks, such as object detection, image segmentation, object tracking, and behavior analysis. Image Classification enjoys comprehensive applications, including face recognition and smart video analysis in the security and protection field, traffic scenario recognition in the traffic field, image retrieval and electronic photo album classification in the internet industry, and image recognition in the medical industry.
Generally speaking, Image Classification attempts to fully describe the whole image by feature engineering and assigns labels by a classifier. Hence, how to extract the features of images is the essential part. Before we have deep learning, the most adopted classification method is the Bag of Words model. However, Image Classification based on deep learning can learn the hierarchical feature description by supervised and unsupervised learning, replacing the manual image feature selection. Recently, Convolution Neural Network (CNN) in deep learning gives an awesome performance in the image field. It uses pixel information as the input to get all the information to the maximum extent. Additionally, since the model uses convolution to extract features, the classification result is the output. Thus, this end-to-end method performs well and is widespread.
Image Classification is a basic but important field in computer vision, whose research results have a lasting impact on the development of computer vision and even deep learning. Image classification has many sub-fields, such as multi-label image classification and fine-grained image classification. Here we only brief on the single-label image classification.
<aname="1"></a>
## 1. Dataset Introduction
<aname="1.1"></a>
### 1.1 ImageNet-1k
The ImageNet project is a large-scale visual database for the research of visual object recognition software. More than 14 million images have been annotated manually to point out objects in the picture, and at least 1 million images are provided with borders. ImageNet-1k is a subset of the ImageNet dataset, which contains 1000 categories. The training set contains 1281167 image data, and the validation set contains 50,000 image data. Since 2010, ImageNet began to hold an annual image classification competition, namely, the ImageNet Large Scale Visual Recognition Challenge (ILSVRC) with ImageNet-1k as its specified dataset. To date, ImageNet-1k has become one of the most significant contributors to the development of computer vision, based on which numerous initial models of downstream computer vision tasks are trained.
<aname="1.2"></a>
### 1.2 CIFAR-10/CIFAR-100
The CIFAR-10 data set consists of 60,000 color images of 10 categories with an image resolution of 32x32, and each category has 6000 images, including 5000 in the training set and 1000 in the validation set. The 10 different classes represent airplanes, cars, birds, cats, deer, dogs, frogs, horses, ships, and trucks. The CIFAR-100 dataset is an extension of CIFAR-10 and consists of 60,000 color images of 100 classes with an image resolution of 32x32, and each class has 600 images, including 500 in the training set and 100 in the validation set. Researchers can try different algorithms quickly due to their small scale. These two data sets are also commonly used for testing the quality of models in image classification.
<aname="2"></a>
## 2. Image Classification Process
The prepared training data is correspondingly preprocessed and then passed through the image classification model. The output of the model and the real label are used in a cross-entropy loss function which describes the convergence direction of the model. An image classification model can be obtained by repeatedly traversing all the image data input models, conducting the corresponding gradient descent for the final loss function through some optimizers, returning the gradient information to the model, and updating the weight of the model.
<aname="2.1"></a>
### 2.1 Data and its Preprocessing
The quality and quantity of data often determine the performance of a model. In the field of image classification, data includes images and labels. In most cases, labeled data is scarce to an extent that hard to saturate the model. In order to enable the model to learn more image features, a lot of image transformation or data augmentation is required before the image enters the model, so as to ensure the diversity of input data, hence better generalization capabilities of the model. PaddleClas provides standard image transformation for training ImageNet-1k and 8 data augmentation methods. For related codes, please refer to [Data Preprocess](https://github.com/PaddlePaddle/PaddleClas/blob/develop/ppcls/data/preprocess),and the configuration file to [Data Augmentation Configuration File](https://github.com/PaddlePaddle/PaddleClas/blob/develop/ppcls/configs/ImageNet/DataAugment).
<aname="2.2"></a>
### 2.2 Prepare the Model
After the data is settled, the model often determines the upper limit of the final accuracy. In the field of image classification, classic models emerge in endlessly. PaddleClas provides 36 series, or a total of 164 ImageNet pre-trained models. For specific accuracy, speed and other indicators, please refer to [Backbone Network Introduction](https://github.com/PaddlePaddle/PaddleClas/blob/develop/docs/en/models).
<aname="2.3"></a>
### 2.3 Train the Model
After preparing the data and model, you can start training the model and updating the parameters of the model. After many iterations, a trained model can finally be obtained for image classification tasks. The training process of image classification requires a lot of experience and involves the setting of many hyperparameters. PaddleClas provides a series of [training tuning methods](https://github.com/PaddlePaddle/PaddleClas/blob/develop/docs/en/models/Tricks_en.md), which can help you quickly obtain a high-precision model.
<aname="2.4"></a>
### 2.4 Evaluate the Model
After a model is trained, the evaluation results of the model on the validation set can determine the performance of the model. The evaluation index is generally Top1-Acc or Top5-Acc, and the higher the index, the better the model performance.
<aname="3"></a>
## 3. Main Algorithms Introduction
- LeNet: Yan LeCun et al. first applied convolutional neural networks to image classification tasks in the 1990s, and creatively proposed LeNet, which achieved great success in handwritten digit recognition tasks.
- AlexNet: Alex Krizhevsky et al. proposed AlexNet in 2012 and applied it to ImageNet, and won the 2012 ImageNet classification competition. Since then, a deep learning boom is created.
- VGG: Simonyan and Zisserman put forward the VGG network structure in 2014. This network structure uses a smaller convolution kernel to stack the entire network, achieving better performance in ImageNet classification and providing new ideas for the subsequent network structure design.
- GoogLeNet: Christian Szegedy et al. presented GoogLeNet in 2014. This network uses a multi-branch structure and a global average pooling layer (GAP). While maintaining the accuracy of the model, the amount of model storage and calculation witnesses a drastic decrease. The network won the 2014 ImageNet classification competition.
- ResNet: Kaiming He et al. delivered ResNet in 2015, which deepened the depth of the network by introducing a residual module, reducing the recognition error rate of ImageNet classification to 3.6%, which exceeded the recognition accuracy of normal human eyes for the first time.
- DenseNet: Huang Gao et al. proposed DenseNet in 2017. The network designed a denser connected block and achieved higher performance with a smaller amount of parameters.
- EfficientNet: Mingxing Tan et al. introduced EfficientNet in 2019. This network balances the width of the network, the depth of the network, and the resolution of the input image. With the same FLOPS and parameters, it reaches the state-of-the-art results.
For more algorithm introduction, please refer to [Algorithm Introduction](https://github.com/PaddlePaddle/PaddleClas/blob/develop/docs/en/models).
-[2. Choice of Learning Rate and Learning Rate Declining Strategy](#2)
-[2.1 Concept of Learning Rate](#2.1)
-[2.2 Learning Rate Decline Strategy](#2.2)
-[2.3 Warmup Strategy](#2.3)
-[3. Choice of Batch_size](#3)
-[4. Choice of Weight_decay](#4)
-[5. Choice of Label_smoothing](#5)
-[6. Change the Crop Area and Stretch Transformation Degree of the Images for Small Models](#6)
-[7. Use Data Augmentation to Improve Accuracy](#7)
-[8. Determine the Tuning Strategy by Train_acc and Test_acc](#8)
-[9. Improve the Accuracy of Your Own Data Set with Existing Pre-trained Models](#9)
<aname="1"></a>
## 1. Choice of Optimizers
Since the development of deep learning, there have been many researchers working on the optimizer. The purpose of the optimizer is to make the loss function as small as possible, so as to find suitable parameters to complete a certain task. At present, the main optimizers used in model training are SGD, RMSProp, Adam, AdaDelt and so on. The SGD optimizers with momentum is widely used in academia and industry, so most of models we release are trained by SGD optimizer with momentum. But the SGD optimizer with momentum has two disadvantages, one is that the convergence speed is slow, the other is that the initial learning rate is difficult to set, however, if the initial learning rate is set properly and the models are trained in sufficient iterations, the models trained by SGD with momentum can reach higher accuracy compared with the models trained by other optimizers. Some other optimizers with adaptive learning rate such as Adam, RMSProp and so on tent to converge faster, but the final convergence accuracy will be slightly worse. If you want to train a model in faster convergence speed, we recommend you use the optimizers with adaptive learning rate, but if you want to train a model with higher accuracy, we recommend you to use SGD optimizer with momentum.
## Choice of Learning Rate and Learning Rate Declining Strategy:
<aname="2"></a>
## 2. Choice of Learning Rate and Learning Rate Declining Strategy
The choice of learning rate is related to the optimizer, data set and tasks. Here we mainly introduce the learning rate of training ImageNet-1K with momentum + SGD as the optimizer and the choice of learning rate decline.
### Concept of Learning Rate:
<aname="2.1"></a>
### 2.1 Concept of Learning Rate
the learning rate is the hyperparameter to control the learning speed, the lower the learning rate, the slower the change of the loss value, though using a low learning rate can ensure that you will not miss any local minimum, but it also means that the convergence speed is slow, especially when the gradient is trapped in a gradient plateau area.
### Learning Rate Decline Strategy:
<aname="2.2"></a>
### 2.2 Learning Rate Decline Strategy
During training, if we always use the same learning rate, we cannot get the model with highest accuracy, so the learning rate should be adjust during training. In the early stage of training, the weights are in a random initialization state and the gradients are tended to descent, so we can set a relatively large learning rate for faster convergence. In the late stage of training, the weights are close to the optimal values, the optimal value cannot be reached by a relatively large learning rate, so a relatively smaller learning rate should be used. During training, many researchers use the piecewise_decay learning rate reduction strategy, which is a stepwise decline learning rate. For example, in the training of ResNet50, the initial learning rate we set is 0.1, and the learning rate drops to 1/10 every 30 epoches, the total epoches for training is 120. Besides the piecewise_decay, many researchers also proposed other ways to decrease the learning rate, such as polynomial_decay, exponential_decay and cosine_decay and so on, among them, cosine_decay has become the preferred learning rate reduction method for improving model accuracy beacause there is no need to adjust hyperparameters and the robustness is relatively high. The learning rate curves of cosine_decay and piecewise_decay are shown in the following figures, it is easy to observe that during the entire training process, cosine_decay keeps a relatively large learning rate, so its convergence is slower, but the final convergence accuracy is better than the one using piecewise_decay.
![](../../images/models/lr_decay.jpeg)
In addition, we can also see from the figures that the number of epoches with a small learning rate in cosine_decay is fewer, which will affect the final accuracy, so in order to make cosine_decay play a better effect, it is recommended to use cosine_decay in large epoched, such as 200 epoches.
### Warmup Strategy
<aname="2.3"></a>
### 2.3 Warmup Strategy
If a large batch_size is adopted to train nerual network, we recommend you to adopt warmup strategy. as the name suggests, the warmup strategy is to let model learning first warm up, we do not directly use the initial learning rate at the begining of training, instead, we use a gradually increasing learning rate to train the model, when the increasing learning rate reaches the initial learning rate, the learning rate reduction method mentioned in the learning rate reduction strategy is then used to decay the learning rate. Experiments show that when the batch size is large, warmup strategy can improve the accuracy. Some model training with large batch_size such as MobileNetV3 training, we set the epoch in warmup to 5 by default, that is, first in 5 epoches, the learning rate increases from 0 to initial learning rate, then learning rate decay begins.
## Choice of Batch_size
<aname="3"></a>
## 3. Choice of Batch_size
Batch_size is an important hyperparameter in training neural networks, batch_size determines how much data is sent to the neural network to for training at a time. In the paper [1], the author found in experiments that when batch_size is linearly related to the learning rate, the convergence accuracy is hardly affected. When training ImageNet data, an initial learning rate of 0.1 are commonly chosen for training, and batch_size is 256, so according to the actual model size and memory, you can set the learning rate to 0.1\*k, batch_size to 256\*k.
## Choice of Weight_decay
<aname="4"></a>
## 4. Choice of Weight_decay
Overfitting is a common term in machine learning. A simple understanding is that the model performs well on the training data, but it performs poorly on the test data. In the convolutional neural network, there also exists the problem of overfitting. To avoid overfitting, many regular ways have been proposed. Among them, weight_decay is one of the widely used ways to avoid overfitting. After the final loss function, L2 regularization(weight_decay) is added to the loss function, with the help of L2 regularization, the weight of the network tend to choose a smaller value, and finally the parameters in the entire network tends to 0, and the generalization performance of the model is improved accordingly. In different kinds of Deep learning frame, the meaning of L2_decay is the coefficient of L2 regularization, on paddle, the name of this value is L2_decay, so in the following the value is called L2_decay. the larger the coefficient, the more the model tends to be underfitting. In the task of training ImageNet, this parameter is set to 1e-4 in most network. In some small networks such as MobileNet networks, in order to avoid network underfitting, the value is set to 1e-5 ~ 4e-5. Of course, the setting of this value is also related to the specific data set, When the data set is large, the network itself tends to be under-fitted, and the value can be appropriately reduced. When the data set is small, the network tends to overfit itself, so the value can be increased appropriately. The following table shows the accuracy of MobileNetV1_x0_25 using different l2_decay on ImageNet-1k. Since MobileNetV1_x0_25 is a relatively small network, the large l2_decay will make the network tend to be underfitting, so in this network, 3e-5 are better choices compared with 1e-4.
| Model | L2_decay | Train acc1/acc5 | Test acc1/acc5 |
...
...
@@ -39,7 +61,8 @@ In addition, the setting of L2_decay is also related to whether other regulariza
In summary, l2_decay can be adjusted according to specific tasks and models. Usually simple tasks or larger models are recommended to use Larger l2_decay, complex tasks or smaller models are recommended to use smaller l2_decay.
## Choice of Label_smoothing
<aname="5"></a>
## 5. Choice of Label_smoothing
Label_smoothing is a regularization method in deep learning. Its full name is Label Smoothing Regularization (LSR), that is, label smoothing regularization. In the traditional classification task, when calculating the loss function, the real one hot label and the output of the neural network are calculated in cross-entropy formula, the label smoothing aims to make the real one hot label become smooth label, which makes the neural network no longer learn from the hard labels, but the soft labels with a probability value, where the probability of the position corresponding to the category is the largest and the probability of other positions are very small value, specific calculation method can be seen in the paper[2]. In label-smoothing, there is an epsilon parameter describing the degree of softening the label. The larger epsilon, the smaller the probability and smoother the label, on the contrary, the label tends to be hard label. during training on ImageNet-1K, the parameter is usually set to 0.1. In the experiments of training ResNet50, when using label_smoothing, the accuracy is higher than the one without label_smoothing, the following table shows the performance of ResNet50_vd with label smoothing and without label smoothing.
| Model | Use_label_smoothing | Test acc1 |
...
...
@@ -57,7 +80,8 @@ But, because label smoothing can be regarded as a regular way, on relatively sma
In summary, the use of label_smoohing for larger models can effectively improve the accuracy of the model, and the use of label_smoohing for smaller models may reduce the accuracy of the model, so before deciding whether to use label_smoohing, you need to evaluate the size of the model and the difficulty of the task.
## Change the Crop Area and Stretch Transformation Degree of the Images for Small Models
<aname="6"></a>
## 6. Change the Crop Area and Stretch Transformation Degree of the Images for Small Models
In the standard preprocessing of ImageNet-1k data, two values of scale and ratio are defined in the random_crop function. These two values respectively determine the size of the image crop and the degree of stretching of the image. The default value of scale is 0.08-1(lower_scale-upper_scale), the default value range of ratio is 3/4-4/3(lower_ratio-upper_ratio). In small network training, such data argument will make the network underfitting, resulting in a decrease in accuracy. In order to improve the accuracy of the network, you can make the data argument weaker, that is, increase the crop area of the images or weaken the degree of stretching and transformation of the images, we can achieve weaker image transformation by increasing the value of lower_scale or narrowing the gap between lower_ratio and upper_scale. The following table lists the accuracy of training MobileNetV2_x0_25 with different lower_scale. It can be seen that the training accuracy and validation accuracy are improved after increasing the crop area of the images
| Model | Scale Range | Train_acc1/acc5 | Test_acc1/acc5 |
...
...
@@ -65,7 +89,8 @@ In the standard preprocessing of ImageNet-1k data, two values of scale and ratio
In general, the size of the data set is critical to the performances, but the annotation of images are often more expensive, so the number of annotated images are often scarce. In this case, the data argument is particularly important. In the standard data augmentation for training on ImageNet-1k, two data augmentation methods which are random_crop and random_flip are mainly used. However, in recent years, more and more data augmentation methods have been proposed, such as cutout, mixup, cutmix, AutoAugment, etc. Experiments show that these data augmentation methods can effectively improve the accuracy of the model. The following table lists the performance of ResNet50 in 8 different data augmentation methods. It can be seen that compared to the baseline, all data augmentation methods can be useful for the accuracy of ResNet50, among them cutmix is currently the most effective data argument. More data argument can be seen here[**Data Argument**](https://paddleclas.readthedocs.io/zh_CN/latest/advanced_tutorials/image_augmentation/ImageAugment.html).
| Model | Data Argument | Test top-1 |
...
...
@@ -80,10 +105,12 @@ In general, the size of the data set is critical to the performances, but the an
| ResNet50 | Random-Erasing | 77.91% |
| ResNet50 | Hide-and-Seek | 77.43% |
## Determine the Tuning Strategy by Train_acc and Test_acc
<aname="8"></a>
## 8. Determine the Tuning Strategy by Train_acc and Test_acc
In the process of training the network, the training set accuracy rate and validation set accuracy rate of each epoch are usually printed. Generally speaking, the accuracy of the training set is slightly higher than the accuracy of the validation set or the same are good state in training, but if you find that the accuracy of training set is much higher than the one of validation set, it means that overfitting happens in your task, which need more regularization, such as increase the value of L2_decay, using more data argument or label smoothing and so on. If you find that the accuracy of training set is lower than the one of validation set, it means that underfitting happens in your task, which recommend you to decrease the value of L2_decay, using fewer data argument, increase the area of the crop area of the images, weaken the stretching transformation of the images, remove label_smoothing, etc.
## Improve the Accuracy of Your Own Data Set with Existing Pre-trained Models
<aname="9"></a>
## 9. Improve the Accuracy of Your Own Data Set with Existing Pre-trained Models
In the field of computer vision, it has become common to load pre-trained models to train one's own tasks. Compared with starting training from random initialization, loading pre-trained models can often improve the accuracy of specific tasks. In general, the pre-trained model widely used in the industry is obtained from the ImageNet-1k dataset. The fc layer weight of the pre-trained model is a matrix of k\*1000, where k is The number of neurons before, and the weights of the fc layer is not need to load because of the different tasks. In terms of learning rate, if your training data set is particularly small (such as less than 1,000), we recommend that you use a smaller initial learning rate, such as 0.001 (batch_size: 256, the same below), to avoid a large learning rate undermine pre-training weights, if your training data set is relatively large (greater than 100,000), we recommend that you try a larger initial learning rate, such as 0.01 or greater.
This tutorial is mainly for new users, that is, users who are in the introductory stage of deep learning-related theoretical knowledge, know some python grammar, and can read simple codes. This content mainly includes the use of PaddleClas for image classification network training and model prediction.
---
## Catalogue
-[1. Basic knowledge](#1)
-[2. Environmental installation and configuration](#2)
-[3. Data preparation and processing](#3)
-[4. Model training](#4)
-[4.1 Use CPU for model training](#4.1)
-[4.1.1 Training without using pre-trained models](#4.1.1)
-[4.1.2 Use pre-trained models for training](#4.1.2)
-[4.2 Use GPU for model training](#4.2)
-[4.2.1 Training without using pre-trained models](#4.2.1)
-[4.2.2 Use pre-trained models for training](#4.2.2)
-[5. Model prediction](#5)
<aname="1"></a>
## 1. Basic knowledge
Image classification is a pattern classification problem, which is the most basic task in computer vision. Its goal is to classify different images into different categories. We will briefly explain some concepts that need to be understood during model training. We hope to be helpful to you who are experiencing PaddleClas for the first time:
- train/val/test dataset represents training set, validation set and test set respectively:
- Training dataset: used to train the model so that the model can recognize different types of features;
- Validation set (val dataset): the test set during the training process, which is convenient for checking the status of the model during the training process;
- Test dataset: After training the model, the test dataset is used to evaluate the results of the model.
- Pre-trained model
Using a pre-trained model trained on a larger dataset, that is, the weights of the parameters are preset, can help the model converge faster on the new dataset. Especially for some tasks with scarce training data, when the neural network parameters are very large, we may not be able to fully train the model with a small amount of training data. The method of loading the pre-trained model can be thought of as allowing the model to learn based on a better initial weight, so as to achieve better performance.
- epoch
The total number of training epochs of the model. The model passes through all the samples in the training set once, which is an epoch. When the difference between the error rate of the validation set and the error rate of the training set is small, the current number of epochs can be considered appropriate; when the error rate of the validation set first decreases and then becomes larger, it means that the number of epochs is too large and the number of epochs needs to be reduced. Otherwise, the model may overfit the training set.
- Loss Function
During the training process, measure the difference between the model output (predicted value) and the ground truth.
- Accuracy (Acc): indicates the proportion of the number of samples with correct predictions to the total data
- Top1 Acc: If the classification with the highest probability in the prediction result is correct, it is judged to be correct;
- Top5 Acc: If there is a correct classification among the top 5 probability rankings in the prediction result, it is judged as correct;
<aname="2"></a>
## 2. Environmental installation and configuration
For specific installation steps, please refer to [Paddle Installation Document](../installation/install_paddle_en.md), [PaddleClas Installation Document](../installation/install_paddleclas_en.md).
<aname="3"></a>
## 3. Data preparation and processing
Enter the PaddleClas directory:
```shell
# linux or mac, $path_to_PaddleClas represents the root directory of PaddleClas, and users need to modify it according to their real directory.
cd$path_to_PaddleClas
```
Enter the `dataset/flowers102` directory, download and unzip the flowers102 dataset:
```shell
# linux or mac
cd dataset/
# If you want to download directly from the browser, you can copy the link and visit, then download and unzip
If there is no `wget` command or if you are downloading in the Windows operating system, you need to copy the address to the browser to download, and unzip it to the directory `PaddleClas/dataset/`.
After the unzip operation is completed, there are three `.txt` files for training and testing under the directory `PaddleClas/dataset/flowers102`: `train_list.txt` (training set, 1020 images), `val_list.txt` (validation Set, 1020 images), `train_extra_list.txt` (larger training set, 7169 images). The format of each line in the file: **image relative path****image label_id** (note: there is a space between the two columns), and there is also a mapping file for label id and category name: `flowers102_label_list.txt` .
The image files of the flowers102 dataset are stored in the `dataset/flowers102/jpg` directory. The image examples are as follows:
# windoes users can open the PaddleClas root directory
```
<aname="4"></a>
## 4. Model training
<aname="4.1"></a>
### 4.1 Use CPU for model training
Since the CPU is used for model training, the calculation speed is slow, so here is ShuffleNetV2_x0_25 as an example. This model has a small amount of calculation and a faster calculation speed on the CPU. But also because the model is small, the accuracy of the trained model will also be limited.
<aname="4.1.1"></a>
#### 4.1.1 Training without using pre-trained models
```shell
# If you are using the windows operating system, please enter the root directory of PaddleClas in cmd and execute this command:
- The `-c` parameter is to specify the path of the configuration file for training, and the specific hyperparameters for training can be viewed in the `yaml` file
- The `Global.device` parameter in the `yaml` file is set to `cpu`, that is, the CPU is used for training (if not set, this parameter defaults to `gpu`)
- The `epochs` parameter in the `yaml` file is set to 20, indicating that 20 epoch iterations are performed on the entire data set. It is estimated that the training can be completed in about 20 minutes (different CPUs have slightly different training times). At this time, the training model is not sufficient. To improve the accuracy of the training model, please set this parameter to a larger value, such as **40**, the training time will be extended accordingly
- The `-o` parameter can be set to `True` or `False`, or it can be the storage path of the pre-training model. When `True` is selected, the pre-training weights will be automatically downloaded to the local. Note: If it is a pre-training model path, do not add: `.pdparams`
You can compare whether to use the pre-trained model and observe the drop in loss.
<aname="4.2"></a>
### 4.2 Use GPU for model training
Since GPU training is faster and more complex models can be used, take ResNet50_vd as an example. Compared with ShuffleNetV2_x0_25, this model is more computationally intensive, and the accuracy of the trained model will be higher.
First, you must set the environment variables and use the 0th GPU for training:
- For Linux users:
```shell
export CUDA_VISIBLE_DEVICES=0
```
- For Windows users
```shell
set CUDA_VISIBLE_DEVICES=0
```
<aname="4.2.1"></a>
#### 4.2.1 Training without using pre-trained models
**Note**: This training script uses GPU. If you use CPU, you can modify it as shown in [4.1 Use CPU for model training] (#4.1) above.
The `Top1 Acc` curve of the validation set is shown below. The highest accuracy rate is `0.9402`. After loading the pre-trained model, the accuracy of the flowers102 data set has been greatly improved, and the absolute accuracy has increased by more than 65%.
After the training is completed, the trained model can be used to predict the image category. Take the trained ResNet50_vd model as an example, the prediction code is as follows:
The `-i` parameter can also be the directory of the image file to be tested (`dataset/flowers102/jpg/`). After running successfully, some sample results are as follows:
Here is a quick start tutorial for professional users to use PaddleClas on the Linux operating system. The main content is based on the CIFAR-100 data set. You can quickly experience the training of different models, experience loading different pre-trained models, experience the SSLD knowledge distillation solution, and experience data augmentation. Please refer to [Installation Guide](../installation/install_paddleclas_en.md) to configure the operating environment and clone PaddleClas code.
------
## Catalogue
-[1. Data and model preparation](#1)
-[1.1 Data preparation](#1.1)
-[1.1.1 Prepare CIFAR100](#1.1.1)
-[2. Model training](#2)
-[2.1 Single label training](#2.1)
-[2.1.1 Training without loading the pre-trained model](#2.1.1)
-[2.1.2 Transfer learning](#2.1.2)
-[3. Data Augmentation](#3)
-[3.1 Data augmentation-Mixup](#3.1)
-[4. Knowledge distillation](#4)
-[5. Model evaluation and inference](#5)
-[5.1 Single-label classification model evaluation and inference](#5.1)
-[5.1.1 Single-label classification model evaluation](#5.1.1)
-[5.1.2 Single-label classification model prediction](#5.1.2)
-[5.1.3 Single-label classification uses inference model for model inference](#5.1.3)
<aname="1"></a>
## 1. Data and model preparation
<aname="1.1"></a>
### 1.1 Data preparation
* Enter the PaddleClas directory.
```
cd path_to_PaddleClas
```
<aname="1.1.1"></a>
#### 1.1.1 Prepare CIFAR100
* Enter the `dataset/` directory, download and unzip the CIFAR100 dataset.
The highest accuracy of the validation set is around 0.415.
<aname="2.1.2"></a>
#### 2.1.2 Transfer learning
* Based on ImageNet1k classification pre-training model ResNet50_vd_pretrained (accuracy rate 79.12%) for fine-tuning, the training script is shown below.
The highest accuracy of the validation set is about 0.718. After loading the pre-trained model, the accuracy of the CIFAR100 data set has been greatly improved, with an absolute accuracy increase of 30%.
* Based on ImageNet1k classification pre-training model ResNet50_vd_ssld_pretrained (accuracy rate of 82.39%) for fine-tuning, the training script is shown below.
In the final CIFAR100 verification set, the top-1 accuracy is 0.73. Compared with the fine-tuning of the pre-trained model with a top-1 accuracy of 79.12%, the top-1 accuracy of the new data set can be increased by 1.2% again.
* Replace the backbone with MobileNetV3_large_x1_0 for fine-tuning, the training script is shown below.
The highest accuracy of the validation set is about 0.601, which is nearly 12% lower than ResNet50_vd.
<aname="3"></a>
## 3. Data Augmentation
PaddleClas contains many data augmentation methods, such as Mixup, Cutout, RandomErasing, etc. For specific methods, please refer to [Data augmentation chapter](../algorithm_introduction/DataAugmentation_en.md)。
<aname="3.1"></a>
### 3.1 Data augmentation-Mixup
Based on the training method in [Data Augmentation Chapter](../algorithm_introduction/DataAugmentation_en.md) in Section 3.3, combined with Mixup's data augmentation method for training, the specific training script is shown below.
The final accuracy on the CIFAR100 verification set is 0.73, and the use of data augmentation can increase the model accuracy by about 1.2% again.
***Note**
* For other data augmentation configuration files, please refer to the configuration files in `ppcls/configs/ImageNet/DataAugment/`.
* The number of epochs for training CIFAR100 is small, so the accuracy of the validation set may fluctuate by about 1%.
<aname="4"></a>
## 4. Knowledge distillation
PaddleClas includes a self-developed SSLD knowledge distillation scheme. For specific content, please refer to [Knowledge Distillation Chapter](../algorithm_introduction/knowledge_distillation_en.md). This section will try to use knowledge distillation technology to train the MobileNetV3_large_x1_0 model. Here we use the ResNet50_vd model trained in section 2.1.2 as the teacher model for distillation. First, save the ResNet50_vd model trained in section 2.1.2 to the specified directory. The script is as follows.
The model name, teacher model and student model configuration, pre-training address configuration, and freeze_params configuration in the configuration file are as follows, where the two values in `freeze_params_list` represent whether the teacher model and the student model freeze parameter training respectively.
```yaml
Arch:
name:"DistillationModel"
# if not null, its lengths should be same as models
pretrained_list:
# if not null, its lengths should be same as models
freeze_params_list:
-True
-False
models:
-Teacher:
name:ResNet50_vd
pretrained:"./pretrained/best_model"
-Student:
name:MobileNetV3_large_x1_0
pretrained:True
```
The loss configuration is as follows, where the training loss is the cross entropy of the output of the student model and the teacher model, and the validation loss is the cross entropy of the output of the student model and the true label.
In the end, the accuracy on the CIFAR100 validation set was 64.4%. Using the teacher model for knowledge distillation, the accuracy of MobileNetV3 increased by 4.3%.
***Note**
* In the distillation process, the pre-trained model used by the teacher model is the training result on the CIFAR100 dataset, and the student model uses the MobileNetV3_large_x1_0 pre-trained model with an accuracy of 75.32% on the ImageNet1k dataset.
* The distillation process does not need to use real labels, so more unlabeled data can be used. In the process of use, you can generate fake `train_list.txt` from unlabeled data, and then merge it with the real `train_list.txt`, You can experience it yourself based on your own data.
<aname="5"></a>
## 5. Model evaluation and inference
<aname="5.1"></a>
### 5.1 Single-label classification model evaluation and inference
<aname="5.1.1"></a>
#### 5.1.1 Single-label classification model evaluation
After training the model, you can use the following commands to evaluate the accuracy of the model.
#### 5.1.2 Single-label classification model prediction
After the model training is completed, the pre-trained model obtained by the training can be loaded for model prediction. A complete example is provided in `tools/infer.py`, the model prediction can be completed by executing the following command:
#### 5.1.3 Single-label classification uses inference model for model inference
We need to export the inference model, PaddlePaddle supports the use of prediction engines for inference. Here, we will introduce how to use the prediction engine for inference:
First, export the trained model to inference model:
* By default, `inference.pdiparams`, `inference.pdmodel` and `inference.pdiparams.info` files will be generated in the `inference` folder.
Use prediction engines for inference:
Enter the deploy directory:
```bash
cd deploy
```
Change the `inference_cls.yaml` file. Since the resolution used for training CIFAR100 is 32x32, the relevant resolution needs to be changed. The image preprocessing in the final configuration file is as follows:
```yaml
PreProcess:
transform_ops:
-ResizeImage:
resize_short:36
-CropImage:
size:32
-NormalizeImage:
scale:0.00392157
mean:[0.485,0.456,0.406]
std:[0.229,0.224,0.225]
order:''
-ToCHWImage:
```
Execute the command to make predictions. Since the default `class_id_map_file` is the mapping file of the ImageNet dataset, you need to set None here.
Experience the training, evaluation, and prediction of multi-label classification based on the [NUS-WIDE-SCENE](https://lms.comp.nus.edu.sg/wp-content/uploads/2019/research/nuswide/NUS-WIDE.html) dataset, which is a subset of the NUS-WIDE dataset. Please first install PaddlePaddle and PaddleClas, see [Paddle Installation](https://github.com/PaddlePaddle/PaddleClas/blob/develop/docs/zh_CN/installation) and [PaddleClas installation](https://github.com/PaddlePaddle/PaddleClas/blob/develop/docs/zh_CN/installation/install_ paddleclas.md) for more details.
## Contents
-[1. Data and Model Preparation](#1)
-[2. Model Training](#2)
-[3. Model Evaluation](#3)
-[4. Model Prediction](#4)
-[5. Predictive engine-based Prediction](#5)
-[5.1 Export inference model](#5.1)
-[5.2 Predictive engine-based Prediction](#5.2)
<aname="1"></a>
## 1. Data and Model Preparation
- Go to `PaddleClas`.
```
cd path_to_PaddleClas
```
- Create and go to `dataset/NUS-WIDE-SCENE`, download and unzip the NUS-WIDE-SCENE dataset.
Based on the flowers102 dataset, it takes only 30 mins to experience PaddleClas, include training varieties of backbone and pretrained model, SSLD distillation, and multiple data augmentation, Please refer to [Installation](install_en.md) to install at first.
## Preparation
* Enter insatallation dir.
```
cd path_to_PaddleClas
```
* Enter `dataset/flowers102`, download and decompress flowers102 dataset.
```shell
cd dataset/flowers102
# If you want to download from the brower, you can copy the link, visit it
**Note**: If you want to download the pretrained models on Windows environment, you can copy the links to the browser and download.
## Training
* All experiments are running on the NVIDIA® Tesla® V100 single card.
* First of all, use the following command to set visible device.
If you use mac or linux, you can use the following command:
```shell
export CUDA_VISIBLE_DEVICES=0
```
* If you use windows, you can use the following command.
```shell
set CUDA_VISIBLE_DEVICES=0
```
* If you want to train on cpu device, you can modify the field `use_gpu: True` in the config file to `use_gpu: False`, or you can append `-o use_gpu=False` in the training command, which means override the value of `use_gpu` as False.
Compare with ResNet50_vd pretrained model, it decrease by 5% to 90%. Different architecture generates different performance, actually it is a task-oriented decision to apply the best performance model, should consider the inference time, storage, heterogeneous device, etc.
### RandomErasing
Data augmentation works when training data is small.
* The ResNet50_vd model pretrained on previous chapter will be used as the teacher model to train student model. Save the model to specified directory, command as follows:
* Samples in the `extra_list.txt` and `val_list.txt` don't have intersection
* Because of in the source code, label information is unused, This is still unlabeled distillation
* Teacher model use the pretrained_model trained on the flowers102 dataset, and student model use the MobileNetV3_large_x1_0 pretrained model(Acc 75.32\%) trained on the ImageNet1K dataset