@@ -106,6 +106,14 @@ Finetuning is carried out on ImageNet1k dataset to restore distribution between
* For image classsification tasks, The model accuracy can be further improved when the test scale is 1.15 times that of training[5]. For the 82.99% ResNet50_vd pretrained model, it comes to 83.7% using 320x320 for the evaluation. We use Fix strategy to finetune the model with the training scale set as 320x320. During the process, the pre-preocessing pipeline is same for both training and test. All the weights except the fully connected layer are freezed. Finally the top-1 accuracy comes to **84.0%**.
### Some phenomena during the experiment
In the prediction process, the average value and variance of the batch norm are obtained by loading the pretrained model (set its mode as test mode). In the training process, batch norm is obtained by counting the information of the current batch (set its mode as train mode) and calculating the moving average with the historical saved information. In the distillation task, we found that through the train mode, In the distillation task, we found that the real-time change of the bn parameter of the teacher model to guide the student model is better than the student model obtained through the test mode distillation. The following is a set of experimental results. Therefore, in this distillation scheme, we use train mode to get the soft label of the teacher model.
|Teacher Model | Teacher Top1 | Student Model | Student Top1|
@@ -113,7 +121,7 @@ Finetuning is carried out on ImageNet1k dataset to restore distribution between
* Adjust the learning rate of the middle layer. The middle layer feature map of the model obtained by distillation is more refined. Therefore, when the distillation model is used as the pretrained model in other tasks, if the same learning rate as before is adopted, it is easy to destroy the features. If the learning rate of the overall model training is reduced, it will bring about the problem of slow convergence. Therefore, we use the strategy of adjusting the learning rate of the middle layer. specifically:
* For ResNet50_vd, we set up a learning rate list. The three conv2d convolution parameters before the resiual block have a uniform learning rate multiple, and the four resiual block conv2d have theirs own learning rate parameters, respectively. 5 values need to be set in the list. By the experiment, we find that when used for transfer learning finetune classification model, the learning rate list with `[0.1,0.1,0.2,0.2,0.3]` performs better in most tasks; while in the object detection tasks, `[0.05, 0.05, 0.05, 0.1, 0.15]` can bring greater accuracy gains.
* For MoblileNetV3_large_1x0, because it contains 15 blocks, we set each 3 blocks to share a learning rate, so 5 learning rate values are required. We find that in classification and detection tasks, the learning rate list with `[0.25, 0.25, 0.5, 0.5, 0.75]` performs better in most tasks.
* For MoblileNetV3_large_x1_0, because it contains 15 blocks, we set each 3 blocks to share a learning rate, so 5 learning rate values are required. We find that in classification and detection tasks, the learning rate list with `[0.25, 0.25, 0.5, 0.5, 0.75]` performs better in most tasks.
* Appropriate l2 decay. Different l2 decay values are set for different models during training. In order to prevent overfitting, l2 decay is ofen set as large for large models. L2 decay is set as `1e-4` for ResNet50, and `1e-5 ~ 4e-5` for MobileNet series models. L2 decay needs also to be adjusted when applied in other tasks. Taking Faster_RCNN_MobiletNetV3_FPN as an example, we found that only modifying l2 decay can bring up to 0.5% accuracy (mAP) improvement on the COCO2017 dataset.
...
...
@@ -167,54 +175,52 @@ This section will introduce the SSLD distillation experiments in detail based on
#### Distill ResNet50_vd using ResNeXt101_32x16d_wsl
#### Distill MobileNetV3_small_x1_0 using MobileNetV3_large_x1_0
Configuration of distilling `ResNet50_vd` using `ResNeXt101_32x16d_wsl` is as follows.
An example of SSLD distillation is provided here. The configuration file of `MobileNetV3_large_x1_0` distilling `MobileNetV3_small_x1_0` is provided in `ppcls/configs/ImageNet/Distillation/mv3_large_x1_0_distill_mv3_small_x1_0.yaml`, and the user can directly replace the path of the configuration file in `tools/train.sh` to use it.
# if not null, its lengths should be same as models
pretrained_list:
# if not null, its lengths should be same as models
freeze_params_list:
- True
- False
models:
- Teacher:
name: MobileNetV3_large_x1_0
pretrained: True
use_ssld: True
- Student:
name: MobileNetV3_small_x1_0
pretrained: False
infer_model_name: "Student"
```
In configuration file, the `freeze_params_list` needs to specify whether the model needs to freeze the parameters, the `models` needs to specify the teacher model and the student model, and the teacher model needs to load the pretrained model. The user can directly change the model here.
### Begin to train the network
If everything is ready, users can begin to train the network using the following command.
* Before using SSLD, users need to train a teacher model on the target dataset firstly. The teacher model is used to guide the training of the student model.
* When using SSLD, users need to set `use_distillation` in the configuration file to` True`. In addition, because the student model learns soft-label with knowledge information, you need to turn off the `label_smoothing` option.
* If the student model is not loaded with a pretrained model, the other hyperparameters of the training can refer to the hyperparameters trained by the student model on ImageNet-1k. If the student model is loaded with the pre-trained model, the learning rate can be adjusted to `1/100~1/10` of the standard learning rate.
* In the process of SSLD distillation, the student model only learns the soft label, which makes the training process more difficult. It is recommended that the value of `l2_decay` can be decreased appropriately to obtain higher accuracy of the validation set.
@@ -107,10 +103,6 @@ In `RandAugment`, the author proposes a random augmentation method. Instead of u
In PaddleClas, `RandAugment` is used as follows.
```python
fromppcls.data.imaugimportDecodeImage
fromppcls.data.imaugimportResizeImage
fromppcls.data.imaugimportRandAugment
fromppcls.data.imaugimporttransform
size=224
...
...
@@ -153,10 +145,6 @@ Cutout is a kind of dropout, but occludes input image rather than feature map. I
In PaddleClas, `Cutout` is used as follows.
```python
fromppcls.data.imaugimportDecodeImage
fromppcls.data.imaugimportResizeImage
fromppcls.data.imaugimportCutout
fromppcls.data.imaugimporttransform
size=224
...
...
@@ -188,11 +176,6 @@ RandomErasing is similar to the Cutout. It is also to solve the problem of poor
In PaddleClas, `RandomErasing` is used as follows.
```python
fromppcls.data.imaugimportDecodeImage
fromppcls.data.imaugimportResizeImage
fromppcls.data.imaugimportToCHWImage
fromppcls.data.imaugimportRandomErasing
fromppcls.data.imaugimporttransform
size=224
...
...
@@ -229,11 +212,6 @@ Images are divided into some patches for `HideAndSeek` and masks are generated w
In PaddleClas, `HideAndSeek` is used as follows.
```python
fromppcls.data.imaugimportDecodeImage
fromppcls.data.imaugimportResizeImage
fromppcls.data.imaugimportToCHWImage
fromppcls.data.imaugimportHideAndSeek
fromppcls.data.imaugimporttransform
size=224
...
...
@@ -283,11 +261,6 @@ It shows that the second method is better.
The usage of `GridMask` in PaddleClas is shown below.
```python
fromdata.imaugimportDecodeImage
fromdata.imaugimportResizeImage
fromdata.imaugimportToCHWImage
fromdata.imaugimportGridMask
fromdata.imaugimporttransform
size=224
...
...
@@ -329,11 +302,6 @@ Mixup is the first solution for image aliasing, it is easy to realize and perfor
The usage of `Mixup` in PaddleClas is shown below.
```python
fromppcls.data.imaugimportDecodeImage
fromppcls.data.imaugimportResizeImage
fromppcls.data.imaugimportToCHWImage
fromppcls.data.imaugimporttransform
fromppcls.data.imaugimportMixupOperator
size=224
...
...
@@ -373,11 +341,6 @@ Cutmix randomly cuts out an `ROI` from one image, and then covered onto the corr
```python
romppcls.data.imaugimportDecodeImage
fromppcls.data.imaugimportResizeImage
fromppcls.data.imaugimportToCHWImage
fromppcls.data.imaugimporttransform
fromppcls.data.imaugimportCutmixOperator
size=224
...
...
@@ -444,10 +407,9 @@ Configuration of `RandAugment` is shown as follows. `Num_layers`(default as 2) a
```yaml
transforms:
transform_ops:
-DecodeImage:
to_rgb:True
to_np:False
channel_first:False
-RandCropImage:
size:224
...
...
@@ -457,11 +419,10 @@ Configuration of `RandAugment` is shown as follows. `Num_layers`(default as 2) a
num_layers:2
magnitude:5
-NormalizeImage:
scale:1./255.
scale:1.0/255.0
mean:[0.485,0.456,0.406]
std:[0.229,0.224,0.225]
order:''
-ToCHWImage:
```
### Cutout
...
...
@@ -469,24 +430,22 @@ Configuration of `RandAugment` is shown as follows. `Num_layers`(default as 2) a
Configuration of `Cutout` is shown as follows. `n_holes`(default as 1) and `n_holes`(default as 112) are two hyperparameters.
```yaml
transforms:
transform_ops:
-DecodeImage:
to_rgb:True
to_np:False
channel_first:False
-RandCropImage:
size:224
-RandFlipImage:
flip_code:1
-NormalizeImage:
scale:1./255.
scale:1.0/255.0
mean:[0.485,0.456,0.406]
std:[0.229,0.224,0.225]
order:''
-Cutout:
n_holes:1
length:112
-ToCHWImage:
```
### Mixup
...
...
@@ -495,42 +454,39 @@ Configuration of `Cutout` is shown as follows. `n_holes`(default as 1) and `n_ho
Configuration of `Mixup` is shown as follows. `alpha`(default as 0.2) is hyperparameter which users need to care about. What's more, `use_mix` need to be set as `True` in the root of the configuration.
```yaml
transforms:
transform_ops:
-DecodeImage:
to_rgb:True
to_np:False
channel_first:False
-RandCropImage:
size:224
-RandFlipImage:
flip_code:1
-NormalizeImage:
scale:1./255.
scale:1.0/255.0
mean:[0.485,0.456,0.406]
std:[0.229,0.224,0.225]
order:''
-ToCHWImage:
mix:
batch_transform_ops:
-MixupOperator:
alpha:0.2
```
## 启动命令
## Start training
Users can use the following command to start the training process, which can also be referred to `tools/run.sh`.
Users can use the following command to start the training process, which can also be referred to `tools/train.sh`.
*When using augmentation methods based on image aliasing, users need to set `use_mix` in the configuration file as `True`. In addition, because the label needs to be aliased when the image is aliased, the accuracy of the training data cannot be calculated. The training accuracy rate was not printed during the training process.
* In addition, because the label needs to be aliased when the image is aliased, the accuracy of the training data cannot be calculated. The training accuracy rate was not printed during the training process.
* The training data is more difficult with data augmentation, so the training loss may be larger, the training set accuracy is relatively low, but it has better generalization ability, so the validation set accuracy is relatively higher.