@@ -74,6 +74,7 @@ The configuration file is in [ch_PP-OCRv2_rec_distillation.yml](../../configs/re
#### 2.1.1 Model Structure
In the knowledge distillation task, the model structure configuration is as follows.
```yaml
Architecture:
model_type:&model_type"rec"# Model category, recognition, detection, etc.
...
...
@@ -85,37 +86,55 @@ Architecture:
freeze_params:false# Do you need fixed parameters
return_all_feats:true# Do you need to return all features, if it is False, only the final output is returned
model_type:*model_type# Model category
algorithm:CRNN# The algorithm name of the sub-network. The remaining parameters of the sub-network are consistent with the general model training configuration
algorithm:SVTR# The algorithm name of the sub-network. The remaining parameters of the sub-network are consistent with the general model training configuration
Transform:
Backbone:
name:MobileNetV1Enhance
scale:0.5
Neck:
name:SequenceEncoder
encoder_type:rnn
hidden_size:64
last_conv_stride:[1,2]
last_pool_type:avg
Head:
name:CTCHead
mid_channels:96
fc_decay:0.00002
name:MultiHead
head_list:
-CTCHead:
Neck:
name:svtr
dims:64
depth:2
hidden_dims:120
use_guide:True
Head:
fc_decay:0.00001
-SARHead:
enc_dim:512
max_text_length:*max_text_length
Student:# Another sub-network, here is a distillation example of DML, the two sub-networks have the same structure, and both need to learn parameters
pretrained:# The following parameters are the same as above
freeze_params:false
return_all_feats:true
model_type:*model_type
algorithm:CRNN
algorithm:SVTR
Transform:
Backbone:
name:MobileNetV1Enhance
scale:0.5
Neck:
name:SequenceEncoder
encoder_type:rnn
hidden_size:64
last_conv_stride:[1,2]
last_pool_type:avg
Head:
name:CTCHead
mid_channels:96
fc_decay:0.00002
name:MultiHead
head_list:
-CTCHead:
Neck:
name:svtr
dims:64
depth:2
hidden_dims:120
use_guide:True
Head:
fc_decay:0.00001
-SARHead:
enc_dim:512
max_text_length:*max_text_length
```
If you want to add more sub-networks for training, you can also add the corresponding fields in the configuration file according to the way of adding `Student` and `Teacher`.
...
...
@@ -132,55 +151,83 @@ Architecture:
freeze_params:false
return_all_feats:true
model_type:*model_type
algorithm:CRNN
algorithm:SVTR
Transform:
Backbone:
name:MobileNetV1Enhance
scale:0.5
Neck:
name:SequenceEncoder
encoder_type:rnn
hidden_size:64
last_conv_stride:[1,2]
last_pool_type:avg
Head:
name:CTCHead
mid_channels:96
fc_decay:0.00002
name:MultiHead
head_list:
-CTCHead:
Neck:
name:svtr
dims:64
depth:2
hidden_dims:120
use_guide:True
Head:
fc_decay:0.00001
-SARHead:
enc_dim:512
max_text_length:*max_text_length
Student:
pretrained:
freeze_params:false
return_all_feats:true
model_type:*model_type
algorithm:CRNN
algorithm:SVTR
Transform:
Backbone:
name:MobileNetV1Enhance
scale:0.5
Neck:
name:SequenceEncoder
encoder_type:rnn
hidden_size:64
last_conv_stride:[1,2]
last_pool_type:avg
Head:
name:CTCHead
mid_channels:96
fc_decay:0.00002
Student2:# The new sub-network introduced in the knowledge distillation task, the configuration is the same as above
name:MultiHead
head_list:
-CTCHead:
Neck:
name:svtr
dims:64
depth:2
hidden_dims:120
use_guide:True
Head:
fc_decay:0.00001
-SARHead:
enc_dim:512
max_text_length:*max_text_length
Student2:
pretrained:
freeze_params:false
return_all_feats:true
model_type:*model_type
algorithm:CRNN
algorithm:SVTR
Transform:
Backbone:
name:MobileNetV1Enhance
scale:0.5
Neck:
name:SequenceEncoder
encoder_type:rnn
hidden_size:64
last_conv_stride:[1,2]
last_pool_type:avg
Head:
name:CTCHead
mid_channels:96
fc_decay:0.00002
name:MultiHead
head_list:
-CTCHead:
Neck:
name:svtr
dims:64
depth:2
hidden_dims:120
use_guide:True
Head:
fc_decay:0.00001
-SARHead:
enc_dim:512
max_text_length:*max_text_length
```
```
When the model is finally trained, it contains 3 sub-networks: `Teacher`, `Student`, `Student2`.
...
...
@@ -224,23 +271,42 @@ Loss:
act: "softmax" # Activation function, use it to process the input, can be softmax, sigmoid or None, the default is None
model_name_pairs: # The subnet name pair used to calculate DML loss. If you want to calculate the DML loss of other subnets, you can continue to add it below the list
- ["Student", "Teacher"]
key:head_out
key: head_out
multi_head: True # whether to use mult_head
dis_head: ctc # assign the head name to calculate loss
name: dml_ctc # prefix name of the loss
- DistillationDMLLoss: # DML loss function, inherited from the standard DMLLoss
weight: 0.5
act: "softmax" # Activation function, use it to process the input, can be softmax, sigmoid or None, the default is None
model_name_pairs: # The subnet name pair used to calculate DML loss. If you want to calculate the DML loss of other subnets, you can continue to add it below the list
- ["Student", "Teacher"]
key: head_out
multi_head: True # whether to use mult_head
dis_head: sar # assign the head name to calculate loss
name: dml_sar # prefix name of the loss
- DistillationDistanceLoss: # Distilled distance loss function
weight: 1.0
mode: "l2" # Support l1, l2 or smooth_l1
model_name_pairs: # Calculate the distance loss of the subnet name pair
- ["Student", "Teacher"]
key: backbone_out
- DistillationSARLoss: # SAR loss function based on distillation, inherited from standard SAR loss
weight: 1.0 # The weight of the loss function. In loss_config_list, each loss function must include this field
model_name_list: ["Student", "Teacher"] # For the prediction results of the distillation model, extract the output of these two sub-networks and calculate the SAR loss with gt
key: head_out # In the sub-network output dict, take the corresponding tensor
multi_head: True # whether it is multi-head or not, if true, SAR branch is used to calculate the loss
```
Among the above loss functions, all distillation loss functions are inherited from the standard loss function class.
The main functions are: Analyze the output of the distillation model, find the intermediate node (tensor) used to calculate the loss,
and then use the standard loss function class to calculate.
Taking the above configuration as an example, the final distillation training loss function contains the following three parts.
Taking the above configuration as an example, the final distillation training loss function contains the following five parts.
- The final output `head_out` of `Student` and `Teacher` calculates the CTC loss with gt (loss weight equals 1.0). Here, because both sub-networks need to update the parameters, both of them need to calculate the loss with gt.
- DML loss between `Student` and `Teacher`'s final output `head_out` (loss weight equals 1.0).
- CTC branch of the final output `head_out` for `Student` and `Teacher` calculates the CTC loss with gt (loss weight equals 1.0). Here, because both sub-networks need to update the parameters, both of them need to calculate the loss with gt.
- SAR branch of the final output `head_out` for `Student` and `Teacher` calculates the SAR loss with gt (loss weight equals 1.0). Here, because both sub-networks need to update the parameters, both of them need to calculate the loss with gt.
- DML loss between CTC branch of `Student` and `Teacher`'s final output `head_out` (loss weight equals 1.0).
- DML loss between SAR branch of `Student` and `Teacher`'s final output `head_out` (loss weight equals 0.5).
- L2 loss between `Student` and `Teacher`'s backbone network output `backbone_out` (loss weight equals 1.0).
For more specific implementation of `CombinedLoss`, please refer to: [combined_loss.py](../../ppocr/losses/combined_loss.py#L23).
...
...
@@ -257,6 +323,7 @@ PostProcess:
name: DistillationCTCLabelDecode # CTC decoding post-processing of distillation tasks, inherited from the standard CTCLabelDecode class
model_name: ["Student", "Teacher"] # For the prediction results of the distillation model, extract the outputs of these two sub-networks and decode them
key: head_out # Take the corresponding tensor in the subnet output dict
multi_head: True # whether it is multi-head or not, if true, CTC branch is used to calculate the loss
```
Taking the above configuration as an example, the CTC decoding output of the two sub-networks `Student` and `Teahcer` will be calculated at the same time.
...
...
@@ -276,6 +343,7 @@ Metric:
base_metric_name: RecMetric # The base class of indicator calculation. For the output of the model, the indicator will be calculated based on this class
main_indicator: acc # The name of the indicator
key: "Student" # Select the main_indicator of this subnet as the criterion for saving the best model
ignore_space: False # whether to ignore space during evaulation
```
Taking the above configuration as an example, the accuracy metric of the `Student` subnet will be used as the judgment metric for saving the best model.
...
...
@@ -289,13 +357,13 @@ For more specific implementation of `DistillationMetric`, please refer to: [dist
There are two ways to fine-tune the recognition distillation task.
1. Fine-tuning based on knowledge distillation: this situation is relatively simple, download the pre-trained model. Then configure the pre-training model path and your own data path in [ch_PP-OCRv2_rec_distillation.yml](../../configs/rec/ch_PP-OCRv2/ch_PP-OCRv2_rec_distillation.yml) to perform fine-tuning training of the model.
1. Fine-tuning based on knowledge distillation: this situation is relatively simple, download the pre-trained model. Then configure the pre-training model path and your own data path in [ch_PP-OCRv2_rec_distillation.yml](../../configs/rec/PP-OCRv3/ch_PP-OCRv3_rec_distillation.yml) to perform fine-tuning training of the model.
2. Do not use knowledge distillation in fine-tuning: In this case, you need to first extract the student model parameters from the pre-training model. The specific steps are as follows.
- First download the pre-trained model and unzip it.
After the extraction is complete, use [ch_PP-OCRv2_rec.yml](../../configs/rec/ch_PP-OCRv2/ch_PP-OCRv2_rec.yml) to modify the path of the pre-trained model (the path of the exported `student.pdparams` model) and your own data path to fine-tune the model.
After the extraction is complete, use [ch_PP-OCRv3_rec.yml](../../configs/rec/PP-OCRv3/ch_PP-OCRv3_rec.yml) to modify the path of the pre-trained model (the path of the exported `student.pdparams` model) and your own data path to fine-tune the model.
<a name="22"></a>
### 2.2 Detection Model Configuration File Analysis