## 1. PP-OCRv3 Introduction

PP-OCRv3 is further upgraded on the basis of PP-OCRv2. The pipeline is the same as PP-OCRv2, optimized for detection model and recognition model. Among them, the detection module is still optimized based on the DB algorithm, while the recognition module uses CVRT to replace CRNN, and makes industrial adaptation to it. The pipeline of PP-OCRv3 is as follows (the new strategy for PP-OCRv3 is in the pink box):

<div align="center">
<img src="https://raw.githubusercontent.com/PaddlePaddle/PaddleOCR/release/2.6/doc/ppocrv3_framework.png"  width = "80%"  />
</div>


PP-OCRv3 upgrades the text detection model and text recognition model in 9 aspects based on PP-OCRv2. 

- Text detection:
    - LK-PAN：LK-PAN: a PAN module with large receptive field;
    - DML: deep mutual learning for teacher model;
    - RSE-FPN: a FPN module with residual attention mechanism;


- Text recognition
    - SVTR-LCNet: lightweight text recognition network;
    - GTC: Guided rraining of CTC by attention;
    - TextConAug: data augmentation for mining text context information;
    - TextRotNet: self-supervised pre-trained model;
    - U-DML: unified-deep mutual learning;
    - UIM: unlabeled images mining;

In the case of comparable speeds, the accuracy of various scenarios has been greatly improved:
- Compared with the PP-OCRv2 Chinese model, the Chinese scene is improved by more than 5%;
- Compared with the PP-OCRv2 English model in the English digital scene, it is improved by 11%;
- In multi-language scenarios, the recognition performance of 80+ languages is optimized, and the average accuracy rate is increased by more than 5%.



For more details, please refer to the technical report: https://arxiv.org/abs/2206.03001 .

For more information about PaddleOCR, you can click https://github.com/PaddlePaddle/PaddleOCR to learn more.





## 2. Model Effects

The results of PP-OCRv3 are as follows:

<div align="center">
<img src="https://user-images.githubusercontent.com/12406017/200261622-1b928d93-93ab-4575-8c60-214bcc03eda1.png"  width = "80%"  />
</div>
<div align="center">
<img src="https://user-images.githubusercontent.com/12406017/200261711-9f18bb04-3736-4f51-892c-de801db9ab9e.png"  width = "80%"  />
</div>





## 3. How to Use the Model

### 3.1 Inference
* Install PaddleOCR whl package

In [None]:
! pip install paddleocr

* Quick experience

In [None]:
# command line usage
! paddleocr --image_dir PaddleOCR/doc/imgs/11.jpg --use_angle_cls true

After the operation is complete, the following results will be output in the terminal:
```log
[[[28.0, 37.0], [302.0, 39.0], [302.0, 72.0], [27.0, 70.0]], ('纯臻营养护发素', 0.96588134765625)]
[[[26.0, 81.0], [172.0, 83.0], [172.0, 104.0], [25.0, 101.0]], ('产品信息/参数', 0.9113278985023499)]
[[[28.0, 115.0], [330.0, 115.0], [330.0, 132.0], [28.0, 132.0]], ('（45元/每公斤，100公斤起订）', 0.8843421936035156)]
......
```




## 4. Model Principles

The optimization ideas are as follows

1. Text detection enhancement strategies
- LK-PAN: a PAN module with large receptive field
  
LK-PAN (Large Kernel PAN) is a lightweight PAN structure with a larger receptive field. The core is to change the convolution kernel in the path augmentation of the PAN structure from 3*3 to 9*9. By increasing the convolution kernel, the receptive field covered by each position of the feature map is improved, and it is easier to detect text in large fonts and text with extreme aspect ratios. Using the LK-PAN structure, the hmean of the teacher model can be improved from 83.2% to 85.0%.

   <div align="center">
   <img src="https://raw.githubusercontent.com/PaddlePaddle/PaddleOCR/release/2.6/doc/ppocr_v3/LKPAN.png"  width = "60%"  />
   </div>

- DML: deep mutual learning for teacher model

[DML](https://arxiv.org/abs/1706.00384) (Deep Mutual Learning) The mutual learning distillation method, as shown in the figure below, can effectively improve the accuracy of the text detection model by learning from each other with two models with the same structure. The teacher model adopts the DML strategy, and the hmean is increased from 85% to 86%. By updating the teacher model of CML in PP-OCRv2 to the above higher-accuracy teacher model, the hmean of the student model can be further improved from 83.2% to 84.3%.
   <div align="center">
   <img src="https://raw.githubusercontent.com/PaddlePaddle/PaddleOCR/release/2.6/doc/ppocr_v3/teacher_dml.png"  width = "60%"  />
   </div>

- RSE-FPN: a FPN module with residual attention mechanism

RSE-FPN (Residual Squeeze-and-Excitation FPN), as shown in the figure below, introduces a residual structure and a channel attention structure, and replaces the convolutional layer in the FPN with the RSEConv layer of the channel attention structure to further improve the representation of the feature map ability. Considering that the number of FPN channels in the detection model of PP-OCRv2 is very small, only 96, if SEblock is directly used to replace the convolution in FPN, the features of some channels will be suppressed, and the accuracy will be reduced. The introduction of residual structure in RSEConv will alleviate the above problems and improve the text detection effect. By further updating the FPN structure of the student model of CML in PP-OCRv2 to RSE-FPN, the hmean of the student model can be further improved from 84.3% to 85.4%.

<div align="center">
<img src="https://raw.githubusercontent.com/PaddlePaddle/PaddleOCR/release/2.6/doc/ppocr_v3/RSEFPN.png"  width = "60%"  />
</div>

1. Text recognition enhancement strategies
- SVTR_LCNet: lightweight text recognition network

SVTR_LCNet is a lightweight text recognition network that integrates Transformer-based SVTR network and lightweight CNN network PP-LCNet for text recognition tasks. Using this network, the prediction speed is 20% better than the recognition model of PP-OCRv2, but the effect of the recognition model is slightly worse because the distillation strategy is not adopted. In addition, the normalization height of the input image is further increased from 32 to 48, and the prediction speed is slightly slower, but the model effect is greatly improved, and the recognition accuracy rate reaches 73.98% (+2.08%), which is close to the recognition model effect of PP-OCRv2 using the distillation strategy.

- GTC: Guided rraining of CTC by attention

[GTC](https://arxiv.org/pdf/2002.01276.pdf) (Guided Training of CTC), which uses the Attention module CTC training and integrates the expression of multiple text features is an effective strategy to improve text recognition. Using this strategy, the Attention module is completely removed during prediction, and no time-consuming is added in the inference stage, and the accuracy of the recognition model is further improved to 75.8% (+1.82%). The training process is as follows:

<div align="center">
<img src="https://user-images.githubusercontent.com/12406017/200265540-1bbb730f-35d4-4d72-8e00-70856bb932ee.png"  width = "60%"  />
</div>

- TextConAug: data augmentation for mining text context information

TextConAug is a data augmentation strategy for mining textual context information. The main idea comes from the paper [ConCLR](https://www.cse.cuhk.edu.hk/~byu/papers/C139-AAAI2022-ConCLR.pdf) , the author proposes ConAug data augmentation to connect 2 different images in a batch to form new images and perform self-supervised comparative learning. PP-OCRv3 applies this method to supervised learning tasks, and designs the TextConAug data augmentation method, which can enrich the context information of training data and improve the diversity of training data. Using this strategy, the accuracy of the recognition model is further improved to 76.3% (+0.5%). The schematic diagram of TextConAug is as follows:

<div align="center">
<img src="https://user-images.githubusercontent.com/12406017/200265540-1bbb730f-35d4-4d72-8e00-70856bb932ee.png"  width = "60%"  />
</div>

- TextRotNet: self-supervised pre-trained model

TextRotNet is a pre-training model that uses a large amount of unlabeled text line data and is trained in a self-supervised manner. Refer to the paper [STR-Fewer-Labels](https://github.com/ku21fan/STR-Fewer-Labels). This model can initialize the initial weights of SVTR_LCNet, which helps the text recognition model to converge to a better position. Using this strategy, the accuracy of the recognition model is further improved to 76.9% (+0.6%). The TextRotNet training process is shown in the following figure:

<div align="center">
<img src="https://raw.githubusercontent.com/PaddlePaddle/PaddleOCR/release/2.6/doc/ppocr_v3/SSL.png"  width = "60%"  />
</div>

- U-DML: unified-deep mutual learning

UDML (Unified-Deep Mutual Learning) joint mutual learning is a strategy adopted in PP-OCRv2 that is very effective for text recognition to improve the model effect. In PP-OCRv3, for two different SVTR_LCNet and Attention structures, the feature map of PP-LCNet, the output of the SVTR module and the output of the Attention module between them are simultaneously supervised and trained. Using this strategy, the accuracy of the recognition model is further improved to 78.4% (+1.5%).

- UIM: unlabeled images mining

UIM (Unlabeled Images Mining) is a very simple unlabeled data mining scheme. The core idea is to use a high-precision text recognition model to predict unlabeled data, obtain pseudo-labels, and select samples with high prediction confidence as training data for training small models. Using this strategy, the accuracy of the recognition model is further improved to 79.4% (+1%).

<div align="center">
<img src="https://raw.githubusercontent.com/PaddlePaddle/PaddleOCR/release/2.6/doc/ppocr_v3/UIM.png"  width = "60%"  />
</div>

## 5. Attention


General data are used in the training process of PP-OCR series models. If the performance is not satisfactory in the actual scene, a small amount of data can be marked for finetune.

## 6. Related papers and citations
```
@article{du2021pp,
  title={PP-OCRv2: bag of tricks for ultra lightweight OCR system},
  author={Du, Yuning and Li, Chenxia and Guo, Ruoyu and Cui, Cheng and Liu, Weiwei and Zhou, Jun and Lu, Bin and Yang, Yehua and Liu, Qiwen and Hu, Xiaoguang and others},
  journal={arXiv preprint arXiv:2109.03144},
  year={2021}
}

@inproceedings{zhang2018deep,
  title={Deep mutual learning},
  author={Zhang, Ying and Xiang, Tao and Hospedales, Timothy M and Lu, Huchuan},
  booktitle={Proceedings of the IEEE conference on computer vision and pattern recognition},
  pages={4320--4328},
  year={2018}
}

@inproceedings{hu2020gtc,
  title={Gtc: Guided training of ctc towards efficient and accurate scene text recognition},
  author={Hu, Wenyang and Cai, Xiaocong and Hou, Jun and Yi, Shuai and Lin, Zhiping},
  booktitle={Proceedings of the AAAI Conference on Artificial Intelligence},
  volume={34},
  number={07},
  pages={11005--11012},
  year={2020}
}

@inproceedings{zhang2022context,
  title={Context-based Contrastive Learning for Scene Text Recognition},
  author={Zhang, Xinyun and Zhu, Binwu and Yao, Xufeng and Sun, Qi and Li, Ruiyu and Yu, Bei},
  year={2022},
  organization={AAAI}
}
```
