## 1. PP-StructureV2 Introduction

PP-StructureV2 is further improved on the basis of PP-StructureV1, mainly in the following three aspects:

 * **System function upgrade**: Added image correction and layout restoration modules, image conversion to word/pdf, and key information extraction capabilities!
 * **System performance optimization** ：
	 * Layout analysis: released a lightweight layout analysis model, the speed is increased by **11 times**, and the average CPU time is only **41ms**!
	 * Table recognition: three optimization strategies are designed, and the model accuracy is improved by **6%** when the prediction time is constant.
	 * Key information extraction: designing a visually irrelevant model structure, the accuracy of semantic entity recognition is improved by **2.8%**, and the accuracy of relation extraction is improved by **9.1%**.
 * **Chinese scene adaptation**: Complete the Chinese scene adaptation for layout analysis and table recognition, open source **out-of-the-box** Chinese scene layout structure model!

The PP-StructureV2 framework is shown in the figure below. Firstly, the input document image direction is corrected by the Image Direction Correction module. For the Layout Information Extraction subsystem, as shown in the upper branch, the corrected image is firstly divided into different areas such as text, table and image through the layout analysis module, and then these areas are recognized respectively. For example, the table area is sent to the table recognition module for structural recognition, and the text area is sent to the OCR engine for text recognition. Finally, the layout recovery module is used to restore the image to an editable Word file consistent with the original image layout. For the Key Information Extraction subsystem, as shown in the lower branch, OCR engine is used to extract the text content, then the Semantic Entity Recognition module and Relation Extraction module are used to obtain the entities and their relationship in the image, respectively, so as to extract the required key information.

<div align="center">
<img src="https://user-images.githubusercontent.com/14270174/185939247-57e53254-399c-46c4-a610-da4fa79232f5.png"  width = "80%"  />
</div>


We made 8 improvements to 3 key sub-modules in the system.

* Layout analysis
	* PP-PicoDet: A better real-time object detector on mobile devices
	* FGD: Focal and Global Knowledge Distillation

* Table Recognition
	* PP-LCNet: CPU-friendly Lightweight Backbone
	* CSP-PAN: Lightweight Multi-level Feature Fusion Module
	* SLAHead: Structure and Location Alignment Module

* Key Information Extraction
	* VI-LayoutXLM: Visual-feature Independent LayoutXLM
	* TB-YX: Threshold-Based YX sorting algorithm
	* UDML: Unified-Deep Mutual Learning

Finally, compared to PP-StructureV1:

- The number of parameters of the layout analysis model is reduced by 95.6%, the inference speed is increased by 11 times, and the accuracy is increased by 0.4%;
- The table recognition model improves the model accuracy by 6% and the end-to-end TEDS by 2% without changing the prediction time.
- The speed of the key information extraction model is increased by 2.8 times, the accuracy of the semantic entity recognition model is increased by 2.8%, and the accuracy of the relationship extraction model is increased by 9.1%.


For more details, please refer to the technical report: https://arxiv.org/abs/2210.05391v2 .

For more information about PaddleOCR, you can click https://github.com/PaddlePaddle/PaddleOCR to learn more.


## 2. Model Effects

The results of PP-StructureV2 are as follows:

- Layout analysis
  
<div align="center">
<img src="https://user-images.githubusercontent.com/14270174/185940654-956ef614-888a-4779-bf63-a6c2b61b97fa.png"  width = "60%"  />
</div>

- Table recognition
  
<div align="center">
<img src="https://user-images.githubusercontent.com/14270174/185941221-c94e3d45-524c-4073-9644-21ba6a9fd93e.png"  width = "60%"  />
</div>

- Layout recovery
  
<div align="center">
<img src="https://user-images.githubusercontent.com/14270174/185941816-4dabb3e8-a0db-4094-98ea-52e0a0fda8e8.png"  width = "60%"  />
</div>





## 3. How to Use the Model

### 3.1 Inference
* Install PaddleOCR whl package

In [None]:
! pip install "paddleocr>=2.6.1.0"

* Quick experience
  
image orientation + layout analysis + table recognition

In [None]:
! wget https://raw.githubusercontent.com/PaddlePaddle/PaddleOCR/release/2.6/ppstructure/docs/table/1.png
! pip install paddleclas
! paddleocr --image_dir=1.png --type=structure --image_orientation=true

layout analysis + table recognition

In [None]:
! wget https://raw.githubusercontent.com/PaddlePaddle/PaddleOCR/release/2.6/ppstructure/docs/table/1.png
! paddleocr --image_dir=1.png --type=structure

layout analysis

In [None]:
! wget https://raw.githubusercontent.com/PaddlePaddle/PaddleOCR/release/2.6/ppstructure/docs/table/1.png
! paddleocr --image_dir=1.png --type=structure --table=false --ocr=false

table recognition

In [None]:
! wget https://raw.githubusercontent.com/PaddlePaddle/PaddleOCR/release/2.6/ppstructure/docs/table/table.jpg
! paddleocr --image_dir=table.jpg --type=structure --layout=false

### 3.2 Train the model
The PP-StructureV2 system consists of a layout analysis model, a text detection model, a text recognition model and a table recognition model. For the four model training tutorials, please refer to the following documents:
1. Layout analysis model: [Layout analysis model training tutorial](https://github.com/PaddlePaddle/PaddleOCR/blob/release%2F2.6/ppstructure/layout/README_ch.md)
2. text detection model: [text detection training tutorial](https://github.com/PaddlePaddle/PaddleOCR/blob/release%2F2.6/doc/doc_ch/detection.md)
3. text recognition model: [text recognition training tutorial](https://github.com/PaddlePaddle/PaddleOCR/blob/release%2F2.6/doc/doc_ch/recognition.md)
3. table recognition model: [table recognition training tutorial](https://github.com/PaddlePaddle/PaddleOCR/blob/release%2F2.6/doc/doc_ch/table_recognition.md)

After the model training is completed, it can be used in series by specifying the model path. The command reference is as follows:
```python
paddleocr --image_dir 11.jpg --layout_model_dir=/path/to/layout_inference_model --det_model_dir=/path/to/det_inference_model --rec_model_dir=/path/to/rec_inference_model --table_model_dir=/path/to/table_inference_model
```

## 4. Model Principles

The enhancement strategies of each module are as follows

1. Image Direction Correction Module
   
Since the training set is generally dominated by 0-degree images, the information extraction effect of rotated images is often compromised. In PP-StructureV2, the input image direction is firstly corrected by the PULC text image direction model provided by PaddleClas. Some demo images in the dataset are shown below. Different from the text line direction classifier, the text image direction classifier performs direction classification for the entire image. The text image direction classification model achieves 99% accuracy on the validation set with 463 FPS on CPU device.

<div align="center">
    <img src="https://user-images.githubusercontent.com/14270174/185939683-f6465473-3303-4a0c-95be-51f04fb9f387.png" width="600">
</div>

1. Layout Analysis

Layout Analysis refers to dividing document images into predefined areas such as text, title, table, and figure. In PP-Structure, we adopted the object detection algorithm PP-YOLOv2 as the layout detector.

**(1)PP-PicoDet: A better real-time object detector on mobile devices**

PaddleDetection proposed a new family of real-time object detectors, named PP-PicoDet, which achieves superior performance on mobile devices. PP-PicoDet adopts the CSP structure to constructure CSP-PAN as the neck, SimOTA as label assignment strategy, PP-LCNet as the backbone, and an improved detection One-shot Neural Architecture Search(NAS) is proposed to find the optimal architecture automatically for object detection. We replace PP-YOLOv2 adopted by PP-Structure with PP-PicoDet, and adjust the input scale from 640*640 to 800*608, which is more suitable for document images. With 1.0x configuration, the accuracy is comparable to PP-YOLOv2, and the CPU inference speed is 11 times faster.

**(2) FGD: Focal and Global Knowledge Distillation**

FGD, a knowledge distillation algorithm for object detection, takes into account local and global feature maps, combining focal distillation and global distillation. Focal distillation separates the foreground and background of the image, forcing the student to focus on the teacher’s critical pixels and channels. Global distillation rebuilds the relation between different pixels and transfers it from teachers to students, compensating for missing global information in focal distillation. Based on the FGD distillation strategy, the student model (LCNet1.0x based PP-PicoDet) gets 0.5% mAP improvement with the knowledge from the teacher model (LCNet2.5x based PP-PicoDet). Finally the student model is only 0.2% lower than the teacher model on mAP, but 100% faster.

1. Table Recognition

In PP-StructureV2, we propose an efficient Table Recognition algorithm named SLANet (Structure Location Alignment Network). Compared with TableRec-RARE, SLANet has been upgraded in terms of model structure and loss. The enhancement strategies are as follows:

**(1) PP-LCNet: CPU-friendly Lightweight Backbone**

PP-LCNet is a lightweight CPU network based on the MKLDNN acceleration strategy, which achieves better performance on multiple tasks than lightweight models such as ShuffleNetV2, MobileNetV3, and GhostNet. Additionally, pre-trained weights trained by SSLD on ImageNet are used for Table Recognition model training process for higher accuracy.

**(2) CSP-PAN: Lightweight Multi-level Feature Fusion Module**

Fusion of the features extracted by the backbone network can effectively alleviate problems brought by scale changes in complex scenes. In the early days, the FPN module was proposed and used for feature fusion, but its feature fusion process was one-way (from high-level to low-level), which was not sufficient. CSP-PAN is improved based on PAN. While ensuring more sufficient feature fusion, strategies such as CSP block and depthwise separable convolution are used to reduce the computational cost. In SLANet, we reduce the output channels of CSP-PAN from 128 to 96 in order to reduce the model size.


**(3) SLAHead: Structure and Location Alignment Module**

In the TableRec-RARE head, output of each step is concatenated and fed into SDM (Structure Decode Module) and CLDM (Cell Location Decode Module) to generate all cell tokens and coordinates, which ignores the one-to-one correspondence between cell token and coordinates. Therefore, we propose the SLAHead to align cell token and coordinates. In SLAHead, output of each step is fed into SDM and CLDM to get the token and coordinates of the current step, the token and coordinates of all steps are concatenated to get the HTML table representation and coordinates of all cells.

<div align="center">
    <img src="https://user-images.githubusercontent.com/14270174/185940968-e3a2fbac-78d7-4b74-af54-a1dab860f470.png" width="1200">
</div>


**(4) Merge Token**

In TableRec-RARE, we use two separate tokens `<td>` and `</td>` to represent a non-cross-row-column cell, which limits the network’s ability to handle tables with a large number of cells. Inspired by TableMaster, we regard `<td>` and `</td>` as one token `<td></td>` in SLANet.


1. Layout Recovery

Layout Recovery a newly added module which is responsible for restoring the image to an editable Word file according to the analysis results. The following figure shows the result of layout restoration:

<div align="center">
    <img src="https://user-images.githubusercontent.com/14270174/185941816-4dabb3e8-a0db-4094-98ea-52e0a0fda8e8.png" width="1200">
</div>

1. Key Information Extraction

Key Information Extraction (KIE) is usually used to extract the specific information such as name, address and other fields in the ID card or forms. Semantic Entity Recognition (SER) and Relationship Extraction (RE) are two subtasks in KIE, which have been supported in PP-Structure. In PP-StructureV2, we design a visual-feature independent LayoutXLM structure for less inference time cost. TB-YX sorting algorithm and U-DML knowledge distillation are utilized for higher accuracy. The following figure shows the KIE framework.


<div align="center">
    <img src="https://user-images.githubusercontent.com/14270174/185941978-abec7d4a-5e3a-4141-83f8-088d04ef898e.png" width="1000">
</div>


The enhancement strategies are as follows:

**(1) VI-LayoutXLM（Visual-feature Independent LayoutXLM）**

Visual backbone network is introduced in LayoutLMv2 and LayoutXLM to extract visual features and combine with subsequent text embedding as multi-modal input embedding. Considering that the visual backbone is base on ResNet x101 64x4d, which takes much time during the visual feature extraction process, we remove this submodule from LayoutXLM. Surprisingly, we found that Hmean of SER and RE tasks based on LayoutXLM is not decreased, and Hmean of SER task based on LayoutLMv2 is just reduced by 2.1%, while the model size is reduced by about 340MB. At the same time, based on the XFUND dataset, the accuracy of VI-LayoutXLM on the RE task is improved by `1.06%`.

**(2) TB-YX: Threshold-Based YX sorting algorithm**

Text reading order is important for KIE tasks. In traditional multi-modal KIE methods, incorrect reading order that may be generated by different OCR engines is not considered, which will directly affect the position embedding and final inference result. Generally, we sort the OCR results from top to bottom and then left to right according to the absolute coordinates of the detected text boxes (YX). The obtained order is usually unstable and not consistent with the reading order. We introduce a position offset threshold th to address this problem (TB-YX). The text boxes are still sorted from top to bottom first, but when the distance between the two text boxes in the Y direction is less than the threshold th, their order is determined by the order in the X direction.

<div align="center">
    <img src="https://user-images.githubusercontent.com/14270174/185942080-9d4bafc9-fa7f-4da4-b139-b2bd703dc76d.png" width="800">
</div>


Using this strategy, on the XFUND dataset, the F1 index of the SER task increased by `2.06%`, and the F1 index of the RE task increased by `7.04%`.

**(3) U-DML: Unified-Deep Mutual Learning**

U-DML is a distillation method proposed in PP-OCRv2 which can effectively improve the accuracy without increasing model size. In PP-StructureV2, we apply U-DML to the training process of SER and RE tasks, and Hmean is increased by 0.6% and 5.1%, repectively.


The visualization results of VI-LayoutXLM based on the SER task are shown below.

<div align="center">
    <img src="https://user-images.githubusercontent.com/14270174/185942213-0909135b-3bcd-4d79-9e69-847dfb1c3b82.png" width="800">
</div>

<div align="center">
    <img src="https://user-images.githubusercontent.com/14270174/185942237-72923b42-8590-42eb-b687-fa819b1c3afd.png" width="800">
</div>


The visualization results of VI-LayoutXLM based on the RE task are shown below.


<div align="center">
    <img src="https://user-images.githubusercontent.com/14270174/185942400-8920dc3c-de7f-46d0-b0bc-baca9536e0e1.png" width="800">
</div>

<div align="center">
    <img src="https://user-images.githubusercontent.com/14270174/185942416-ca4fd8b0-9227-4c65-b969-0afbda525b85.png" width="800">
</div>


## 5. Attention

1. The PP-StructureV2 series of models have public data sets during the training process. If the performance is not satisfactory in the actual scene, a small amount of data can be marked for finetune.
2. The online experience currently only supports table recognition. For layout analysis and layout recovery, please refer to `3.1 Model Inference`.

## 6. Related papers and citations
```
@article{li2022pp,
  title={PP-StructureV2: A Stronger Document Analysis System},
  author={Li, Chenxia and Guo, Ruoyu and Zhou, Jun and An, Mengtao and Du, Yuning and Zhu, Lingfeng and Liu, Yi and Hu, Xiaoguang and Yu, Dianhai},
  journal={arXiv preprint arXiv:2210.05391},
  year={2022}
}
```
