未验证 提交 52970f04 编写于 作者: 文幕地方's avatar 文幕地方 提交者: GitHub

add PP-StructureV2 (#5558)

* add PP-OCRv2

* add PP-OCRv2 benckmark

* update app

* update introduction

* update PP-OCRv2 introduction

* add pipeline of PP-OCRv2

* add PP-OCRv3

* update benckmark

* add paddleocr to requirements

* update

* support paddleocr 2.6.0.1

* rm print code

* add image download

* add model train

* update PP-OCRv3 ref

* update doc

* update doc

* update PP-OCR info

* add PP-StructureV2

* update doc

* merge upstream

* update info.yaml
上级 499de374
---
Model_Info:
name: "PP-OCRv2"
description: ""
description_en: ""
description: "PP-OCRv2文字检测识别系统"
description_en: "PP-OCRv2 text detection and recognition system"
icon: "@后续UE统一设计之后,会存到bos上某个位置"
from_repo: "PaddleOCR"
Task:
- tag_en: "CV"
tag: "计算机视觉"
sub_tag_en: "Character Recognition"
sub_tag: "文字识别"
sub_tag_en: "Text Detection, Character Recognition, Optical Character Recognition"
sub_tag: "文字检测,文字识别,OCR"
Example:
- title: "《动手学OCR》系列课程之:PP-OCRv2预测部署实战"
url: "https://aistudio.baidu.com/aistudio/projectdetail/3552922?channelType=0&channel=0"
title_en: "Dive into OCR series of courses: PP-OCRv2 prediction and deployment"
url_en: "https://aistudio.baidu.com/aistudio/projectdetail/3552922?channelType=0&channel=0"
- title: "《动手学OCR》系列课程之:OCR文本识别实战"
url: "https://aistudio.baidu.com/aistudio/projectdetail/3552051?channelType=0&channel=0"
title_en: "Dive into OCR series of courses: text recognition in practice"
url_en: "https://aistudio.baidu.com/aistudio/projectdetail/3552051?channelType=0&channel=0"
- title: "《动手学OCR》系列课程之:OCR文本检测实践"
url: "https://aistudio.baidu.com/aistudio/projectdetail/3551779?channelType=0&channel=0"
title_en: "Dive into OCR series of courses: text detection in practice"
url_en: "https://aistudio.baidu.com/aistudio/projectdetail/3551779?channelType=0&channel=0"
Datasets: "ICDAR 2015, ICDAR2019-LSVT,ICDAR2017-RCTW-17,Total-Text,ICDAR2019-ArT"
Pulisher: "Baidu"
License: "apache.2.0"
Paper:
- title: "PP-OCRv2: Bag of Tricks for Ultra Lightweight OCR System"
url: "https://arxiv.org/pdf/2109.03144v2.pdf"
url: "https://arxiv.org/abs/2109.03144"
IfTraining: 0
IfOnlineDemo: 1
......@@ -196,7 +196,7 @@
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"display_name": "Python 3.8.13 ('py38')",
"language": "python",
"name": "python3"
},
......@@ -210,7 +210,12 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.8.8"
"version": "3.8.13"
},
"vscode": {
"interpreter": {
"hash": "58fd1890da6594cebec461cf98c6cb9764024814357f166387d10d267624ecd6"
}
}
},
"nbformat": 4,
......
......@@ -112,7 +112,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"### 3.2 Train the model.\n",
"### 3.2 Train the model\n",
"The PP-OCR system consists of a text detection model, an angle classifier and a text recognition model. For the three model training tutorials, please refer to the following documents:\n",
"1. text detection model: [text detection training tutorial](https://github.com/PaddlePaddle/PaddleOCR/blob/release%2F2.6/doc/doc_ch/detection.md)\n",
"1. angle classifier: [angle classifier training tutorial](https://github.com/PaddlePaddle/PaddleOCR/blob/release%2F2.6/doc/doc_ch/angle_class.md)\n",
......@@ -130,6 +130,7 @@
"source": [
"## 4. Model Principles\n",
"\n",
"The enhancement strategies are as follows\n",
"\n",
"1. Text detection enhancement strategies\n",
"- Adopt CML (Collaborative Mutual Learning) collaborative mutual learning knowledge distillation strategy.\n",
......@@ -193,7 +194,7 @@
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"display_name": "Python 3.8.13 ('py38')",
"language": "python",
"name": "python3"
},
......@@ -207,7 +208,12 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.8.8"
"version": "3.8.13"
},
"vscode": {
"interpreter": {
"hash": "58fd1890da6594cebec461cf98c6cb9764024814357f166387d10d267624ecd6"
}
}
},
"nbformat": 4,
......
---
Model_Info:
name: "PP-OCRv3"
description: ""
description_en: ""
description: "PP-OCRv3文字检测识别系统"
description_en: "PP-OCRv3 text detection and recognition system"
icon: "@后续UE统一设计之后,会存到bos上某个位置"
from_repo: "PaddleOCR"
Task:
- tag_en: "CV"
tag: "计算机视觉"
sub_tag_en: "Character Recognition"
sub_tag: "文字识别"
sub_tag_en: "Text Detection, Character Recognition, Optical Character Recognition"
sub_tag: "文字检测,文字识别,OCR"
Example:
- title: "【官方】十分钟完成 PP-OCRv3 识别全流程实战"
- title: "【官方】十分钟完成 PP-OCRv3 识别全流程实战"
url: "https://aistudio.baidu.com/aistudio/projectdetail/3916206?channelType=0&channel=0"
title_en: "[Official] Complete the whole process of PP-OCRv3 identification in ten minutes"
url_en: "https://aistudio.baidu.com/aistudio/projectdetail/3916206?channelType=0&channel=0"
- title: "鸟枪换炮!基于PP-OCRv3的电表检测识别"
url: "https://aistudio.baidu.com/aistudio/projectdetail/511591?channelType=0&channel=0"
title_en: "Swap the shotgun! Detection and recognition electricity meters based on PP-OCRv3"
url_en: "https://aistudio.baidu.com/aistudio/projectdetail/511591?channelType=0&channel=0"
- title: "基于PP-OCRv3实现PCB字符识别"
url: "https://aistudio.baidu.com/aistudio/projectdetail/4008973?channelType=0&channel=0"
title_en: "PCB character recognition based on PP-OCRv3"
url_en: "https://aistudio.baidu.com/aistudio/projectdetail/4008973?channelType=0&channel=0"
Datasets: "ICDAR 2015, ICDAR2019-LSVT,ICDAR2017-RCTW-17,Total-Text,ICDAR2019-ArT"
Pulisher: "Baidu"
License: "apache.2.0"
......
......@@ -67,7 +67,7 @@
"source": [
"## 3. 模型如何使用\n",
"\n",
"### 3.1 模型推理\n",
"### 3.1 模型推理\n",
"* 安装PaddleOCR whl包"
]
},
......@@ -96,7 +96,7 @@
},
{
"cell_type": "code",
"execution_count": null,
"execution_count": 1,
"metadata": {
"scrolled": true,
"tags": []
......@@ -136,7 +136,7 @@
"模型训练完成后,可以通过指定模型路径的方式串联使用\n",
"命令参考如下:\n",
"```python\n",
"paddleocr --image_dir 11.jpg --use_angle_cls true --ocr_version PP-OCRv2 --det_model_dir=/path/to/det_inference_model --cls_model_dir=/path/to/cls_inference_model --rec_model_dir=/path/to/rec_inference_model\n",
"paddleocr --image_dir 11.jpg --use_angle_cls true --det_model_dir=/path/to/det_inference_model --cls_model_dir=/path/to/cls_inference_model --rec_model_dir=/path/to/rec_inference_model\n",
"```"
]
},
......@@ -228,36 +228,11 @@
"source": [
"## 6. 相关论文以及引用信息\n",
"```\n",
"@article{du2021pp,\n",
" title={PP-OCRv2: bag of tricks for ultra lightweight OCR system},\n",
" author={Du, Yuning and Li, Chenxia and Guo, Ruoyu and Cui, Cheng and Liu, Weiwei and Zhou, Jun and Lu, Bin and Yang, Yehua and Liu, Qiwen and Hu, Xiaoguang and others},\n",
" journal={arXiv preprint arXiv:2109.03144},\n",
" year={2021}\n",
"}\n",
"\n",
"@inproceedings{zhang2018deep,\n",
" title={Deep mutual learning},\n",
" author={Zhang, Ying and Xiang, Tao and Hospedales, Timothy M and Lu, Huchuan},\n",
" booktitle={Proceedings of the IEEE conference on computer vision and pattern recognition},\n",
" pages={4320--4328},\n",
" year={2018}\n",
"}\n",
"\n",
"@inproceedings{hu2020gtc,\n",
" title={Gtc: Guided training of ctc towards efficient and accurate scene text recognition},\n",
" author={Hu, Wenyang and Cai, Xiaocong and Hou, Jun and Yi, Shuai and Lin, Zhiping},\n",
" booktitle={Proceedings of the AAAI Conference on Artificial Intelligence},\n",
" volume={34},\n",
" number={07},\n",
" pages={11005--11012},\n",
" year={2020}\n",
"}\n",
"\n",
"@inproceedings{zhang2022context,\n",
" title={Context-based Contrastive Learning for Scene Text Recognition},\n",
" author={Zhang, Xinyun and Zhu, Binwu and Yao, Xufeng and Sun, Qi and Li, Ruiyu and Yu, Bei},\n",
" year={2022},\n",
" organization={AAAI}\n",
"@article{li2022pp,\n",
" title={PP-OCRv3: More Attempts for the Improvement of Ultra Lightweight OCR System},\n",
" author={Li, Chenxia and Liu, Weiwei and Guo, Ruoyu and Yin, Xiaoting and Jiang, Kaitao and Du, Yongkun and Du, Yuning and Zhu, Lingfeng and Lai, Baohua and Hu, Xiaoguang and others},\n",
" journal={arXiv preprint arXiv:2206.03001},\n",
" year={2022}\n",
"}\n",
"```\n"
]
......@@ -265,7 +240,7 @@
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"display_name": "Python 3.8.13 ('py38')",
"language": "python",
"name": "python3"
},
......@@ -279,7 +254,12 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.8.8"
"version": "3.8.13"
},
"vscode": {
"interpreter": {
"hash": "58fd1890da6594cebec461cf98c6cb9764024814357f166387d10d267624ecd6"
}
}
},
"nbformat": 4,
......
......@@ -129,7 +129,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"### 3.2 Train the model.\n",
"### 3.2 Train the model\n",
"The PP-OCR system consists of a text detection model, an angle classifier and a text recognition model. For the three model training tutorials, please refer to the following documents:\n",
"1. text detection model: [text detection training tutorial](https://github.com/PaddlePaddle/PaddleOCR/blob/release%2F2.6/doc/doc_ch/detection.md)\n",
"1. angle classifier: [angle classifier training tutorial](https://github.com/PaddlePaddle/PaddleOCR/blob/release%2F2.6/doc/doc_ch/angle_class.md)\n",
......@@ -137,7 +137,7 @@
"\n",
"After the model training is completed, it can be used in series by specifying the model path. The command reference is as follows:\n",
"```python\n",
"paddleocr --image_dir 11.jpg --use_angle_cls true --ocr_version PP-OCRv2 --det_model_dir=/path/to/det_inference_model --cls_model_dir=/path/to/cls_inference_model --rec_model_dir=/path/to/rec_inference_model\n",
"paddleocr --image_dir 11.jpg --use_angle_cls true --det_model_dir=/path/to/det_inference_model --cls_model_dir=/path/to/cls_inference_model --rec_model_dir=/path/to/rec_inference_model\n",
"```"
]
},
......@@ -147,7 +147,7 @@
"source": [
"## 4. Model Principles\n",
"\n",
"The optimization ideas are as follows\n",
"The enhancement strategies are as follows\n",
"\n",
"1. Text detection enhancement strategies\n",
"- LK-PAN: a PAN module with large receptive field\n",
......@@ -231,36 +231,11 @@
"source": [
"## 6. Related papers and citations\n",
"```\n",
"@article{du2021pp,\n",
" title={PP-OCRv2: bag of tricks for ultra lightweight OCR system},\n",
" author={Du, Yuning and Li, Chenxia and Guo, Ruoyu and Cui, Cheng and Liu, Weiwei and Zhou, Jun and Lu, Bin and Yang, Yehua and Liu, Qiwen and Hu, Xiaoguang and others},\n",
" journal={arXiv preprint arXiv:2109.03144},\n",
" year={2021}\n",
"}\n",
"\n",
"@inproceedings{zhang2018deep,\n",
" title={Deep mutual learning},\n",
" author={Zhang, Ying and Xiang, Tao and Hospedales, Timothy M and Lu, Huchuan},\n",
" booktitle={Proceedings of the IEEE conference on computer vision and pattern recognition},\n",
" pages={4320--4328},\n",
" year={2018}\n",
"}\n",
"\n",
"@inproceedings{hu2020gtc,\n",
" title={Gtc: Guided training of ctc towards efficient and accurate scene text recognition},\n",
" author={Hu, Wenyang and Cai, Xiaocong and Hou, Jun and Yi, Shuai and Lin, Zhiping},\n",
" booktitle={Proceedings of the AAAI Conference on Artificial Intelligence},\n",
" volume={34},\n",
" number={07},\n",
" pages={11005--11012},\n",
" year={2020}\n",
"}\n",
"\n",
"@inproceedings{zhang2022context,\n",
" title={Context-based Contrastive Learning for Scene Text Recognition},\n",
" author={Zhang, Xinyun and Zhu, Binwu and Yao, Xufeng and Sun, Qi and Li, Ruiyu and Yu, Bei},\n",
" year={2022},\n",
" organization={AAAI}\n",
"@article{li2022pp,\n",
" title={PP-OCRv3: More Attempts for the Improvement of Ultra Lightweight OCR System},\n",
" author={Li, Chenxia and Liu, Weiwei and Guo, Ruoyu and Yin, Xiaoting and Jiang, Kaitao and Du, Yongkun and Du, Yuning and Zhu, Lingfeng and Lai, Baohua and Hu, Xiaoguang and others},\n",
" journal={arXiv preprint arXiv:2206.03001},\n",
" year={2022}\n",
"}\n",
"```\n"
]
......@@ -268,7 +243,7 @@
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"display_name": "Python 3.8.13 ('py38')",
"language": "python",
"name": "python3"
},
......@@ -282,7 +257,12 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.8.8"
"version": "3.8.13"
},
"vscode": {
"interpreter": {
"hash": "58fd1890da6594cebec461cf98c6cb9764024814357f166387d10d267624ecd6"
}
}
},
"nbformat": 4,
......
import gradio as gr
import base64
from io import BytesIO
from PIL import Image
from paddleocr import PPStructure
table_engine = PPStructure(layout=False, show_log=True)
def image_to_base64(image):
# 输入为PIL读取的图片,输出为base64格式
byte_data = BytesIO() # 创建一个字节流管道
image.save(byte_data, format="JPEG") # 将图片数据存入字节流管道
byte_data = byte_data.getvalue() # 从字节流管道中获取二进制
base64_str = base64.b64encode(byte_data).decode("ascii") # 二进制转base64
return base64_str
# UGC: Define the inference fn() for your models
def model_inference(image):
result = table_engine(image)
res = result[0]['res']['html']
json_out = {"result": res}
return res, json_out
def clear_all():
return None, None, None
with gr.Blocks() as demo:
gr.Markdown("PP-StructureV2")
with gr.Column(scale=1, min_width=100):
img_in = gr.Image(
value="https://user-images.githubusercontent.com/12406017/200574299-32537341-c329-42a5-ae41-35ee4bd43f2f.png",
label="Input")
with gr.Row():
btn1 = gr.Button("Clear")
btn2 = gr.Button("Submit")
html_out = gr.HTML(label="Output")
json_out = gr.JSON(label="jsonOutput")
btn2.click(fn=model_inference, inputs=img_in, outputs=[html_out, json_out])
btn1.click(fn=clear_all, inputs=None, outputs=[img_in, html_out, json_out])
gr.Button.style(1)
demo.launch()
【PP-StructureV2-App-YAML】
APP_Info:
title: PP-StructureV2-App
colorFrom: blue
colorTo: yellow
sdk: gradio
sdk_version: 3.4.1
app_file: app.py
license: apache-2.0
device: cpu
\ No newline at end of file
gradio
paddlepaddle
paddleocr>=2.6.1.0
# 模型列表
## 1. 版面分析模型
|模型名称|模型简介|推理模型大小|下载地址|dict path|
| --- | --- | --- | --- | --- |
| picodet_lcnet_x1_0_fgd_layout | 基于PicoDet LCNet_x1_0和FGD蒸馏在PubLayNet 数据集训练的英文版面分析模型,可以划分**文字、标题、表格、图片以及列表**5类区域 | 9.7M | [推理模型](https://paddleocr.bj.bcebos.com/ppstructure/models/layout/picodet_lcnet_x1_0_fgd_layout_infer.tar) / [训练模型](https://paddleocr.bj.bcebos.com/ppstructure/models/layout/picodet_lcnet_x1_0_fgd_layout.pdparams) | [PubLayNet dict](https://github.com/PaddlePaddle/PaddleOCR/blob/release/2.6/ppocr/utils/dict/layout_dict/layout_publaynet_dict.txt) |
| ppyolov2_r50vd_dcn_365e_publaynet | 基于PP-YOLOv2在PubLayNet数据集上训练的英文版面分析模型 | 221.0M | [推理模型](https://paddle-model-ecology.bj.bcebos.com/model/layout-parser/ppyolov2_r50vd_dcn_365e_publaynet.tar) / [训练模型](https://paddle-model-ecology.bj.bcebos.com/model/layout-parser/ppyolov2_r50vd_dcn_365e_publaynet_pretrained.pdparams) | 同上 |
| picodet_lcnet_x1_0_fgd_layout_cdla | CDLA数据集训练的中文版面分析模型,可以划分为**表格、图片、图片标题、表格、表格标题、页眉、脚本、引用、公式**10类区域 | 9.7M | [推理模型](https://paddleocr.bj.bcebos.com/ppstructure/models/layout/picodet_lcnet_x1_0_fgd_layout_cdla_infer.tar) / [训练模型](https://paddleocr.bj.bcebos.com/ppstructure/models/layout/picodet_lcnet_x1_0_fgd_layout_cdla.pdparams) | [CDLA dict](https://github.com/PaddlePaddle/PaddleOCR/blob/release/2.6/ppocr/utils/dict/layout_dict/layout_cdla_dict.txt) |
| picodet_lcnet_x1_0_fgd_layout_table | 表格数据集训练的版面分析模型,支持中英文文档表格区域的检测 | 9.7M | [推理模型](https://paddleocr.bj.bcebos.com/ppstructure/models/layout/picodet_lcnet_x1_0_fgd_layout_table_infer.tar) / [训练模型](https://paddleocr.bj.bcebos.com/ppstructure/models/layout/picodet_lcnet_x1_0_fgd_layout_table.pdparams) | [Table dict](https://github.com/PaddlePaddle/PaddleOCR/blob/release/2.6/ppocr/utils/dict/layout_dict/layout_table_dict.txt) |
| ppyolov2_r50vd_dcn_365e_tableBank_word | 基于PP-YOLOv2在TableBank Word 数据集训练的版面分析模型,支持英文文档表格区域的检测 | 221.0M | [推理模型](https://paddle-model-ecology.bj.bcebos.com/model/layout-parser/ppyolov2_r50vd_dcn_365e_tableBank_word.tar) | 同上 |
| ppyolov2_r50vd_dcn_365e_tableBank_latex | 基于PP-YOLOv2在TableBank Latex数据集训练的版面分析模型,支持英文文档表格区域的检测 | 221.0M | [推理模型](https://paddle-model-ecology.bj.bcebos.com/model/layout-parser/ppyolov2_r50vd_dcn_365e_tableBank_latex.tar) | 同上 |
## 2. OCR和表格识别模型
### 2.1 OCR
|模型名称|模型简介|推理模型大小|下载地址|
| --- | --- | --- | --- |
|en_ppocr_mobile_v2.0_table_det|PubTabNet数据集训练的英文表格场景的文字检测|4.7M|[推理模型](https://paddleocr.bj.bcebos.com/dygraph_v2.0/table/en_ppocr_mobile_v2.0_table_det_infer.tar) / [训练模型](https://paddleocr.bj.bcebos.com/dygraph_v2.1/table/en_ppocr_mobile_v2.0_table_det_train.tar) |
|en_ppocr_mobile_v2.0_table_rec|PubTabNet数据集训练的英文表格场景的文字识别|6.9M|[推理模型](https://paddleocr.bj.bcebos.com/dygraph_v2.0/table/en_ppocr_mobile_v2.0_table_rec_infer.tar) / [训练模型](https://paddleocr.bj.bcebos.com/dygraph_v2.1/table/en_ppocr_mobile_v2.0_table_rec_train.tar) |
如需要使用其他OCR模型,可以在 [PP-OCR model_list](https://github.com/PaddlePaddle/PaddleOCR/blob/release/2.6/doc/doc_ch/models_list.md) 下载模型或者使用自己训练好的模型配置到 `det_model_dir`, `rec_model_dir`两个字段即可。
### 2.2 表格识别模型
|模型名称|模型简介|推理模型大小|下载地址|
| --- | --- | --- | --- |
|en_ppocr_mobile_v2.0_table_structure|基于TableRec-RARE在PubTabNet数据集上训练的英文表格识别模型|6.8M|[推理模型](https://paddleocr.bj.bcebos.com/dygraph_v2.0/table/en_ppocr_mobile_v2.0_table_structure_infer.tar) / [训练模型](https://paddleocr.bj.bcebos.com/dygraph_v2.1/table/en_ppocr_mobile_v2.0_table_structure_train.tar) |
|en_ppstructure_mobile_v2.0_SLANet|基于SLANet在PubTabNet数据集上训练的英文表格识别模型|9.2M|[推理模型](https://paddleocr.bj.bcebos.com/ppstructure/models/slanet/en_ppstructure_mobile_v2.0_SLANet_infer.tar) / [训练模型](https://paddleocr.bj.bcebos.com/ppstructure/models/slanet/en_ppstructure_mobile_v2.0_SLANet_train.tar) |
|ch_ppstructure_mobile_v2.0_SLANet|基于SLANet的中文表格识别模型|9.3M|[推理模型](https://paddleocr.bj.bcebos.com/ppstructure/models/slanet/ch_ppstructure_mobile_v2.0_SLANet_infer.tar) / [训练模型](https://paddleocr.bj.bcebos.com/ppstructure/models/slanet/ch_ppstructure_mobile_v2.0_SLANet_train.tar) |
<a name="3"></a>
# Model list
## 1. Layout Analysis
|model name|description | inference model size |download|dict path|
| --- |---| --- | --- | --- |
| picodet_lcnet_x1_0_fgd_layout | The layout analysis English model trained on the PubLayNet dataset based on PicoDet LCNet_x1_0 and FGD . the model can recognition 5 types of areas such as **Text, Title, Table, Picture and List** | 9.7M | [inference model](https://paddleocr.bj.bcebos.com/ppstructure/models/layout/picodet_lcnet_x1_0_fgd_layout_infer.tar) / [trained model](https://paddleocr.bj.bcebos.com/ppstructure/models/layout/picodet_lcnet_x1_0_fgd_layout.pdparams) | [PubLayNet dict](https://github.com/PaddlePaddle/PaddleOCR/blob/release/2.6/ppocr/utils/dict/layout_dict/layout_publaynet_dict.txt) |
| ppyolov2_r50vd_dcn_365e_publaynet | The layout analysis English model trained on the PubLayNet dataset based on PP-YOLOv2 | 221.0M | [inference_moel](https://paddle-model-ecology.bj.bcebos.com/model/layout-parser/ppyolov2_r50vd_dcn_365e_publaynet.tar) / [trained model](https://paddle-model-ecology.bj.bcebos.com/model/layout-parser/ppyolov2_r50vd_dcn_365e_publaynet_pretrained.pdparams) | same as above |
| picodet_lcnet_x1_0_fgd_layout_cdla | The layout analysis Chinese model trained on the CDLA dataset, the model can recognition 10 types of areas such as **Table、Figure、Figure caption、Table、Table caption、Header、Footer、Reference、Equation** | 9.7M | [inference model](https://paddleocr.bj.bcebos.com/ppstructure/models/layout/picodet_lcnet_x1_0_fgd_layout_cdla_infer.tar) / [trained model](https://paddleocr.bj.bcebos.com/ppstructure/models/layout/picodet_lcnet_x1_0_fgd_layout_cdla.pdparams) | [CDLA dict](https://github.com/PaddlePaddle/PaddleOCR/blob/release/2.6/ppocr/utils/dict/layout_dict/layout_cdla_dict.txt) |
| picodet_lcnet_x1_0_fgd_layout_table | The layout analysis model trained on the table dataset, the model can detect tables in Chinese and English documents | 9.7M | [inference model](https://paddleocr.bj.bcebos.com/ppstructure/models/layout/picodet_lcnet_x1_0_fgd_layout_table_infer.tar) / [trained model](https://paddleocr.bj.bcebos.com/ppstructure/models/layout/picodet_lcnet_x1_0_fgd_layout_table.pdparams) | [Table dict](https://github.com/PaddlePaddle/PaddleOCR/blob/release/2.6/ppocr/utils/dict/layout_dict/layout_table_dict.txt) |
| ppyolov2_r50vd_dcn_365e_tableBank_word | The layout analysis model trained on the TableBank Word dataset based on PP-YOLOv2, the model can detect tables in English documents | 221.0M | [inference model](https://paddle-model-ecology.bj.bcebos.com/model/layout-parser/ppyolov2_r50vd_dcn_365e_tableBank_word.tar) | same as above |
| ppyolov2_r50vd_dcn_365e_tableBank_latex | The layout analysis model trained on the TableBank Latex dataset based on PP-YOLOv2, the model can detect tables in English documents | 221.0M | [inference model](https://paddle-model-ecology.bj.bcebos.com/model/layout-parser/ppyolov2_r50vd_dcn_365e_tableBank_latex.tar) | same as above |
## 2. OCR and Table Recognition
### 2.1 OCR
|model name| description | inference model size |download|
| --- |---|---| --- |
|en_ppocr_mobile_v2.0_table_det| Text detection model of English table scenes trained on PubTabNet dataset | 4.7M |[inference model](https://paddleocr.bj.bcebos.com/dygraph_v2.0/table/en_ppocr_mobile_v2.0_table_det_infer.tar) / [trained model](https://paddleocr.bj.bcebos.com/dygraph_v2.1/table/en_ppocr_mobile_v2.0_table_det_train.tar) |
|en_ppocr_mobile_v2.0_table_rec| Text recognition model of English table scenes trained on PubTabNet dataset | 6.9M |[inference model](https://paddleocr.bj.bcebos.com/dygraph_v2.0/table/en_ppocr_mobile_v2.0_table_rec_infer.tar) / [trained model](https://paddleocr.bj.bcebos.com/dygraph_v2.1/table/en_ppocr_mobile_v2.0_table_rec_train.tar) |
If you need to use other OCR models, you can download the model in [PP-OCR model_list](https://github.com/PaddlePaddle/PaddleOCR/blob/release/2.6/doc/doc_en/models_list_en.md) or use the model you trained yourself to configure to `det_model_dir`, `rec_model_dir` field.
<a name="22"></a>
### 2.2 Table Recognition
|model| description |inference model size|download|
| --- |-----------------------------------------------------------------------------| --- | --- |
|en_ppocr_mobile_v2.0_table_structure| English table recognition model trained on PubTabNet dataset based on TableRec-RARE |6.8M|[inference model](https://paddleocr.bj.bcebos.com/dygraph_v2.0/table/en_ppocr_mobile_v2.0_table_structure_infer.tar) / [trained model](https://paddleocr.bj.bcebos.com/dygraph_v2.1/table/en_ppocr_mobile_v2.0_table_structure_train.tar) |
|en_ppstructure_mobile_v2.0_SLANet|English table recognition model trained on PubTabNet dataset based on SLANet|9.2M|[inference model](https://paddleocr.bj.bcebos.com/ppstructure/models/slanet/en_ppstructure_mobile_v2.0_SLANet_infer.tar) / [trained model](https://paddleocr.bj.bcebos.com/ppstructure/models/slanet/en_ppstructure_mobile_v2.0_SLANet_train.tar) |
|ch_ppstructure_mobile_v2.0_SLANet|Chinese table recognition model based on SLANet|9.3M|[inference model](https://paddleocr.bj.bcebos.com/ppstructure/models/slanet/ch_ppstructure_mobile_v2.0_SLANet_infer.tar) / [trained model](https://paddleocr.bj.bcebos.com/ppstructure/models/slanet/ch_ppstructure_mobile_v2.0_SLANet_train.tar) |
---
Model_Info:
name: "PP-StructureV2"
description: "PP-StructureV2文档分析系统,包含版面分析,表格识别,版面恢复和关键信息抽取"
description_en: "PP-StructureV2 document analysis system, including layout analysis, table recognition, layout recovery and key information extraction"
icon: "@后续UE统一设计之后,会存到bos上某个位置"
from_repo: "PaddleOCR"
Task:
- tag_en: "CV"
tag: "计算机视觉"
sub_tag_en: "Layout Analysis, Table Recognition, Layout Recovery, Key Information Extraction"
sub_tag: "版面分析,表格识别,版面恢复,关键信息提取"
Example:
- title: "表格识别实战"
url: "https://aistudio.baidu.com/aistudio/projectdetail/4770296?channelType=0&channel=0"
title_en: "table recognition"
url_en: "https://aistudio.baidu.com/aistudio/projectdetail/4770296?channelType=0&channel=0"
- title: "OCR发票关键信息抽取"
url: "https://aistudio.baidu.com/aistudio/projectdetail/4823162?channelType=0&channel=0"
title_en: "Invoice key information extraction"
url_en: "https://aistudio.baidu.com/aistudio/projectdetail/4823162?channelType=0&channel=0"
Datasets: "ICDAR 2015, ICDAR2019-LSVT,ICDAR2017-RCTW-17,Total-Text,ICDAR2019-ArT"
Pulisher: "Baidu"
License: "apache.2.0"
Paper:
- title: "PP-StructureV2: A Stronger Document Analysis System"
url: "https://arxiv.org/abs/2210.05391v2"
IfTraining: 0
IfOnlineDemo: 1
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 1. PP-StructureV2模型简介\n",
"\n",
"PP-StructureV2在PP-StructureV1的基础上进一步改进,主要有以下3个方面升级:\n",
"\n",
" * **系统功能升级** :新增图像矫正和版面恢复模块,图像转word/pdf、关键信息抽取能力全覆盖!\n",
" * **系统性能优化** :\n",
"\t * 版面分析:发布轻量级版面分析模型,速度提升**11倍**,平均CPU耗时仅需**41ms**!\n",
"\t * 表格识别:设计3大优化策略,预测耗时不变情况下,模型精度提升**6%**。\n",
"\t * 关键信息抽取:设计视觉无关模型结构,语义实体识别精度提升**2.8%**,关系抽取精度提升**9.1%**。\n",
" * **中文场景适配** :完成对版面分析与表格识别的中文场景适配,开源**开箱即用**的中文场景版面结构化模型!\n",
"\n",
"PP-StructureV2系统流程图如下所示,文档图像首先经过图像矫正模块,判断整图方向并完成转正,随后可以完成版面信息分析与关键信息抽取2类任务。版面分析任务中,图像首先经过版面分析模型,将图像划分为文本、表格、图像等不同区域,随后对这些区域分别进行识别,如,将表格区域送入表格识别模块进行结构化识别,将文本区域送入OCR引擎进行文字识别,最后使用版面恢复模块将其恢复为与原始图像布局一致的word或者pdf格式的文件;关键信息抽取任务中,首先使用OCR引擎提取文本内容,然后由语义实体识别模块获取图像中的语义实体,最后经关系抽取模块获取语义实体之间的对应关系,从而提取需要的关键信息。\n",
"\n",
"<div align=\"center\">\n",
"<img src=\"https://user-images.githubusercontent.com/14270174/185939247-57e53254-399c-46c4-a610-da4fa79232f5.png\" width = \"80%\" />\n",
"</div>\n",
"\n",
"\n",
"从算法改进思路来看,对系统中的3个关键子模块,共进行了8个方面的改进。\n",
"\n",
"* 版面分析\n",
"\t* PP-PicoDet: 轻量级版面分析模型\n",
"\t* FGD: 兼顾全局与局部特征的模型蒸馏算法\n",
"\n",
"* 表格识别\n",
"\t* PP-LCNet: CPU友好型轻量级骨干网络\n",
"\t* CSP-PAN: 轻量级高低层特征融合模块\n",
"\t* SLAHead: 结构与位置信息对齐的特征解码模块\n",
"\n",
"* 关键信息抽取\n",
"\t* VI-LayoutXLM: 视觉特征无关的多模态预训练模型结构\n",
"\t* TB-YX: 考虑阅读顺序的文本行排序逻辑\n",
"\t* UDML: 联合互学习知识蒸馏策略\n",
"\n",
"最终,与PP-StructureV1相比:\n",
"\n",
"- 版面分析模型参数量减少95.6%,推理速度提升11倍,精度提升0.4%;\n",
"- 表格识别预测耗时不变,模型精度提升6%,端到端TEDS提升2%;\n",
"- 关键信息抽取模型速度提升2.8倍,语义实体识别模型精度提升2.8%;关系抽取模型精度提升9.1%。\n",
"\n",
"\n",
"更详细的优化细节可参考技术报告:https://arxiv.org/abs/2210.05391v2 。\n",
"\n",
"更多关于PaddleOCR的内容,可以点击 https://github.com/PaddlePaddle/PaddleOCR 进行了解。\n",
"\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 2. 模型效果\n",
"\n",
"PP-StructureV2的效果如下:\n",
"\n",
"- 版面分析\n",
" \n",
"<div align=\"center\">\n",
"<img src=\"https://user-images.githubusercontent.com/14270174/185940654-956ef614-888a-4779-bf63-a6c2b61b97fa.png\" width = \"60%\" />\n",
"</div>\n",
"\n",
"- 表格识别\n",
" \n",
"<div align=\"center\">\n",
"<img src=\"https://user-images.githubusercontent.com/14270174/185941221-c94e3d45-524c-4073-9644-21ba6a9fd93e.png\" width = \"60%\" />\n",
"</div>\n",
"\n",
"- 版面恢复\n",
" \n",
"<div align=\"center\">\n",
"<img src=\"https://user-images.githubusercontent.com/14270174/185941816-4dabb3e8-a0db-4094-98ea-52e0a0fda8e8.png\" width = \"60%\" />\n",
"</div>\n",
"\n",
"\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 3. 模型如何使用\n",
"\n",
"### 3.1 模型推理:\n",
"* 安装PaddleOCR whl包"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false,
"jupyter": {
"outputs_hidden": false
},
"scrolled": true,
"tags": []
},
"outputs": [],
"source": [
"! pip install paddleocr>=2.6.1.0 -i http://mirrors.cloud.tencent.com/pypi/simple"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"* 快速体验\n",
" \n",
"图像方向分类+版面分析+表格识别"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"scrolled": true,
"tags": []
},
"outputs": [],
"source": [
"! wget https://raw.githubusercontent.com/PaddlePaddle/PaddleOCR/release/2.6/ppstructure/docs/table/1.png\n",
"! paddleocr --image_dir=1.png --type=structure --image_orientation=true"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n",
"版面分析+表格识别"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"! wget https://raw.githubusercontent.com/PaddlePaddle/PaddleOCR/release/2.6/ppstructure/docs/table/1.png\n",
"! paddleocr --image_dir=1.png --type=structure"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"版面分析"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"! wget https://raw.githubusercontent.com/PaddlePaddle/PaddleOCR/release/2.6/ppstructure/docs/table/1.png\n",
"! paddleocr --image_dir=1.png --type=structure --table=false --ocr=false"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"表格识别"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"! wget https://raw.githubusercontent.com/PaddlePaddle/PaddleOCR/release/2.6/ppstructure/docs/table/table.jpg\n",
"! paddleocr --image_dir=table.jpg --type=structure --layout=false"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 3.2 模型训练\n",
"PP-StructureV2系统由版面分析模型、文本检测模型、文本识别模型和表格识别模型构成,四个模型训练教程可参考如下文档:\n",
"1. 版面分析模型: [版面分析模型训练教程](https://github.com/PaddlePaddle/PaddleOCR/blob/release%2F2.6/ppstructure/layout/README_ch.md)\n",
"2. 文本检测模型: [文本检测训练教程](https://github.com/PaddlePaddle/PaddleOCR/blob/release%2F2.6/doc/doc_ch/detection.md)\n",
"3. 文本识别模型: [文本识别训练教程](https://github.com/PaddlePaddle/PaddleOCR/blob/release%2F2.6/doc/doc_ch/recognition.md)\n",
"3. 表格识别模型: [表格识别训练教程](https://github.com/PaddlePaddle/PaddleOCR/blob/release%2F2.6/doc/doc_ch/table_recognition.md)\n",
"\n",
"模型训练完成后,可以通过指定模型路径的方式串联使用\n",
"命令参考如下:\n",
"```python\n",
"paddleocr --image_dir 11.jpg --layout_model_dir=/path/to/layout_inference_model --det_model_dir=/path/to/det_inference_model --rec_model_dir=/path/to/rec_inference_model --table_model_dir=/path/to/table_inference_model\n",
"```"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 4. 原理\n",
"\n",
"各模块优化思路具体如下\n",
"\n",
"1. 整图方向矫正\n",
" \n",
"由于训练集一般以正方向图像为主,旋转过的文档图像直接输入模型会增加识别难度,影响识别效果。PP-StructureV2引入了整图方向矫正模块来判断含文字图像的方向,并将其进行方向调整。\n",
"\n",
"我们直接调用PaddleClas中提供的文字图像方向分类模型-[PULC_text_image_orientation](https://github.com/PaddlePaddle/PaddleClas/blob/develop/docs/zh_CN/PULC/PULC_text_image_orientation.md),该模型部分数据集图像如下所示。不同于文本行方向分类器,文字图像方向分类模型针对整图进行方向判别。文字图像方向分类模型在验证集上精度高达99%,单张图像CPU预测耗时仅为`2.16ms`。\n",
"\n",
"<div align=\"center\">\n",
" <img src=\"https://user-images.githubusercontent.com/14270174/185939683-f6465473-3303-4a0c-95be-51f04fb9f387.png\" width=\"600\">\n",
"</div>\n",
"\n",
"2. 版面分析\n",
"\n",
"版面分析指的是对图片形式的文档进行区域划分,定位其中的关键区域,如文字、标题、表格、图片等,PP-StructureV1使用了PaddleDetection中开源的高效检测算法PP-YOLOv2完成版面分析的任务。\n",
"\n",
"**(1)轻量级版面分析模型PP-PicoDet**\n",
"\n",
"`PP-PicoDet`是PaddleDetection中提出的轻量级目标检测模型,通过使用PP-LCNet骨干网络、CSP-PAN特征融合模块、SimOTA标签分配方法等优化策略,最终在CPU与移动端具有卓越的性能。我们将PP-StructureV1中采用的PP-YOLOv2模型替换为`PP-PicoDet`,同时针对版面分析场景优化预测尺度,从针对目标检测设计的`640*640`调整为更适配文档图像的`800*608`,在`1.0x`配置下,模型精度与PP-YOLOv2相当,CPU平均预测速度可提升11倍。\n",
"\n",
"**(2)FGD知识蒸馏**\n",
"\n",
"FGD(Focal and Global Knowledge Distillation for Detectors),是一种兼顾局部全局特征信息的模型蒸馏方法,分为Focal蒸馏和Global蒸馏2个部分。Focal蒸馏分离图像的前景和背景,让学生模型分别关注教师模型的前景和背景部分特征的关键像素;Global蒸馏部分重建不同像素之间的关系并将其从教师转移到学生,以补偿Focal蒸馏中丢失的全局信息。我们基于FGD蒸馏策略,使用教师模型PP-PicoDet-LCNet2.5x(mAP=94.2%)蒸馏学生模型PP-PicoDet-LCNet1.0x(mAP=93.5%),可将学生模型精度提升0.5%,和教师模型仅差0.2%,而预测速度比教师模型快1倍。\n",
"\n",
"\n",
"3. 表格识别\n",
"\n",
"PP-StructureV2中,我们对模型结构和损失函数等5个方面进行升级,提出了 SLANet (Structure Location Alignment Network) ,优化点如下:\n",
"\n",
"**(1) CPU友好型轻量级骨干网络PP-LCNet**\n",
"\n",
"PP-LCNet是结合Intel-CPU端侧推理特性而设计的轻量高性能骨干网络,该方案在图像分类任务上取得了比ShuffleNetV2、MobileNetV3、GhostNet等轻量级模型更优的“精度-速度”均衡。PP-StructureV2中,我们采用PP-LCNet作为骨干网络,表格识别模型精度从71.73%提升至72.98%;同时加载通过SSLD知识蒸馏方案训练得到的图像分类模型权重作为表格识别的预训练模型,最终精度进一步提升2.95%至74.71%。\n",
"\n",
"**(2)轻量级高低层特征融合模块CSP-PAN**\n",
"\n",
"对骨干网络提取的特征进行融合,可以有效解决尺度变化较大等复杂场景中的模型预测问题。早期,FPN模块被提出并用于特征融合,但是它的特征融合过程仅包含单向(高->低),融合不够充分。CSP-PAN基于PAN进行改进,在保证特征融合更为充分的同时,使用CSP block、深度可分离卷积等策略减小了计算量。在表格识别场景中,我们进一步将CSP-PAN的通道数从128降低至96以降低模型大小。最终表格识别模型精度提升0.97%至75.68%,预测速度提升10%。\n",
"\n",
"**(3)结构与位置信息对齐的特征解码模块SLAHead**\n",
"\n",
"TableRec-RARE的TableAttentionHead如下图a所示,TableAttentionHead在执行完全部step的计算后拿到最终隐藏层状态表征(hiddens),随后hiddens经由SDM(Structure Decode Module)和CLDM(Cell Location Decode Module)模块生成全部的表格结构token和单元格坐标。但是这种设计忽略了单元格token和坐标之间一一对应的关系。\n",
"\n",
"PP-StructureV2中,我们设计SLAHead模块,对单元格token和坐标之间做了对齐操作,如下图b所示。在SLAHead中,每一个step的隐藏层状态表征会分别送入SDM和CLDM来得到当前step的token和坐标,每个step的token和坐标输出分别进行concat得到表格的html表达和全部单元格的坐标。此外,考虑到表格识别模型的单元格准确率依赖于表格结构的识别准确,我们将损失函数中表格结构分支与单元格定位分支的权重比从1:1提升到8:1,并使用收敛更稳定的Smoothl1 Loss替换定位分支中的MSE Loss。最终模型精度从75.68%提高至77.7%。\n",
"\n",
"\n",
"<div align=\"center\">\n",
" <img src=\"https://user-images.githubusercontent.com/14270174/185940968-e3a2fbac-78d7-4b74-af54-a1dab860f470.png\" width=\"1200\">\n",
"</div>\n",
"\n",
"\n",
"**(4)其他**\n",
"\n",
"TableRec-RARE算法中,我们使用`<td>`和`</td>`两个单独的token来表示一个非跨行列单元格,这种表示方式限制了网络对于单元格数量较多表格的处理能力。\n",
"\n",
"PP-StructureV2中,我们参考TableMaster中的token处理方法,将`<td>`和`</td>`合并为一个token-`<td></td>`。合并token后,验证集中token长度大于500的图片也参与模型评估,最终模型精度降低为76.31%,但是端到端TEDS提升1.04%。\n",
"\n",
"\n",
"4. 版面恢复\n",
"\n",
"版面恢复指的是文档图像经过OCR识别、版面分析、表格识别等方法处理后的内容可以与原始文档保持相同的排版方式,并输出到word等文档中。PP-StructureV2中,我们版面恢复系统,包含版面分析、表格识别、OCR文本检测与识别等子模块。\n",
"下图展示了版面恢复的结果:\n",
"\n",
"<div align=\"center\">\n",
" <img src=\"https://user-images.githubusercontent.com/14270174/185941816-4dabb3e8-a0db-4094-98ea-52e0a0fda8e8.png\" width=\"1200\">\n",
"</div>\n",
"\n",
"5. 关键信息抽取\n",
"\n",
"关键信息抽取指的是针对文档图像的文字内容,提取出用户关注的关键信息,如身份证中的姓名、住址等字段。PP-Structure中支持了基于多模态LayoutLM系列模型的语义实体识别 (Semantic Entity Recognition, SER) 以及关系抽取 (Relation Extraction, RE) 任务。PP-StructureV2中,我们对模型结构以及下游任务训练方法进行升级,提出了VI-LayoutXLM(Visual-feature Independent LayoutXLM),具体流程图如下所示。\n",
"\n",
"\n",
"<div align=\"center\">\n",
" <img src=\"https://user-images.githubusercontent.com/14270174/185941978-abec7d4a-5e3a-4141-83f8-088d04ef898e.png\" width=\"1000\">\n",
"</div>\n",
"\n",
"\n",
"具体优化策略如下:\n",
"\n",
"**(1) VI-LayoutXLM(Visual-feature Independent LayoutXLM)**\n",
"\n",
"LayoutLMv2以及LayoutXLM中引入视觉骨干网络,用于提取视觉特征,并与后续的text embedding进行联合,作为多模态的输入embedding。但是该模块为基于`ResNet_x101_64x4d`的特征提取网络,特征抽取阶段耗时严重,因此我们将其去除,同时仍然保留文本、位置以及布局等信息,最终发现针对LayoutXLM进行改进,下游SER任务精度无损,针对LayoutLMv2进行改进,下游SER任务精度仅降低`2.1%`,而模型大小减小了约`340M`。同时,基于XFUND数据集,VI-LayoutXLM在RE任务上的精度也进一步提升了`1.06%`。\n",
"\n",
"**(2) TB-YX排序方法(Threshold-Based YX sorting algorithm)**\n",
"\n",
"文本阅读顺序对于信息抽取与文本理解等任务至关重要,传统多模态模型中,没有考虑不同OCR工具可能产生的不正确阅读顺序,而模型输入中包含位置编码,阅读顺序会直接影响预测结果,在预处理中,我们对文本行按照从上到下,从左到右(YX)的顺序进行排序,为防止文本行位置轻微干扰带来的排序结果不稳定问题,在排序的过程中,引入位置偏移阈值Th,对于Y方向距离小于Th的2个文本内容,使用x方向的位置从左到右进行排序。\n",
"\n",
"不同排序方法的结果对比如下所示,可以看出引入偏离阈值之后,排序结果更加符合人类的阅读顺序。\n",
"\n",
"<div align=\"center\">\n",
" <img src=\"https://user-images.githubusercontent.com/14270174/185942080-9d4bafc9-fa7f-4da4-b139-b2bd703dc76d.png\" width=\"800\">\n",
"</div>\n",
"\n",
"\n",
"使用该策略,最终XFUND数据集上,SER任务F1指标提升`2.06%`,RE任务F1指标提升`7.04%`。\n",
"\n",
"**(3) 互学习蒸馏策略**\n",
"\n",
"UDML(Unified-Deep Mutual Learning)联合互学习是PP-OCRv2与PP-OCRv3中采用的对于文本识别非常有效的提升模型效果的策略。在训练时,引入2个完全相同的模型进行互学习,计算2个模型之间的互蒸馏损失函数(DML loss),同时对transformer中间层的输出结果计算距离损失函数(L2 loss)。使用该策略,最终XFUND数据集上,SER任务F1指标提升`0.6%`,RE任务F1指标提升`5.01%`。\n",
"\n",
"最终优化后模型基于SER任务的可视化结果如下所示。\n",
"\n",
"<div align=\"center\">\n",
" <img src=\"https://user-images.githubusercontent.com/14270174/185942213-0909135b-3bcd-4d79-9e69-847dfb1c3b82.png\" width=\"800\">\n",
"</div>\n",
"\n",
"<div align=\"center\">\n",
" <img src=\"https://user-images.githubusercontent.com/14270174/185942237-72923b42-8590-42eb-b687-fa819b1c3afd.png\" width=\"800\">\n",
"</div>\n",
"\n",
"\n",
"RE任务的可视化结果如下所示。\n",
"\n",
"\n",
"<div align=\"center\">\n",
" <img src=\"https://user-images.githubusercontent.com/14270174/185942400-8920dc3c-de7f-46d0-b0bc-baca9536e0e1.png\" width=\"800\">\n",
"</div>\n",
"\n",
"<div align=\"center\">\n",
" <img src=\"https://user-images.githubusercontent.com/14270174/185942416-ca4fd8b0-9227-4c65-b969-0afbda525b85.png\" width=\"800\">\n",
"</div>\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 5. 注意事项\n",
"\n",
"1. PP-StructureV2系列模型训练过程中均公开数据集,如在实际场景中表现不满意,可标注少量数据进行finetune。\n",
"2. 在线体验目前仅开放表格识别的体验,如需版面分析和版面恢复,请参考`3.1 模型推理`。"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 6. 相关论文以及引用信息\n",
"```\n",
"@article{li2022pp,\n",
" title={PP-StructureV2: A Stronger Document Analysis System},\n",
" author={Li, Chenxia and Guo, Ruoyu and Zhou, Jun and An, Mengtao and Du, Yuning and Zhu, Lingfeng and Liu, Yi and Hu, Xiaoguang and Yu, Dianhai},\n",
" journal={arXiv preprint arXiv:2210.05391},\n",
" year={2022}\n",
"}\n",
"```\n"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3.8.13 ('py38')",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.8.13"
},
"vscode": {
"interpreter": {
"hash": "58fd1890da6594cebec461cf98c6cb9764024814357f166387d10d267624ecd6"
}
}
},
"nbformat": 4,
"nbformat_minor": 4
}
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 1. PP-StructureV2 Introduction\n",
"\n",
"PP-StructureV2 is further improved on the basis of PP-StructureV1, mainly in the following three aspects:\n",
"\n",
" * **System function upgrade**: Added image correction and layout restoration modules, image conversion to word/pdf, and key information extraction capabilities!\n",
" * **System performance optimization** :\n",
"\t * Layout analysis: released a lightweight layout analysis model, the speed is increased by **11 times**, and the average CPU time is only **41ms**!\n",
"\t * Table recognition: three optimization strategies are designed, and the model accuracy is improved by **6%** when the prediction time is constant.\n",
"\t * Key information extraction: designing a visually irrelevant model structure, the accuracy of semantic entity recognition is improved by **2.8%**, and the accuracy of relation extraction is improved by **9.1%**.\n",
" * **Chinese scene adaptation**: Complete the Chinese scene adaptation for layout analysis and table recognition, open source **out-of-the-box** Chinese scene layout structure model!\n",
"\n",
"The PP-StructureV2 framework is shown in the figure below. Firstly, the input document image direction is corrected by the Image Direction Correction module. For the Layout Information Extraction subsystem, as shown in the upper branch, the corrected image is firstly divided into different areas such as text, table and image through the layout analysis module, and then these areas are recognized respectively. For example, the table area is sent to the table recognition module for structural recognition, and the text area is sent to the OCR engine for text recognition. Finally, the layout recovery module is used to restore the image to an editable Word file consistent with the original image layout. For the Key Information Extraction subsystem, as shown in the lower branch, OCR engine is used to extract the text content, then the Semantic Entity Recognition module and Relation Extraction module are used to obtain the entities and their relationship in the image, respectively, so as to extract the required key information.\n",
"\n",
"<div align=\"center\">\n",
"<img src=\"https://user-images.githubusercontent.com/14270174/185939247-57e53254-399c-46c4-a610-da4fa79232f5.png\" width = \"80%\" />\n",
"</div>\n",
"\n",
"\n",
"We made 8 improvements to 3 key sub-modules in the system.\n",
"\n",
"* Layout analysis\n",
"\t* PP-PicoDet: A better real-time object detector on mobile devices\n",
"\t* FGD: Focal and Global Knowledge Distillation\n",
"\n",
"* Table Recognition\n",
"\t* PP-LCNet: CPU-friendly Lightweight Backbone\n",
"\t* CSP-PAN: Lightweight Multi-level Feature Fusion Module\n",
"\t* SLAHead: Structure and Location Alignment Module\n",
"\n",
"* Key Information Extraction\n",
"\t* VI-LayoutXLM: Visual-feature Independent LayoutXLM\n",
"\t* TB-YX: Threshold-Based YX sorting algorithm\n",
"\t* UDML: Unified-Deep Mutual Learning\n",
"\n",
"Finally, compared to PP-StructureV1:\n",
"\n",
"- The number of parameters of the layout analysis model is reduced by 95.6%, the inference speed is increased by 11 times, and the accuracy is increased by 0.4%;\n",
"- The table recognition model improves the model accuracy by 6% and the end-to-end TEDS by 2% without changing the prediction time.\n",
"- The speed of the key information extraction model is increased by 2.8 times, the accuracy of the semantic entity recognition model is increased by 2.8%, and the accuracy of the relationship extraction model is increased by 9.1%.\n",
"\n",
"\n",
"For more details, please refer to the technical report: https://arxiv.org/abs/2210.05391v2 .\n",
"\n",
"For more information about PaddleOCR, you can click https://github.com/PaddlePaddle/PaddleOCR to learn more.\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 2. Model Effects\n",
"\n",
"The results of PP-StructureV2 are as follows:\n",
"\n",
"- Layout analysis\n",
" \n",
"<div align=\"center\">\n",
"<img src=\"https://user-images.githubusercontent.com/14270174/185940654-956ef614-888a-4779-bf63-a6c2b61b97fa.png\" width = \"60%\" />\n",
"</div>\n",
"\n",
"- Table recognition\n",
" \n",
"<div align=\"center\">\n",
"<img src=\"https://user-images.githubusercontent.com/14270174/185941221-c94e3d45-524c-4073-9644-21ba6a9fd93e.png\" width = \"60%\" />\n",
"</div>\n",
"\n",
"- Layout recovery\n",
" \n",
"<div align=\"center\">\n",
"<img src=\"https://user-images.githubusercontent.com/14270174/185941816-4dabb3e8-a0db-4094-98ea-52e0a0fda8e8.png\" width = \"60%\" />\n",
"</div>\n",
"\n",
"\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 3. How to Use the Model\n",
"\n",
"### 3.1 Inference\n",
"* Install PaddleOCR whl package"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false,
"jupyter": {
"outputs_hidden": false
},
"scrolled": true,
"tags": []
},
"outputs": [],
"source": [
"! pip install paddleocr>=2.6.1.0 -i http://mirrors.cloud.tencent.com/pypi/simple"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"* Quick experience\n",
" \n",
"image orientation + layout analysis + table recognition"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"scrolled": true,
"tags": []
},
"outputs": [],
"source": [
"! wget https://raw.githubusercontent.com/PaddlePaddle/PaddleOCR/release/2.6/ppstructure/docs/table/1.png\n",
"! paddleocr --image_dir=1.png --type=structure --image_orientation=true"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"layout analysis + table recognition"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"! wget https://raw.githubusercontent.com/PaddlePaddle/PaddleOCR/release/2.6/ppstructure/docs/table/1.png\n",
"! paddleocr --image_dir=1.png --type=structure"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"layout analysis"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"! wget https://raw.githubusercontent.com/PaddlePaddle/PaddleOCR/release/2.6/ppstructure/docs/table/1.png\n",
"! paddleocr --image_dir=1.png --type=structure --table=false --ocr=false"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"table recognition"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"! wget https://raw.githubusercontent.com/PaddlePaddle/PaddleOCR/release/2.6/ppstructure/docs/table/table.jpg\n",
"! paddleocr --image_dir=table.jpg --type=structure --layout=false"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 3.2 Train the model\n",
"The PP-StructureV2 system consists of a layout analysis model, a text detection model, a text recognition model and a table recognition model. For the four model training tutorials, please refer to the following documents:\n",
"1. Layout analysis model: [Layout analysis model training tutorial](https://github.com/PaddlePaddle/PaddleOCR/blob/release%2F2.6/ppstructure/layout/README_ch.md)\n",
"2. text detection model: [text detection training tutorial](https://github.com/PaddlePaddle/PaddleOCR/blob/release%2F2.6/doc/doc_ch/detection.md)\n",
"3. text recognition model: [text recognition training tutorial](https://github.com/PaddlePaddle/PaddleOCR/blob/release%2F2.6/doc/doc_ch/recognition.md)\n",
"3. table recognition model: [table recognition training tutorial](https://github.com/PaddlePaddle/PaddleOCR/blob/release%2F2.6/doc/doc_ch/table_recognition.md)\n",
"\n",
"After the model training is completed, it can be used in series by specifying the model path. The command reference is as follows:\n",
"```python\n",
"paddleocr --image_dir 11.jpg --layout_model_dir=/path/to/layout_inference_model --det_model_dir=/path/to/det_inference_model --rec_model_dir=/path/to/rec_inference_model --table_model_dir=/path/to/table_inference_model\n",
"```"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 4. Model Principles\n",
"\n",
"The enhancement strategies of each module are as follows\n",
"\n",
"1. Image Direction Correction Module\n",
" \n",
"Since the training set is generally dominated by 0-degree images, the information extraction effect of rotated images is often compromised. In PP-StructureV2, the input image direction is firstly corrected by the PULC text image direction model provided by PaddleClas. Some demo images in the dataset are shown below. Different from the text line direction classifier, the text image direction classifier performs direction classification for the entire image. The text image direction classification model achieves 99% accuracy on the validation set with 463 FPS on CPU device.\n",
"\n",
"<div align=\"center\">\n",
" <img src=\"https://user-images.githubusercontent.com/14270174/185939683-f6465473-3303-4a0c-95be-51f04fb9f387.png\" width=\"600\">\n",
"</div>\n",
"\n",
"1. Layout Analysis\n",
"\n",
"Layout Analysis refers to dividing document images into predefined areas such as text, title, table, and figure. In PP-Structure, we adopted the object detection algorithm PP-YOLOv2 as the layout detector.\n",
"\n",
"**(1)PP-PicoDet: A better real-time object detector on mobile devices**\n",
"\n",
"PaddleDetection proposed a new family of real-time object detectors, named PP-PicoDet, which achieves superior performance on mobile devices. PP-PicoDet adopts the CSP structure to constructure CSP-PAN as the neck, SimOTA as label assignment strategy, PP-LCNet as the backbone, and an improved detection One-shot Neural Architecture Search(NAS) is proposed to find the optimal architecture automatically for object detection. We replace PP-YOLOv2 adopted by PP-Structure with PP-PicoDet, and adjust the input scale from 640*640 to 800*608, which is more suitable for document images. With 1.0x configuration, the accuracy is comparable to PP-YOLOv2, and the CPU inference speed is 11 times faster.\n",
"\n",
"**(2) FGD: Focal and Global Knowledge Distillation**\n",
"\n",
"FGD, a knowledge distillation algorithm for object detection, takes into account local and global feature maps, combining focal distillation and global distillation. Focal distillation separates the foreground and background of the image, forcing the student to focus on the teacher’s critical pixels and channels. Global distillation rebuilds the relation between different pixels and transfers it from teachers to students, compensating for missing global information in focal distillation. Based on the FGD distillation strategy, the student model (LCNet1.0x based PP-PicoDet) gets 0.5% mAP improvement with the knowledge from the teacher model (LCNet2.5x based PP-PicoDet). Finally the student model is only 0.2% lower than the teacher model on mAP, but 100% faster.\n",
"\n",
"1. Table Recognition\n",
"\n",
"In PP-StructureV2, we propose an efficient Table Recognition algorithm named SLANet (Structure Location Alignment Network). Compared with TableRec-RARE, SLANet has been upgraded in terms of model structure and loss. The enhancement strategies are as follows:\n",
"\n",
"**(1) PP-LCNet: CPU-friendly Lightweight Backbone**\n",
"\n",
"PP-LCNet is a lightweight CPU network based on the MKLDNN acceleration strategy, which achieves better performance on multiple tasks than lightweight models such as ShuffleNetV2, MobileNetV3, and GhostNet. Additionally, pre-trained weights trained by SSLD on ImageNet are used for Table Recognition model training process for higher accuracy.\n",
"\n",
"**(2) CSP-PAN: Lightweight Multi-level Feature Fusion Module**\n",
"\n",
"Fusion of the features extracted by the backbone network can effectively alleviate problems brought by scale changes in complex scenes. In the early days, the FPN module was proposed and used for feature fusion, but its feature fusion process was one-way (from high-level to low-level), which was not sufficient. CSP-PAN is improved based on PAN. While ensuring more sufficient feature fusion, strategies such as CSP block and depthwise separable convolution are used to reduce the computational cost. In SLANet, we reduce the output channels of CSP-PAN from 128 to 96 in order to reduce the model size.\n",
"\n",
"\n",
"**(3) SLAHead: Structure and Location Alignment Module**\n",
"\n",
"In the TableRec-RARE head, output of each step is concatenated and fed into SDM (Structure Decode Module) and CLDM (Cell Location Decode Module) to generate all cell tokens and coordinates, which ignores the one-to-one correspondence between cell token and coordinates. Therefore, we propose the SLAHead to align cell token and coordinates. In SLAHead, output of each step is fed into SDM and CLDM to get the token and coordinates of the current step, the token and coordinates of all steps are concatenated to get the HTML table representation and coordinates of all cells.\n",
"\n",
"<div align=\"center\">\n",
" <img src=\"https://user-images.githubusercontent.com/14270174/185940968-e3a2fbac-78d7-4b74-af54-a1dab860f470.png\" width=\"1200\">\n",
"</div>\n",
"\n",
"\n",
"**(4) Merge Token**\n",
"\n",
"In TableRec-RARE, we use two separate tokens `<td>` and `</td>` to represent a non-cross-row-column cell, which limits the network’s ability to handle tables with a large number of cells. Inspired by TableMaster, we regard `<td>` and `</td>` as one token `<td></td>` in SLANet.\n",
"\n",
"\n",
"1. Layout Recovery\n",
"\n",
"Layout Recovery a newly added module which is responsible for restoring the image to an editable Word file according to the analysis results. The following figure shows the result of layout restoration:\n",
"\n",
"<div align=\"center\">\n",
" <img src=\"https://user-images.githubusercontent.com/14270174/185941816-4dabb3e8-a0db-4094-98ea-52e0a0fda8e8.png\" width=\"1200\">\n",
"</div>\n",
"\n",
"1. Key Information Extraction\n",
"\n",
"Key Information Extraction (KIE) is usually used to extract the specific information such as name, address and other fields in the ID card or forms. Semantic Entity Recognition (SER) and Relationship Extraction (RE) are two subtasks in KIE, which have been supported in PP-Structure. In PP-StructureV2, we design a visual-feature independent LayoutXLM structure for less inference time cost. TB-YX sorting algorithm and U-DML knowledge distillation are utilized for higher accuracy. The following figure shows the KIE framework.\n",
"\n",
"\n",
"<div align=\"center\">\n",
" <img src=\"https://user-images.githubusercontent.com/14270174/185941978-abec7d4a-5e3a-4141-83f8-088d04ef898e.png\" width=\"1000\">\n",
"</div>\n",
"\n",
"\n",
"The enhancement strategies are as follows:\n",
"\n",
"**(1) VI-LayoutXLM(Visual-feature Independent LayoutXLM)**\n",
"\n",
"Visual backbone network is introduced in LayoutLMv2 and LayoutXLM to extract visual features and combine with subsequent text embedding as multi-modal input embedding. Considering that the visual backbone is base on ResNet x101 64x4d, which takes much time during the visual feature extraction process, we remove this submodule from LayoutXLM. Surprisingly, we found that Hmean of SER and RE tasks based on LayoutXLM is not decreased, and Hmean of SER task based on LayoutLMv2 is just reduced by 2.1%, while the model size is reduced by about 340MB. At the same time, based on the XFUND dataset, the accuracy of VI-LayoutXLM on the RE task is improved by `1.06%`.\n",
"\n",
"**(2) TB-YX: Threshold-Based YX sorting algorithm**\n",
"\n",
"Text reading order is important for KIE tasks. In traditional multi-modal KIE methods, incorrect reading order that may be generated by different OCR engines is not considered, which will directly affect the position embedding and final inference result. Generally, we sort the OCR results from top to bottom and then left to right according to the absolute coordinates of the detected text boxes (YX). The obtained order is usually unstable and not consistent with the reading order. We introduce a position offset threshold th to address this problem (TB-YX). The text boxes are still sorted from top to bottom first, but when the distance between the two text boxes in the Y direction is less than the threshold th, their order is determined by the order in the X direction.\n",
"\n",
"<div align=\"center\">\n",
" <img src=\"https://user-images.githubusercontent.com/14270174/185942080-9d4bafc9-fa7f-4da4-b139-b2bd703dc76d.png\" width=\"800\">\n",
"</div>\n",
"\n",
"\n",
"Using this strategy, on the XFUND dataset, the F1 index of the SER task increased by `2.06%`, and the F1 index of the RE task increased by `7.04%`.\n",
"\n",
"**(3) U-DML: Unified-Deep Mutual Learning**\n",
"\n",
"U-DML is a distillation method proposed in PP-OCRv2 which can effectively improve the accuracy without increasing model size. In PP-StructureV2, we apply U-DML to the training process of SER and RE tasks, and Hmean is increased by 0.6% and 5.1%, repectively.\n",
"\n",
"\n",
"The visualization results of VI-LayoutXLM based on the SER task are shown below.\n",
"\n",
"<div align=\"center\">\n",
" <img src=\"https://user-images.githubusercontent.com/14270174/185942213-0909135b-3bcd-4d79-9e69-847dfb1c3b82.png\" width=\"800\">\n",
"</div>\n",
"\n",
"<div align=\"center\">\n",
" <img src=\"https://user-images.githubusercontent.com/14270174/185942237-72923b42-8590-42eb-b687-fa819b1c3afd.png\" width=\"800\">\n",
"</div>\n",
"\n",
"\n",
"The visualization results of VI-LayoutXLM based on the RE task are shown below.\n",
"\n",
"\n",
"<div align=\"center\">\n",
" <img src=\"https://user-images.githubusercontent.com/14270174/185942400-8920dc3c-de7f-46d0-b0bc-baca9536e0e1.png\" width=\"800\">\n",
"</div>\n",
"\n",
"<div align=\"center\">\n",
" <img src=\"https://user-images.githubusercontent.com/14270174/185942416-ca4fd8b0-9227-4c65-b969-0afbda525b85.png\" width=\"800\">\n",
"</div>\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 5. Attention\n",
"\n",
"1. The PP-StructureV2 series of models have public data sets during the training process. If the performance is not satisfactory in the actual scene, a small amount of data can be marked for finetune.\n",
"2. The online experience currently only supports table recognition. For layout analysis and layout recovery, please refer to `3.1 Model Inference`."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 6. Related papers and citations\n",
"```\n",
"@article{li2022pp,\n",
" title={PP-StructureV2: A Stronger Document Analysis System},\n",
" author={Li, Chenxia and Guo, Ruoyu and Zhou, Jun and An, Mengtao and Du, Yuning and Zhu, Lingfeng and Liu, Yi and Hu, Xiaoguang and Yu, Dianhai},\n",
" journal={arXiv preprint arXiv:2210.05391},\n",
" year={2022}\n",
"}\n",
"```\n"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3.8.13 ('py38')",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.8.13"
},
"vscode": {
"interpreter": {
"hash": "58fd1890da6594cebec461cf98c6cb9764024814357f166387d10d267624ecd6"
}
}
},
"nbformat": 4,
"nbformat_minor": 4
}
Markdown is supported
0% .
You are about to add 0 people to the discussion. Proceed with caution.
先完成此消息的编辑!
想要评论请 注册