提交 2976dab9 编写于 作者: A an1018

Merge branch 'dygraph' of https://github.com/PaddlePaddle/PaddleOCR into add_layout_hub

......@@ -26,17 +26,16 @@ PaddleOCR aims to create multilingual, awesome, leading, and practical OCR tools
</div>
## Recent updates
- **🔥2022.8.24 Release PaddleOCR [release/2.6](https://github.com/PaddlePaddle/PaddleOCR/tree/release/2.6)**
- Release [PP-Structurev2](./ppstructure/),with functions and performance fully upgraded, adapted to Chinese scenes, and new support for [Layout Recovery](./ppstructure/recovery) and [PDF to Word]();
- [Layout Analysis](./ppstructure/layout) optimization: model storage reduced by 95%, while speed increased by 11 times, and the average CPU time-cost is only 41ms;
- [Table Recognition](./ppstructure/table) optimization: 3 optimization strategies are designed, and the model accuracy is improved by 6% under comparable time consumption;
- [Key Information Extraction](./ppstructure/kie) optimization:a visual-independent model structure is designed, the accuracy of semantic entity recognition is increased by 2.8%, and the accuracy of relation extraction is increased by 9.1%.
- **🔥2022.5.9 Release PaddleOCR [release/2.5](https://github.com/PaddlePaddle/PaddleOCR/tree/release/2.5)**
- Release [PP-OCRv3](./doc/doc_en/ppocr_introduction_en.md#pp-ocrv3): With comparable speed, the effect of Chinese scene is further improved by 5% compared with PP-OCRv2, the effect of English scene is improved by 11%, and the average recognition accuracy of 80 language multilingual models is improved by more than 5%.
- Release [PPOCRLabelv2](./PPOCRLabel): Add the annotation function for table recognition task, key information extraction task and irregular text image.
- Release interactive e-book [*"Dive into OCR"*](./doc/doc_en/ocr_book_en.md), covers the cutting-edge theory and code practice of OCR full stack technology.
- 2021.12.21 Release PaddleOCR [release/2.4](https://github.com/PaddlePaddle/PaddleOCR/tree/release/2.4)
- Release 1 text detection algorithm (PSENet), 3 text recognition algorithms (NRTR、SEED、SAR).
- Release 1 key information extraction algorithm (SDMGR, [tutorial](./ppstructure/docs/kie_en.md)) and 3 [DocVQA](./ppstructure/vqa) algorithms (LayoutLM, LayoutLMv2, LayoutXLM).
- 2021.9.7 Release PaddleOCR [release/2.3](https://github.com/PaddlePaddle/PaddleOCR/tree/release/2.3)
- Release [PP-OCRv2](./doc/doc_en/ppocr_introduction_en.md#pp-ocrv2). The inference speed of PP-OCRv2 is 220% higher than that of PP-OCR server in CPU device. The F-score of PP-OCRv2 is 7% higher than that of PP-OCR mobile.
- 2021.8.3 Release PaddleOCR [release/2.2](https://github.com/PaddlePaddle/PaddleOCR/tree/release/2.2)
- Release a new structured documents analysis toolkit, i.e., [PP-Structure](./ppstructure/README.md), support layout analysis and table recognition (One-key to export chart images to Excel files).
- [more](./doc/doc_en/update_en.md)
......@@ -45,7 +44,9 @@ PaddleOCR aims to create multilingual, awesome, leading, and practical OCR tools
PaddleOCR support a variety of cutting-edge algorithms related to OCR, and developed industrial featured models/solution [PP-OCR](./doc/doc_en/ppocr_introduction_en.md) and [PP-Structure](./ppstructure/README.md) on this basis, and get through the whole process of data production, model training, compression, inference and deployment.
![](./doc/features_en.png)
<div align="center">
<img src="https://user-images.githubusercontent.com/25809855/186171245-40abc4d7-904f-4949-ade1-250f86ed3a90.png">
</div>
> It is recommended to start with the “quick experience” in the document tutorial
......@@ -113,18 +114,19 @@ PaddleOCR support a variety of cutting-edge algorithms related to OCR, and devel
- [Quick Start](./ppstructure/docs/quickstart_en.md)
- [Model Zoo](./ppstructure/docs/models_list_en.md)
- [Model training](./doc/doc_en/training_en.md)
- [Layout Parser](./ppstructure/layout/README.md)
- [Layout Analysis](./ppstructure/layout/README.md)
- [Table Recognition](./ppstructure/table/README.md)
- [DocVQA](./ppstructure/vqa/README.md)
- [Key Information Extraction](./ppstructure/docs/kie_en.md)
- [Key Information Extraction](./ppstructure/kie/README.md)
- [Inference and Deployment](./deploy/README.md)
- [Python Inference](./ppstructure/docs/inference_en.md)
- [C++ Inference]()
- [C++ Inference](./deploy/cpp_infer/readme.md)
- [Serving](./deploy/pdserving/README.md)
- [Academic algorithms](./doc/doc_en/algorithms_en.md)
- [Academic Algorithms](./doc/doc_en/algorithm_overview_en.md)
- [Text detection](./doc/doc_en/algorithm_overview_en.md)
- [Text recognition](./doc/doc_en/algorithm_overview_en.md)
- [End-to-end](./doc/doc_en/algorithm_overview_en.md)
- [End-to-end OCR](./doc/doc_en/algorithm_overview_en.md)
- [Table Recognition](./doc/doc_en/algorithm_overview_en.md)
- [Key Information Extraction](./doc/doc_en/algorithm_overview_en.md)
- [Add New Algorithms to PaddleOCR](./doc/doc_en/add_new_algorithm_en.md)
- Data Annotation and Synthesis
- [Semi-automatic Annotation Tool: PPOCRLabel](./PPOCRLabel/README.md)
......@@ -135,9 +137,9 @@ PaddleOCR support a variety of cutting-edge algorithms related to OCR, and devel
- [General OCR Datasets(Chinese/English)](doc/doc_en/dataset/datasets_en.md)
- [HandWritten_OCR_Datasets(Chinese)](doc/doc_en/dataset/handwritten_datasets_en.md)
- [Various OCR Datasets(multilingual)](doc/doc_en/dataset/vertical_and_multilingual_datasets_en.md)
- [layout analysis](doc/doc_en/dataset/layout_datasets_en.md)
- [table recognition](doc/doc_en/dataset/table_datasets_en.md)
- [DocVQA](doc/doc_en/dataset/docvqa_datasets_en.md)
- [Layout Analysis](doc/doc_en/dataset/layout_datasets_en.md)
- [Table Recognition](doc/doc_en/dataset/table_datasets_en.md)
- [Key Information Extraction](doc/doc_en/dataset/kie_datasets_en.md)
- [Code Structure](./doc/doc_en/tree_en.md)
- [Visualization](#Visualization)
- [Community](#Community)
......@@ -176,7 +178,7 @@ PaddleOCR support a variety of cutting-edge algorithms related to OCR, and devel
</details>
<details open>
<summary>PP-Structure</summary>
<summary>PP-Structurev2</summary>
- layout analysis + table recognition
<div align="center">
......@@ -185,12 +187,28 @@ PaddleOCR support a variety of cutting-edge algorithms related to OCR, and devel
- SER (Semantic entity recognition)
<div align="center">
<img src="./ppstructure/docs/vqa/result_ser/zh_val_0_ser.jpg" width="800">
<img src="https://user-images.githubusercontent.com/25809855/186094456-01a1dd11-1433-4437-9ab2-6480ac94ec0a.png" width="600">
</div>
<div align="center">
<img src="https://user-images.githubusercontent.com/14270174/185310636-6ce02f7c-790d-479f-b163-ea97a5a04808.jpg" width="600">
</div>
<div align="center">
<img src="https://user-images.githubusercontent.com/14270174/185539517-ccf2372a-f026-4a7c-ad28-c741c770f60a.png" width="600">
</div>
- RE (Relation Extraction)
<div align="center">
<img src="./ppstructure/docs/vqa/result_re/zh_val_21_re.jpg" width="800">
<img src="https://user-images.githubusercontent.com/25809855/186094813-3a8e16cc-42e5-4982-b9f4-0134dfb5688d.png" width="600">
</div>
<div align="center">
<img src="https://user-images.githubusercontent.com/14270174/185393805-c67ff571-cf7e-4217-a4b0-8b396c4f22bb.jpg" width="600">
</div>
<div align="center">
<img src="https://user-images.githubusercontent.com/14270174/185540080-0431e006-9235-4b6d-b63d-0b3c6e1de48f.jpg" width="600">
</div>
</details>
......
......@@ -27,21 +27,20 @@ PaddleOCR旨在打造一套丰富、领先、且实用的OCR工具库,助力
## 近期更新
- **🔥2022.7 发布[OCR场景应用集合](./applications)**
- 发布OCR场景应用集合,包含数码管、液晶屏、车牌、高精度SVTR模型等**7个垂类模型**,覆盖通用,制造、金融、交通行业的主要OCR垂类应用。
- **🔥2022.8.24 发布 PaddleOCR [release/2.6](https://github.com/PaddlePaddle/PaddleOCR/tree/release/2.6)**
- 发布[PP-Structurev2](./ppstructure/),系统功能性能全面升级,适配中文场景,新增支持[版面复原](./ppstructure/recovery)[PDF转Word]();
- [版面分析](./ppstructure/layout)模型优化:模型存储减少95%,速度提升11倍,平均CPU耗时仅需41ms;
- [表格识别](./ppstructure/table)模型优化:设计3大优化策略,预测耗时不变情况下,模型精度提升6%;
- [关键信息抽取](./ppstructure/kie)模型优化:设计视觉无关模型结构,语义实体识别精度提升2.8%,关系抽取精度提升9.1%。
- **🔥2022.5.9 发布PaddleOCR [release/2.5](https://github.com/PaddlePaddle/PaddleOCR/tree/release/2.5)**
- **🔥2022.8 发布 [OCR场景应用集合](./applications)**
- 包含数码管、液晶屏、车牌、高精度SVTR模型、手写体识别等**9个垂类模型**,覆盖通用,制造、金融、交通行业的主要OCR垂类应用。
- **2022.5.9 发布 PaddleOCR [release/2.5](https://github.com/PaddlePaddle/PaddleOCR/tree/release/2.5)**
- 发布[PP-OCRv3](./doc/doc_ch/ppocr_introduction.md#pp-ocrv3),速度可比情况下,中文场景效果相比于PP-OCRv2再提升5%,英文场景提升11%,80语种多语言模型平均识别准确率提升5%以上;
- 发布半自动标注工具[PPOCRLabelv2](./PPOCRLabel):新增表格文字图像、图像关键信息抽取任务和不规则文字图像的标注功能;
- 发布OCR产业落地工具集:打通22种训练部署软硬件环境与方式,覆盖企业90%的训练部署环境需求;
- 发布交互式OCR开源电子书[《动手学OCR》](./doc/doc_ch/ocr_book.md),覆盖OCR全栈技术的前沿理论与代码实践,并配套教学视频。
- 2021.12.21 发布PaddleOCR [release/2.4](https://github.com/PaddlePaddle/PaddleOCR/tree/release/2.4)
- OCR算法新增1种文本检测算法([PSENet](./doc/doc_ch/algorithm_det_psenet.md)),3种文本识别算法([NRTR](./doc/doc_ch/algorithm_rec_nrtr.md)[SEED](./doc/doc_ch/algorithm_rec_seed.md)[SAR](./doc/doc_ch/algorithm_rec_sar.md));
- 文档结构化算法新增1种关键信息提取算法([SDMGR](./ppstructure/docs/kie.md)),3种[DocVQA](./ppstructure/vqa)算法(LayoutLM、LayoutLMv2,LayoutXLM)。
- 2021.9.7 发布PaddleOCR [release/2.3](https://github.com/PaddlePaddle/PaddleOCR/tree/release/2.3)
- 发布[PP-OCRv2](./doc/doc_ch/ppocr_introduction.md#pp-ocrv2),CPU推理速度相比于PP-OCR server提升220%;效果相比于PP-OCR mobile 提升7%。
- 2021.8.3 发布PaddleOCR [release/2.2](https://github.com/PaddlePaddle/PaddleOCR/tree/release/2.2)
- 发布文档结构分析[PP-Structure](./ppstructure/README_ch.md)工具包,支持版面分析与表格识别(含Excel导出)。
> [更多](./doc/doc_ch/update.md)
......@@ -49,7 +48,9 @@ PaddleOCR旨在打造一套丰富、领先、且实用的OCR工具库,助力
支持多种OCR相关前沿算法,在此基础上打造产业级特色模型[PP-OCR](./doc/doc_ch/ppocr_introduction.md)[PP-Structure](./ppstructure/README_ch.md),并打通数据生产、模型训练、压缩、预测部署全流程。
![](./doc/features.png)
<div align="center">
<img src="https://user-images.githubusercontent.com/25809855/186170862-b8f80f6c-fee7-4b26-badc-de9c327c76ce.png">
</div>
> 上述内容的使用方法建议从文档教程中的快速开始体验
......@@ -213,14 +214,30 @@ PaddleOCR旨在打造一套丰富、领先、且实用的OCR工具库,助力
- SER(语义实体识别)
<div align="center">
<img src="./ppstructure/docs/vqa/result_ser/zh_val_0_ser.jpg" width="800">
<img src="https://user-images.githubusercontent.com/14270174/185310636-6ce02f7c-790d-479f-b163-ea97a5a04808.jpg" width="600">
</div>
<div align="center">
<img src="https://user-images.githubusercontent.com/14270174/185539517-ccf2372a-f026-4a7c-ad28-c741c770f60a.png" width="600">
</div>
<div align="center">
<img src="https://user-images.githubusercontent.com/25809855/186094456-01a1dd11-1433-4437-9ab2-6480ac94ec0a.png" width="600">
</div>
- RE(关系提取)
<div align="center">
<img src="./ppstructure/docs/vqa/result_re/zh_val_21_re.jpg" width="800">
<img src="https://user-images.githubusercontent.com/14270174/185393805-c67ff571-cf7e-4217-a4b0-8b396c4f22bb.jpg" width="600">
</div>
<div align="center">
<img src="https://user-images.githubusercontent.com/14270174/185540080-0431e006-9235-4b6d-b63d-0b3c6e1de48f.jpg" width="600">
</div>
<div align="center">
<img src="https://user-images.githubusercontent.com/25809855/186094813-3a8e16cc-42e5-4982-b9f4-0134dfb5688d.png" width="600">
</div>
</details>
<a name="许可证书"></a>
......
......@@ -78,7 +78,7 @@ python3 tools/export_model.py -c configs/rec/rec_r50_fpn_srn.yml -o Global.pretr
SRN文本识别模型推理,可以执行如下命令:
```
python3 tools/infer/predict_rec.py --image_dir="./doc/imgs_words/en/word_1.png" --rec_model_dir="./inference/rec_srn/" --rec_image_shape="1,64,256" --rec_char_type="ch" --rec_algorithm="SRN" --rec_char_dict_path=./ppocr/utils/ic15_dict.txt --use_space_char=False
python3 tools/infer/predict_rec.py --image_dir="./doc/imgs_words/en/word_1.png" --rec_model_dir="./inference/rec_srn/" --rec_image_shape="1,64,256" --rec_algorithm="SRN" --rec_char_dict_path=./ppocr/utils/ic15_dict.txt --use_space_char=False
```
<a name="4-2"></a>
......
# Academic Algorithms and Models
PaddleOCR will add cutting-edge OCR algorithms and models continuously. Check out the supported models and tutorials by clicking the following list:
- [text detection algorithms](./algorithm_overview_en.md#11)
- [text recognition algorithms](./algorithm_overview_en.md#12)
- [end-to-end algorithms](./algorithm_overview_en.md#2)
- [table recognition algorithms](./algorithm_overview_en.md#3)
Developers are welcome to contribute more algorithms! Please refer to [add new algorithm](./add_new_algorithm_en.md) guideline.
......@@ -7,7 +7,11 @@
- [3. Table Recognition Algorithms](#3)
- [4. Key Information Extraction Algorithms](#4)
This tutorial lists the OCR algorithms supported by PaddleOCR, as well as the models and metrics of each algorithm on **English public datasets**. It is mainly used for algorithm introduction and algorithm performance comparison. For more models on other datasets including Chinese, please refer to [PP-OCR v2.0 models list](./models_list_en.md).
This tutorial lists the OCR algorithms supported by PaddleOCR, as well as the models and metrics of each algorithm on **English public datasets**. It is mainly used for algorithm introduction and algorithm performance comparison. For more models on other datasets including Chinese, please refer to [PP-OCRv3 models list](./models_list_en.md).
>>
Developers are welcome to contribute more algorithms! Please refer to [add new algorithm](./add_new_algorithm_en.md) guideline.
<a name="1"></a>
......
......@@ -652,8 +652,9 @@ def main():
for index, pdf_img in enumerate(img):
os.makedirs(
os.path.join(args.output, img_name), exist_ok=True)
pdf_img_path = os.path.join(args.output, img_name, img_name
+ '_' + str(index) + '.jpg')
pdf_img_path = os.path.join(
args.output, img_name,
img_name + '_' + str(index) + '.jpg')
cv2.imwrite(pdf_img_path, pdf_img)
img_paths.append([pdf_img_path, pdf_img])
......
......@@ -104,8 +104,9 @@ def load_model(config, model, optimizer=None, model_type='det'):
continue
pre_value = params[key]
if pre_value.dtype == paddle.float16:
pre_value = pre_value.astype(paddle.float32)
is_float16 = True
if pre_value.dtype != value.dtype:
pre_value = pre_value.astype(value.dtype)
if list(value.shape) == list(pre_value.shape):
new_state_dict[key] = pre_value
else:
......@@ -162,8 +163,9 @@ def load_pretrained_params(model, path):
logger.warning("The pretrained params {} not in model".format(k1))
else:
if params[k1].dtype == paddle.float16:
params[k1] = params[k1].astype(paddle.float32)
is_float16 = True
if params[k1].dtype != state_dict[k1].dtype:
params[k1] = params[k1].astype(state_dict[k1].dtype)
if list(state_dict[k1].shape) == list(params[k1].shape):
new_state_dict[k1] = params[k1]
else:
......
English | [简体中文](README_ch.md)
- [1. Introduction](#1-introduction)
- [2. Update log](#2-update-log)
- [3. Features](#3-features)
- [4. Results](#4-results)
- [4.1 Layout analysis and table recognition](#41-layout-analysis-and-table-recognition)
- [4.2 KIE](#42-kie)
- [5. Quick start](#5-quick-start)
- [6. PP-Structure System](#6-pp-structure-system)
- [6.1 Layout analysis and table recognition](#61-layout-analysis-and-table-recognition)
- [6.1.1 Layout analysis](#611-layout-analysis)
- [6.1.2 Table recognition](#612-table-recognition)
- [6.2 KIE](#62-kie)
- [7. Model List](#7-model-list)
- [7.1 Layout analysis model](#71-layout-analysis-model)
- [7.2 OCR and table recognition model](#72-ocr-and-table-recognition-model)
- [7.3 KIE model](#73-kie-model)
- [2. Features](#2-features)
- [3. Results](#3-results)
- [3.1 Layout analysis and table recognition](#31-layout-analysis-and-table-recognition)
- [3.2 Layout Recovery](#32-layout-recovery)
- [3.3 KIE](#33-kie)
- [4. Quick start](#4-quick-start)
- [5. Model List](#5-model-list)
## 1. Introduction
PP-Structure is an OCR toolkit that can be used for document analysis and processing with complex structures, designed to help developers better complete document understanding tasks
PP-Structure is an intelligent document analysis system developed by the PaddleOCR team, which aims to help developers better complete tasks related to document understanding such as layout analysis and table recognition.
## 2. Update log
* 2022.02.12 KIE add LayoutLMv2 model。
* 2021.12.07 add [KIE SER and RE tasks](kie/README.md)
The pipeline of PP-Structurev2 system is shown below. The document image first passes through the image direction correction module to identify the direction of the entire image and complete the direction correction. Then, two tasks of layout information analysis and key information extraction can be completed.
## 3. Features
- In the layout analysis task, the image first goes through the layout analysis model to divide the image into different areas such as text, table, and figure, and then analyze these areas separately. For example, the table area is sent to the form recognition module for structured recognition, and the text area is sent to the OCR engine for text recognition. Finally, the layout recovery module restores it to a word or pdf file with the same layout as the original image;
- In the key information extraction task, the OCR engine is first used to extract the text content, and then the SER(semantic entity recognition) module obtains the semantic entities in the image, and finally the RE(relationship extraction) module obtains the correspondence between the semantic entities, thereby extracting the required key information.
<img src="./docs/ppstructurev2_pipeline.png" width="100%"/>
The main features of PP-Structure are as follows:
More technical details: 👉 [PP-Structurev2 Technical Report](docs/PP-Structurev2_introduction.md)
- Support the layout analysis of documents, divide the documents into 5 types of areas **text, title, table, image and list** (conjunction with Layout-Parser)
- Support to extract the texts from the text, title, picture and list areas (used in conjunction with PP-OCR)
- Support to extract excel files from the table areas
- Support python whl package and command line usage, easy to use
- Support custom training for layout analysis and table structure tasks
- Support Document Key Information Extraction (KIE) tasks: Semantic Entity Recognition (SER) and Relation Extraction (RE)
PP-Structurev2 supports independent use or flexible collocation of each module. For example, you can use layout analysis alone or table recognition alone. Click the corresponding link below to get the tutorial for each independent module:
## 4. Results
- [Layout Analysis](layout/README.md)
- [Table Recognition](table/README.md)
- [Key Information Extraction](kie/README.md)
- [Layout Recovery](recovery/README.md)
### 4.1 Layout analysis and table recognition
## 2. Features
<img src="docs/table/ppstructure.GIF" width="100%"/>
The figure shows the pipeline of layout analysis + table recognition. The image is first divided into four areas of image, text, title and table by layout analysis, and then OCR detection and recognition is performed on the three areas of image, text and title, and the table is performed table recognition, where the image will also be stored for use.
### 4.2 KIE
* SER
*
![](docs/kie/result_ser/zh_val_0_ser.jpg) | ![](docs/kie/result_ser/zh_val_42_ser.jpg)
---|---
Different colored boxes in the figure represent different categories. For xfun dataset, there are three categories: query, answer and header:
The main features of PP-Structurev2 are as follows:
- Support layout analysis of documents in the form of images/pdfs, which can be divided into areas such as **text, titles, tables, figures, formulas, etc.**;
- Support common Chinese and English **table detection** tasks;
- Support structured table recognition, and output the final result to **Excel file**;
- Support multimodal-based Key Information Extraction (KIE) tasks - **Semantic Entity Recognition** (SER) and **Relation Extraction (RE);
- Support **layout recovery**, that is, restore the document in word or pdf format with the same layout as the original image;
- Support customized training and multiple inference deployment methods such as python whl package quick start;
- Connect with the semi-automatic data labeling tool PPOCRLabel, which supports the labeling of layout analysis, table recognition, and SER.
* Dark purple: header
* Light purple: query
* Army green: answer
## 3. Results
The corresponding category and OCR recognition results are also marked at the top left of the OCR detection box.
PP-Structurev2 supports the independent use or flexible collocation of each module. For example, layout analysis can be used alone, or table recognition can be used alone. Only the visualization effects of several representative usage methods are shown here.
### 3.1 Layout analysis and table recognition
* RE
![](docs/kie/result_re/zh_val_21_re.jpg) | ![](docs/kie/result_re/zh_val_40_re.jpg)
---|---
The figure shows the pipeline of layout analysis + table recognition. The image is first divided into four areas of image, text, title and table by layout analysis, and then OCR detection and recognition is performed on the three areas of image, text and title, and the table is performed table recognition, where the image will also be stored for use.
<img src="docs/table/ppstructure.GIF" width="100%"/>
### 3.2 Layout recovery
In the figure, the red box represents the question, the blue box represents the answer, and the question and answer are connected by green lines. The corresponding category and OCR recognition results are also marked at the top left of the OCR detection box.
The following figure shows the effect of layout recovery based on the results of layout analysis and table recognition in the previous section.
<img src="./docs/recovery/recovery.jpg" width="100%"/>
## 5. Quick start
### 3.3 KIE
Start from [Quick Installation](./docs/quickstart.md)
* SER
## 6. PP-Structure System
Different colored boxes in the figure represent different categories.
### 6.1 Layout analysis and table recognition
<div align="center">
<img src="https://user-images.githubusercontent.com/25809855/186094456-01a1dd11-1433-4437-9ab2-6480ac94ec0a.png" width="600">
</div>
![pipeline](docs/table/pipeline.jpg)
<div align="center">
<img src="https://user-images.githubusercontent.com/25809855/186095702-9acef674-12af-4d09-97fc-abf4ab32600e.png" width="600">
</div>
In PP-Structure, the image will be divided into 5 types of areas **text, title, image list and table**. For the first 4 types of areas, directly use PP-OCR system to complete the text detection and recognition. For the table area, after the table structuring process, the table in image is converted into an Excel file with the same table style.
<div align="center">
<img src="https://user-images.githubusercontent.com/14270174/185539141-68e71c75-5cf7-4529-b2ca-219d29fa5f68.jpg" width="600">
</div>
#### 6.1.1 Layout analysis
<div align="center">
<img src="https://user-images.githubusercontent.com/14270174/185310636-6ce02f7c-790d-479f-b163-ea97a5a04808.jpg" width="600">
</div>
Layout analysis classifies image by region, including the use of Python scripts of layout analysis tools, extraction of designated category detection boxes, performance indicators, and custom training layout analysis models. For details, please refer to [document](layout/README.md).
<div align="center">
<img src="https://user-images.githubusercontent.com/14270174/185539517-ccf2372a-f026-4a7c-ad28-c741c770f60a.png" width="600">
</div>
#### 6.1.2 Table recognition
* RE
Table recognition converts table images into excel documents, which include the detection and recognition of table text and the prediction of table structure and cell coordinates. For detailed instructions, please refer to [document](table/README.md)
In the figure, the red box represents `Question`, the blue box represents `Answer`, and `Question` and `Answer` are connected by green lines.
### 6.2 KIE
<div align="center">
<img src="https://user-images.githubusercontent.com/25809855/186094813-3a8e16cc-42e5-4982-b9f4-0134dfb5688d.png" width="600">
</div>
Multi-modal based Key Information Extraction (KIE) methods include Semantic Entity Recognition (SER) and Relation Extraction (RE) tasks. Based on SER task, text recognition and classification in images can be completed. Based on THE RE task, we can extract the relation of the text content in the image, such as judge the problem pair. For details, please refer to [document](kie/README.md)
<div align="center">
<img src="https://user-images.githubusercontent.com/25809855/186095641-5843b4da-34d7-4c1c-943a-b1036a859fe3.png" width="600">
</div>
## 7. Model List
<div align="center">
<img src="https://user-images.githubusercontent.com/14270174/185393805-c67ff571-cf7e-4217-a4b0-8b396c4f22bb.jpg" width="600">
</div>
PP-Structure Series Model List (Updating)
<div align="center">
<img src="https://user-images.githubusercontent.com/14270174/185540080-0431e006-9235-4b6d-b63d-0b3c6e1de48f.jpg" width="600">
</div>
### 7.1 Layout analysis model
## 4. Quick start
|model name|description|download|label_map|
| --- | --- | --- |--- |
| ppyolov2_r50vd_dcn_365e_publaynet | The layout analysis model trained on the PubLayNet dataset can divide image into 5 types of areas **text, title, table, picture, and list** | [PubLayNet](https://paddle-model-ecology.bj.bcebos.com/model/layout-parser/ppyolov2_r50vd_dcn_365e_publaynet.tar) | {0: "Text", 1: "Title", 2: "List", 3:"Table", 4:"Figure"}|
Start from [Quick Start](./docs/quickstart_en.md).
### 7.2 OCR and table recognition model
## 5. Model List
|model name|description|model size|download|
| --- | --- | --- | --- |
|ch_PP-OCRv3_det| [New] Lightweight model, supporting Chinese, English, multilingual text detection | 3.8M |[inference model](https://paddleocr.bj.bcebos.com/PP-OCRv3/chinese/ch_PP-OCRv3_det_infer.tar) / [trained model](https://paddleocr.bj.bcebos.com/PP-OCRv3/chinese/ch_PP-OCRv3_det_distill_train.tar)|
|ch_PP-OCRv3_rec| [New] Lightweight model, supporting Chinese, English, multilingual text recognition | 12.4M |[inference model](https://paddleocr.bj.bcebos.com/PP-OCRv3/chinese/ch_PP-OCRv3_rec_infer.tar) / [trained model](https://paddleocr.bj.bcebos.com/PP-OCRv3/chinese/ch_PP-OCRv3_rec_train.tar) |
|ch_ppstructure_mobile_v2.0_SLANet|Chinese table recognition model based on SLANet|9.3M|[inference model](https://paddleocr.bj.bcebos.com/ppstructure/models/slanet/ch_ppstructure_mobile_v2.0_SLANet_infer.tar) / [trained model](https://paddleocr.bj.bcebos.com/ppstructure/models/slanet/ch_ppstructure_mobile_v2.0_SLANet_train.tar) |
Some tasks need to use both the structured analysis models and the OCR models. For example, the table recognition task needs to use the table recognition model for structured analysis, and the OCR model to recognize the text in the table. Please select the appropriate models according to your specific needs.
### 7.3 KIE model
For structural analysis related model downloads, please refer to:
- [PP-Structure Model Zoo](./docs/models_list_en.md)
|model name|description|model size|download|
| --- | --- | --- | --- |
|ser_LayoutXLM_xfun_zhd|SER model trained on xfun Chinese dataset based on LayoutXLM|1.4G|[inference model coming soon]() / [trained model](https://paddleocr.bj.bcebos.com/pplayout/ser_LayoutXLM_xfun_zh.tar) |
|re_LayoutXLM_xfun_zh|RE model trained on xfun Chinese dataset based on LayoutXLM|1.4G|[inference model coming soon]() / [trained model](https://paddleocr.bj.bcebos.com/pplayout/re_LayoutXLM_xfun_zh.tar) |
For OCR related model downloads, please refer to:
- [PP-OCR Model Zoo](../doc/doc_en/models_list_en.md)
If you need to use other models, you can download the model in [PPOCR model_list](../doc/doc_en/models_list_en.md) and [PPStructure model_list](./docs/models_list.md)
......@@ -21,7 +21,7 @@ PP-Structurev2系统流程图如下所示,文档图像首先经过图像矫正
- 关键信息抽取任务中,首先使用OCR引擎提取文本内容,然后由语义实体识别模块获取图像中的语义实体,最后经关系抽取模块获取语义实体之间的对应关系,从而提取需要的关键信息。
<img src="./docs/ppstructurev2_pipeline.png" width="100%"/>
更多技术细节:👉 [PP-Structurev2技术报告]()
更多技术细节:👉 [PP-Structurev2技术报告](docs/PP-Structurev2_introduction.md)
PP-Structurev2支持各个模块独立使用或灵活搭配,如,可以单独使用版面分析,或单独使用表格识别,点击下面相应链接获取各个独立模块的使用教程:
......@@ -76,6 +76,14 @@ PP-Structurev2支持各个模块独立使用或灵活搭配,如,可以单独
<img src="https://user-images.githubusercontent.com/14270174/185539517-ccf2372a-f026-4a7c-ad28-c741c770f60a.png" width="600">
</div>
<div align="center">
<img src="https://user-images.githubusercontent.com/25809855/186094456-01a1dd11-1433-4437-9ab2-6480ac94ec0a.png" width="600">
</div>
<div align="center">
<img src="https://user-images.githubusercontent.com/25809855/186095702-9acef674-12af-4d09-97fc-abf4ab32600e.png" width="600">
</div>
* RE
图中红色框表示`问题`,蓝色框表示`答案``问题``答案`之间使用绿色线连接。
......@@ -88,6 +96,14 @@ PP-Structurev2支持各个模块独立使用或灵活搭配,如,可以单独
<img src="https://user-images.githubusercontent.com/14270174/185540080-0431e006-9235-4b6d-b63d-0b3c6e1de48f.jpg" width="600">
</div>
<div align="center">
<img src="https://user-images.githubusercontent.com/25809855/186094813-3a8e16cc-42e5-4982-b9f4-0134dfb5688d.png" width="600">
</div>
<div align="center">
<img src="https://user-images.githubusercontent.com/25809855/186095641-5843b4da-34d7-4c1c-943a-b1036a859fe3.png" width="600">
</div>
<a name="4"></a>
## 4. 快速体验
......
# PP-Structurev2
## 目录
- [1. 背景](#1-背景)
- [2. 简介](#3-简介)
- [3. 整图方向矫正](#3-整图方向矫正)
- [4. 版面信息结构化](#4-版面信息结构化)
- [4.1 版面分析](#41-版面分析)
- [4.2 表格识别](#42-表格识别)
- [4.3 版面恢复](#43-版面恢复)
- [5. 关键信息抽取](#5-关键信息抽取)
- [6. Reference](#6-Reference)
## 1. 背景
现实场景中包含大量的文档图像,它们以图片等非结构化形式存储。基于文档图像的结构化分析与信息抽取对于数据的数字化存储以及产业的数字化转型至关重要。基于该考虑,PaddleOCR自研并发布了PP-Structure智能文档分析系统,旨在帮助开发者更好的完成版面分析、表格识别、关键信息抽取等文档理解相关任务。
近期,PaddleOCR团队针对PP-Structurev1的版面分析、表格识别、关键信息抽取模块,进行了共计8个方面的升级,同时新增整图方向矫正、文档复原等功能,打造出一个全新的、效果更优的文档分析系统:PP-Structurev2。
## 2. 简介
PP-Structurev2在PP-Structurev1的基础上进一步改进,主要有以下3个方面升级:
* **系统功能升级** :新增图像矫正和版面复原模块,图像转word/pdf、关键信息抽取能力全覆盖!
* **系统性能优化**
* 版面分析:发布轻量级版面分析模型,速度提升**11倍**,平均CPU耗时仅需**41ms**
* 表格识别:设计3大优化策略,预测耗时不变情况下,模型精度提升**6%**
* 关键信息抽取:设计视觉无关模型结构,语义实体识别精度提升**2.8%**,关系抽取精度提升**9.1%**
* **中文场景适配** :完成对版面分析与表格识别的中文场景适配,开源**开箱即用**的中文场景版面结构化模型!
PP-Structurev2系统流程图如下所示,文档图像首先经过图像矫正模块,判断整图方向并完成转正,随后可以完成版面信息分析与关键信息抽取2类任务。版面分析任务中,图像首先经过版面分析模型,将图像划分为文本、表格、图像等不同区域,随后对这些区域分别进行识别,如,将表格区域送入表格识别模块进行结构化识别,将文本区域送入OCR引擎进行文字识别,最后使用版面恢复模块将其恢复为与原始图像布局一致的word或者pdf格式的文件;关键信息抽取任务中,首先使用OCR引擎提取文本内容,然后由语义实体识别模块获取图像中的语义实体,最后经关系抽取模块获取语义实体之间的对应关系,从而提取需要的关键信息。
<div align="center">
<img src="https://user-images.githubusercontent.com/14270174/185939247-57e53254-399c-46c4-a610-da4fa79232f5.png" width="1200">
</div>
从算法改进思路来看,对系统中的3个关键子模块,共进行了8个方面的改进。
* 版面分析
* PP-PicoDet:轻量级版面分析模型
* FGD:兼顾全局与局部特征的模型蒸馏算法
* 表格识别
* PP-LCNet: CPU友好型轻量级骨干网络
* CSP-PAN:轻量级高低层特征融合模块
* SLAHead:结构与位置信息对齐的特征解码模块
* 关键信息抽取
* VI-LayoutXLM:视觉特征无关的多模态预训练模型结构
* TB-YX:考虑阅读顺序的文本行排序逻辑
* UDML:联合互学习知识蒸馏策略
最终,与PP-Structurev1相比:
- 版面分析模型参数量减少95.6%,推理速度提升11倍,精度提升0.4%;
- 表格识别预测耗时不变,模型精度提升6%,端到端TEDS提升2%;
- 关键信息抽取模型速度提升2.8倍,语义实体识别模型精度提升2.8%;关系抽取模型精度提升9.1%。
下面对各个模块进行详细介绍。
## 3. 整图方向矫正
由于训练集一般以正方向图像为主,旋转过的文档图像直接输入模型会增加识别难度,影响识别效果。PP-Structurev2引入了整图方向矫正模块来判断含文字图像的方向,并将其进行方向调整。
我们直接调用PaddleClas中提供的文字图像方向分类模型-[PULC_text_image_orientation](https://github.com/PaddlePaddle/PaddleClas/blob/develop/docs/zh_CN/PULC/PULC_text_image_orientation.md),该模型部分数据集图像如下所示。不同于文本行方向分类器,文字图像方向分类模型针对整图进行方向判别。文字图像方向分类模型在验证集上精度高达99%,单张图像CPU预测耗时仅为`2.16ms`
<div align="center">
<img src="https://user-images.githubusercontent.com/14270174/185939683-f6465473-3303-4a0c-95be-51f04fb9f387.png" width="600">
</div>
## 4. 版面信息结构化
### 4.1 版面分析
版面分析指的是对图片形式的文档进行区域划分,定位其中的关键区域,如文字、标题、表格、图片等,PP-Structurev1使用了PaddleDetection中开源的高效检测算法PP-YOLOv2完成版面分析的任务。
在PP-Structurev2中,我们发布基于PP-PicoDet的轻量级版面分析模型,并针对版面分析场景定制图像尺度,同时使用FGD知识蒸馏算法,进一步提升模型精度。最终CPU上`41ms`即可完成版面分析过程(仅包含模型推理时间,数据预处理耗时大约50ms左右)。在公开数据集PubLayNet 上,消融实验如下:
| 实验序号 | 策略 | 模型存储(M) | mAP | CPU预测耗时(ms) |
|:------:|:------:|:------:|:------:|:------:|
| 1 | PP-YOLOv2(640*640) | 221 | 93.6% | 512 |
| 2 | PP-PicoDet-LCNet2.5x(640*640) | 29.7 | 92.5% |53.2|
| 3 | PP-PicoDet-LCNet2.5x(800*608) | 29.7 | 94.2% |83.1 |
| 4 | PP-PicoDet-LCNet1.0x(800*608) | 9.7 | 93.5% | 41.2|
| 5 | PP-PicoDet-LCNet1.0x(800*608) + FGD | 9.7 | 94% |41.2|
* 测试条件
* paddle版本:2.3.0
* CPU:Intel(R) Xeon(R) Gold 6148 CPU @ 2.40GHz,开启mkldnn,线程数为10
在PubLayNet数据集上,与其他方法的性能对比如下表所示。可以看到,和基于Detectron2的版面分析工具layoutparser相比,我们的模型精度高出大约5%,预测速度快约69倍。
| 模型 | mAP | CPU预测耗时 |
|-------------------|-----------|------------|
| layoutparser (Detectron2) | 88.98% | 2.9s |
| PP-Structurev2 (PP-PicoDet) | **94%** | 41.2ms |
[PubLayNet](https://github.com/ibm-aur-nlp/PubLayNet)数据集是一个大型的文档图像数据集,包含Text、Title、Tale、Figure、List,共5个类别。数据集中包含335,703张训练集、11,245张验证集和11,405张测试集。训练数据与标注示例图如下所示:
<div align="center">
<img src="https://user-images.githubusercontent.com/14270174/185940305-2cd3633b-4c43-4f84-8a6f-5ce6a24e88ce.png" width="600">
</div>
#### 4.1.1 优化策略
**(1)轻量级版面分析模型PP-PicoDet**
`PP-PicoDet`是PaddleDetection中提出的轻量级目标检测模型,通过使用PP-LCNet骨干网络、CSP-PAN特征融合模块、SimOTA标签分配方法等优化策略,最终在CPU与移动端具有卓越的性能。我们将PP-Structurev1中采用的PP-YOLOv2模型替换为`PP-PicoDet`,同时针对版面分析场景优化预测尺度,从针对目标检测设计的`640*640`调整为更适配文档图像的`800*608`,在`1.0x`配置下,模型精度与PP-YOLOv2相当,CPU平均预测速度可提升11倍。
**(1)FGD知识蒸馏**
FGD(Focal and Global Knowledge Distillation for Detectors),是一种兼顾局部全局特征信息的模型蒸馏方法,分为Focal蒸馏和Global蒸馏2个部分。Focal蒸馏分离图像的前景和背景,让学生模型分别关注教师模型的前景和背景部分特征的关键像素;Global蒸馏部分重建不同像素之间的关系并将其从教师转移到学生,以补偿Focal蒸馏中丢失的全局信息。我们基于FGD蒸馏策略,使用教师模型PP-PicoDet-LCNet2.5x(mAP=94.2%)蒸馏学生模型PP-PicoDet-LCNet1.0x(mAP=93.5%),可将学生模型精度提升0.5%,和教师模型仅差0.2%,而预测速度比教师模型快1倍。
#### 4.1.2 场景适配
**(1)中文版面分析**
除了英文公开数据集PubLayNet,我们也在中文场景进行了场景适配与方法验证。[CDLA](https://github.com/buptlihang/CDLA)是一个中文文档版面分析数据集,面向中文文献类(论文)场景,包含正文、标题等10个label。数据集中包含5,000张训练集和1,000张验证集。训练数据与标注示例图如下所示:
<div align="center">
<img src="https://user-images.githubusercontent.com/14270174/185940445-92b7613f-e431-43c2-9033-3b3618ddae02.png" width="600">
</div>
在CDLA 数据集上,消融实验如下:
| 实验序号 | 策略 | mAP |
|:------:|:------:|:------:|
| 1 | PP-YOLOv2 | 84.7% |
| 2 | PP-PicoDet-LCNet2.5x(800*608) | 87.8% |
| 3 | PP-PicoDet-LCNet1.0x(800*608) | 84.5% |
| 4 | PP-PicoDet-LCNet1.0x(800*608) + FGD | 86.8% |
**(2)表格版面分析**
在实际应用中,很多场景并不关注图像中的图片、文本等版面区域,而仅需要提取文档图像中的表格,此时版面分析任务退化为一个表格检测任务,表格检测往往也是表格识别的前序任务。面向中英文文档场景,我们整理了开源领域含表格的版面分析数据集,包括TableBank、DocBank等。融合后的数据集中包含496,405张训练集与9,495张验证集图像。
在表格数据集上,消融实验如下:
| 实验序号 | 策略 | mAP |
|:------:|:------:|:------:|
| 1 | PP-YOLOv2 |91.3% |
| 2 | PP-PicoDet-LCNet2.5x(800*608) | 95.9% |
| 3 | PP-PicoDet-LCNet1.0x(800*608) | 95.2% |
| 4 | PP-PicoDet-LCNet1.0x(800*608) + FGD | 95.7% |
表格检测效果示意图如下:
<div align="center">
<img src="https://user-images.githubusercontent.com/14270174/185940654-956ef614-888a-4779-bf63-a6c2b61b97fa.png" width="600">
</div>
### 4.2 表格识别
基于深度学习的表格识别算法种类丰富,PP-Structurev1中,我们基于文本识别算法RARE研发了端到端表格识别算法TableRec-RARE,模型输出为表格结构的HTML表示,进而可以方便地转化为Excel文件。PP-Structurev2中,我们对模型结构和损失函数等5个方面进行升级,提出了 SLANet (Structure Location Alignment Network) ,模型结构如下图所示:
<div align="center">
<img src="https://user-images.githubusercontent.com/14270174/185940811-089c9265-4be9-4776-b365-6d1125606b4b.png" width="1200">
</div>
在PubTabNet英文表格识别数据集上的消融实验如下:
|策略|Acc|TEDS|推理速度(CPU+MKLDNN)|模型大小|
|---|---|---|---|---|
|TableRec-RARE| 71.73% | 93.88% |779ms |6.8M|
|+PP-LCNet| 74.71% |94.37% |778ms| 8.7M|
|+CSP-PAN| 75.68%| 94.72% |708ms| 9.3M|
|+SLAHead| 77.7%|94.85%| 766ms| 9.2M|
|+MergeToken| 76.31%| 95.89%|766ms| 9.2M|
* 测试环境
* paddle版本:2.3.1
* CPU:Intel(R) Xeon(R) Gold 6148 CPU @ 2.40GHz,开启mkldnn,线程数为10
在PubtabNet英文表格识别数据集上,和其他方法对比如下:
|策略|Acc|TEDS|推理速度(CPU+MKLDNN)|模型大小|
|---|---|---|---|---|
|TableMaster|77.9%|96.12%|2144ms|253M|
|TableRec-RARE| 71.73% | 93.88% |779ms |6.8M|
|SLANet|76.31%| 95.89%|766ms|9.2M|
#### 4.2.1 优化策略
**(1) CPU友好型轻量级骨干网络PP-LCNet**
PP-LCNet是结合Intel-CPU端侧推理特性而设计的轻量高性能骨干网络,该方案在图像分类任务上取得了比ShuffleNetV2、MobileNetV3、GhostNet等轻量级模型更优的“精度-速度”均衡。PP-Structurev2中,我们采用PP-LCNet作为骨干网络,表格识别模型精度从71.73%提升至72.98%;同时加载通过SSLD知识蒸馏方案训练得到的图像分类模型权重作为表格识别的预训练模型,最终精度进一步提升2.95%至74.71%。
**(2)轻量级高低层特征融合模块CSP-PAN**
对骨干网络提取的特征进行融合,可以有效解决尺度变化较大等复杂场景中的模型预测问题。早期,FPN模块被提出并用于特征融合,但是它的特征融合过程仅包含单向(高->低),融合不够充分。CSP-PAN基于PAN进行改进,在保证特征融合更为充分的同时,使用CSP block、深度可分离卷积等策略减小了计算量。在表格识别场景中,我们进一步将CSP-PAN的通道数从128降低至96以降低模型大小。最终表格识别模型精度提升0.97%至75.68%,预测速度提升10%。
**(3)结构与位置信息对齐的特征解码模块SLAHead**
TableRec-RARE的TableAttentionHead如下图a所示,TableAttentionHead在执行完全部step的计算后拿到最终隐藏层状态表征(hiddens),随后hiddens经由SDM(Structure Decode Module)和CLDM(Cell Location Decode Module)模块生成全部的表格结构token和单元格坐标。但是这种设计忽略了单元格token和坐标之间一一对应的关系。
PP-Structurev2中,我们设计SLAHead模块,对单元格token和坐标之间做了对齐操作,如下图b所示。在SLAHead中,每一个step的隐藏层状态表征会分别送入SDM和CLDM来得到当前step的token和坐标,每个step的token和坐标输出分别进行concat得到表格的html表达和全部单元格的坐标。此外,考虑到表格识别模型的单元格准确率依赖于表格结构的识别准确,我们将损失函数中表格结构分支与单元格定位分支的权重比从1:1提升到8:1,并使用收敛更稳定的Smoothl1 Loss替换定位分支中的MSE Loss。最终模型精度从75.68%提高至77.7%。
<div align="center">
<img src="https://user-images.githubusercontent.com/14270174/185940968-e3a2fbac-78d7-4b74-af54-a1dab860f470.png" width="1200">
</div>
**(4)其他**
TableRec-RARE算法中,我们使用`<td>``</td>`两个单独的token来表示一个非跨行列单元格,这种表示方式限制了网络对于单元格数量较多表格的处理能力。
PP-Structurev2中,我们参考TableMaster中的token处理方法,将`<td>``</td>`合并为一个token-`<td></td>`。合并token后,验证集中token长度大于500的图片也参与模型评估,最终模型精度降低为76.31%,但是端到端TEDS提升1.04%。
#### 4.2.2 中文场景适配
除了上述模型策略的升级外,本次升级还开源了中文表格识别模型。在实际应用场景中,表格图像存在着各种各样的倾斜角度(PubTabNet数据集不存在该问题),因此在中文模型中,我们将单元格坐标回归的点数从2个(左上,右下)增加到4个(左上,右上,右下,左下)。在内部测试集上,模型升级前后指标如下:
|模型|acc|
|---|---|
|TableRec-RARE|44.3%|
|SLANet|59.35%|
可视化结果如下,左为输入图像,右为识别的html表格
<div align="center">
<img src="https://user-images.githubusercontent.com/14270174/185941221-c94e3d45-524c-4073-9644-21ba6a9fd93e.png" width="800">
</div>
<div align="center">
<img src="https://user-images.githubusercontent.com/14270174/185941254-31f2b1fa-d594-4037-b1c7-0f24543e5d19.png" width="800">
</div>
<div align="center">
<img src="https://user-images.githubusercontent.com/14270174/185941273-2f7131df-3fe7-43b8-9c64-77ad2cf3b947.png" width="800">
</div>
<div align="center">
<img src="https://user-images.githubusercontent.com/14270174/185941295-d0672aa8-548d-4e6a-812c-ac5d5fd8a269.png" width="800">
</div>
<div align="center">
<img src="https://user-images.githubusercontent.com/14270174/185941324-8036e959-abdc-4dc5-a9f2-730b07b4e3d3.png" width="800">
</div>
### 4.3 版面恢复
版面恢复指的是文档图像经过OCR识别、版面分析、表格识别等方法处理后的内容可以与原始文档保持相同的排版方式,并输出到word等文档中。PP-Structurev2中,我们版面恢复系统,包含版面分析、表格识别、OCR文本检测与识别等子模块。
下图展示了版面恢复的结果:
<div align="center">
<img src="https://user-images.githubusercontent.com/14270174/185941816-4dabb3e8-a0db-4094-98ea-52e0a0fda8e8.png" width="1200">
</div>
## 5. 关键信息抽取
关键信息抽取指的是针对文档图像的文字内容,提取出用户关注的关键信息,如身份证中的姓名、住址等字段。PP-Structure中支持了基于多模态LayoutLM系列模型的语义实体识别 (Semantic Entity Recognition, SER) 以及关系抽取 (Relation Extraction, RE) 任务。PP-Structurev2中,我们对模型结构以及下游任务训练方法进行升级,提出了VI-LayoutXLM(Visual-feature Independent LayoutXLM),具体流程图如下所示。
<div align="center">
<img src="https://user-images.githubusercontent.com/14270174/185941978-abec7d4a-5e3a-4141-83f8-088d04ef898e.png" width="1000">
</div>
具体优化策略包括:
* VI-LayoutXLM:视觉特征无关的多模态预训练模型结构
* TB-YX:考虑人类阅读顺序的文本行排序逻辑
* UDML:联合互学习知识蒸馏策略
XFUND-zh数据集上,SER任务的消融实验如下所示。
| 实验序号 | 策略 | 模型大小(G) | 精度 | GPU预测耗时(ms) | CPU预测耗时(ms) |
|:------:|:------:|:------:|:------:|:------:|:------:|
| 1 | LayoutXLM | 1.4 | 89.50% | 59.35 | 766.16 |
| 2 | VI-LayoutXLM | 1.1 | 90.46% | 23.71 | 675.58 |
| 3 | 实验2 + TB-YX文本行排序 | 1.1 | 92.50% | 23.71 | 675.58 |
| 4 | 实验3 + UDML蒸馏 | 1.1 | 93.19% | 23.71 | 675.58 |
| 5 | 实验3 + UDML蒸馏 | 1.1 | **93.19%** | **15.49** | **675.58** |
* 测试条件
* paddle版本:2.3.0
* GPU:V100,实验5的GPU预测耗时使用`trt+fp16`测试得到,环境为cuda10.2+ cudnn8.1.1 + trt7.2.3.4,其他实验的预测耗时统计中没有使用TRT。
* CPU:Intel(R) Xeon(R) Gold 6148 CPU @ 2.40GHz,开启mkldnn,线程数为10
在XFUND数据集上,与其他方法的效果对比如下所示。
| 模型 | SER Hmean | RE Hmean |
|-------------------|-----------|------------|
| LayoutLMv2-base | 85.44% | 67.77% |
| LayoutXLM-base | 89.24% | 70.73% |
| StrucTexT-large | 92.29% | **86.81%** |
| VI-LayoutXLM-base (ours) | **93.19%** | 83.92% |
### 5.1 优化策略
**(1) VI-LayoutXLM(Visual-feature Independent LayoutXLM)**
LayoutLMv2以及LayoutXLM中引入视觉骨干网络,用于提取视觉特征,并与后续的text embedding进行联合,作为多模态的输入embedding。但是该模块为基于`ResNet_x101_64x4d`的特征提取网络,特征抽取阶段耗时严重,因此我们将其去除,同时仍然保留文本、位置以及布局等信息,最终发现针对LayoutXLM进行改进,下游SER任务精度无损,针对LayoutLMv2进行改进,下游SER任务精度仅降低`2.1%`,而模型大小减小了约`340M`。具体消融实验如下所示。
| 模型 | 模型大小 (G) | F-score | 精度收益 |
|-----------------|----------|---------|--------|
| LayoutLMv2 | 0.76 | 84.20% | - |
| VI-LayoutLMv2 | 0.42 | 82.10% | -2.10% |
| LayoutXLM | 1.4 | 89.50% | - |
| VI-LayouXLM | 1.1 | 90.46% | +0.96% |
同时,基于XFUND数据集,VI-LayoutXLM在RE任务上的精度也进一步提升了`1.06%`
**(2) TB-YX排序方法(Threshold-Based YX sorting algorithm)**
文本阅读顺序对于信息抽取与文本理解等任务至关重要,传统多模态模型中,没有考虑不同OCR工具可能产生的不正确阅读顺序,而模型输入中包含位置编码,阅读顺序会直接影响预测结果,在预处理中,我们对文本行按照从上到下,从左到右(YX)的顺序进行排序,为防止文本行位置轻微干扰带来的排序结果不稳定问题,在排序的过程中,引入位置偏移阈值Th,对于Y方向距离小于Th的2个文本内容,使用x方向的位置从左到右进行排序。TB-YX排序方法伪代码如下所示。
```py
def order_by_tbyx(ocr_info, th=20):
"""
ocr_info: a list of dict, which contains bbox information([x1, y1, x2, y2])
th: threshold of the position threshold
"""
res = sorted(ocr_info, key=lambda r: (r["bbox"][1], r["bbox"][0])) # sort using y1 first and then x1
for i in range(len(res) - 1):
for j in range(i, 0, -1):
# restore the order using the
if abs(res[j + 1]["bbox"][1] - res[j]["bbox"][1]) < th and \
(res[j + 1]["bbox"][0] < res[j]["bbox"][0]):
tmp = deepcopy(res[j])
res[j] = deepcopy(res[j + 1])
res[j + 1] = deepcopy(tmp)
else:
break
return res
```
不同排序方法的结果对比如下所示,可以看出引入偏离阈值之后,排序结果更加符合人类的阅读顺序。
<div align="center">
<img src="https://user-images.githubusercontent.com/14270174/185942080-9d4bafc9-fa7f-4da4-b139-b2bd703dc76d.png" width="800">
</div>
使用该策略,最终XFUND数据集上,SER任务F1指标提升`2.06%`,RE任务F1指标提升`7.04%`
**(3) 互学习蒸馏策略**
UDML(Unified-Deep Mutual Learning)联合互学习是PP-OCRv2与PP-OCRv3中采用的对于文本识别非常有效的提升模型效果的策略。在训练时,引入2个完全相同的模型进行互学习,计算2个模型之间的互蒸馏损失函数(DML loss),同时对transformer中间层的输出结果计算距离损失函数(L2 loss)。使用该策略,最终XFUND数据集上,SER任务F1指标提升`0.6%`,RE任务F1指标提升`5.01%`
最终优化后模型基于SER任务的可视化结果如下所示。
<div align="center">
<img src="https://user-images.githubusercontent.com/14270174/185942213-0909135b-3bcd-4d79-9e69-847dfb1c3b82.png" width="800">
</div>
<div align="center">
<img src="https://user-images.githubusercontent.com/14270174/185942237-72923b42-8590-42eb-b687-fa819b1c3afd.png" width="800">
</div>
RE任务的可视化结果如下所示。
<div align="center">
<img src="https://user-images.githubusercontent.com/14270174/185942400-8920dc3c-de7f-46d0-b0bc-baca9536e0e1.png" width="800">
</div>
<div align="center">
<img src="https://user-images.githubusercontent.com/14270174/185942416-ca4fd8b0-9227-4c65-b969-0afbda525b85.png" width="800">
</div>
### 5.2 更多场景消融实验
我们在FUNSD数据集上,同时基于RE任务进行对本次升级策略进行验证,具体实验结果如下所示,可以看出该方案针对不同任务,在不同数据集上均有非常明显的精度收益。
#### 5.2.1 XFUND_zh数据集
**RE任务结果**
| 实验序号 | 策略 | 模型大小(G) | F1-score |
|:------:|:------------:|:---------:|:----------:|
| 1 | LayoutXLM | 1.4 | 70.81% |
| 2 | VI-LayoutXLM | 1.1 | 71.87% |
| 3 | 实验2 + PP-OCR排序 | 1.1 | 78.91% |
| 4 | 实验3 + UDML蒸馏 | 1.1 | **83.92%** |
#### 5.2.2 FUNSD数据集
**SER任务结果**
| 实验序号 | 策略 | F1-score |
|:------:|:------:|:------:|
| 1 | LayoutXLM | 82.28% |
| 2 | PP-Structurev2 SER | **87.79%** |
**RE任务结果**
| 实验序号 | 策略 | F1-score |
|:------:|:------:|:------:|
| 1 | LayoutXLM | 53.13% |
| 2 | PP-Structurev2 SER | **74.87%** |
## 6. Reference
* [1] Zhong X, ShafieiBavani E, Jimeno Yepes A. Image-based table recognition: data, model, and evaluation[C]//European Conference on Computer Vision. Springer, Cham, 2020: 564-580.
* [2] Cui C, Gao T, Wei S. Yuning Du, Ruoyu Guo, Shuilong Dong, Bin Lu, Ying Zhou, Xueying Lv, Qiwen Liu, Xiaoguang Hu, Dianhai Yu, and Yanjun Ma* [J]. Pplcnet: A lightweight cpu convolutional neural network, 2021, 3.
* [3] Lin T Y, Dollár P, Girshick R, et al. Feature pyramid networks for object detection[C]//Proceedings of the IEEE conference on computer vision and pattern recognition. 2017: 2117-2125.
* [4] Yu G, Chang Q, Lv W, et al. PP-PicoDet: A Better Real-Time Object Detector on Mobile Devices[J]. arXiv preprint arXiv:2111.00902, 2021.
* [5] Bochkovskiy A, Wang C Y, Liao H Y M. Yolov4: Optimal speed and accuracy of object detection[J]. arXiv preprint arXiv:2004.10934, 2020.
* [6] Ye J, Qi X, He Y, et al. PingAn-VCGroup's Solution for ICDAR 2021 Competition on Scientific Literature Parsing Task B: Table Recognition to HTML[J]. arXiv preprint arXiv:2105.01848, 2021.
* [7] Zhong X, Tang J, Yepes A J. Publaynet: largest dataset ever for document layout analysis[C]//2019 International Conference on Document Analysis and Recognition (ICDAR). IEEE, 2019: 1015-1022.
* [8] CDLA:https://github.com/buptlihang/CDLA
* [9]Gao L, Huang Y, Déjean H, et al. ICDAR 2019 competition on table detection and recognition (cTDaR)[C]//2019 International Conference on Document Analysis and Recognition (ICDAR). IEEE, 2019: 1510-1515.
* [10] Mondal A, Lipps P, Jawahar C V. IIIT-AR-13K: a new dataset for graphical object detection in documents[C]//International Workshop on Document Analysis Systems. Springer, Cham, 2020: 216-230.
* [11] Tal ocr_tabel:https://ai.100tal.com/dataset
* [12] Li M, Cui L, Huang S, et al. Tablebank: A benchmark dataset for table detection and recognition[J]. arXiv preprint arXiv:1903.01949, 2019.
* [13]Li M, Xu Y, Cui L, et al. DocBank: A benchmark dataset for document layout analysis[J]. arXiv preprint arXiv:2006.01038, 2020.
* [14] Xu Y, Li M, Cui L, et al. Layoutlm: Pre-training of text and layout for document image understanding[C]//Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 2020: 1192-1200.
* [15] Xu Y, Xu Y, Lv T, et al. LayoutLMv2: Multi-modal pre-training for visually-rich document understanding[J]. arXiv preprint arXiv:2012.14740, 2020.
* [16] Xu Y, Lv T, Cui L, et al. Layoutxlm: Multimodal pre-training for multilingual visually-rich document understanding[J]. arXiv preprint arXiv:2104.08836, 2021.
* [17] Xu Y, Lv T, Cui L, et al. XFUND: A Benchmark Dataset for Multilingual Visually Rich Form Understanding[C]//Findings of the Association for Computational Linguistics: ACL 2022. 2022: 3214-3224.
* [18] Jaume G, Ekenel H K, Thiran J P. Funsd: A dataset for form understanding in noisy scanned documents[C]//2019 International Conference on Document Analysis and Recognition Workshops (ICDARW). IEEE, 2019, 2: 1-6.
# Python Inference
- [1. Structure](#1)
- [1. Layout Structured Analysis](#1)
- [1.1 layout analysis + table recognition](#1.1)
- [1.2 layout analysis](#1.2)
- [1.3 table recognition](#1.3)
- [2. KIE](#2)
- [2. Key Information Extraction](#2)
<a name="1"></a>
## 1. Structure
## 1. Layout Structured Analysis
Go to the `ppstructure` directory
```bash
......@@ -70,7 +70,7 @@ python3 predict_system.py --det_model_dir=inference/ch_PP-OCRv3_det_infer \
After the operation is completed, each image will have a directory with the same name in the `structure` directory under the directory specified by the `output` field. Each table in the image will be stored as an excel. The filename of excel is their coordinates in the image.
<a name="2"></a>
## 2. KIE
## 2. Key Information Extraction
```bash
cd ppstructure
......
# PP-Structure Quick Start
- [1. Install package](#1-install-package)
- [2. Use](#2-use)
- [1. Environment Preparation](#1-environment-preparation)
- [2. Quick Use](#2-quick-use)
- [2.1 Use by command line](#21-use-by-command-line)
- [2.1.1 image orientation + layout analysis + table recognition](#211-image-orientation--layout-analysis--table-recognition)
- [2.1.2 layout analysis + table recognition](#212-layout-analysis--table-recognition)
......@@ -9,22 +9,41 @@
- [2.1.4 table recognition](#214-table-recognition)
- [2.1.5 Key Information Extraction](#215-Key-Information-Extraction)
- [2.1.6 layout recovery](#216-layout-recovery)
- [2.2 Use by code](#22-use-by-code)
- [2.2 Use by python script](#22-use-by-python-script)
- [2.2.1 image orientation + layout analysis + table recognition](#221-image-orientation--layout-analysis--table-recognition)
- [2.2.2 layout analysis + table recognition](#222-layout-analysis--table-recognition)
- [2.2.3 layout analysis](#223-layout-analysis)
- [2.2.4 table recognition](#224-table-recognition)
- [2.2.5 DocVQA](#225-dockie)
- [2.2.5 Key Information Extraction](#225-Key-Information-Extraction)
- [2.2.6 layout recovery](#226-layout-recovery)
- [2.3 Result description](#23-result-description)
- [2.3.1 layout analysis + table recognition](#231-layout-analysis--table-recognition)
- [2.3.2 Key Information Extraction](#232-Key-Information-Extraction)
- [2.4 Parameter Description](#24-parameter-description)
- [3. Summary](#3-summary)
<a name="1"></a>
## 1. Install package
## 1. Environment Preparation
### 1.1 Install PaddlePaddle
> If you do not have a Python environment, please refer to [Environment Preparation](./environment_en.md).
- If you have CUDA 9 or CUDA 10 installed on your machine, please run the following command to install
```bash
python3 -m pip install paddlepaddle-gpu -i https://mirror.baidu.com/pypi/simple
```
- If you have no available GPU on your machine, please run the following command to install the CPU version
```bash
python3 -m pip install paddlepaddle -i https://mirror.baidu.com/pypi/simple
```
For more software version requirements, please refer to the instructions in [Installation Document](https://www.paddlepaddle.org.cn/install/quick) for operation.
### 1.2 Install PaddleOCR Whl Package
```bash
# Install paddleocr, version 2.6 is recommended
......@@ -34,17 +53,15 @@ pip3 install "paddleocr>=2.6"
pip3 install paddleclas
# Install the KIE dependency packages (if you do not use the KIE, you can skip it)
pip3 install -r ppstructure/kie/requirements.txt
pip3 install -r kie/requirements.txt
# Install the layout recovery dependency packages (if you do not use the layout recovery, you can skip it)
pip3 install -r ppstructure/recovery/requirements.txt
pip3 install -r recovery/requirements.txt
```
<a name="2"></a>
## 2. Use
## 2. Quick Use
<a name="21"></a>
### 2.1 Use by command line
......@@ -52,45 +69,41 @@ pip3 install -r ppstructure/recovery/requirements.txt
<a name="211"></a>
#### 2.1.1 image orientation + layout analysis + table recognition
```bash
paddleocr --image_dir=PaddleOCR/ppstructure/docs/table/1.png --type=structure --image_orientation=true
paddleocr --image_dir=ppstructure/docs/table/1.png --type=structure --image_orientation=true
```
<a name="212"></a>
#### 2.1.2 layout analysis + table recognition
```bash
paddleocr --image_dir=PaddleOCR/ppstructure/docs/table/1.png --type=structure
paddleocr --image_dir=ppstructure/docs/table/1.png --type=structure
```
<a name="213"></a>
#### 2.1.3 layout analysis
```bash
paddleocr --image_dir=PaddleOCR/ppstructure/docs/table/1.png --type=structure --table=false --ocr=false
paddleocr --image_dir=ppstructure/docs/table/1.png --type=structure --table=false --ocr=false
```
<a name="214"></a>
#### 2.1.4 table recognition
```bash
paddleocr --image_dir=PaddleOCR/ppstructure/docs/table/table.jpg --type=structure --layout=false
paddleocr --image_dir=ppstructure/docs/table/table.jpg --type=structure --layout=false
```
<a name="215"></a>
#### 2.1.5 Key Information Extraction
Please refer to: [Key Information Extraction](../kie/README.md) .
Key information extraction does not currently support use by the whl package. For detailed usage tutorials, please refer to: [Key Information Extraction](../kie/README.md).
<a name="216"></a>
#### 2.1.6 layout recovery
```
# Chinese pic
paddleocr --image_dir=PaddleOCR/ppstructure/docs/table/1.png --type=structure --recovery=true
# English pic
paddleocr --image_dir=PaddleOCR/ppstructure/docs/table/1.png --type=structure --recovery=true --lang='en'
# pdf file
paddleocr --image_dir=ppstructure/recovery/UnrealText.pdf --type=structure --recovery=true --lang='en'
paddleocr --image_dir=ppstructure/docs/table/1.png --type=structure --recovery=true --lang='en'
```
<a name="22"></a>
### 2.2 Use by code
### 2.2 Use by python script
<a name="221"></a>
#### 2.2.1 image orientation + layout analysis + table recognition
......@@ -103,7 +116,7 @@ from paddleocr import PPStructure,draw_structure_result,save_structure_res
table_engine = PPStructure(show_log=True, image_orientation=True)
save_folder = './output'
img_path = 'PaddleOCR/ppstructure/docs/table/1.png'
img_path = 'ppstructure/docs/table/1.png'
img = cv2.imread(img_path)
result = table_engine(img)
save_structure_res(result, save_folder,os.path.basename(img_path).split('.')[0])
......@@ -114,7 +127,7 @@ for line in result:
from PIL import Image
font_path = 'PaddleOCR/doc/fonts/simfang.ttf' # PaddleOCR下提供字体包
font_path = 'doc/fonts/simfang.ttf' # PaddleOCR下提供字体包
image = Image.open(img_path).convert('RGB')
im_show = draw_structure_result(image, result,font_path=font_path)
im_show = Image.fromarray(im_show)
......@@ -132,7 +145,7 @@ from paddleocr import PPStructure,draw_structure_result,save_structure_res
table_engine = PPStructure(show_log=True)
save_folder = './output'
img_path = 'PaddleOCR/ppstructure/docs/table/1.png'
img_path = 'ppstructure/docs/table/1.png'
img = cv2.imread(img_path)
result = table_engine(img)
save_structure_res(result, save_folder,os.path.basename(img_path).split('.')[0])
......@@ -143,7 +156,7 @@ for line in result:
from PIL import Image
font_path = 'PaddleOCR/doc/fonts/simfang.ttf' # font provieded in PaddleOCR
font_path = 'doc/fonts/simfang.ttf' # font provieded in PaddleOCR
image = Image.open(img_path).convert('RGB')
im_show = draw_structure_result(image, result,font_path=font_path)
im_show = Image.fromarray(im_show)
......@@ -161,7 +174,7 @@ from paddleocr import PPStructure,save_structure_res
table_engine = PPStructure(table=False, ocr=False, show_log=True)
save_folder = './output'
img_path = 'PaddleOCR/ppstructure/docs/table/1.png'
img_path = 'ppstructure/docs/table/1.png'
img = cv2.imread(img_path)
result = table_engine(img)
save_structure_res(result, save_folder, os.path.basename(img_path).split('.')[0])
......@@ -182,7 +195,7 @@ from paddleocr import PPStructure,save_structure_res
table_engine = PPStructure(layout=False, show_log=True)
save_folder = './output'
img_path = 'PaddleOCR/ppstructure/docs/table/table.jpg'
img_path = 'ppstructure/docs/table/table.jpg'
img = cv2.imread(img_path)
result = table_engine(img)
save_structure_res(result, save_folder, os.path.basename(img_path).split('.')[0])
......@@ -195,7 +208,7 @@ for line in result:
<a name="225"></a>
#### 2.2.5 Key Information Extraction
Please refer to: [Key Information Extraction](../kie/README.md) .
Key information extraction does not currently support use by the whl package. For detailed usage tutorials, please refer to: [Key Information Extraction](../kie/README.md).
<a name="226"></a>
#### 2.2.6 layout recovery
......@@ -246,8 +259,8 @@ Each field in dict is described as follows:
| field | description |
| --- |---|
|type| Type of image area. |
|bbox| The coordinates of the image area in the original image, respectively [upper left corner x, upper left corner y, lower right corner x, lower right corner y]. |
|type| Type of image area. |
|bbox| The coordinates of the image area in the original image, respectively [upper left corner x, upper left corner y, lower right corner x, lower right corner y]. |
|res| OCR or table recognition result of the image area. <br> table: a dict with field descriptions as follows: <br>&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp; `html`: html str of table.<br>&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp; In the code usage mode, set return_ocr_result_in_table=True whrn call can get the detection and recognition results of each text in the table area, corresponding to the following fields: <br>&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp; `boxes`: text detection boxes.<br>&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp; `rec_res`: text recognition results.<br> OCR: A tuple containing the detection boxes and recognition results of each single text. |
After the recognition is completed, each image will have a directory with the same name under the directory specified by the `output` field. Each table in the image will be stored as an excel, and the picture area will be cropped and saved. The filename of excel and picture is their coordinates in the image.
......@@ -291,3 +304,8 @@ Please refer to: [Key Information Extraction](../kie/README.md) .
| structure_version | Structure version, optional PP-structure and PP-structurev2 | PP-structure |
Most of the parameters are consistent with the PaddleOCR whl package, see [whl package documentation](../../doc/doc_en/whl.md)
<a name="3"></a>
## 3. Summary
Through the content in this section, you can master the use of PP-Structure related functions through PaddleOCR whl package. Please refer to [documentation tutorial](../../README.md) for more detailed usage tutorials including model training, inference and deployment, etc.
\ No newline at end of file
......@@ -41,7 +41,7 @@ inference:tools/infer/predict_rec.py
--use_gpu:True|False
--enable_mkldnn:False
--cpu_threads:6
--rec_batch_num:1|6
--rec_batch_num:1
--use_tensorrt:False
--precision:fp32
--rec_model_dir:
......
......@@ -41,7 +41,7 @@ inference:tools/infer/predict_rec.py --rec_image_shape="3,48,320"
--use_gpu:True|False
--enable_mkldnn:False
--cpu_threads:6
--rec_batch_num:1|6
--rec_batch_num:1
--use_tensorrt:False
--precision:fp32
--rec_model_dir:
......
......@@ -41,7 +41,7 @@ inference:tools/infer/predict_rec.py
--use_gpu:True|False
--enable_mkldnn:False
--cpu_threads:6
--rec_batch_num:1|6
--rec_batch_num:1
--use_tensorrt:False
--precision:fp32
--rec_model_dir:
......
......@@ -41,7 +41,7 @@ inference:tools/infer/predict_rec.py
--use_gpu:True|False
--enable_mkldnn:False
--cpu_threads:6
--rec_batch_num:1|6
--rec_batch_num:1
--use_tensorrt:False
--precision:fp32
--rec_model_dir:
......
......@@ -4,7 +4,7 @@ Global:
log_smooth_window: 20
print_batch_step: 5
save_model_dir: ./output/table_mv3/
save_epoch_step: 3
save_epoch_step: 400
# evaluation is run every 400 iterations after the 0th iteration
eval_batch_step: [0, 40000]
cal_metric_during_train: True
......@@ -17,7 +17,8 @@ Global:
# for data or label process
character_dict_path: ppocr/utils/dict/table_structure_dict.txt
character_type: en
max_text_length: 800
max_text_length: &max_text_length 500
box_format: &box_format 'xyxy' # 'xywh', 'xyxy', 'xyxyxyxy'
infer_mode: False
Optimizer:
......@@ -37,12 +38,14 @@ Architecture:
Backbone:
name: MobileNetV3
scale: 1.0
model_name: large
model_name: small
disable_se: true
Head:
name: TableAttentionHead
hidden_size: 256
loc_type: 2
max_text_length: 800
max_text_length: *max_text_length
loc_reg_num: &loc_reg_num 4
Loss:
name: TableAttentionLoss
......@@ -70,6 +73,8 @@ Train:
learn_empty_box: False
merge_no_span_structure: False
replace_empty_cell_token: False
loc_reg_num: *loc_reg_num
max_text_length: *max_text_length
- TableBoxEncode:
- ResizeTableImage:
max_len: 488
......@@ -102,6 +107,8 @@ Eval:
learn_empty_box: False
merge_no_span_structure: False
replace_empty_cell_token: False
loc_reg_num: *loc_reg_num
max_text_length: *max_text_length
- TableBoxEncode:
- ResizeTableImage:
max_len: 488
......
......@@ -41,7 +41,7 @@ inference:tools/infer/predict_rec.py --rec_char_dict_path=./ppocr/utils/EN_symbo
--use_gpu:True|False
--enable_mkldnn:False
--cpu_threads:6
--rec_batch_num:1|6
--rec_batch_num:1
--use_tensorrt:False
--precision:fp32
--rec_model_dir:
......
......@@ -41,7 +41,7 @@ inference:tools/infer/predict_rec.py --rec_char_dict_path=./ppocr/utils/ic15_dic
--use_gpu:True|False
--enable_mkldnn:False
--cpu_threads:6
--rec_batch_num:1|6
--rec_batch_num:1
--use_tensorrt:False
--precision:fp32
--rec_model_dir:
......
......@@ -41,7 +41,7 @@ inference:tools/infer/predict_rec.py --rec_char_dict_path=./ppocr/utils/ic15_dic
--use_gpu:True|False
--enable_mkldnn:False
--cpu_threads:6
--rec_batch_num:1|6
--rec_batch_num:1
--use_tensorrt:False
--precision:fp32
--rec_model_dir:
......
......@@ -41,7 +41,7 @@ inference:tools/infer/predict_rec.py --rec_char_dict_path=./ppocr/utils/ic15_dic
--use_gpu:True|False
--enable_mkldnn:False
--cpu_threads:6
--rec_batch_num:1|6
--rec_batch_num:1
--use_tensorrt:False
--precision:fp32
--rec_model_dir:
......
......@@ -41,7 +41,7 @@ inference:tools/infer/predict_rec.py --rec_char_dict_path=./ppocr/utils/ic15_dic
--use_gpu:True|False
--enable_mkldnn:False
--cpu_threads:6
--rec_batch_num:1|6
--rec_batch_num:1
--use_tensorrt:False
--precision:fp32
--rec_model_dir:
......
===========================train_params===========================
model_name:rec_r31_robustscanner
python:python
python:python3.7
gpu_list:0|0,1
Global.use_gpu:True|True
Global.auto_cast:null
......@@ -39,11 +39,11 @@ infer_export:tools/export_model.py -c test_tipc/configs/rec_r31_robustscanner/re
infer_quant:False
inference:tools/infer/predict_rec.py --rec_char_dict_path=./ppocr/utils/dict90.txt --rec_image_shape="3,48,48,160" --use_space_char=False --rec_algorithm="RobustScanner"
--use_gpu:True|False
--enable_mkldnn:True|False
--cpu_threads:1|6
--rec_batch_num:1|6
--use_tensorrt:False|False
--precision:fp32|int8
--enable_mkldnn:False
--cpu_threads:6
--rec_batch_num:1
--use_tensorrt:False
--precision:fp32
--rec_model_dir:
--image_dir:./inference/rec_inference
--save_log_path:./test/output/
......
......@@ -41,7 +41,7 @@ inference:tools/infer/predict_rec.py --rec_char_dict_path=./ppocr/utils/dict90.t
--use_gpu:True
--enable_mkldnn:False
--cpu_threads:6
--rec_batch_num:1|6
--rec_batch_num:1
--use_tensorrt:False
--precision:fp32
--rec_model_dir:
......
......@@ -41,7 +41,7 @@ inference:tools/infer/predict_rec.py --rec_char_dict_path=./ppocr/utils/dict/spi
--use_gpu:True|False
--enable_mkldnn:False
--cpu_threads:6
--rec_batch_num:1|6
--rec_batch_num:1
--use_tensorrt:False
--precision:fp32
--rec_model_dir:
......
......@@ -41,7 +41,7 @@ inference:tools/infer/predict_rec.py --rec_char_dict_path=./ppocr/utils/ic15_dic
--use_gpu:True|False
--enable_mkldnn:False
--cpu_threads:6
--rec_batch_num:1|6
--rec_batch_num:1
--use_tensorrt:False
--precision:fp32
--rec_model_dir:
......
......@@ -41,7 +41,7 @@ inference:tools/infer/predict_rec.py --rec_char_dict_path=./ppocr/utils/ic15_dic
--use_gpu:True|False
--enable_mkldnn:False
--cpu_threads:6
--rec_batch_num:1|6
--rec_batch_num:1
--use_tensorrt:False
--precision:fp32
--rec_model_dir:
......
......@@ -41,7 +41,7 @@ inference:tools/infer/predict_rec.py --rec_char_dict_path=./ppocr/utils/ic15_dic
--use_gpu:True|False
--enable_mkldnn:False
--cpu_threads:6
--rec_batch_num:1|6
--rec_batch_num:1
--use_tensorrt:False
--precision:fp32
--rec_model_dir:
......
......@@ -41,7 +41,7 @@ inference:tools/infer/predict_rec.py --rec_char_dict_path=./ppocr/utils/ic15_dic
--use_gpu:True|False
--enable_mkldnn:False
--cpu_threads:6
--rec_batch_num:1|6
--rec_batch_num:1
--use_tensorrt:False
--precision:fp32
--rec_model_dir:
......
......@@ -41,7 +41,7 @@ inference:tools/infer/predict_rec.py --rec_char_dict_path=./ppocr/utils/ic15_dic
--use_gpu:True|False
--enable_mkldnn:False
--cpu_threads:6
--rec_batch_num:1|6
--rec_batch_num:1
--use_tensorrt:False
--precision:fp32
--rec_model_dir:
......
......@@ -41,7 +41,7 @@ inference:tools/infer/predict_rec.py --rec_char_dict_path=./ppocr/utils/ic15_dic
--use_gpu:True|False
--enable_mkldnn:False
--cpu_threads:6
--rec_batch_num:1|6
--rec_batch_num:1
--use_tensorrt:False
--precision:fp32
--rec_model_dir:
......
......@@ -41,7 +41,7 @@ inference:tools/infer/predict_rec.py --rec_char_dict_path=./ppocr/utils/ic15_dic
--use_gpu:True|False
--enable_mkldnn:False
--cpu_threads:6
--rec_batch_num:1|6
--rec_batch_num:1
--use_tensorrt:False
--precision:fp32
--rec_model_dir:
......
......@@ -41,7 +41,7 @@ inference:tools/infer/predict_rec.py --rec_char_dict_path=./ppocr/utils/ic15_dic
--use_gpu:True|False
--enable_mkldnn:False
--cpu_threads:6
--rec_batch_num:1|6
--rec_batch_num:1
--use_tensorrt:False
--precision:fp32
--rec_model_dir:
......
......@@ -41,7 +41,7 @@ inference:tools/infer/predict_rec.py --rec_char_dict_path=./ppocr/utils/EN_symbo
--use_gpu:True|False
--enable_mkldnn:False
--cpu_threads:6
--rec_batch_num:1|6
--rec_batch_num:1
--use_tensorrt:False
--precision:fp32
--rec_model_dir:
......
......@@ -160,6 +160,8 @@ if [ ${MODE} = "lite_train_lite_infer" ];then
ln -s ./icdar2015_lite ./icdar2015
wget -nc -P ./ic15_data/ https://paddleocr.bj.bcebos.com/dataset/rec_gt_train_lite.txt --no-check-certificate
wget -nc -P ./ic15_data/ https://paddleocr.bj.bcebos.com/dataset/rec_gt_test_lite.txt --no-check-certificate
mv ic15_data/rec_gt_train_lite.txt ic15_data/rec_gt_train.txt
mv ic15_data/rec_gt_test_lite.txt ic15_data/rec_gt_test.txt
cd ../
cd ./inference && tar xf rec_inference.tar && cd ../
if [ ${model_name} == "ch_PP-OCRv2_det" ] || [ ${model_name} == "ch_PP-OCRv2_det_PACT" ]; then
......@@ -221,7 +223,6 @@ if [ ${MODE} = "lite_train_lite_infer" ];then
fi
if [ ${model_name} == "layoutxlm_ser" ] || [ ${model_name} == "vi_layoutxlm_ser" ]; then
pip install -r ppstructure/kie/requirements.txt
pip install paddlenlp\>=2.3.5 --force-reinstall -i https://mirrors.aliyun.com/pypi/simple/
wget -nc -P ./train_data/ https://paddleocr.bj.bcebos.com/ppstructure/dataset/XFUND.tar --no-check-certificate
cd ./train_data/ && tar xf XFUND.tar
cd ../
......
......@@ -23,6 +23,7 @@ __dir__ = os.path.dirname(os.path.abspath(__file__))
sys.path.insert(0, __dir__)
sys.path.insert(0, os.path.abspath(os.path.join(__dir__, '..')))
import paddle
from ppocr.data import build_dataloader
from ppocr.modeling.architectures import build_model
from ppocr.postprocess import build_post_process
......@@ -86,6 +87,30 @@ def main():
else:
model_type = None
# build metric
eval_class = build_metric(config['Metric'])
# amp
use_amp = config["Global"].get("use_amp", False)
amp_level = config["Global"].get("amp_level", 'O2')
amp_custom_black_list = config['Global'].get('amp_custom_black_list',[])
if use_amp:
AMP_RELATED_FLAGS_SETTING = {
'FLAGS_cudnn_batchnorm_spatial_persistent': 1,
'FLAGS_max_inplace_grad_add': 8,
}
paddle.fluid.set_flags(AMP_RELATED_FLAGS_SETTING)
scale_loss = config["Global"].get("scale_loss", 1.0)
use_dynamic_loss_scaling = config["Global"].get(
"use_dynamic_loss_scaling", False)
scaler = paddle.amp.GradScaler(
init_loss_scaling=scale_loss,
use_dynamic_loss_scaling=use_dynamic_loss_scaling)
if amp_level == "O2":
model = paddle.amp.decorate(
models=model, level=amp_level, master_weight=True)
else:
scaler = None
best_model_dict = load_model(
config, model, model_type=config['Architecture']["model_type"])
if len(best_model_dict):
......@@ -93,11 +118,9 @@ def main():
for k, v in best_model_dict.items():
logger.info('{}:{}'.format(k, v))
# build metric
eval_class = build_metric(config['Metric'])
# start eval
metric = program.eval(model, valid_dataloader, post_process_class,
eval_class, model_type, extra_input)
eval_class, model_type, extra_input, scaler, amp_level, amp_custom_black_list)
logger.info('metric eval ***************')
for k, v in metric.items():
logger.info('{}:{}'.format(k, v))
......
......@@ -349,6 +349,13 @@ class TextRecognizer(object):
for beg_img_no in range(0, img_num, batch_num):
end_img_no = min(img_num, beg_img_no + batch_num)
norm_img_batch = []
if self.rec_algorithm == "SRN":
encoder_word_pos_list = []
gsrm_word_pos_list = []
gsrm_slf_attn_bias1_list = []
gsrm_slf_attn_bias2_list = []
if self.rec_algorithm == "SAR":
valid_ratios = []
imgC, imgH, imgW = self.rec_image_shape[:3]
max_wh_ratio = imgW / imgH
# max_wh_ratio = 0
......@@ -357,22 +364,16 @@ class TextRecognizer(object):
wh_ratio = w * 1.0 / h
max_wh_ratio = max(max_wh_ratio, wh_ratio)
for ino in range(beg_img_no, end_img_no):
if self.rec_algorithm == "SAR":
norm_img, _, _, valid_ratio = self.resize_norm_img_sar(
img_list[indices[ino]], self.rec_image_shape)
norm_img = norm_img[np.newaxis, :]
valid_ratio = np.expand_dims(valid_ratio, axis=0)
valid_ratios = []
valid_ratios.append(valid_ratio)
norm_img_batch.append(norm_img)
elif self.rec_algorithm == "SRN":
norm_img = self.process_image_srn(
img_list[indices[ino]], self.rec_image_shape, 8, 25)
encoder_word_pos_list = []
gsrm_word_pos_list = []
gsrm_slf_attn_bias1_list = []
gsrm_slf_attn_bias2_list = []
encoder_word_pos_list.append(norm_img[1])
gsrm_word_pos_list.append(norm_img[2])
gsrm_slf_attn_bias1_list.append(norm_img[3])
......
......@@ -191,7 +191,8 @@ def train(config,
logger,
log_writer=None,
scaler=None,
amp_level='O2'):
amp_level='O2',
amp_custom_black_list=[]):
cal_metric_during_train = config['Global'].get('cal_metric_during_train',
False)
calc_epoch_interval = config['Global'].get('calc_epoch_interval', 1)
......@@ -278,10 +279,7 @@ def train(config,
model_average = True
# use amp
if scaler:
custom_black_list = config['Global'].get(
'amp_custom_black_list', [])
with paddle.amp.auto_cast(
level=amp_level, custom_black_list=custom_black_list):
with paddle.amp.auto_cast(level=amp_level, custom_black_list=amp_custom_black_list):
if model_type == 'table' or extra_input:
preds = model(images, data=batch[1:])
elif model_type in ["kie"]:
......@@ -386,7 +384,9 @@ def train(config,
eval_class,
model_type,
extra_input=extra_input,
scaler=scaler)
scaler=scaler,
amp_level=amp_level,
amp_custom_black_list=amp_custom_black_list)
cur_metric_str = 'cur metric, {}'.format(', '.join(
['{}: {}'.format(k, v) for k, v in cur_metric.items()]))
logger.info(cur_metric_str)
......@@ -477,7 +477,9 @@ def eval(model,
eval_class,
model_type=None,
extra_input=False,
scaler=None):
scaler=None,
amp_level='O2',
amp_custom_black_list = []):
model.eval()
with paddle.no_grad():
total_frame = 0.0
......@@ -498,7 +500,7 @@ def eval(model,
# use amp
if scaler:
with paddle.amp.auto_cast(level='O2'):
with paddle.amp.auto_cast(level=amp_level, custom_black_list=amp_custom_black_list):
if model_type == 'table' or extra_input:
preds = model(images, data=batch[1:])
elif model_type in ["kie"]:
......
......@@ -138,9 +138,7 @@ def main(config, device, logger, vdl_writer):
# build metric
eval_class = build_metric(config['Metric'])
# load pretrain model
pre_best_model_dict = load_model(config, model, optimizer,
config['Architecture']["model_type"])
logger.info('train dataloader has {} iters'.format(len(train_dataloader)))
if valid_dataloader is not None:
logger.info('valid dataloader has {} iters'.format(
......@@ -148,6 +146,7 @@ def main(config, device, logger, vdl_writer):
use_amp = config["Global"].get("use_amp", False)
amp_level = config["Global"].get("amp_level", 'O2')
amp_custom_black_list = config['Global'].get('amp_custom_black_list',[])
if use_amp:
AMP_RELATED_FLAGS_SETTING = {
'FLAGS_cudnn_batchnorm_spatial_persistent': 1,
......@@ -166,12 +165,16 @@ def main(config, device, logger, vdl_writer):
else:
scaler = None
# load pretrain model
pre_best_model_dict = load_model(config, model, optimizer,
config['Architecture']["model_type"])
if config['Global']['distributed']:
model = paddle.DataParallel(model)
# start train
program.train(config, train_dataloader, valid_dataloader, device, model,
loss_class, optimizer, lr_scheduler, post_process_class,
eval_class, pre_best_model_dict, logger, vdl_writer, scaler,amp_level)
eval_class, pre_best_model_dict, logger, vdl_writer, scaler,amp_level, amp_custom_black_list)
def test_reader(config, device, logger):
......
Markdown is supported
0% .
You are about to add 0 people to the discussion. Proceed with caution.
先完成此消息的编辑!
想要评论请 注册