diff --git a/README.md b/README.md index f57672e5055df042ede9ae03bbed590889c5941c..58e436c99220d6b4ec5e4e4934bbeeed66408503 100644 --- a/README.md +++ b/README.md @@ -113,18 +113,19 @@ PaddleOCR support a variety of cutting-edge algorithms related to OCR, and devel - [Quick Start](./ppstructure/docs/quickstart_en.md) - [Model Zoo](./ppstructure/docs/models_list_en.md) - [Model training](./doc/doc_en/training_en.md) - - [Layout Parser](./ppstructure/layout/README.md) + - [Layout Analysis](./ppstructure/layout/README.md) - [Table Recognition](./ppstructure/table/README.md) - - [DocVQA](./ppstructure/vqa/README.md) - - [Key Information Extraction](./ppstructure/docs/kie_en.md) + - [Key Information Extraction](./ppstructure/kie/README.md) - [Inference and Deployment](./deploy/README.md) - [Python Inference](./ppstructure/docs/inference_en.md) - - [C++ Inference]() + - [C++ Inference](./deploy/cpp_infer/readme.md) - [Serving](./deploy/pdserving/README.md) -- [Academic algorithms](./doc/doc_en/algorithms_en.md) +- [Academic Algorithms](./doc/doc_en/algorithm_overview_en.md) - [Text detection](./doc/doc_en/algorithm_overview_en.md) - [Text recognition](./doc/doc_en/algorithm_overview_en.md) - - [End-to-end](./doc/doc_en/algorithm_overview_en.md) + - [End-to-end OCR](./doc/doc_en/algorithm_overview_en.md) + - [Table Recognition](./doc/doc_en/algorithm_overview_en.md) + - [Key Information Extraction](./doc/doc_en/algorithm_overview_en.md) - [Add New Algorithms to PaddleOCR](./doc/doc_en/add_new_algorithm_en.md) - Data Annotation and Synthesis - [Semi-automatic Annotation Tool: PPOCRLabel](./PPOCRLabel/README.md) @@ -135,9 +136,9 @@ PaddleOCR support a variety of cutting-edge algorithms related to OCR, and devel - [General OCR Datasets(Chinese/English)](doc/doc_en/dataset/datasets_en.md) - [HandWritten_OCR_Datasets(Chinese)](doc/doc_en/dataset/handwritten_datasets_en.md) - [Various OCR Datasets(multilingual)](doc/doc_en/dataset/vertical_and_multilingual_datasets_en.md) - - [layout analysis](doc/doc_en/dataset/layout_datasets_en.md) - - [table recognition](doc/doc_en/dataset/table_datasets_en.md) - - [DocVQA](doc/doc_en/dataset/docvqa_datasets_en.md) + - [Layout Analysis](doc/doc_en/dataset/layout_datasets_en.md) + - [Table Recognition](doc/doc_en/dataset/table_datasets_en.md) + - [Key Information Extraction](doc/doc_en/dataset/kie_datasets_en.md) - [Code Structure](./doc/doc_en/tree_en.md) - [Visualization](#Visualization) - [Community](#Community) @@ -176,7 +177,7 @@ PaddleOCR support a variety of cutting-edge algorithms related to OCR, and devel
-PP-Structure +PP-Structurev2 - layout analysis + table recognition
@@ -185,12 +186,28 @@ PaddleOCR support a variety of cutting-edge algorithms related to OCR, and devel - SER (Semantic entity recognition)
- + +
+ +
+ +
+ +
+
- RE (Relation Extraction)
- + +
+ +
+ +
+ +
+
diff --git a/README_ch.md b/README_ch.md index e801ce561cb41aafb376f81a3016f0a6b838320d..49e84e15fd429ecb26c6c579857920882a1145d6 100755 --- a/README_ch.md +++ b/README_ch.md @@ -27,16 +27,9 @@ PaddleOCR旨在打造一套丰富、领先、且实用的OCR工具库,助力 ## 近期更新 -- **🔥2022.5.11~13 每晚8:30【超强OCR技术详解与产业应用实战】三日直播课** - - 11日:开源最强OCR系统PP-OCRv3揭秘 - - 12日:云边端全覆盖的PP-OCRv3训练部署实战 - - 13日:OCR产业应用全流程拆解与实战 +- **🔥2022.7 发布[OCR场景应用集合](./applications)** + - 发布OCR场景应用集合,包含数码管、液晶屏、车牌、高精度SVTR模型等**7个垂类模型**,覆盖通用,制造、金融、交通行业的主要OCR垂类应用。 - 赶紧扫码报名吧! -
- -
- - **🔥2022.5.9 发布PaddleOCR [release/2.5](https://github.com/PaddlePaddle/PaddleOCR/tree/release/2.5)** - 发布[PP-OCRv3](./doc/doc_ch/ppocr_introduction.md#pp-ocrv3),速度可比情况下,中文场景效果相比于PP-OCRv2再提升5%,英文场景提升11%,80语种多语言模型平均识别准确率提升5%以上; - 发布半自动标注工具[PPOCRLabelv2](./PPOCRLabel):新增表格文字图像、图像关键信息抽取任务和不规则文字图像的标注功能; @@ -71,24 +64,22 @@ PaddleOCR旨在打造一套丰富、领先、且实用的OCR工具库,助力 ## 《动手学OCR》电子书 - [《动手学OCR》电子书📚](./doc/doc_ch/ocr_book.md) -## 场景应用 -- PaddleOCR场景应用覆盖通用,制造、金融、交通行业的主要OCR垂类应用,在PP-OCR、PP-Structure的通用能力基础之上,以notebook的形式展示利用场景数据微调、模型优化方法、数据增广等内容,为开发者快速落地OCR应用提供示范与启发。详情可查看[README](./applications)。 ## 开源社区 - +- **项目合作📑:** 如果您是企业开发者且有明确的OCR垂类应用需求,填写[问卷](https://paddle.wjx.cn/vj/QwF7GKw.aspx)后可免费与官方团队展开不同层次的合作。 - **加入社区👬:** 微信扫描二维码并填写问卷之后,加入交流群领取福利 - - **获取5月11-13日每晚20:30《OCR超强技术详解与产业应用实战》的直播课链接** + - **获取PaddleOCR最新发版解说《OCR超强技术详解与产业应用实战》系列直播课回放链接** - **10G重磅OCR学习大礼包:**《动手学OCR》电子书,配套讲解视频和notebook项目;66篇OCR相关顶会前沿论文打包放送,包括CVPR、AAAI、IJCAI、ICCV等;PaddleOCR历次发版直播课视频;OCR社区优秀开发者项目分享视频。 - -- **社区贡献**🏅️:[社区贡献](./doc/doc_ch/thirdparty.md)文档中包含了社区用户**使用PaddleOCR开发的各种工具、应用**以及**为PaddleOCR贡献的功能、优化的文档与代码**等,是官方为社区开发者打造的荣誉墙,也是帮助优质项目宣传的广播站。 +- **社区项目**🏅️:[社区项目](./doc/doc_ch/thirdparty.md)文档中包含了社区用户**使用PaddleOCR开发的各种工具、应用**以及**为PaddleOCR贡献的功能、优化的文档与代码**等,是官方为社区开发者打造的荣誉墙,也是帮助优质项目宣传的广播站。 - **社区常规赛**🎁:社区常规赛是面向OCR开发者的积分赛事,覆盖文档、代码、模型和应用四大类型,以季度为单位评选并发放奖励,赛题详情与报名方法可参考[链接](https://github.com/PaddlePaddle/PaddleOCR/issues/4982)。
- +
+ ## PP-OCR系列模型列表(更新中) @@ -96,14 +87,21 @@ PaddleOCR旨在打造一套丰富、领先、且实用的OCR工具库,助力 | ------------------------------------- | ----------------------- | --------------- | ------------------------------------------------------------ | ------------------------------------------------------------ | ------------------------------------------------------------ | | 中英文超轻量PP-OCRv3模型(16.2M) | ch_PP-OCRv3_xx | 移动端&服务器端 | [推理模型](https://paddleocr.bj.bcebos.com/PP-OCRv3/chinese/ch_PP-OCRv3_det_infer.tar) / [训练模型](https://paddleocr.bj.bcebos.com/PP-OCRv3/chinese/ch_PP-OCRv3_det_distill_train.tar) | [推理模型](https://paddleocr.bj.bcebos.com/dygraph_v2.0/ch/ch_ppocr_mobile_v2.0_cls_infer.tar) / [训练模型](https://paddleocr.bj.bcebos.com/dygraph_v2.0/ch/ch_ppocr_mobile_v2.0_cls_train.tar) | [推理模型](https://paddleocr.bj.bcebos.com/PP-OCRv3/chinese/ch_PP-OCRv3_rec_infer.tar) / [训练模型](https://paddleocr.bj.bcebos.com/PP-OCRv3/chinese/ch_PP-OCRv3_rec_train.tar) | | 英文超轻量PP-OCRv3模型(13.4M) | en_PP-OCRv3_xx | 移动端&服务器端 | [推理模型](https://paddleocr.bj.bcebos.com/PP-OCRv3/english/en_PP-OCRv3_det_infer.tar) / [训练模型](https://paddleocr.bj.bcebos.com/PP-OCRv3/english/en_PP-OCRv3_det_distill_train.tar) | [推理模型](https://paddleocr.bj.bcebos.com/dygraph_v2.0/ch/ch_ppocr_mobile_v2.0_cls_infer.tar) / [训练模型](https://paddleocr.bj.bcebos.com/dygraph_v2.0/ch/ch_ppocr_mobile_v2.0_cls_train.tar) | [推理模型](https://paddleocr.bj.bcebos.com/PP-OCRv3/english/en_PP-OCRv3_rec_infer.tar) / [训练模型](https://paddleocr.bj.bcebos.com/PP-OCRv3/english/en_PP-OCRv3_rec_train.tar) | -| 中英文超轻量PP-OCRv2模型(13.0M) | ch_PP-OCRv2_xx | 移动端&服务器端 | [推理模型](https://paddleocr.bj.bcebos.com/PP-OCRv2/chinese/ch_PP-OCRv2_det_infer.tar) / [训练模型](https://paddleocr.bj.bcebos.com/PP-OCRv2/chinese/ch_PP-OCRv2_det_distill_train.tar) | [推理模型](https://paddleocr.bj.bcebos.com/dygraph_v2.0/ch/ch_ppocr_mobile_v2.0_cls_infer.tar) / [预训练模型](https://paddleocr.bj.bcebos.com/dygraph_v2.0/ch/ch_ppocr_mobile_v2.0_cls_train.tar) | [推理模型](https://paddleocr.bj.bcebos.com/PP-OCRv2/chinese/ch_PP-OCRv2_rec_infer.tar) / [训练模型](https://paddleocr.bj.bcebos.com/PP-OCRv2/chinese/ch_PP-OCRv2_rec_train.tar) | -| 中英文超轻量PP-OCR mobile模型(9.4M) | ch_ppocr_mobile_v2.0_xx | 移动端&服务器端 | [推理模型](https://paddleocr.bj.bcebos.com/dygraph_v2.0/ch/ch_ppocr_mobile_v2.0_det_infer.tar) / [预训练模型](https://paddleocr.bj.bcebos.com/dygraph_v2.0/ch/ch_ppocr_mobile_v2.0_det_train.tar) | [推理模型](https://paddleocr.bj.bcebos.com/dygraph_v2.0/ch/ch_ppocr_mobile_v2.0_cls_infer.tar) / [预训练模型](https://paddleocr.bj.bcebos.com/dygraph_v2.0/ch/ch_ppocr_mobile_v2.0_cls_train.tar) | [推理模型](https://paddleocr.bj.bcebos.com/dygraph_v2.0/ch/ch_ppocr_mobile_v2.0_rec_infer.tar) / [预训练模型](https://paddleocr.bj.bcebos.com/dygraph_v2.0/ch/ch_ppocr_mobile_v2.0_rec_pre.tar) | -| 中英文通用PP-OCR server模型(143.4M) | ch_ppocr_server_v2.0_xx | 服务器端 | [推理模型](https://paddleocr.bj.bcebos.com/dygraph_v2.0/ch/ch_ppocr_server_v2.0_det_infer.tar) / [预训练模型](https://paddleocr.bj.bcebos.com/dygraph_v2.0/ch/ch_ppocr_server_v2.0_det_train.tar) | [推理模型](https://paddleocr.bj.bcebos.com/dygraph_v2.0/ch/ch_ppocr_mobile_v2.0_cls_infer.tar) / [预训练模型](https://paddleocr.bj.bcebos.com/dygraph_v2.0/ch/ch_ppocr_mobile_v2.0_cls_train.tar) | [推理模型](https://paddleocr.bj.bcebos.com/dygraph_v2.0/ch/ch_ppocr_server_v2.0_rec_infer.tar) / [预训练模型](https://paddleocr.bj.bcebos.com/dygraph_v2.0/ch/ch_ppocr_server_v2.0_rec_pre.tar) | -更多模型下载(包括多语言),可以参考[PP-OCR 系列模型下载](./doc/doc_ch/models_list.md),文档分析相关模型参考[PP-Structure 系列模型下载](./ppstructure/docs/models_list.md) +- 超轻量OCR系列更多模型下载(包括多语言),可以参考[PP-OCR系列模型下载](./doc/doc_ch/models_list.md),文档分析相关模型参考[PP-Structure系列模型下载](./ppstructure/docs/models_list.md) + +### PaddleOCR场景应用模型 + +| 行业 | 类别 | 亮点 | 文档说明 | 模型下载 | +| ---- | ------------ | ---------------------------------- | ------------------------------------------------------------ | --------------------------------------------- | +| 制造 | 数码管识别 | 数码管数据合成、漏识别调优 | [光功率计数码管字符识别](./applications/光功率计数码管字符识别/光功率计数码管字符识别.md) | [下载链接](./applications/README.md#模型下载) | +| 金融 | 通用表单识别 | 多模态通用表单结构化提取 | [多模态表单识别](./applications/多模态表单识别.md) | [下载链接](./applications/README.md#模型下载) | +| 交通 | 车牌识别 | 多角度图像处理、轻量模型、端侧部署 | [轻量级车牌识别](./applications/轻量级车牌识别.md) | [下载链接](./applications/README.md#模型下载) | +- 更多制造、金融、交通行业的主要OCR垂类应用模型(如电表、液晶屏、高精度SVTR模型等),可参考[场景应用模型下载](./applications) + ## 文档教程 - [运行环境准备](./doc/doc_ch/environment.md) @@ -120,7 +118,7 @@ PaddleOCR旨在打造一套丰富、领先、且实用的OCR工具库,助力 - [知识蒸馏](./doc/doc_ch/knowledge_distillation.md) - [推理部署](./deploy/README_ch.md) - [基于Python预测引擎推理](./doc/doc_ch/inference_ppocr.md) - - [基于C++预测引擎推理](./deploy/cpp_infer/readme.md) + - [基于C++预测引擎推理](./deploy/cpp_infer/readme_ch.md) - [服务化部署](./deploy/pdserving/README_CN.md) - [端侧部署](./deploy/lite/readme.md) - [Paddle2ONNX模型转化与预测](./deploy/paddle2onnx/readme.md) @@ -132,16 +130,17 @@ PaddleOCR旨在打造一套丰富、领先、且实用的OCR工具库,助力 - [模型训练](./doc/doc_ch/training.md) - [版面分析](./ppstructure/layout/README_ch.md) - [表格识别](./ppstructure/table/README_ch.md) - - [关键信息提取](./ppstructure/docs/kie.md) - - [DocVQA](./ppstructure/vqa/README_ch.md) + - [关键信息提取](./ppstructure/kie/README_ch.md) - [推理部署](./deploy/README_ch.md) - [基于Python预测引擎推理](./ppstructure/docs/inference.md) - - [基于C++预测引擎推理]() + - [基于C++预测引擎推理](./deploy/cpp_infer/readme_ch.md) - [服务化部署](./deploy/pdserving/README_CN.md) -- [前沿算法与模型🚀](./doc/doc_ch/algorithm.md) - - [文本检测算法](./doc/doc_ch/algorithm_overview.md#11-%E6%96%87%E6%9C%AC%E6%A3%80%E6%B5%8B%E7%AE%97%E6%B3%95) - - [文本识别算法](./doc/doc_ch/algorithm_overview.md#12-%E6%96%87%E6%9C%AC%E8%AF%86%E5%88%AB%E7%AE%97%E6%B3%95) - - [端到端算法](./doc/doc_ch/algorithm_overview.md#2-%E6%96%87%E6%9C%AC%E8%AF%86%E5%88%AB%E7%AE%97%E6%B3%95) +- [前沿算法与模型🚀](./doc/doc_ch/algorithm_overview.md) + - [文本检测算法](./doc/doc_ch/algorithm_overview.md) + - [文本识别算法](./doc/doc_ch/algorithm_overview.md) + - [端到端OCR算法](./doc/doc_ch/algorithm_overview.md) + - [表格识别算法](./doc/doc_ch/algorithm_overview.md) + - [关键信息抽取算法](./doc/doc_ch/algorithm_overview.md) - [使用PaddleOCR架构添加新算法](./doc/doc_ch/add_new_algorithm.md) - [场景应用](./applications) - 数据标注与合成 @@ -155,7 +154,7 @@ PaddleOCR旨在打造一套丰富、领先、且实用的OCR工具库,助力 - [垂类多语言OCR数据集](doc/doc_ch/dataset/vertical_and_multilingual_datasets.md) - [版面分析数据集](doc/doc_ch/dataset/layout_datasets.md) - [表格识别数据集](doc/doc_ch/dataset/table_datasets.md) - - [DocVQA数据集](doc/doc_ch/dataset/docvqa_datasets.md) + - [关键信息提取数据集](doc/doc_ch/dataset/kie_datasets.md) - [代码组织结构](./doc/doc_ch/tree.md) - [效果展示](#效果展示) - [《动手学OCR》电子书📚](./doc/doc_ch/ocr_book.md) @@ -214,14 +213,30 @@ PaddleOCR旨在打造一套丰富、领先、且实用的OCR工具库,助力 - SER(语义实体识别)
- +
+
+ +
+ +
+ +
+ - RE(关系提取)
- +
+
+ +
+ +
+ +
+ diff --git "a/applications/\345\217\221\347\245\250\345\205\263\351\224\256\344\277\241\346\201\257\346\212\275\345\217\226.md" "b/applications/\345\217\221\347\245\250\345\205\263\351\224\256\344\277\241\346\201\257\346\212\275\345\217\226.md" new file mode 100644 index 0000000000000000000000000000000000000000..cd7fa1a0b3c988b21b33fe8f123e7d7c3e851ca5 --- /dev/null +++ "b/applications/\345\217\221\347\245\250\345\205\263\351\224\256\344\277\241\346\201\257\346\212\275\345\217\226.md" @@ -0,0 +1,337 @@ + +# 基于VI-LayoutXLM的发票关键信息抽取 + +- [1. 项目背景及意义](#1-项目背景及意义) +- [2. 项目内容](#2-项目内容) +- [3. 安装环境](#3-安装环境) +- [4. 关键信息抽取](#4-关键信息抽取) + - [4.1 文本检测](#41-文本检测) + - [4.2 文本识别](#42-文本识别) + - [4.3 语义实体识别](#43-语义实体识别) + - [4.4 关系抽取](#44-关系抽取) + + + +## 1. 项目背景及意义 + +关键信息抽取在文档场景中被广泛使用,如身份证中的姓名、住址信息抽取,快递单中的姓名、联系方式等关键字段内容的抽取。传统基于模板匹配的方案需要针对不同的场景制定模板并进行适配,较为繁琐,不够鲁棒。基于该问题,我们借助飞桨提供的PaddleOCR套件中的关键信息抽取方案,实现对增值税发票场景的关键信息抽取。 + +## 2. 项目内容 + +本项目基于PaddleOCR开源套件,以VI-LayoutXLM多模态关键信息抽取模型为基础,针对增值税发票场景进行适配,提取该场景的关键信息。 + +## 3. 安装环境 + +```bash +# 首先git官方的PaddleOCR项目,安装需要的依赖 +# 第一次运行打开该注释 +git clone https://gitee.com/PaddlePaddle/PaddleOCR.git +cd PaddleOCR +# 安装PaddleOCR的依赖 +pip install -r requirements.txt +# 安装关键信息抽取任务的依赖 +pip install -r ./ppstructure/vqa/requirements.txt +``` + +## 4. 关键信息抽取 + +基于文档图像的关键信息抽取包含3个部分:(1)文本检测(2)文本识别(3)关键信息抽取方法,包括语义实体识别或者关系抽取,下面分别进行介绍。 + +### 4.1 文本检测 + + +本文重点关注发票的关键信息抽取模型训练与预测过程,因此在关键信息抽取过程中,直接使用标注的文本检测与识别标注信息进行测试,如果你希望自定义该场景的文本检测模型,完成端到端的关键信息抽取部分,请参考[文本检测模型训练教程](../doc/doc_ch/detection.md),按照训练数据格式准备数据,并完成该场景下垂类文本检测模型的微调过程。 + + +### 4.2 文本识别 + +本文重点关注发票的关键信息抽取模型训练与预测过程,因此在关键信息抽取过程中,直接使用提供的文本检测与识别标注信息进行测试,如果你希望自定义该场景的文本检测模型,完成端到端的关键信息抽取部分,请参考[文本识别模型训练教程](../doc/doc_ch/recognition.md),按照训练数据格式准备数据,并完成该场景下垂类文本识别模型的微调过程。 + +### 4.3 语义实体识别 (Semantic Entity Recognition) + +语义实体识别指的是给定一段文本行,确定其类别(如`姓名`、`住址`等类别)。PaddleOCR中提供了基于VI-LayoutXLM的多模态语义实体识别方法,融合文本、位置与版面信息,相比LayoutXLM多模态模型,去除了其中的视觉骨干网络特征提取部分,引入符合阅读顺序的文本行排序方法,同时使用UDML联合互蒸馏方法进行训练,最终在精度与速度方面均超越LayoutXLM。更多关于VI-LayoutXLM的算法介绍与精度指标,请参考:[VI-LayoutXLM算法介绍](../doc/doc_ch/algorithm_kie_vi_layoutxlm.md)。 + +#### 4.3.1 准备数据 + +发票场景为例,我们首先需要标注出其中的关键字段,我们将其标注为`问题-答案`的key-value pair,如下,编号No为12270830,则`No`字段标注为question,`12270830`字段标注为answer。如下图所示。 + +
+ +
+ +**注意:** + +* 如果文本检测模型数据标注过程中,没有标注 **非关键信息内容** 的检测框,那么在标注关键信息抽取任务的时候,也不需要标注该部分,如上图所示;如果标注的过程,如果同时标注了**非关键信息内容** 的检测框,那么我们需要将该部分的label记为other。 +* 标注过程中,需要以文本行为单位进行标注,无需标注单个字符的位置信息。 + + +已经处理好的增值税发票数据集从这里下载:[增值税发票数据集下载链接](https://aistudio.baidu.com/aistudio/datasetdetail/165561)。 + +下载好发票数据集,并解压在train_data目录下,目录结构如下所示。 + +``` +train_data + |--zzsfp + |---class_list.txt + |---imgs/ + |---train.json + |---val.json +``` + +其中`class_list.txt`是包含`other`, `question`, `answer`,3个种类的的类别列表(不区分大小写),`imgs`目录底下,`train.json`与`val.json`分别表示训练与评估集合的标注文件。训练集中包含30张图片,验证集中包含8张图片。部分标注如下所示。 + +```py +b33.jpg [{"transcription": "No", "label": "question", "points": [[2882, 472], [3026, 472], [3026, 588], [2882, 588]], }, {"transcription": "12269563", "label": "answer", "points": [[3066, 448], [3598, 448], [3598, 576], [3066, 576]], ]}] +``` + +相比于OCR检测的标注,仅多了`label`字段。 + + +#### 4.3.2 开始训练 + + +VI-LayoutXLM的配置为[ser_vi_layoutxlm_xfund_zh_udml.yml](../configs/kie/vi_layoutxlm/ser_vi_layoutxlm_xfund_zh_udml.yml),需要修改数据、类别数目以及配置文件。 + +```yml +Architecture: + model_type: &model_type "vqa" + name: DistillationModel + algorithm: Distillation + Models: + Teacher: + pretrained: + freeze_params: false + return_all_feats: true + model_type: *model_type + algorithm: &algorithm "LayoutXLM" + Transform: + Backbone: + name: LayoutXLMForSer + pretrained: True + # one of base or vi + mode: vi + checkpoints: + # 定义类别数目 + num_classes: &num_classes 5 + ... + +PostProcess: + name: DistillationSerPostProcess + model_name: ["Student", "Teacher"] + key: backbone_out + # 定义类别文件 + class_path: &class_path train_data/zzsfp/class_list.txt + +Train: + dataset: + name: SimpleDataSet + # 定义训练数据目录与标注文件 + data_dir: train_data/zzsfp/imgs + label_file_list: + - train_data/zzsfp/train.json + ... + +Eval: + dataset: + # 定义评估数据目录与标注文件 + name: SimpleDataSet + data_dir: train_data/zzsfp/imgs + label_file_list: + - train_data/zzsfp/val.json + ... +``` + +LayoutXLM与VI-LayoutXLM针对该场景的训练结果如下所示。 + +| 模型 | 迭代轮数 | Hmean | +| :---: | :---: | :---: | +| LayoutXLM | 50 | 100% | +| VI-LayoutXLM | 50 | 100% | + +可以看出,由于当前数据量较少,场景比较简单,因此2个模型的Hmean均达到了100%。 + + +#### 4.3.3 模型评估 + +模型训练过程中,使用的是知识蒸馏的策略,最终保留了学生模型的参数,在评估时,我们需要针对学生模型的配置文件进行修改: [ser_vi_layoutxlm_xfund_zh.yml](../configs/kie/vi_layoutxlm/ser_vi_layoutxlm_xfund_zh.yml),修改内容与训练配置相同,包括**类别数、类别映射文件、数据目录**。 + +修改完成后,执行下面的命令完成评估过程。 + +```bash +# 注意:需要根据你的配置文件地址与保存的模型地址,对评估命令进行修改 +python3 tools/eval.py -c ./fapiao/ser_vi_layoutxlm.yml -o Architecture.Backbone.checkpoints=fapiao/models/ser_vi_layoutxlm_fapiao_udml/best_accuracy +``` + +输出结果如下所示。 + +``` +[2022/08/18 08:49:58] ppocr INFO: metric eval *************** +[2022/08/18 08:49:58] ppocr INFO: precision:1.0 +[2022/08/18 08:49:58] ppocr INFO: recall:1.0 +[2022/08/18 08:49:58] ppocr INFO: hmean:1.0 +[2022/08/18 08:49:58] ppocr INFO: fps:1.9740402401574881 +``` + +#### 4.3.4 模型预测 + +使用下面的命令进行预测。 + +```bash +python3 tools/infer_vqa_token_ser.py -c fapiao/ser_vi_layoutxlm.yml -o Architecture.Backbone.checkpoints=fapiao/models/ser_vi_layoutxlm_fapiao_udml/best_accuracy Global.infer_img=./train_data/XFUND/zh_val/val.json Global.infer_mode=False +``` + +预测结果会保存在配置文件中的`Global.save_res_path`目录中。 + +部分预测结果如下所示。 + +
+ +
+ + +* 注意:在预测时,使用的文本检测与识别结果为标注的结果,直接从json文件里面进行读取。 + +如果希望使用OCR引擎结果得到的结果进行推理,则可以使用下面的命令进行推理。 + + +```bash +python3 tools/infer_vqa_token_ser.py -c fapiao/ser_vi_layoutxlm.yml -o Architecture.Backbone.checkpoints=fapiao/models/ser_vi_layoutxlm_fapiao_udml/best_accuracy Global.infer_img=./train_data/zzsfp/imgs/b25.jpg Global.infer_mode=True +``` + +结果如下所示。 + +
+ +
+ +它会使用PP-OCRv3的文本检测与识别模型进行获取文本位置与内容信息。 + +可以看出,由于训练的过程中,没有标注额外的字段为other类别,所以大多数检测出来的字段被预测为question或者answer。 + +如果希望构建基于你在垂类场景训练得到的OCR检测与识别模型,可以使用下面的方法传入检测与识别的inference 模型路径,即可完成OCR文本检测与识别以及SER的串联过程。 + +```bash +python3 tools/infer_vqa_token_ser.py -c fapiao/ser_vi_layoutxlm.yml -o Architecture.Backbone.checkpoints=fapiao/models/ser_vi_layoutxlm_fapiao_udml/best_accuracy Global.infer_img=./train_data/zzsfp/imgs/b25.jpg Global.infer_mode=True Global.kie_rec_model_dir="your_rec_model" Global.kie_det_model_dir="your_det_model" +``` + +### 4.4 关系抽取(Relation Extraction) + +使用SER模型,可以获取图像中所有的question与answer的字段,继续这些字段的类别,我们需要进一步获取question与answer之间的连接,因此需要进一步训练关系抽取模型,解决该问题。本文也基于VI-LayoutXLM多模态预训练模型,进行下游RE任务的模型训练。 + +#### 4.4.1 准备数据 + +以发票场景为例,相比于SER任务,RE中还需要标记每个文本行的id信息以及链接关系linking,如下所示。 + +
+ +
+ + +标注文件的部分内容如下所示。 + +```py +b33.jpg [{"transcription": "No", "label": "question", "points": [[2882, 472], [3026, 472], [3026, 588], [2882, 588]], "id": 0, "linking": [[0, 1]]}, {"transcription": "12269563", "label": "answer", "points": [[3066, 448], [3598, 448], [3598, 576], [3066, 576]], "id": 1, "linking": [[0, 1]]}] +``` + +相比与SER的标注,多了`id`与`linking`的信息,分别表示唯一标识以及连接关系。 + +已经处理好的增值税发票数据集从这里下载:[增值税发票数据集下载链接](https://aistudio.baidu.com/aistudio/datasetdetail/165561)。 + +#### 4.4.2 开始训练 + +基于VI-LayoutXLM的RE任务配置为[re_vi_layoutxlm_xfund_zh_udml.yml](../configs/kie/vi_layoutxlm/re_vi_layoutxlm_xfund_zh_udml.yml),需要修改**数据路径、类别列表文件**。 + +```yml +Train: + dataset: + name: SimpleDataSet + # 定义训练数据目录与标注文件 + data_dir: train_data/zzsfp/imgs + label_file_list: + - train_data/zzsfp/train.json + transforms: + - DecodeImage: # load image + img_mode: RGB + channel_first: False + - VQATokenLabelEncode: # Class handling label + contains_re: True + algorithm: *algorithm + class_path: &class_path train_data/zzsfp/class_list.txt + ... + +Eval: + dataset: + # 定义评估数据目录与标注文件 + name: SimpleDataSet + data_dir: train_data/zzsfp/imgs + label_file_list: + - train_data/zzsfp/val.json + ... + +``` + +LayoutXLM与VI-LayoutXLM针对该场景的训练结果如下所示。 + +| 模型 | 迭代轮数 | Hmean | +| :---: | :---: | :---: | +| LayoutXLM | 50 | 98.0% | +| VI-LayoutXLM | 50 | 99.3% | + +可以看出,对于VI-LayoutXLM相比LayoutXLM的Hmean高了1.3%。 + + +#### 4.4.3 模型评估 + +模型训练过程中,使用的是知识蒸馏的策略,最终保留了学生模型的参数,在评估时,我们需要针对学生模型的配置文件进行修改: [re_vi_layoutxlm_xfund_zh.yml](../configs/kie/vi_layoutxlm/re_vi_layoutxlm_xfund_zh.yml),修改内容与训练配置相同,包括**类别映射文件、数据目录**。 + +修改完成后,执行下面的命令完成评估过程。 + +```bash +# 注意:需要根据你的配置文件地址与保存的模型地址,对评估命令进行修改 +python3 tools/eval.py -c ./fapiao/re_vi_layoutxlm.yml -o Architecture.Backbone.checkpoints=fapiao/models/re_vi_layoutxlm_fapiao_udml/best_accuracy +``` + +输出结果如下所示。 + +```py +[2022/08/18 12:17:14] ppocr INFO: metric eval *************** +[2022/08/18 12:17:14] ppocr INFO: precision:1.0 +[2022/08/18 12:17:14] ppocr INFO: recall:0.9873417721518988 +[2022/08/18 12:17:14] ppocr INFO: hmean:0.9936305732484078 +[2022/08/18 12:17:14] ppocr INFO: fps:2.765963539771157 +``` + +#### 4.4.4 模型预测 + +使用下面的命令进行预测。 + +```bash +# -c 后面的是RE任务的配置文件 +# -o 后面的字段是RE任务的配置 +# -c_ser 后面的是SER任务的配置文件 +# -c_ser 后面的字段是SER任务的配置 +python3 tools/infer_vqa_token_ser_re.py -c fapiao/re_vi_layoutxlm.yml -o Architecture.Backbone.checkpoints=fapiao/models/re_vi_layoutxlm_fapiao_udml/best_accuracy Global.infer_img=./train_data/zzsfp/val.json Global.infer_mode=False -c_ser fapiao/ser_vi_layoutxlm.yml -o_ser Architecture.Backbone.checkpoints=fapiao/models/ser_vi_layoutxlm_fapiao_udml/best_accuracy +``` + +预测结果会保存在配置文件中的`Global.save_res_path`目录中。 + +部分预测结果如下所示。 + +
+ +
+ + +* 注意:在预测时,使用的文本检测与识别结果为标注的结果,直接从json文件里面进行读取。 + +如果希望使用OCR引擎结果得到的结果进行推理,则可以使用下面的命令进行推理。 + +```bash +python3 tools/infer_vqa_token_ser_re.py -c fapiao/re_vi_layoutxlm.yml -o Architecture.Backbone.checkpoints=fapiao/models/re_vi_layoutxlm_fapiao_udml/best_accuracy Global.infer_img=./train_data/zzsfp/val.json Global.infer_mode=True -c_ser fapiao/ser_vi_layoutxlm.yml -o_ser Architecture.Backbone.checkpoints=fapiao/models/ser_vi_layoutxlm_fapiao_udml/best_accuracy +``` + +如果希望构建基于你在垂类场景训练得到的OCR检测与识别模型,可以使用下面的方法传入,即可完成SER + RE的串联过程。 + +```bash +python3 tools/infer_vqa_token_ser_re.py -c fapiao/re_vi_layoutxlm.yml -o Architecture.Backbone.checkpoints=fapiao/models/re_vi_layoutxlm_fapiao_udml/best_accuracy Global.infer_img=./train_data/zzsfp/val.json Global.infer_mode=True -c_ser fapiao/ser_vi_layoutxlm.yml -o_ser Architecture.Backbone.checkpoints=fapiao/models/ser_vi_layoutxlm_fapiao_udml/best_accuracy Global.kie_rec_model_dir="your_rec_model" Global.kie_det_model_dir="your_det_model" +``` diff --git a/configs/det/ch_PP-OCRv2/ch_PP-OCRv2_det_cml.yml b/configs/det/ch_PP-OCRv2/ch_PP-OCRv2_det_cml.yml index df429314cd0ec058aa6779a0ff55656f1b211bbf..acf438950a43af3356c7ab0aadf956fdf226814e 100644 --- a/configs/det/ch_PP-OCRv2/ch_PP-OCRv2_det_cml.yml +++ b/configs/det/ch_PP-OCRv2/ch_PP-OCRv2_det_cml.yml @@ -14,6 +14,9 @@ Global: use_visualdl: False infer_img: doc/imgs_en/img_10.jpg save_res_path: ./output/det_db/predicts_db.txt + use_amp: False + amp_level: O2 + amp_custom_black_list: ['exp'] Architecture: name: DistillationModel diff --git a/configs/kie/layoutlm_series/re_layoutlmv2_xfund_zh.yml b/configs/kie/layoutlm_series/re_layoutlmv2_xfund_zh.yml index 4b330d8d58bef2d549ec7e0fea5986746a23fbe4..3e3578d8cac1aadd484f583dbe0955f7c47fca73 100644 --- a/configs/kie/layoutlm_series/re_layoutlmv2_xfund_zh.yml +++ b/configs/kie/layoutlm_series/re_layoutlmv2_xfund_zh.yml @@ -11,11 +11,11 @@ Global: save_inference_dir: use_visualdl: False seed: 2022 - infer_img: ppstructure/docs/vqa/input/zh_val_21.jpg + infer_img: ppstructure/docs/kie/input/zh_val_21.jpg save_res_path: ./output/re_layoutlmv2_xfund_zh/res/ Architecture: - model_type: vqa + model_type: kie algorithm: &algorithm "LayoutLMv2" Transform: Backbone: diff --git a/configs/kie/layoutlm_series/re_layoutxlm_xfund_zh.yml b/configs/kie/layoutlm_series/re_layoutxlm_xfund_zh.yml index a092106eea10e0457419e5551dd75819adeddf1b..2401cf317987c5614a476065191e750587bc09b5 100644 --- a/configs/kie/layoutlm_series/re_layoutxlm_xfund_zh.yml +++ b/configs/kie/layoutlm_series/re_layoutxlm_xfund_zh.yml @@ -11,11 +11,11 @@ Global: save_inference_dir: use_visualdl: False seed: 2022 - infer_img: ppstructure/docs/vqa/input/zh_val_21.jpg + infer_img: ppstructure/docs/kie/input/zh_val_21.jpg save_res_path: ./output/re_layoutxlm_xfund_zh/res/ Architecture: - model_type: vqa + model_type: kie algorithm: &algorithm "LayoutXLM" Transform: Backbone: diff --git a/configs/kie/layoutlm_series/ser_layoutlm_xfund_zh.yml b/configs/kie/layoutlm_series/ser_layoutlm_xfund_zh.yml index 8c754dd8c542b12de4ee493052407bb0da687fd0..34c7d4114062e9227d48ad5684024e2776e68447 100644 --- a/configs/kie/layoutlm_series/ser_layoutlm_xfund_zh.yml +++ b/configs/kie/layoutlm_series/ser_layoutlm_xfund_zh.yml @@ -11,11 +11,11 @@ Global: save_inference_dir: use_visualdl: False seed: 2022 - infer_img: ppstructure/docs/vqa/input/zh_val_42.jpg + infer_img: ppstructure/docs/kie/input/zh_val_42.jpg save_res_path: ./output/re_layoutlm_xfund_zh/res Architecture: - model_type: vqa + model_type: kie algorithm: &algorithm "LayoutLM" Transform: Backbone: diff --git a/configs/kie/layoutlm_series/ser_layoutlmv2_xfund_zh.yml b/configs/kie/layoutlm_series/ser_layoutlmv2_xfund_zh.yml index 3c0ffabe4465e36e5699a135a9ed0b6254cbf20b..c5e833524011b03110db3bd6f4bf845db8473922 100644 --- a/configs/kie/layoutlm_series/ser_layoutlmv2_xfund_zh.yml +++ b/configs/kie/layoutlm_series/ser_layoutlmv2_xfund_zh.yml @@ -11,11 +11,11 @@ Global: save_inference_dir: use_visualdl: False seed: 2022 - infer_img: ppstructure/docs/vqa/input/zh_val_42.jpg + infer_img: ppstructure/docs/kie/input/zh_val_42.jpg save_res_path: ./output/ser_layoutlmv2_xfund_zh/res/ Architecture: - model_type: vqa + model_type: kie algorithm: &algorithm "LayoutLMv2" Transform: Backbone: diff --git a/configs/kie/layoutlm_series/ser_layoutxlm_xfund_zh.yml b/configs/kie/layoutlm_series/ser_layoutxlm_xfund_zh.yml index 18f87bdebc249940ef3ec1897af3ad1a240f3705..abcfec2d16f13d4b4266633dbb509e0fba6d931f 100644 --- a/configs/kie/layoutlm_series/ser_layoutxlm_xfund_zh.yml +++ b/configs/kie/layoutlm_series/ser_layoutxlm_xfund_zh.yml @@ -11,11 +11,11 @@ Global: save_inference_dir: use_visualdl: False seed: 2022 - infer_img: ppstructure/docs/vqa/input/zh_val_42.jpg + infer_img: ppstructure/docs/kie/input/zh_val_42.jpg save_res_path: ./output/ser_layoutxlm_xfund_zh/res Architecture: - model_type: vqa + model_type: kie algorithm: &algorithm "LayoutXLM" Transform: Backbone: diff --git a/configs/kie/vi_layoutxlm/re_vi_layoutxlm_xfund_zh.yml b/configs/kie/vi_layoutxlm/re_vi_layoutxlm_xfund_zh.yml index 89f7d5c3cb74854bb9fe7e28fdc8365ed37655be..ea9f50ef56ec8b169333263c1d5e96586f9472b3 100644 --- a/configs/kie/vi_layoutxlm/re_vi_layoutxlm_xfund_zh.yml +++ b/configs/kie/vi_layoutxlm/re_vi_layoutxlm_xfund_zh.yml @@ -11,11 +11,13 @@ Global: save_inference_dir: use_visualdl: False seed: 2022 - infer_img: ppstructure/docs/vqa/input/zh_val_21.jpg + infer_img: ppstructure/docs/kie/input/zh_val_21.jpg save_res_path: ./output/re/xfund_zh/with_gt + kie_rec_model_dir: + kie_det_model_dir: Architecture: - model_type: vqa + model_type: kie algorithm: &algorithm "LayoutXLM" Transform: Backbone: diff --git a/configs/kie/vi_layoutxlm/re_vi_layoutxlm_xfund_zh_udml.yml b/configs/kie/vi_layoutxlm/re_vi_layoutxlm_xfund_zh_udml.yml index c1bfdb6c6cee1c9618602016fec6cc1ec0a7b3bf..b96528d2738e7cfb2575feca4146af1eed0c5d2f 100644 --- a/configs/kie/vi_layoutxlm/re_vi_layoutxlm_xfund_zh_udml.yml +++ b/configs/kie/vi_layoutxlm/re_vi_layoutxlm_xfund_zh_udml.yml @@ -11,11 +11,11 @@ Global: save_inference_dir: use_visualdl: False seed: 2022 - infer_img: ppstructure/docs/vqa/input/zh_val_21.jpg + infer_img: ppstructure/docs/kie/input/zh_val_21.jpg save_res_path: ./output/re/xfund_zh/with_gt Architecture: - model_type: &model_type "vqa" + model_type: &model_type "kie" name: DistillationModel algorithm: Distillation Models: diff --git a/configs/kie/vi_layoutxlm/ser_vi_layoutxlm_xfund_zh.yml b/configs/kie/vi_layoutxlm/ser_vi_layoutxlm_xfund_zh.yml index d54125db64cef289457c4b855fe9bded3fa4149f..b8aa44dde8fd3fdc4ff14bbca20513b95178cdb0 100644 --- a/configs/kie/vi_layoutxlm/ser_vi_layoutxlm_xfund_zh.yml +++ b/configs/kie/vi_layoutxlm/ser_vi_layoutxlm_xfund_zh.yml @@ -11,16 +11,18 @@ Global: save_inference_dir: use_visualdl: False seed: 2022 - infer_img: ppstructure/docs/vqa/input/zh_val_42.jpg + infer_img: ppstructure/docs/kie/input/zh_val_42.jpg # if you want to predict using the groundtruth ocr info, # you can use the following config # infer_img: train_data/XFUND/zh_val/val.json # infer_mode: False save_res_path: ./output/ser/xfund_zh/res + kie_rec_model_dir: + kie_det_model_dir: Architecture: - model_type: vqa + model_type: kie algorithm: &algorithm "LayoutXLM" Transform: Backbone: diff --git a/configs/kie/vi_layoutxlm/ser_vi_layoutxlm_xfund_zh_udml.yml b/configs/kie/vi_layoutxlm/ser_vi_layoutxlm_xfund_zh_udml.yml index 6f0961c8e80312ab26a8d1649bf2bb10f8792efb..238bbd2b2c7083b5534062afd3e6c11a87494a56 100644 --- a/configs/kie/vi_layoutxlm/ser_vi_layoutxlm_xfund_zh_udml.yml +++ b/configs/kie/vi_layoutxlm/ser_vi_layoutxlm_xfund_zh_udml.yml @@ -11,12 +11,12 @@ Global: save_inference_dir: use_visualdl: False seed: 2022 - infer_img: ppstructure/docs/vqa/input/zh_val_42.jpg + infer_img: ppstructure/docs/kie/input/zh_val_42.jpg save_res_path: ./output/ser_layoutxlm_xfund_zh/res Architecture: - model_type: &model_type "vqa" + model_type: &model_type "kie" name: DistillationModel algorithm: Distillation Models: diff --git a/configs/table/SLANet_ch.yml b/configs/table/SLANet_ch.yml new file mode 100644 index 0000000000000000000000000000000000000000..997ff0a77b5ea824957abc1d32a7ba7f70abc12c --- /dev/null +++ b/configs/table/SLANet_ch.yml @@ -0,0 +1,141 @@ +Global: + use_gpu: True + epoch_num: 400 + log_smooth_window: 20 + print_batch_step: 20 + save_model_dir: ./output/SLANet_ch + save_epoch_step: 400 + # evaluation is run every 331 iterations after the 0th iteration + eval_batch_step: [0, 331] + cal_metric_during_train: True + pretrained_model: + checkpoints: + save_inference_dir: ./output/SLANet_ch/infer + use_visualdl: False + infer_img: doc/table/table.jpg + # for data or label process + character_dict_path: ppocr/utils/dict/table_structure_dict_ch.txt + character_type: en + max_text_length: &max_text_length 500 + box_format: &box_format xyxyxyxy # 'xywh', 'xyxy', 'xyxyxyxy' + infer_mode: False + use_sync_bn: True + save_res_path: output/infer + +Optimizer: + name: Adam + beta1: 0.9 + beta2: 0.999 + clip_norm: 5.0 + lr: + learning_rate: 0.001 + regularizer: + name: 'L2' + factor: 0.00000 + +Architecture: + model_type: table + algorithm: SLANet + Backbone: + name: PPLCNet + scale: 1.0 + pretrained: True + use_ssld: True + Neck: + name: CSPPAN + out_channels: 96 + Head: + name: SLAHead + hidden_size: 256 + max_text_length: *max_text_length + loc_reg_num: &loc_reg_num 8 + +Loss: + name: SLALoss + structure_weight: 1.0 + loc_weight: 2.0 + loc_loss: smooth_l1 + +PostProcess: + name: TableLabelDecode + merge_no_span_structure: &merge_no_span_structure True + +Metric: + name: TableMetric + main_indicator: acc + compute_bbox_metric: False + loc_reg_num: *loc_reg_num + box_format: *box_format + del_thead_tbody: True + +Train: + dataset: + name: PubTabDataSet + data_dir: train_data/table/train/ + label_file_list: [train_data/table/train.txt] + transforms: + - DecodeImage: + img_mode: BGR + channel_first: False + - TableLabelEncode: + learn_empty_box: False + merge_no_span_structure: *merge_no_span_structure + replace_empty_cell_token: False + loc_reg_num: *loc_reg_num + max_text_length: *max_text_length + - TableBoxEncode: + in_box_format: *box_format + out_box_format: *box_format + - ResizeTableImage: + max_len: 488 + - NormalizeImage: + scale: 1./255. + mean: [0.485, 0.456, 0.406] + std: [0.229, 0.224, 0.225] + order: 'hwc' + - PaddingTableImage: + size: [488, 488] + - ToCHWImage: + - KeepKeys: + keep_keys: [ 'image', 'structure', 'bboxes', 'bbox_masks', 'shape' ] + loader: + shuffle: True + batch_size_per_card: 48 + drop_last: True + num_workers: 1 + +Eval: + dataset: + name: PubTabDataSet + data_dir: train_data/table/val/ + label_file_list: [train_data/table/val.txt] + transforms: + - DecodeImage: + img_mode: BGR + channel_first: False + - TableLabelEncode: + learn_empty_box: False + merge_no_span_structure: *merge_no_span_structure + replace_empty_cell_token: False + loc_reg_num: *loc_reg_num + max_text_length: *max_text_length + - TableBoxEncode: + in_box_format: *box_format + out_box_format: *box_format + - ResizeTableImage: + max_len: 488 + - NormalizeImage: + scale: 1./255. + mean: [0.485, 0.456, 0.406] + std: [0.229, 0.224, 0.225] + order: 'hwc' + - PaddingTableImage: + size: [488, 488] + - ToCHWImage: + - KeepKeys: + keep_keys: [ 'image', 'structure', 'bboxes', 'bbox_masks', 'shape' ] + loader: + shuffle: False + drop_last: False + batch_size_per_card: 48 + num_workers: 1 diff --git a/configs/table/table_mv3.yml b/configs/table/table_mv3.yml index 16c1457442237fc9711b9c3f6dc47625f242956c..9355a236e15b60db18e8715c2702701fd5d36c71 100755 --- a/configs/table/table_mv3.yml +++ b/configs/table/table_mv3.yml @@ -17,7 +17,7 @@ Global: # for data or label process character_dict_path: ppocr/utils/dict/table_structure_dict.txt character_type: en - max_text_length: &max_text_length 800 + max_text_length: &max_text_length 500 box_format: &box_format 'xyxy' # 'xywh', 'xyxy', 'xyxyxyxy' infer_mode: False @@ -38,7 +38,8 @@ Architecture: Backbone: name: MobileNetV3 scale: 1.0 - model_name: large + model_name: small + disable_se: true Head: name: TableAttentionHead hidden_size: 256 @@ -89,7 +90,7 @@ Train: keep_keys: [ 'image', 'structure', 'bboxes', 'bbox_masks', 'shape' ] loader: shuffle: True - batch_size_per_card: 32 + batch_size_per_card: 48 drop_last: True num_workers: 1 @@ -124,5 +125,5 @@ Eval: loader: shuffle: False drop_last: False - batch_size_per_card: 16 + batch_size_per_card: 48 num_workers: 1 diff --git a/deploy/android_demo/app/src/main/cpp/native.cpp b/deploy/android_demo/app/src/main/cpp/native.cpp index ced932556f09244d1e9e962e7b75461203a7cc3a..4961e5ecf141bb50701ecf9c3654a54f062937ce 100644 --- a/deploy/android_demo/app/src/main/cpp/native.cpp +++ b/deploy/android_demo/app/src/main/cpp/native.cpp @@ -47,7 +47,7 @@ str_to_cpu_mode(const std::string &cpu_mode) { std::string upper_key; std::transform(cpu_mode.cbegin(), cpu_mode.cend(), upper_key.begin(), ::toupper); - auto index = cpu_mode_map.find(upper_key); + auto index = cpu_mode_map.find(upper_key.c_str()); if (index == cpu_mode_map.end()) { LOGE("cpu_mode not found %s", upper_key.c_str()); return paddle::lite_api::LITE_POWER_HIGH; @@ -116,4 +116,4 @@ Java_com_baidu_paddle_lite_demo_ocr_OCRPredictorNative_release( ppredictor::OCR_PPredictor *ppredictor = (ppredictor::OCR_PPredictor *)java_pointer; delete ppredictor; -} \ No newline at end of file +} diff --git a/deploy/android_demo/app/src/main/java/com/baidu/paddle/lite/demo/ocr/OCRPredictorNative.java b/deploy/android_demo/app/src/main/java/com/baidu/paddle/lite/demo/ocr/OCRPredictorNative.java index 622da2a3f9a1233167e777e62b687c1f246df01f..41fa183dea1d968582dbedf4e831c55b043ae00f 100644 --- a/deploy/android_demo/app/src/main/java/com/baidu/paddle/lite/demo/ocr/OCRPredictorNative.java +++ b/deploy/android_demo/app/src/main/java/com/baidu/paddle/lite/demo/ocr/OCRPredictorNative.java @@ -54,7 +54,7 @@ public class OCRPredictorNative { } public void destory() { - if (nativePointer > 0) { + if (nativePointer != 0) { release(nativePointer); nativePointer = 0; } diff --git a/deploy/cpp_infer/docs/windows_vs2019_build.md b/deploy/cpp_infer/docs/windows_vs2019_build.md index 4f391d925008b4bffcbd123e937eb608f502c646..bcaefa46f83a30a4c232add78dc2e9f521b9f84f 100644 --- a/deploy/cpp_infer/docs/windows_vs2019_build.md +++ b/deploy/cpp_infer/docs/windows_vs2019_build.md @@ -109,8 +109,10 @@ CUDA_LIB、CUDNN_LIB、TENSORRT_DIR、WITH_GPU、WITH_TENSORRT 运行之前,将下面文件拷贝到`build/Release/`文件夹下 1. `paddle_inference/paddle/lib/paddle_inference.dll` -2. `opencv/build/x64/vc15/bin/opencv_world455.dll` -3. 如果使用openblas版本的预测库还需要拷贝 `paddle_inference/third_party/install/openblas/lib/openblas.dll` +2. `paddle_inference/third_party/install/onnxruntime/lib/onnxruntime.dll` +3. `paddle_inference/third_party/install/paddle2onnx/lib/paddle2onnx.dll` +4. `opencv/build/x64/vc15/bin/opencv_world455.dll` +5. 如果使用openblas版本的预测库还需要拷贝 `paddle_inference/third_party/install/openblas/lib/openblas.dll` ### Step4: 预测 diff --git a/deploy/slim/quantization/README_en.md b/deploy/slim/quantization/README_en.md index 33b2c4784afa4be68c8b9db1a02d83013c886655..c6796ae9dc256496308e432023c45ef1026c3d92 100644 --- a/deploy/slim/quantization/README_en.md +++ b/deploy/slim/quantization/README_en.md @@ -73,4 +73,4 @@ python deploy/slim/quantization/export_model.py -c configs/det/ch_ppocr_v2.0/ch_ The numerical range of the quantized model parameters derived from the above steps is still FP32, but the numerical range of the parameters is int8. The derived model can be converted through the `opt tool` of PaddleLite. -For quantitative model deployment, please refer to [Mobile terminal model deployment](../../lite/readme_en.md) +For quantitative model deployment, please refer to [Mobile terminal model deployment](../../lite/readme.md) diff --git a/doc/doc_ch/algorithm.md b/doc/doc_ch/algorithm.md deleted file mode 100644 index c91fc783f943f7692c0203253a2cf585f0c1e5b1..0000000000000000000000000000000000000000 --- a/doc/doc_ch/algorithm.md +++ /dev/null @@ -1,14 +0,0 @@ -# 前沿算法与模型 - -PaddleOCR将**持续新增**支持OCR领域前沿算法与模型,已支持的模型与使用教程可点击下方列表查看: - -- [文本检测算法](./algorithm_overview.md#11-%E6%96%87%E6%9C%AC%E6%A3%80%E6%B5%8B%E7%AE%97%E6%B3%95) -- [文本识别算法](./algorithm_overview.md#12-%E6%96%87%E6%9C%AC%E8%AF%86%E5%88%AB%E7%AE%97%E6%B3%95) -- [端到端算法](./algorithm_overview.md#2-%E6%96%87%E6%9C%AC%E8%AF%86%E5%88%AB%E7%AE%97%E6%B3%95) -- [表格识别](./algorithm_overview.md#3-%E8%A1%A8%E6%A0%BC%E8%AF%86%E5%88%AB%E7%AE%97%E6%B3%95) - -**欢迎广大开发者合作共建,贡献更多算法,合入有奖🎁!具体可查看[社区常规赛](https://github.com/PaddlePaddle/PaddleOCR/issues/4982)。** - -新增算法可参考如下教程: - -- [使用PaddleOCR架构添加新算法](./add_new_algorithm.md) diff --git a/doc/doc_ch/algorithm_kie_layoutxlm.md b/doc/doc_ch/algorithm_kie_layoutxlm.md index 8b50e98c1c4680809287472baca4f1c88d115704..e693be49b7bc89e04b169fe74cf76525b2494948 100644 --- a/doc/doc_ch/algorithm_kie_layoutxlm.md +++ b/doc/doc_ch/algorithm_kie_layoutxlm.md @@ -66,10 +66,10 @@ LayoutXLM模型基于SER任务进行推理,可以执行如下命令: ```bash cd ppstructure -python3 vqa/predict_vqa_token_ser.py \ - --vqa_algorithm=LayoutXLM \ +python3 kie/predict_kie_token_ser.py \ + --kie_algorithm=LayoutXLM \ --ser_model_dir=../inference/ser_layoutxlm_infer \ - --image_dir=./docs/vqa/input/zh_val_42.jpg \ + --image_dir=./docs/kie/input/zh_val_42.jpg \ --ser_dict_path=../train_data/XFUND/class_list_xfun.txt \ --vis_font_path=../doc/fonts/simfang.ttf ``` @@ -77,7 +77,7 @@ python3 vqa/predict_vqa_token_ser.py \ SER可视化结果默认保存到`./output`文件夹里面,结果示例如下:
- +
diff --git a/doc/doc_ch/algorithm_kie_vi_layoutxlm.md b/doc/doc_ch/algorithm_kie_vi_layoutxlm.md index 155849a6c91bbd94be89a5f59e1a77bc68609d98..f1bb4b1e62736e88594196819dcc41980f1716bf 100644 --- a/doc/doc_ch/algorithm_kie_vi_layoutxlm.md +++ b/doc/doc_ch/algorithm_kie_vi_layoutxlm.md @@ -59,10 +59,10 @@ VI-LayoutXLM模型基于SER任务进行推理,可以执行如下命令: ```bash cd ppstructure -python3 vqa/predict_vqa_token_ser.py \ - --vqa_algorithm=LayoutXLM \ +python3 kie/predict_kie_token_ser.py \ + --kie_algorithm=LayoutXLM \ --ser_model_dir=../inference/ser_vi_layoutxlm_infer \ - --image_dir=./docs/vqa/input/zh_val_42.jpg \ + --image_dir=./docs/kie/input/zh_val_42.jpg \ --ser_dict_path=../train_data/XFUND/class_list_xfun.txt \ --vis_font_path=../doc/fonts/simfang.ttf \ --ocr_order_method="tb-yx" @@ -71,7 +71,7 @@ python3 vqa/predict_vqa_token_ser.py \ SER可视化结果默认保存到`./output`文件夹里面,结果示例如下:
- +
diff --git a/doc/doc_ch/algorithm_overview.md b/doc/doc_ch/algorithm_overview.md index b889d0b8ffbc190664b278a50ac867f1e14cbb7d..858dc02b9d21981ce3b465f33ce494b290db51fb 100755 --- a/doc/doc_ch/algorithm_overview.md +++ b/doc/doc_ch/algorithm_overview.md @@ -1,4 +1,4 @@ -# 算法汇总 +# 前沿算法与模型 - [1. 两阶段OCR算法](#1) - [1.1 文本检测算法](#11) @@ -7,8 +7,13 @@ - [3. 表格识别算法](#3) - [4. 关键信息抽取算法](#4) +本文给出了PaddleOCR已支持的OCR算法列表,以及每个算法在**英文公开数据集**上的模型和指标,主要用于算法简介和算法性能对比,更多包括中文在内的其他数据集上的模型请参考[PP-OCRv3 系列模型下载](./models_list.md)。 + +>> +PaddleOCR将**持续新增**支持OCR领域前沿算法与模型,**欢迎广大开发者合作共建,贡献更多算法,合入有奖🎁!具体可查看[社区常规赛](https://github.com/PaddlePaddle/PaddleOCR/issues/4982)。** +>> +新增算法可参考教程:[使用PaddleOCR架构添加新算法](./add_new_algorithm.md) -本文给出了PaddleOCR已支持的OCR算法列表,以及每个算法在**英文公开数据集**上的模型和指标,主要用于算法简介和算法性能对比,更多包括中文在内的其他数据集上的模型请参考[PP-OCR v2.0 系列模型下载](./models_list.md)。 diff --git a/doc/doc_ch/algorithm_rec_srn.md b/doc/doc_ch/algorithm_rec_srn.md index ca7961359eb902fafee959b26d02f324aece233a..dd61a388c7024fabdadec1c120bd3341ed0197cc 100644 --- a/doc/doc_ch/algorithm_rec_srn.md +++ b/doc/doc_ch/algorithm_rec_srn.md @@ -78,7 +78,7 @@ python3 tools/export_model.py -c configs/rec/rec_r50_fpn_srn.yml -o Global.pretr SRN文本识别模型推理,可以执行如下命令: ``` -python3 tools/infer/predict_rec.py --image_dir="./doc/imgs_words/en/word_1.png" --rec_model_dir="./inference/rec_srn/" --rec_image_shape="1,64,256" --rec_char_type="ch" --rec_algorithm="SRN" --rec_char_dict_path=./ppocr/utils/ic15_dict.txt --use_space_char=False +python3 tools/infer/predict_rec.py --image_dir="./doc/imgs_words/en/word_1.png" --rec_model_dir="./inference/rec_srn/" --rec_image_shape="1,64,256" --rec_algorithm="SRN" --rec_char_dict_path=./ppocr/utils/ic15_dict.txt --use_space_char=False ``` diff --git a/doc/doc_ch/dataset/kie_datasets.md b/doc/doc_ch/dataset/kie_datasets.md index 4535ae5f8a1ac6d2dc3d4585f33a3ec290e2373e..7f8d14cbc4ad724621f28c7d6ca1f8c2ac79f097 100644 --- a/doc/doc_ch/dataset/kie_datasets.md +++ b/doc/doc_ch/dataset/kie_datasets.md @@ -1,6 +1,7 @@ -# 信息抽取数据集 +# 关键信息抽取数据集 这里整理了常见的DocVQA数据集,持续更新中,欢迎各位小伙伴贡献数据集~ + - [FUNSD数据集](#funsd) - [XFUND数据集](#xfund) - [wildreceipt数据集](#wildreceipt) diff --git a/doc/doc_ch/kie.md b/doc/doc_ch/kie.md index da86797a21648d9b987a55493b714f6b21f21c01..b6f38a662fd98597011c5a51ff29c417d880ca17 100644 --- a/doc/doc_ch/kie.md +++ b/doc/doc_ch/kie.md @@ -64,7 +64,7 @@ zh_train_1.jpg [{"transcription": "中国人体器官捐献", "label": "other" 验证集构建方式与训练集相同。 -* 字典文件 +**(3)字典文件** 训练集与验证集中的文本行包含标签信息,所有标签的列表存在字典文件中(如`class_list.txt`),字典文件中的每一行表示为一个类别名称。 @@ -103,7 +103,7 @@ HEADER ## 1.3. 数据下载 -如果你没有本地数据集,可以从[XFUND](https://github.com/doc-analysis/XFUND)或者[FUNSD](https://guillaumejaume.github.io/FUNSD/)官网下载数据,然后使用XFUND与FUNSD的处理脚本([XFUND](../../ppstructure/vqa/tools/trans_xfun_data.py), [FUNSD](../../ppstructure/vqa/tools/trans_funsd_label.py)),生成用于PaddleOCR训练的数据格式,并使用公开数据集快速体验关键信息抽取的流程。 +如果你没有本地数据集,可以从[XFUND](https://github.com/doc-analysis/XFUND)或者[FUNSD](https://guillaumejaume.github.io/FUNSD/)官网下载数据,然后使用XFUND与FUNSD的处理脚本([XFUND](../../ppstructure/kie/tools/trans_xfun_data.py), [FUNSD](../../ppstructure/kie/tools/trans_funsd_label.py)),生成用于PaddleOCR训练的数据格式,并使用公开数据集快速体验关键信息抽取的流程。 更多关于公开数据集的介绍,请参考[关键信息抽取数据集说明文档](./dataset/kie_datasets.md)。 @@ -209,7 +209,7 @@ Architecture: num_classes: &num_classes 7 PostProcess: - name: VQASerTokenLayoutLMPostProcess + name: kieSerTokenLayoutLMPostProcess # 修改字典文件的路径为你自定义的数据集的字典路径 class_path: &class_path train_data/XFUND/class_list_xfun.txt @@ -347,25 +347,25 @@ output/ser_vi_layoutxlm_xfund_zh/ ```bash -python3 tools/infer_vqa_token_ser.py -c configs/kie/vi_layoutxlm/ser_vi_layoutxlm_xfund_zh.yml -o Architecture.Backbone.checkpoints=./output/ser_vi_layoutxlm_xfund_zh/best_accuracy Global.infer_img=./ppstructure/docs/vqa/input/zh_val_42.jpg +python3 tools/infer_kie_token_ser.py -c configs/kie/vi_layoutxlm/ser_vi_layoutxlm_xfund_zh.yml -o Architecture.Backbone.checkpoints=./output/ser_vi_layoutxlm_xfund_zh/best_accuracy Global.infer_img=./ppstructure/docs/kie/input/zh_val_42.jpg ``` 预测图片如下所示,图片会存储在`Global.save_res_path`路径中。
- +
预测过程中,默认会加载PP-OCRv3的检测识别模型,用于OCR的信息抽取,如果希望加载预先获取的OCR结果,可以使用下面的方式进行预测,指定`Global.infer_img`为标注文件,其中包含图片路径以及OCR信息,同时指定`Global.infer_mode`为False,表示此时不使用OCR预测引擎。 ```bash -python3 tools/infer_vqa_token_ser.py -c configs/kie/vi_layoutxlm/ser_vi_layoutxlm_xfund_zh.yml -o Architecture.Backbone.checkpoints=./output/ser_vi_layoutxlm_xfund_zh/best_accuracy Global.infer_img=./train_data/XFUND/zh_val/val.json Global.infer_mode=False +python3 tools/infer_kie_token_ser.py -c configs/kie/vi_layoutxlm/ser_vi_layoutxlm_xfund_zh.yml -o Architecture.Backbone.checkpoints=./output/ser_vi_layoutxlm_xfund_zh/best_accuracy Global.infer_img=./train_data/XFUND/zh_val/val.json Global.infer_mode=False ``` 对于上述图片,如果使用标注的OCR结果进行信息抽取,预测结果如下。
- +
可以看出,部分检测框信息更加准确,但是整体信息抽取识别结果基本一致。 @@ -375,20 +375,26 @@ python3 tools/infer_vqa_token_ser.py -c configs/kie/vi_layoutxlm/ser_vi_layoutxl ```bash -python3 ./tools/infer_vqa_token_ser_re.py -c configs/kie/vi_layoutxlm/re_vi_layoutxlm_xfund_zh.yml -o Architecture.Backbone.checkpoints=./pretrain_models/re_vi_layoutxlm_udml_xfund_zh/re_layoutxlm_xfund_zh_v4_udml/best_accuracy/ Global.infer_img=./train_data/XFUND/zh_val/image/ -c_ser configs/kie/vi_layoutxlm/ser_vi_layoutxlm_xfund_zh.yml -o_ser Architecture.Backbone.checkpoints=pretrain_models/ser_vi_layoutxlm_udml_xfund_zh/best_accuracy/ +python3 ./tools/infer_kie_token_ser_re.py \ + -c configs/kie/vi_layoutxlm/re_vi_layoutxlm_xfund_zh.yml \ + -o Architecture.Backbone.checkpoints=./pretrain_models/re_vi_layoutxlm_udml_xfund_zh/best_accuracy/ \ + Global.infer_img=./train_data/XFUND/zh_val/image/ \ + -c_ser configs/kie/vi_layoutxlm/ser_vi_layoutxlm_xfund_zh.yml \ + -o_ser Architecture.Backbone.checkpoints=pretrain_models/ \ + ser_vi_layoutxlm_udml_xfund_zh/best_accuracy/ ``` 预测结果如下所示。
- +
如果希望使用标注或者预先获取的OCR信息进行关键信息抽取,同上,可以指定`Global.infer_mode`为False,指定`Global.infer_img`为标注文件。 ```bash -python3 ./tools/infer_vqa_token_ser_re.py -c configs/kie/vi_layoutxlm/re_vi_layoutxlm_xfund_zh.yml -o Architecture.Backbone.checkpoints=./pretrain_models/re_vi_layoutxlm_udml_xfund_zh/re_layoutxlm_xfund_zh_v4_udml/best_accuracy/ Global.infer_img=./train_data/XFUND/zh_val/val.json Global.infer_mode=False -c_ser configs/kie/vi_layoutxlm/ser_vi_layoutxlm_xfund_zh.yml -o_ser Architecture.Backbone.checkpoints=pretrain_models/ser_vi_layoutxlm_udml_xfund_zh/best_accuracy/ +python3 ./tools/infer_kie_token_ser_re.py -c configs/kie/vi_layoutxlm/re_vi_layoutxlm_xfund_zh.yml -o Architecture.Backbone.checkpoints=./pretrain_models/re_vi_layoutxlm_udml_xfund_zh/re_layoutxlm_xfund_zh_v4_udml/best_accuracy/ Global.infer_img=./train_data/XFUND/zh_val/val.json Global.infer_mode=False -c_ser configs/kie/vi_layoutxlm/ser_vi_layoutxlm_xfund_zh.yml -o_ser Architecture.Backbone.checkpoints=pretrain_models/ser_vi_layoutxlm_udml_xfund_zh/best_accuracy/ ``` 其中`c_ser`表示SER的配置文件,`o_ser` 后面需要加上待修改的SER模型与配置文件,如预训练权重等。 @@ -397,7 +403,7 @@ python3 ./tools/infer_vqa_token_ser_re.py -c configs/kie/vi_layoutxlm/re_vi_layo 预测结果如下所示。
- +
可以看出,直接使用标注的OCR结果的RE预测结果要更加准确一些。 @@ -417,8 +423,8 @@ inference 模型(`paddle.jit.save`保存的模型) ```bash # -c 后面设置训练算法的yml配置文件 # -o 配置可选参数 -# Global.pretrained_model 参数设置待转换的训练模型地址。 -# Global.save_inference_dir参数设置转换的模型将保存的地址。 +# Architecture.Backbone.checkpoints 参数设置待转换的训练模型地址 +# Global.save_inference_dir 参数设置转换的模型将保存的地址 python3 tools/export_model.py -c configs/kie/vi_layoutxlm/ser_vi_layoutxlm_xfund_zh.yml -o Architecture.Backbone.checkpoints=./output/ser_vi_layoutxlm_xfund_zh/best_accuracy Global.save_inference_dir=./inference/ser_vi_layoutxlm ``` @@ -440,10 +446,10 @@ VI-LayoutXLM模型基于SER任务进行推理,可以执行如下命令: ```bash cd ppstructure -python3 vqa/predict_vqa_token_ser.py \ - --vqa_algorithm=LayoutXLM \ +python3 kie/predict_kie_token_ser.py \ + --kie_algorithm=LayoutXLM \ --ser_model_dir=../inference/ser_vi_layoutxlm \ - --image_dir=./docs/vqa/input/zh_val_42.jpg \ + --image_dir=./docs/kie/input/zh_val_42.jpg \ --ser_dict_path=../train_data/XFUND/class_list_xfun.txt \ --vis_font_path=../doc/fonts/simfang.ttf \ --ocr_order_method="tb-yx" @@ -452,7 +458,7 @@ python3 vqa/predict_vqa_token_ser.py \ 可视化SER结果结果默认保存到`./output`文件夹里面。结果示例如下:
- +
diff --git a/doc/doc_en/algorithm_en.md b/doc/doc_en/algorithm_en.md deleted file mode 100644 index c880336b4ad528eab2cce479edf11fce0b43f435..0000000000000000000000000000000000000000 --- a/doc/doc_en/algorithm_en.md +++ /dev/null @@ -1,11 +0,0 @@ -# Academic Algorithms and Models - -PaddleOCR will add cutting-edge OCR algorithms and models continuously. Check out the supported models and tutorials by clicking the following list: - - -- [text detection algorithms](./algorithm_overview_en.md#11) -- [text recognition algorithms](./algorithm_overview_en.md#12) -- [end-to-end algorithms](./algorithm_overview_en.md#2) -- [table recognition algorithms](./algorithm_overview_en.md#3) - -Developers are welcome to contribute more algorithms! Please refer to [add new algorithm](./add_new_algorithm_en.md) guideline. diff --git a/doc/doc_en/algorithm_kie_layoutxlm_en.md b/doc/doc_en/algorithm_kie_layoutxlm_en.md new file mode 100644 index 0000000000000000000000000000000000000000..910c1f4d497a6e503f0a7a5ec26dbeceb2d321a1 --- /dev/null +++ b/doc/doc_en/algorithm_kie_layoutxlm_en.md @@ -0,0 +1,162 @@ +# KIE Algorithm - LayoutXLM + + +- [1. Introduction](#1-introduction) +- [2. Environment](#2-environment) +- [3. Model Training / Evaluation / Prediction](#3-model-training--evaluation--prediction) +- [4. Inference and Deployment](#4-inference-and-deployment) + - [4.1 Python Inference](#41-python-inference) + - [4.2 C++ Inference](#42-c-inference) + - [4.3 Serving](#43-serving) + - [4.4 More](#44-more) +- [5. FAQ](#5-faq) +- [Citation](#Citation) + + +## 1. Introduction + +Paper: + +> [LayoutXLM: Multimodal Pre-training for Multilingual Visually-rich Document Understanding](https://arxiv.org/abs/2104.08836) +> +> Yiheng Xu, Tengchao Lv, Lei Cui, Guoxin Wang, Yijuan Lu, Dinei Florencio, Cha Zhang, Furu Wei +> +> 2021 + +On XFUND_zh dataset, the algorithm reproduction Hmean is as follows. + +|Model|Backbone|Task |Cnnfig|Hmean|Download link| +| --- | --- |--|--- | --- | --- | +|LayoutXLM|LayoutXLM-base|SER |[ser_layoutxlm_xfund_zh.yml](../../configs/kie/layoutlm_series/ser_layoutxlm_xfund_zh.yml)|90.38%|[trained model](https://paddleocr.bj.bcebos.com/pplayout/ser_LayoutXLM_xfun_zh.tar)/[inference model](https://paddleocr.bj.bcebos.com/pplayout/ser_LayoutXLM_xfun_zh_infer.tar)| +|LayoutXLM|LayoutXLM-base|RE | [re_layoutxlm_xfund_zh.yml](../../configs/kie/layoutlm_series/re_layoutxlm_xfund_zh.yml)|74.83%|[trained model](https://paddleocr.bj.bcebos.com/pplayout/re_LayoutXLM_xfun_zh.tar)/[inference model(coming soon)]()| + + +## 2. Environment + +Please refer to ["Environment Preparation"](./environment_en.md) to configure the PaddleOCR environment, and refer to ["Project Clone"](./clone_en.md) to clone the project code. + + +## 3. Model Training / Evaluation / Prediction + +Please refer to [KIE tutorial](./kie_en.md)。PaddleOCR has modularized the code structure, so that you only need to **replace the configuration file** to train different models. + + + +## 4. Inference and Deployment + +### 4.1 Python Inference + +**Note:** Currently, the RE model inference process is still in the process of adaptation. We take SER model as an example to introduce the KIE process based on LayoutXLM model. + +First, we need to export the trained model into inference model. Take LayoutXLM model trained on XFUND_zh as an example ([trained model download link](https://paddleocr.bj.bcebos.com/pplayout/ser_LayoutXLM_xfun_zh.tar)). Use the following command to export. + + +``` bash +wget https://paddleocr.bj.bcebos.com/pplayout/ser_LayoutXLM_xfun_zh.tar +tar -xf ser_LayoutXLM_xfun_zh.tar +python3 tools/export_model.py -c configs/kie/layoutlm_series/ser_layoutxlm_xfund_zh.yml -o Architecture.Backbone.checkpoints=./ser_LayoutXLM_xfun_zh/best_accuracy Global.save_inference_dir=./inference/ser_layoutxlm +``` + +Use the following command to infer using LayoutXLM SER model. + +```bash +cd ppstructure +python3 kie/predict_kie_token_ser.py \ + --kie_algorithm=LayoutXLM \ + --ser_model_dir=../inference/ser_layoutxlm_infer \ + --image_dir=./docs/kie/input/zh_val_42.jpg \ + --ser_dict_path=../train_data/XFUND/class_list_xfun.txt \ + --vis_font_path=../doc/fonts/simfang.ttf +``` + +The SER visualization results are saved in the `./output` directory by default. The results are as follows. + + +
+ +
+ + +### 4.2 C++ Inference + +Not supported + +### 4.3 Serving + +Not supported + +### 4.4 More + +Not supported + +## 5. FAQ + +## Citation + +```bibtex +@article{DBLP:journals/corr/abs-2104-08836, + author = {Yiheng Xu and + Tengchao Lv and + Lei Cui and + Guoxin Wang and + Yijuan Lu and + Dinei Flor{\^{e}}ncio and + Cha Zhang and + Furu Wei}, + title = {LayoutXLM: Multimodal Pre-training for Multilingual Visually-rich + Document Understanding}, + journal = {CoRR}, + volume = {abs/2104.08836}, + year = {2021}, + url = {https://arxiv.org/abs/2104.08836}, + eprinttype = {arXiv}, + eprint = {2104.08836}, + timestamp = {Thu, 14 Oct 2021 09:17:23 +0200}, + biburl = {https://dblp.org/rec/journals/corr/abs-2104-08836.bib}, + bibsource = {dblp computer science bibliography, https://dblp.org} +} + +@article{DBLP:journals/corr/abs-1912-13318, + author = {Yiheng Xu and + Minghao Li and + Lei Cui and + Shaohan Huang and + Furu Wei and + Ming Zhou}, + title = {LayoutLM: Pre-training of Text and Layout for Document Image Understanding}, + journal = {CoRR}, + volume = {abs/1912.13318}, + year = {2019}, + url = {http://arxiv.org/abs/1912.13318}, + eprinttype = {arXiv}, + eprint = {1912.13318}, + timestamp = {Mon, 01 Jun 2020 16:20:46 +0200}, + biburl = {https://dblp.org/rec/journals/corr/abs-1912-13318.bib}, + bibsource = {dblp computer science bibliography, https://dblp.org} +} + +@article{DBLP:journals/corr/abs-2012-14740, + author = {Yang Xu and + Yiheng Xu and + Tengchao Lv and + Lei Cui and + Furu Wei and + Guoxin Wang and + Yijuan Lu and + Dinei A. F. Flor{\^{e}}ncio and + Cha Zhang and + Wanxiang Che and + Min Zhang and + Lidong Zhou}, + title = {LayoutLMv2: Multi-modal Pre-training for Visually-Rich Document Understanding}, + journal = {CoRR}, + volume = {abs/2012.14740}, + year = {2020}, + url = {https://arxiv.org/abs/2012.14740}, + eprinttype = {arXiv}, + eprint = {2012.14740}, + timestamp = {Tue, 27 Jul 2021 09:53:52 +0200}, + biburl = {https://dblp.org/rec/journals/corr/abs-2012-14740.bib}, + bibsource = {dblp computer science bibliography, https://dblp.org} +} +``` diff --git a/doc/doc_en/algorithm_kie_sdmgr_en.md b/doc/doc_en/algorithm_kie_sdmgr_en.md new file mode 100644 index 0000000000000000000000000000000000000000..5b12b8c959e830015ffb173626ac5752ee9ecee0 --- /dev/null +++ b/doc/doc_en/algorithm_kie_sdmgr_en.md @@ -0,0 +1,130 @@ + +# KIE Algorithm - SDMGR + +- [1. Introduction](#1-introduction) +- [2. Environment](#2-environment) +- [3. Model Training / Evaluation / Prediction](#3-model-training--evaluation--prediction) +- [4. Inference and Deployment](#4-inference-and-deployment) + - [4.1 Python Inference](#41-python-inference) + - [4.2 C++ Inference](#42-c-inference) + - [4.3 Serving](#43-serving) + - [4.4 More](#44-more) +- [5. FAQ](#5-faq) +- [Citation](#Citation) + +## 1. Introduction + +Paper: + +> [Spatial Dual-Modality Graph Reasoning for Key Information Extraction](https://arxiv.org/abs/2103.14470) +> +> Hongbin Sun and Zhanghui Kuang and Xiaoyu Yue and Chenhao Lin and Wayne Zhang +> +> 2021 + +On wildreceipt dataset, the algorithm reproduction Hmean is as follows. + +|Model|Backbone |Cnnfig|Hmean|Download link| +| --- | --- | --- | --- | --- | +|SDMGR|VGG6|[configs/kie/sdmgr/kie_unet_sdmgr.yml](../../configs/kie/sdmgr/kie_unet_sdmgr.yml)|86.7%|[trained model]( https://paddleocr.bj.bcebos.com/dygraph_v2.1/kie/kie_vgg16.tar)/[inference model(coming soon)]()| + + + +## 2. 环境配置 + +Please refer to ["Environment Preparation"](./environment_en.md) to configure the PaddleOCR environment, and refer to ["Project Clone"](./clone_en.md) to clone the project code. + + + +## 3. Model Training / Evaluation / Prediction + +SDMGR is a key information extraction algorithm that classifies each detected textline into predefined categories, such as order ID, invoice number, amount, etc. + +The training and test data are collected in the wildreceipt dataset, use following command to downloaded the dataset. + + +```bash +wget https://paddleocr.bj.bcebos.com/ppstructure/dataset/wildreceipt.tar && tar xf wildreceipt.tar +``` + +Create dataset soft link to `PaddleOCR/train_data` directory. + +```bash +cd PaddleOCR/ && mkdir train_data && cd train_data +ln -s ../../wildreceipt ./ +``` + + +### 3.1 Model training + +The config file is `configs/kie/sdmgr/kie_unet_sdmgr.yml`, the default dataset path is `train_data/wildreceipt`. + +Use the following command to train the model. + +```bash +python3 tools/train.py -c configs/kie/sdmgr/kie_unet_sdmgr.yml -o Global.save_model_dir=./output/kie/ +``` + +### 3.2 Model evaluation + +Use the following command to evaluate the model. + +```bash +python3 tools/eval.py -c configs/kie/sdmgr/kie_unet_sdmgr.yml -o Global.checkpoints=./output/kie/best_accuracy +``` + +An example of output information is shown below. + +```py +[2022/08/10 05:22:23] ppocr INFO: metric eval *************** +[2022/08/10 05:22:23] ppocr INFO: hmean:0.8670120239257812 +[2022/08/10 05:22:23] ppocr INFO: fps:10.18816520530961 +``` + +### 3.3 Model prediction + +Use the following command to load the model and predict. During the prediction, the text file storing the image path and OCR information needs to be loaded in advance. Use `Global.infer_img` to assign. + +```bash +python3 tools/infer_kie.py -c configs/kie/kie_unet_sdmgr.yml -o Global.checkpoints=kie_vgg16/best_accuracy Global.infer_img=./train_data/wildreceipt/1.txt +``` + +The visualization results and texts are saved in the `./output/sdmgr_kie/` directory by default. The results are as follows. + + +
+ +
+ +## 4. Inference and Deployment + +### 4.1 Python Inference + +Not supported + +### 4.2 C++ Inference + +Not supported + +### 4.3 Serving + +Not supported + +### 4.4 More + +Not supported + +## 5. FAQ + +## Citation + +```bibtex +@misc{sun2021spatial, + title={Spatial Dual-Modality Graph Reasoning for Key Information Extraction}, + author={Hongbin Sun and Zhanghui Kuang and Xiaoyu Yue and Chenhao Lin and Wayne Zhang}, + year={2021}, + eprint={2103.14470}, + archivePrefix={arXiv}, + primaryClass={cs.CV} +} +``` diff --git a/doc/doc_en/algorithm_kie_vi_layoutxlm_en.md b/doc/doc_en/algorithm_kie_vi_layoutxlm_en.md new file mode 100644 index 0000000000000000000000000000000000000000..12b6e1bddbd03b820ce33ba86de3d430a44f8987 --- /dev/null +++ b/doc/doc_en/algorithm_kie_vi_layoutxlm_en.md @@ -0,0 +1,156 @@ +# KIE Algorithm - VI-LayoutXLM + + +- [1. Introduction](#1-introduction) +- [2. Environment](#2-environment) +- [3. Model Training / Evaluation / Prediction](#3-model-training--evaluation--prediction) +- [4. Inference and Deployment](#4-inference-and-deployment) + - [4.1 Python Inference](#41-python-inference) + - [4.2 C++ Inference](#42-c-inference) + - [4.3 Serving](#43-serving) + - [4.4 More](#44-more) +- [5. FAQ](#5-faq) +- [Citation](#Citation) + + +## 1. Introduction + +VI-LayoutXLM is improved based on LayoutXLM. In the process of downstream finetuning, the visual backbone network module is removed, and the model infernce speed is further improved on the basis of almost lossless accuracy. + +On XFUND_zh dataset, the algorithm reproduction Hmean is as follows. + +|Model|Backbone|Task |Cnnfig|Hmean|Download link| +| --- | --- |---| --- | --- | --- | +|VI-LayoutXLM |VI-LayoutXLM-base | SER |[ser_vi_layoutxlm_xfund_zh_udml.yml](../../configs/kie/vi_layoutxlm/ser_vi_layoutxlm_xfund_zh_udml.yml)|93.19%|[trained model](https://paddleocr.bj.bcebos.com/ppstructure/models/vi_layoutxlm/ser_vi_layoutxlm_xfund_pretrained.tar)/[inference model](https://paddleocr.bj.bcebos.com/ppstructure/models/vi_layoutxlm/ser_vi_layoutxlm_xfund_infer.tar)| +|VI-LayoutXLM |VI-LayoutXLM-base |RE | [re_vi_layoutxlm_xfund_zh_udml.yml](../../configs/kie/vi_layoutxlm/re_vi_layoutxlm_xfund_zh_udml.yml)|83.92%|[trained model](https://paddleocr.bj.bcebos.com/ppstructure/models/vi_layoutxlm/re_vi_layoutxlm_xfund_pretrained.tar)/[inference model(coming soon)]()| + + +Please refer to ["Environment Preparation"](./environment_en.md) to configure the PaddleOCR environment, and refer to ["Project Clone"](./clone_en.md) to clone the project code. + + +## 3. Model Training / Evaluation / Prediction + +Please refer to [KIE tutorial](./kie_en.md)。PaddleOCR has modularized the code structure, so that you only need to **replace the configuration file** to train different models. + + +## 4. Inference and Deployment + +### 4.1 Python Inference + +**Note:** Currently, the RE model inference process is still in the process of adaptation. We take SER model as an example to introduce the KIE process based on VI-LayoutXLM model. + +First, we need to export the trained model into inference model. Take VI-LayoutXLM model trained on XFUND_zh as an example ([trained model download link](https://paddleocr.bj.bcebos.com/ppstructure/models/vi_layoutxlm/ser_vi_layoutxlm_xfund_pretrained.tar)). Use the following command to export. + + +``` bash +wget https://paddleocr.bj.bcebos.com/ppstructure/models/vi_layoutxlm/ser_vi_layoutxlm_xfund_pretrained.tar +tar -xf ser_vi_layoutxlm_xfund_pretrained.tar +python3 tools/export_model.py -c configs/kie/vi_layoutxlm/ser_vi_layoutxlm_xfund_zh.yml -o Architecture.Backbone.checkpoints=./ser_vi_layoutxlm_xfund_pretrained/best_accuracy Global.save_inference_dir=./inference/ser_vi_layoutxlm_infer +``` + +Use the following command to infer using VI-LayoutXLM SER model. + + +```bash +cd ppstructure +python3 kie/predict_kie_token_ser.py \ + --kie_algorithm=LayoutXLM \ + --ser_model_dir=../inference/ser_vi_layoutxlm_infer \ + --image_dir=./docs/kie/input/zh_val_42.jpg \ + --ser_dict_path=../train_data/XFUND/class_list_xfun.txt \ + --vis_font_path=../doc/fonts/simfang.ttf \ + --ocr_order_method="tb-yx" +``` + +The SER visualization results are saved in the `./output` folder by default. The results are as follows. + + +
+ +
+ + +### 4.2 C++ Inference + +Not supported + +### 4.3 Serving + +Not supported + +### 4.4 More + +Not supported + +## 5. FAQ + +## Citation + + +```bibtex +@article{DBLP:journals/corr/abs-2104-08836, + author = {Yiheng Xu and + Tengchao Lv and + Lei Cui and + Guoxin Wang and + Yijuan Lu and + Dinei Flor{\^{e}}ncio and + Cha Zhang and + Furu Wei}, + title = {LayoutXLM: Multimodal Pre-training for Multilingual Visually-rich + Document Understanding}, + journal = {CoRR}, + volume = {abs/2104.08836}, + year = {2021}, + url = {https://arxiv.org/abs/2104.08836}, + eprinttype = {arXiv}, + eprint = {2104.08836}, + timestamp = {Thu, 14 Oct 2021 09:17:23 +0200}, + biburl = {https://dblp.org/rec/journals/corr/abs-2104-08836.bib}, + bibsource = {dblp computer science bibliography, https://dblp.org} +} + +@article{DBLP:journals/corr/abs-1912-13318, + author = {Yiheng Xu and + Minghao Li and + Lei Cui and + Shaohan Huang and + Furu Wei and + Ming Zhou}, + title = {LayoutLM: Pre-training of Text and Layout for Document Image Understanding}, + journal = {CoRR}, + volume = {abs/1912.13318}, + year = {2019}, + url = {http://arxiv.org/abs/1912.13318}, + eprinttype = {arXiv}, + eprint = {1912.13318}, + timestamp = {Mon, 01 Jun 2020 16:20:46 +0200}, + biburl = {https://dblp.org/rec/journals/corr/abs-1912-13318.bib}, + bibsource = {dblp computer science bibliography, https://dblp.org} +} + +@article{DBLP:journals/corr/abs-2012-14740, + author = {Yang Xu and + Yiheng Xu and + Tengchao Lv and + Lei Cui and + Furu Wei and + Guoxin Wang and + Yijuan Lu and + Dinei A. F. Flor{\^{e}}ncio and + Cha Zhang and + Wanxiang Che and + Min Zhang and + Lidong Zhou}, + title = {LayoutLMv2: Multi-modal Pre-training for Visually-Rich Document Understanding}, + journal = {CoRR}, + volume = {abs/2012.14740}, + year = {2020}, + url = {https://arxiv.org/abs/2012.14740}, + eprinttype = {arXiv}, + eprint = {2012.14740}, + timestamp = {Tue, 27 Jul 2021 09:53:52 +0200}, + biburl = {https://dblp.org/rec/journals/corr/abs-2012-14740.bib}, + bibsource = {dblp computer science bibliography, https://dblp.org} +} +``` diff --git a/doc/doc_en/algorithm_overview_en.md b/doc/doc_en/algorithm_overview_en.md index 3412ccbf76f6c04b61420a6abd91a55efb383db6..5bf569e3e1649cfabbe196be7e1a55d1caa3bf61 100755 --- a/doc/doc_en/algorithm_overview_en.md +++ b/doc/doc_en/algorithm_overview_en.md @@ -1,17 +1,21 @@ -# OCR Algorithms +# Algorithms -- [1. Two-stage Algorithms](#1) +- [1. Two-stage OCR Algorithms](#1) - [1.1 Text Detection Algorithms](#11) - [1.2 Text Recognition Algorithms](#12) -- [2. End-to-end Algorithms](#2) +- [2. End-to-end OCR Algorithms](#2) - [3. Table Recognition Algorithms](#3) +- [4. Key Information Extraction Algorithms](#4) +This tutorial lists the OCR algorithms supported by PaddleOCR, as well as the models and metrics of each algorithm on **English public datasets**. It is mainly used for algorithm introduction and algorithm performance comparison. For more models on other datasets including Chinese, please refer to [PP-OCRv3 models list](./models_list_en.md). + +>> +Developers are welcome to contribute more algorithms! Please refer to [add new algorithm](./add_new_algorithm_en.md) guideline. -This tutorial lists the OCR algorithms supported by PaddleOCR, as well as the models and metrics of each algorithm on **English public datasets**. It is mainly used for algorithm introduction and algorithm performance comparison. For more models on other datasets including Chinese, please refer to [PP-OCR v2.0 models list](./models_list_en.md). -## 1. Two-stage Algorithms +## 1. Two-stage OCR Algorithms @@ -98,11 +102,12 @@ Refer to [DTRB](https://arxiv.org/abs/1904.01906), the training and evaluation r -## 2. End-to-end Algorithms +## 2. End-to-end OCR Algorithms Supported end-to-end algorithms (Click the link to get the tutorial): - [x] [PGNet](./algorithm_e2e_pgnet_en.md) + ## 3. Table Recognition Algorithms @@ -114,3 +119,34 @@ On the PubTabNet dataset, the algorithm result is as follows: |Model|Backbone|Config|Acc|Download link| |---|---|---|---|---| |TableMaster|TableResNetExtra|[configs/table/table_master.yml](../../configs/table/table_master.yml)|77.47%|[trained](https://paddleocr.bj.bcebos.com/ppstructure/models/tablemaster/table_structure_tablemaster_train.tar) / [inference model](https://paddleocr.bj.bcebos.com/ppstructure/models/tablemaster/table_structure_tablemaster_infer.tar)| + + + + +## 4. Key Information Extraction Algorithms + +Supported KIE algorithms (Click the link to get the tutorial): + +- [x] [VI-LayoutXLM](./algorithm_kie_vi_laoutxlm_en.md) +- [x] [LayoutLM](./algorithm_kie_laoutxlm_en.md) +- [x] [LayoutLMv2](./algorithm_kie_laoutxlm_en.md) +- [x] [LayoutXLM](./algorithm_kie_laoutxlm_en.md) +- [x] [SDMGR](./algorithm_kie_sdmgr_en.md) + +On wildreceipt dataset, the algorithm result is as follows: + +|Model|Backbone|Config|Hmean|Download link| +| --- | --- | --- | --- | --- | +|SDMGR|VGG6|[configs/kie/sdmgr/kie_unet_sdmgr.yml](../../configs/kie/sdmgr/kie_unet_sdmgr.yml)|86.7%|[trained model](https://paddleocr.bj.bcebos.com/dygraph_v2.1/kie/kie_vgg16.tar)| + +On XFUND_zh dataset, the algorithm result is as follows: + +|Model|Backbone|Task|Config|Hmean|Download link| +| --- | --- | --- | --- | --- | --- | +|VI-LayoutXLM| VI-LayoutXLM-base | SER | [ser_vi_layoutxlm_xfund_zh_udml.yml](../../configs/kie/vi_layoutxlm/ser_vi_layoutxlm_xfund_zh_udml.yml)|**93.19%**|[trained model](https://paddleocr.bj.bcebos.com/ppstructure/models/vi_layoutxlm/ser_vi_layoutxlm_xfund_pretrained.tar)| +|LayoutXLM| LayoutXLM-base | SER | [ser_layoutxlm_xfund_zh.yml](../../configs/kie/layoutlm_series/ser_layoutxlm_xfund_zh.yml)|90.38%|[trained model](https://paddleocr.bj.bcebos.com/pplayout/ser_LayoutXLM_xfun_zh.tar)| +|LayoutLM| LayoutLM-base | SER | [ser_layoutlm_xfund_zh.yml](../../configs/kie/layoutlm_series/ser_layoutlm_xfund_zh.yml)|77.31%|[trained model](https://paddleocr.bj.bcebos.com/pplayout/ser_LayoutLM_xfun_zh.tar)| +|LayoutLMv2| LayoutLMv2-base | SER | [ser_layoutlmv2_xfund_zh.yml](../../configs/kie/layoutlm_series/ser_layoutlmv2_xfund_zh.yml)|85.44%|[trained model](https://paddleocr.bj.bcebos.com/pplayout/ser_LayoutLMv2_xfun_zh.tar)| +|VI-LayoutXLM| VI-LayoutXLM-base | RE | [re_vi_layoutxlm_xfund_zh_udml.yml](../../configs/kie/vi_layoutxlm/re_vi_layoutxlm_xfund_zh_udml.yml)|**83.92%**|[trained model](https://paddleocr.bj.bcebos.com/ppstructure/models/vi_layoutxlm/re_vi_layoutxlm_xfund_pretrained.tar)| +|LayoutXLM| LayoutXLM-base | RE | [re_layoutxlm_xfund_zh.yml](../../configs/kie/layoutlm_series/re_layoutxlm_xfund_zh.yml)|74.83%|[trained model](https://paddleocr.bj.bcebos.com/pplayout/re_LayoutXLM_xfun_zh.tar)| +|LayoutLMv2| LayoutLMv2-base | RE | [re_layoutlmv2_xfund_zh.yml](../../configs/kie/layoutlm_series/re_layoutlmv2_xfund_zh.yml)|67.77%|[trained model](https://paddleocr.bj.bcebos.com/pplayout/re_LayoutLMv2_xfun_zh.tar)| diff --git a/doc/doc_en/dataset/docvqa_datasets_en.md b/doc/doc_en/dataset/kie_datasets_en.md similarity index 63% rename from doc/doc_en/dataset/docvqa_datasets_en.md rename to doc/doc_en/dataset/kie_datasets_en.md index 820462c324318a391abe409412e8996f11b36279..3a8b744fc0b2653aab5c1435996a2ef73dd336e4 100644 --- a/doc/doc_en/dataset/docvqa_datasets_en.md +++ b/doc/doc_en/dataset/kie_datasets_en.md @@ -1,7 +1,9 @@ -## DocVQA dataset -Here are the common DocVQA datasets, which are being updated continuously. Welcome to contribute datasets~ +## Key Imnformation Extraction dataset + +Here are the common DocVQA datasets, which are being updated continuously. Welcome to contribute datasets. - [FUNSD dataset](#funsd) - [XFUND dataset](#xfund) +- [wildreceipt dataset](#wildreceipt数据集) #### 1. FUNSD dataset @@ -25,3 +27,21 @@ Here are the common DocVQA datasets, which are being updated continuously. Welco - **Download address**: https://github.com/doc-analysis/XFUND/releases/tag/v1.0 + + + +## 3. wildreceipt dataset + +- **Data source**: https://arxiv.org/abs/2103.14470 +- **Data introduction**: XFUND is an English receipt dataset, which contains 26 different categories. There are 1267 training images and 472 evaluation images, in which 50,000 textlines and boxes are annotated. Part of the image and the annotation box visualization are shown below. + +
+ + +
+ +**Note:** Boxes with category `Ignore` or `Others` are not visualized here. + +- **Download address**: + - Offical dataset: [link](https://download.openmmlab.com/mmocr/data/wildreceipt.tar) + - Dataset converted for PaddleOCR training process: [link](https://paddleocr.bj.bcebos.com/ppstructure/dataset/wildreceipt.tar) diff --git a/doc/doc_en/kie_en.md b/doc/doc_en/kie_en.md new file mode 100644 index 0000000000000000000000000000000000000000..0c335a5ceb8991b80bc0cab6facdf402878abb50 --- /dev/null +++ b/doc/doc_en/kie_en.md @@ -0,0 +1,491 @@ +# Key Information Extraction + +This tutorial provides a guide to the whole process of key information extraction using PaddleOCR, including data preparation, model training, optimization, evaluation, prediction of semantic entity recognition (SER) and relationship extraction (RE) tasks. + + +- [1. Data Preparation](#Data-Preparation) + - [1.1. Prepare for dataset](#11-Prepare-for-dataset) + - [1.2. Custom Dataset](#12-Custom-Dataset) + - [1.3. Download data](#13-Download-data) +- [2. Training](#2-Training) + - [2.1. Start Training](#21-start-training) + - [2.2. Resume Training](#22-Resume-Training) + - [2.3. Mixed Precision Training](#23-Mixed-Precision-Training) + - [2.4. Distributed Training](#24-Distributed-Training) + - [2.5. Train using knowledge distillation](#25-Train-using-knowledge-distillation) + - [2.6. Training on other platform](#26-Training-on-other-platform) +- [3. Evaluation and Test](#3-Evaluation-and-Test) + - [3.1. Evaluation](#31-指标评估) + - [3.2. Test](#32-Test) +- [4. Model inference](#4-Model-inference) +- [5. FAQ](#5-faq) + + +# 1. Data Preparation + +## 1.1. Prepare for dataset + +PaddleOCR supports the following data format when training KIE models. + +- `general data` is used to train a dataset whose annotation is stored in a text file (SimpleDataset). + + +The default storage path of training data is `PaddleOCR/train_data`. If you already have datasets on your disk, you only need to create a soft link to the dataset directory. + +``` +# linux and mac os +ln -sf /train_data/dataset +# windows +mklink /d /train_data/dataset +``` + +## 1.2. Custom Dataset + +The training process generally includes the training set and the evaluation set. The data formats of the two sets are same. + +**(1) Training set** + +It is recommended to put the training images into the same folder, record the path and annotation of images in a text file. The contents of the text file are as follows: + + +```py +" image path annotation information " +zh_train_0.jpg [{"transcription": "汇丰晋信", "label": "other", "points": [[104, 114], [530, 114], [530, 175], [104, 175]], "id": 1, "linking": []}, {"transcription": "受理时间:", "label": "question", "points": [[126, 267], [266, 267], [266, 305], [126, 305]], "id": 7, "linking": [[7, 13]]}, {"transcription": "2020.6.15", "label": "answer", "points": [[321, 239], [537, 239], [537, 285], [321, 285]], "id": 13, "linking": [[7, 13]]}] +zh_train_1.jpg [{"transcription": "中国人体器官捐献", "label": "other", "points": [[544, 459], [954, 459], [954, 517], [544, 517]], "id": 1, "linking": []}, {"transcription": ">编号:MC545715483585", "label": "other", "points": [[1462, 470], [2054, 470], [2054, 543], [1462, 543]], "id": 10, "linking": []}, {"transcription": "CHINAORGANDONATION", "label": "other", "points": [[543, 516], [958, 516], [958, 551], [543, 551]], "id": 14, "linking": []}, {"transcription": "中国人体器官捐献志愿登记表", "label": "header", "points": [[635, 793], [1892, 793], [1892, 904], [635, 904]], "id": 18, "linking": []}] +... +``` + +**Note:** In the text file, please split the image path and annotation with `\t`. Otherwise, error will happen when training. + +The annotation can be parsed by `json` into a list of sub-annotations. Each element in the list is a dict, which stores the required information of each text line. The required fields are as follows. + +- transcription: stores the text content of the text line +- label: the category of the text line content +- points: stores the four point position information of the text line +- id: stores the ID information of the text line for RE model training +- linking: stores the connection information between text lines for RE model training + +**(2) Evaluation set** + +The evaluation set is constructed in the same way as the training set. + +**(3) Dictionary file** + +The textlines in the training set and the evaluation set contain label information. The list of all labels is stored in the dictionary file (such as `class_list.txt`). Each line in the dictionary file is represented as a label name. + +For example, FUND_zh data contains four categories. The contents of the dictionary file are as follows. + +``` +OTHER +QUESTION +ANSWER +HEADER +``` + +In the annotation file, the annotation information of the `label` field of the text line content of each annotation needs to belong to the dictionary content. + + +The final dataset shall have the following file structure. + +``` +|-train_data + |-data_name + |- train.json + |- train + |- zh_train_0.png + |- zh_train_1.jpg + | ... + |- val.json + |- val + |- zh_val_0.png + |- zh_val_1.jpg + | ... +``` + +**Note:** + +-The category information in the annotation file is not case sensitive. For example, 'HEADER' and 'header' will be seen as the same category ID. +- In the dictionary file, it is recommended to put the `other` category (other textlines that need not be paid attention to can be labeled as `other`) on the first line. When parsing, the category ID of the 'other' category will be resolved to 0, and the textlines predicted as `other` will not be visualized later. + +## 1.3. Download data + +If you do not have local dataset, you can donwload the source files of [XFUND](https://github.com/doc-analysis/XFUND) or [FUNSD](https://guillaumejaume.github.io/FUNSD) and use the scripts of [XFUND](../../ppstructure/kie/tools/trans_xfun_data.py) or [FUNSD](../../ppstructure/kie/tools/trans_funsd_label.py) for tranform them into PaddleOCR format. Then you can use the public dataset to quick experience KIE. + +For more information about public KIE datasets, please refer to [KIE dataset tutorial](./dataset/kie_datasets_en.md). + +PaddleOCR also supports the annotation of KIE models. Please refer to [PPOCRLabel tutorial](../../PPOCRLabel/README.md). + +# 2. Training + +PaddleOCR provides training scripts, evaluation scripts and inference scripts. We will introduce based on VI-LayoutXLM model in this section. +This section will take the VI layoutxlm multimodal pre training model as an example to explain. + +> If you want to use the SDMGR based KIE algorithm, please refer to: [SDMGR tutorial](./algorithm_kie_sdmgr_en.md). + + +## 2.1. Start Training + +If you do not use a custom dataset, you can use XFUND_zh that has been processed in PaddleOCR dataset for quick experience. + + +```bash +mkdir train_data +cd train_data +wget https://paddleocr.bj.bcebos.com/ppstructure/dataset/XFUND.tar && tar -xf XFUND.tar +cd .. +``` + +If you don't want to train, and want to directly experience the process of model evaluation, prediction, and inference, you can download the training model provided in PaddleOCR and skip section 2.1. + + +Use the following command to download the trained model. + +```bash +mkdir pretrained_model +cd pretrained_model +# download and uncompress SER model +wget https://paddleocr.bj.bcebos.com/ppstructure/models/vi_layoutxlm/ser_vi_layoutxlm_xfund_pretrained.tar & tar -xf ser_vi_layoutxlm_xfund_pretrained.tar + +# download and uncompress RE model +wget https://paddleocr.bj.bcebos.com/ppstructure/models/vi_layoutxlm/re_vi_layoutxlm_xfund_pretrained.tar & tar -xf re_vi_layoutxlm_xfund_pretrained.tar +``` + +Start training: + +- If your paddlepaddle version is `CPU`, you need to set `Global.use_gpu=False` in your config file. +- During training, PaddleOCR will download the VI-LayoutXLM pretraining model by default. There is no need to download it in advance. + +```bash +# GPU training, support single card and multi-cards +# The training log will be save in "{Global.save_model_dir}/train.log" + +# train SER model using single card +python3 tools/train.py -c configs/kie/vi_layoutxlm/ser_vi_layoutxlm_xfund_zh.yml + +# train SER model using multi-cards, you can use --gpus to assign the GPU ids. +python3 -m paddle.distributed.launch --gpus '0,1,2,3' tools/train.py -c configs/kie/vi_layoutxlm/ser_vi_layoutxlm_xfund_zh.yml + +# train RE model using single card +python3 tools/train.py -c configs/kie/vi_layoutxlm/re_vi_layoutxlm_xfund_zh.yml +``` + +Take the SER model training as an example. After the training is started, you will see the following log output. + +``` +[2022/08/08 16:28:28] ppocr INFO: epoch: [1/200], global_step: 10, lr: 0.000006, loss: 1.871535, avg_reader_cost: 0.28200 s, avg_batch_cost: 0.82318 s, avg_samples: 8.0, ips: 9.71838 samples/s, eta: 0:51:59 +[2022/08/08 16:28:33] ppocr INFO: epoch: [1/200], global_step: 19, lr: 0.000018, loss: 1.461939, avg_reader_cost: 0.00042 s, avg_batch_cost: 0.32037 s, avg_samples: 6.9, ips: 21.53773 samples/s, eta: 0:37:55 +[2022/08/08 16:28:39] ppocr INFO: cur metric, precision: 0.11526348939743859, recall: 0.19776657060518732, hmean: 0.14564265817747712, fps: 34.008392345050055 +[2022/08/08 16:28:45] ppocr INFO: save best model is to ./output/ser_vi_layoutxlm_xfund_zh/best_accuracy +[2022/08/08 16:28:45] ppocr INFO: best metric, hmean: 0.14564265817747712, precision: 0.11526348939743859, recall: 0.19776657060518732, fps: 34.008392345050055, best_epoch: 1 +[2022/08/08 16:28:51] ppocr INFO: save model in ./output/ser_vi_layoutxlm_xfund_zh/latest +``` + +The following information will be automatically printed. + + +|Field | meaning| +| :----: | :------: | +|epoch | current iteration round| +|iter | current iteration times| +|lr | current learning rate| +|loss | current loss function| +| reader_cost | current batch data processing time| +| batch_ Cost | total current batch time| +|samples | number of samples in the current batch| +|ips | number of samples processed per second| + + +PaddleOCR supports evaluation during training. you can modify `eval_batch_step` in the config file `configs/kie/vi_layoutxlm/ser_vi_layoutxlm_xfund_zh.yml` (default as 19 iters). Trained model with best hmean will be saved as `output/ser_vi_layoutxlm_xfund_zh/best_accuracy/`. + +If the evaluation dataset is very large, it's recommended to enlarge the eval interval or evaluate the model after training. + +**Note:** for more KIE models training and configuration files, you can go into `configs/kie/` or refer to [Frontier KIE algorithms](./algorithm_overview_en.md). + + +If you want to train model on your own dataset, you need to modify the data path, dictionary file and category number in the configuration file. + + +Take `configs/kie/vi_layoutxlm/ser_vi_layoutxlm_xfund_zh.yml` as an example, contents we need to fix is as follows. + +```yaml +Architecture: + # ... + Backbone: + name: LayoutXLMForSer + pretrained: True + mode: vi + # Assuming that n categroies are included in the dictionary file (other is included), the the num_classes is set as 2n-1 + num_classes: &num_classes 7 + +PostProcess: + name: kieSerTokenLayoutLMPostProcess + # Modify the dictionary file path for your custom dataset + class_path: &class_path train_data/XFUND/class_list_xfun.txt + +Train: + dataset: + name: SimpleDataSet + # Modify the data path for your training dataset + data_dir: train_data/XFUND/zh_train/image + # Modify the data annotation path for your training dataset + label_file_list: + - train_data/XFUND/zh_train/train.json + ... + loader: + # batch size for single card when training + batch_size_per_card: 8 + ... + +Eval: + dataset: + name: SimpleDataSet + # Modify the data path for your evaluation dataset + data_dir: train_data/XFUND/zh_val/image + # Modify the data annotation path for your evaluation dataset + label_file_list: + - train_data/XFUND/zh_val/val.json + ... + loader: + # batch size for single card when evaluation + batch_size_per_card: 8 +``` + +**Note that the configuration file for prediction/evaluation must be consistent with the training file.** + + +## 2.2. Resume Training + +If the training process is interrupted and you want to load the saved model to resume training, you can specify the path of the model to be loaded by specifying `Architecture.Backbone.checkpoints`. + + +```bash +python3 tools/train.py -c configs/kie/vi_layoutxlm/ser_vi_layoutxlm_xfund_zh.yml -o Architecture.Backbone.checkpoints=./output/ser_vi_layoutxlm_xfund_zh/best_accuracy +``` + +**Note:** + +- Priority of `Architecture.Backbone.checkpoints` is higher than` Architecture.Backbone.pretrained`. You need to set `Architecture.Backbone.checkpoints` for model finetuning, resume and evalution. If you want to train with the NLP pretrained model, you need to set `Architecture.Backbone.pretrained` as `True` and set `Architecture.Backbone.checkpoints` as null (`null`). +- PaddleNLP pretrained models are used here for LayoutXLM series models, the model loading and saving logic is same as those in PaddleNLP. Therefore we do not need to set `Global.pretrained_model` or `Global.checkpoints` here. +- If you use knowledge distillation to train the LayoutXLM series models, resuming training is not supported now. + +## 2.3. Mixed Precision Training + +coming soon! + +## 2.4. Distributed Training + +During multi-machine multi-gpu training, use the `--ips` parameter to set the used machine IP address, and the `--gpus` parameter to set the used GPU ID: + +```bash +python3 -m paddle.distributed.launch --ips="xx.xx.xx.xx,xx.xx.xx.xx" --gpus '0,1,2,3' tools/train.py -c configs/kie/vi_layoutxlm/ser_vi_layoutxlm_xfund_zh.yml +``` + +**Note:** (1) When using multi-machine and multi-gpu training, you need to replace the ips value in the above command with the address of your machine, and the machines need to be able to ping each other. (2) Training needs to be launched separately on multiple machines. The command to view the ip address of the machine is `ifconfig`. (3) For more details about the distributed training speedup ratio, please refer to [Distributed Training Tutorial](./distributed_training_en.md). + + +## 2.5. Train with Knowledge Distillation + +Knowledge distillation is supported in PaddleOCR for KIE model training process. The configuration file is [ser_vi_layoutxlm_xfund_zh_udml.yml](../../configs/kie/vi_layoutxlm/ser_vi_layoutxlm_xfund_zh_udml.yml). For more information, please refer to [doc](./knowledge_distillation_en.md). + +**Note:** The saving and loading logic of the LayoutXLM series KIE models in PaddleOCR is consistent with PaddleNLP, so only the parameters of the student model are saved in the distillation process. If you want to use the saved model for evaluation, you need to use the configuration of the student model (the student model corresponding to the distillation file above is [ser_vi_layoutxlm_xfund_zh.yml](../../configs/kie/vi_layoutxlm/ser_vi_layoutxlm_xfund_zh.yml). + + + +## 2.6. Training on other platform + +- Windows GPU/CPU +The Windows platform is slightly different from the Linux platform: +Windows platform only supports `single gpu` training and inference, specify GPU for training `set CUDA_VISIBLE_DEVICES=0` +On the Windows platform, DataLoader only supports single-process mode, so you need to set `num_workers` to 0; + +- macOS +GPU mode is not supported, you need to set `use_gpu` to False in the configuration file, and the rest of the training evaluation prediction commands are exactly the same as Linux GPU. + +- Linux DCU +Running on a DCU device requires setting the environment variable `export HIP_VISIBLE_DEVICES=0,1,2,3`, and the rest of the training and evaluation prediction commands are exactly the same as the Linux GPU. + +# 3. Evaluation and Test + +## 3.1. Evaluation + +The trained model will be saved in `Global.save_model_dir`. When evaluation, you need to set `Architecture.Backbone.checkpoints` as your model directroy. The evaluation dataset can be set by modifying the `Eval.dataset.label_file_list` field in the `configs/kie/vi_layoutxlm/ser_vi_layoutxlm_xfund_zh.yml` file. + + +```bash +# GPU evaluation, Global.checkpoints is the weight to be tested +python3 tools/eval.py -c configs/kie/vi_layoutxlm/ser_vi_layoutxlm_xfund_zh.yml -o Architecture.Backbone.checkpoints=./output/ser_vi_layoutxlm_xfund_zh/best_accuracy +``` + +The following information will be printed such as precision, recall, hmean and so on. + +```py +[2022/08/09 07:59:28] ppocr INFO: metric eval *************** +[2022/08/09 07:59:28] ppocr INFO: precision:0.697476609016161 +[2022/08/09 07:59:28] ppocr INFO: recall:0.8861671469740634 +[2022/08/09 07:59:28] ppocr INFO: hmean:0.7805806758686339 +[2022/08/09 07:59:28] ppocr INFO: fps:17.367364606899105 +``` + + +## 3.2. Test + +Using the model trained by PaddleOCR, we can quickly get prediction through the following script. + +The default prediction image is stored in `Global.infer_img`, and the trained model weight is specified via `-o Global.checkpoints`. + +According to the `Global.save_model_dir` and `save_epoch_step` fields set in the configuration file, the following parameters will be saved. + + +``` +output/ser_vi_layoutxlm_xfund_zh/ +├── best_accuracy + ├── metric.states + ├── model_config.json + ├── model_state.pdparams +├── best_accuracy.pdopt +├── config.yml +├── train.log +├── latest + ├── metric.states + ├── model_config.json + ├── model_state.pdparams +├── latest.pdopt +``` + +Among them, best_accuracy.* is the best model on the evaluation set; latest.* is the model of the last epoch. + +The configuration file for prediction must be consistent with the training file. If you finish the training process using `python3 tools/train.py -c configs/kie/vi_layoutxlm/ser_vi_layoutxlm_xfund_zh.yml`. You can use the following command for prediction. + + +```bash +python3 tools/infer_kie_token_ser.py -c configs/kie/vi_layoutxlm/ser_vi_layoutxlm_xfund_zh.yml -o Architecture.Backbone.checkpoints=./output/ser_vi_layoutxlm_xfund_zh/best_accuracy Global.infer_img=./ppstructure/docs/kie/input/zh_val_42.jpg +``` + +The output image is as follows, which is also saved in `Global.save_res_path`. + + +
+ +
+ +During the prediction process, the detection and recognition model of PP-OCRv3 will be loaded by default for information extraction of OCR. If you want to load the OCR results obtained in advance, you can use the following method to predict, and specify `Global.infer_img` as the annotation file, which contains the image path and OCR information, and specifies `Global.infer_mode` as False, indicating that the OCR inference engine is not used at this time. + +```bash +python3 tools/infer_kie_token_ser.py -c configs/kie/vi_layoutxlm/ser_vi_layoutxlm_xfund_zh.yml -o Architecture.Backbone.checkpoints=./output/ser_vi_layoutxlm_xfund_zh/best_accuracy Global.infer_img=./train_data/XFUND/zh_val/val.json Global.infer_mode=False +``` + +For the above image, if information extraction is performed using the labeled OCR results, the prediction results are as follows. + +
+ +
+ +It can be seen that part of the detection information is more accurate, but the overall information extraction results are basically the same. + +In RE model prediction, the SER model result needs to be given first, so the configuration file and model weight of SER need to be loaded at the same time, as shown in the following example. + +```bash +python3 ./tools/infer_kie_token_ser_re.py \ + -c configs/kie/vi_layoutxlm/re_vi_layoutxlm_xfund_zh.yml \ + -o Architecture.Backbone.checkpoints=./pretrain_models/re_vi_layoutxlm_udml_xfund_zh/best_accuracy/ \ + Global.infer_img=./train_data/XFUND/zh_val/image/ \ + -c_ser configs/kie/vi_layoutxlm/ser_vi_layoutxlm_xfund_zh.yml \ + -o_ser Architecture.Backbone.checkpoints=pretrain_models/ \ + ser_vi_layoutxlm_udml_xfund_zh/best_accuracy/ +``` + +The result is as follows. + +
+ +
+ + +If you want to load the OCR results obtained in advance, you can use the following method to predict, and specify `Global.infer_img` as the annotation file, which contains the image path and OCR information, and specifies `Global.infer_mode` as False, indicating that the OCR inference engine is not used at this time. + +```bash +python3 ./tools/infer_kie_token_ser_re.py \ + -c configs/kie/vi_layoutxlm/re_vi_layoutxlm_xfund_zh.yml \ + -o Architecture.Backbone.checkpoints=./pretrain_models/re_vi_layoutxlm_udml_xfund_zh/best_accuracy/ \ + Global.infer_img=./train_data/XFUND/zh_val/val.json \ + Global.infer_mode=False \ + -c_ser configs/kie/vi_layoutxlm/ser_vi_layoutxlm_xfund_zh.yml \ + -o_ser Architecture.Backbone.checkpoints=pretrain_models/ser_vi_layoutxlm_udml_xfund_zh/best_accuracy/ +``` + +`c_ser` denotes SER configurations file, `o_ser` denotes the SER model configurations that will override corresponding content in the file. + + +The result is as follows. + +
+ +
+ + +It can be seen that the re prediction results directly using the annotated OCR results are more accurate. + + +# 4. Model inference + + +## 4.1 Export the model + +The inference model (the model saved by `paddle.jit.save`) is generally a solidified model saved after the model training is completed, and is mostly used to give prediction in deployment. + +The model saved during the training process is the checkpoints model, which saves the parameters of the model and is mostly used to resume training. + +Compared with the checkpoints model, the inference model will additionally save the structural information of the model. Therefore, it is easier to deploy because the model structure and model parameters are already solidified in the inference model file, and is suitable for integration with actual systems. + +The SER model can be converted to the inference model using the following command. + + +```bash +# -c Set the training algorithm yml configuration file. +# -o Set optional parameters. +# Architecture.Backbone.checkpoints Set the training model address. +# Global.save_inference_dir Set the address where the converted model will be saved. +python3 tools/export_model.py -c configs/kie/vi_layoutxlm/ser_vi_layoutxlm_xfund_zh.yml -o Architecture.Backbone.checkpoints=./output/ser_vi_layoutxlm_xfund_zh/best_accuracy Global.save_inference_dir=./inference/ser_vi_layoutxlm +``` + +After the conversion is successful, there are three files in the model save directory: + +``` +inference/ser_vi_layoutxlm/ + ├── inference.pdiparams # The parameter file of recognition inference model + ├── inference.pdiparams.info # The parameter information of recognition inference model, which can be ignored + └── inference.pdmodel # The program file of recognition +``` + +Export of RE model is also in adaptation. + +## 4.2 Model inference + +The VI layoutxlm model performs reasoning based on the ser task, and can execute the following commands: + + +Using the following command to infer the VI-LayoutXLM model. + +```bash +cd ppstructure +python3 kie/predict_kie_token_ser.py \ + --kie_algorithm=LayoutXLM \ + --ser_model_dir=../inference/ser_vi_layoutxlm \ + --image_dir=./docs/kie/input/zh_val_42.jpg \ + --ser_dict_path=../train_data/XFUND/class_list_xfun.txt \ + --vis_font_path=../doc/fonts/simfang.ttf \ + --ocr_order_method="tb-yx" +``` + +The visualized result will be saved in `./output`, which is shown as follows. + +
+ +
+ + +# 5. FAQ + +Q1: After the training model is transferred to the inference model, the prediction effect is inconsistent? + +**A**:The problems are mostly caused by inconsistent preprocessing and postprocessing parameters when the trained model predicts and the preprocessing and postprocessing parameters when the inference model predicts. You can compare whether there are differences in preprocessing, postprocessing, and prediction in the configuration files used for training. diff --git a/paddleocr.py b/paddleocr.py index 8e34c4fbc331f798618fc5f33bc00963a577e25a..f6fb095af34a58cc91b9fd0f22b2e95bf833e010 100644 --- a/paddleocr.py +++ b/paddleocr.py @@ -35,7 +35,7 @@ from tools.infer import predict_system from ppocr.utils.logging import get_logger logger = get_logger() -from ppocr.utils.utility import check_and_read_gif, get_image_file_list +from ppocr.utils.utility import check_and_read, get_image_file_list from ppocr.utils.network import maybe_download, download_with_progressbar, is_link, confirm_model_dir_url from tools.infer.utility import draw_ocr, str2bool, check_gpu from ppstructure.utility import init_args, draw_structure_result @@ -289,7 +289,8 @@ MODEL_URLS = { 'ch': { 'url': 'https://paddleocr.bj.bcebos.com/ppstructure/models/layout/picodet_lcnet_x1_0_layout_infer.tar', - 'dict_path': 'ppocr/utils/dict/layout_publaynet_dict.txt' + 'dict_path': + 'ppocr/utils/dict/layout_dict/layout_publaynet_dict.txt' } } } @@ -490,7 +491,7 @@ class PaddleOCR(predict_system.TextSystem): download_with_progressbar(img, 'tmp.jpg') img = 'tmp.jpg' image_file = img - img, flag = check_and_read_gif(image_file) + img, flag, _ = check_and_read(image_file) if not flag: with open(image_file, 'rb') as f: np_arr = np.frombuffer(f.read(), dtype=np.uint8) @@ -584,7 +585,7 @@ class PPStructure(StructureSystem): download_with_progressbar(img, 'tmp.jpg') img = 'tmp.jpg' image_file = img - img, flag = check_and_read_gif(image_file) + img, flag, _ = check_and_read(image_file) if not flag: with open(image_file, 'rb') as f: np_arr = np.frombuffer(f.read(), dtype=np.uint8) @@ -635,4 +636,6 @@ def main(): for item in result: item.pop('img') + item.pop('res') logger.info(item) + logger.info('result save to {}'.format(args.output)) diff --git a/ppocr/data/imaug/copy_paste.py b/ppocr/data/imaug/copy_paste.py index 0b3386c896792bd670cd2bfc757eb3b80f22bac4..79343da60fd40f8dc0ffe8927398b70cb751b532 100644 --- a/ppocr/data/imaug/copy_paste.py +++ b/ppocr/data/imaug/copy_paste.py @@ -35,10 +35,12 @@ class CopyPaste(object): point_num = data['polys'].shape[1] src_img = data['image'] src_polys = data['polys'].tolist() + src_texts = data['texts'] src_ignores = data['ignore_tags'].tolist() ext_data = data['ext_data'][0] ext_image = ext_data['image'] ext_polys = ext_data['polys'] + ext_texts = ext_data['texts'] ext_ignores = ext_data['ignore_tags'] indexs = [i for i in range(len(ext_ignores)) if not ext_ignores[i]] @@ -53,7 +55,7 @@ class CopyPaste(object): src_img = cv2.cvtColor(src_img, cv2.COLOR_BGR2RGB) ext_image = cv2.cvtColor(ext_image, cv2.COLOR_BGR2RGB) src_img = Image.fromarray(src_img).convert('RGBA') - for poly, tag in zip(select_polys, select_ignores): + for idx, poly, tag in zip(select_idxs, select_polys, select_ignores): box_img = get_rotate_crop_image(ext_image, poly) src_img, box = self.paste_img(src_img, box_img, src_polys) @@ -62,6 +64,7 @@ class CopyPaste(object): for _ in range(len(box), point_num): box.append(box[-1]) src_polys.append(box) + src_texts.append(ext_texts[idx]) src_ignores.append(tag) src_img = cv2.cvtColor(np.array(src_img), cv2.COLOR_RGB2BGR) h, w = src_img.shape[:2] @@ -70,6 +73,7 @@ class CopyPaste(object): src_polys[:, :, 1] = np.clip(src_polys[:, :, 1], 0, h) data['image'] = src_img data['polys'] = src_polys + data['texts'] = src_texts data['ignore_tags'] = np.array(src_ignores) return data diff --git a/ppocr/metrics/rec_metric.py b/ppocr/metrics/rec_metric.py index d858ae28e999d546847727243b35f2ac902e1026..9863978116b1340fa809e8919a6a37d598d6bbdf 100644 --- a/ppocr/metrics/rec_metric.py +++ b/ppocr/metrics/rec_metric.py @@ -12,7 +12,7 @@ # See the License for the specific language governing permissions and # limitations under the License. -import Levenshtein +from rapidfuzz.distance import Levenshtein import string @@ -46,8 +46,7 @@ class RecMetric(object): if self.is_filter: pred = self._normalize_text(pred) target = self._normalize_text(target) - norm_edit_dis += Levenshtein.distance(pred, target) / max( - len(pred), len(target), 1) + norm_edit_dis += Levenshtein.normalized_distance(pred, target) if pred == target: correct_num += 1 all_num += 1 diff --git a/ppocr/modeling/backbones/__init__.py b/ppocr/modeling/backbones/__init__.py index f5d54150bc325521698c43662895287e5640fb3d..6fdcc4a759e59027b1457d1e46757c64c4dcad9e 100755 --- a/ppocr/modeling/backbones/__init__.py +++ b/ppocr/modeling/backbones/__init__.py @@ -52,17 +52,15 @@ def build_backbone(config, model_type): support_dict = ['ResNet'] elif model_type == 'kie': from .kie_unet_sdmgr import Kie_backbone - support_dict = ['Kie_backbone'] + from .vqa_layoutlm import LayoutLMForSer, LayoutLMv2ForSer, LayoutLMv2ForRe, LayoutXLMForSer, LayoutXLMForRe + support_dict = [ + 'Kie_backbone', 'LayoutLMForSer', 'LayoutLMv2ForSer', + 'LayoutLMv2ForRe', 'LayoutXLMForSer', 'LayoutXLMForRe' + ] elif model_type == 'table': from .table_resnet_vd import ResNet from .table_mobilenet_v3 import MobileNetV3 support_dict = ['ResNet', 'MobileNetV3'] - elif model_type == 'vqa': - from .vqa_layoutlm import LayoutLMForSer, LayoutLMv2ForSer, LayoutLMv2ForRe, LayoutXLMForSer, LayoutXLMForRe - support_dict = [ - 'LayoutLMForSer', 'LayoutLMv2ForSer', 'LayoutLMv2ForRe', - 'LayoutXLMForSer', 'LayoutXLMForRe' - ] else: raise NotImplementedError diff --git a/ppocr/utils/dict/kie_dict/xfund_class_list.txt b/ppocr/utils/dict/kie_dict/xfund_class_list.txt new file mode 100644 index 0000000000000000000000000000000000000000..faded9f9b8f56bd258909bec9b8f1755aa688367 --- /dev/null +++ b/ppocr/utils/dict/kie_dict/xfund_class_list.txt @@ -0,0 +1,4 @@ +OTHER +QUESTION +ANSWER +HEADER diff --git a/ppocr/utils/save_load.py b/ppocr/utils/save_load.py index 7ccadb005a8ad591d9927c0e028887caacb3e37b..aa65f290c0a5f4f13b3103fb4404815e2ae74a88 100644 --- a/ppocr/utils/save_load.py +++ b/ppocr/utils/save_load.py @@ -54,13 +54,15 @@ def load_model(config, model, optimizer=None, model_type='det'): pretrained_model = global_config.get('pretrained_model') best_model_dict = {} is_float16 = False + is_nlp_model = model_type == 'kie' and config["Architecture"][ + "algorithm"] not in ["SDMGR"] - if model_type == 'vqa': - # NOTE: for vqa model dsitillation, resume training is not supported now + if is_nlp_model is True: + # NOTE: for kie model dsitillation, resume training is not supported now if config["Architecture"]["algorithm"] in ["Distillation"]: return best_model_dict checkpoints = config['Architecture']['Backbone']['checkpoints'] - # load vqa method metric + # load kie method metric if checkpoints: if os.path.exists(os.path.join(checkpoints, 'metric.states')): with open(os.path.join(checkpoints, 'metric.states'), @@ -102,8 +104,9 @@ def load_model(config, model, optimizer=None, model_type='det'): continue pre_value = params[key] if pre_value.dtype == paddle.float16: - pre_value = pre_value.astype(paddle.float32) is_float16 = True + if pre_value.dtype != value.dtype: + pre_value = pre_value.astype(value.dtype) if list(value.shape) == list(pre_value.shape): new_state_dict[key] = pre_value else: @@ -153,15 +156,16 @@ def load_pretrained_params(model, path): new_state_dict = {} is_float16 = False - + for k1 in params.keys(): if k1 not in state_dict.keys(): logger.warning("The pretrained params {} not in model".format(k1)) else: if params[k1].dtype == paddle.float16: - params[k1] = params[k1].astype(paddle.float32) is_float16 = True + if params[k1].dtype != state_dict[k1].dtype: + params[k1] = params[k1].astype(state_dict[k1].dtype) if list(state_dict[k1].shape) == list(params[k1].shape): new_state_dict[k1] = params[k1] else: @@ -192,10 +196,13 @@ def save_model(model, _mkdir_if_not_exist(model_path, logger) model_prefix = os.path.join(model_path, prefix) paddle.save(optimizer.state_dict(), model_prefix + '.pdopt') - if config['Architecture']["model_type"] != 'vqa': + + is_nlp_model = config['Architecture']["model_type"] == 'kie' and config[ + "Architecture"]["algorithm"] not in ["SDMGR"] + if is_nlp_model is not True: paddle.save(model.state_dict(), model_prefix + '.pdparams') metric_prefix = model_prefix - else: # for vqa system, we follow the save/load rules in NLP + else: # for kie system, we follow the save/load rules in NLP if config['Global']['distributed']: arch = model._layers else: diff --git a/ppocr/utils/utility.py b/ppocr/utils/utility.py index b881fcab20bc5ca076a0002bd72349768c7d881a..18357c8e97bcea8ee321856a87146a4a7b901469 100755 --- a/ppocr/utils/utility.py +++ b/ppocr/utils/utility.py @@ -50,7 +50,7 @@ def get_check_global_params(mode): def _check_image_file(path): - img_end = {'jpg', 'bmp', 'png', 'jpeg', 'rgb', 'tif', 'tiff', 'gif'} + img_end = {'jpg', 'bmp', 'png', 'jpeg', 'rgb', 'tif', 'tiff', 'gif', 'pdf'} return any([path.lower().endswith(e) for e in img_end]) @@ -59,7 +59,7 @@ def get_image_file_list(img_file): if img_file is None or not os.path.exists(img_file): raise Exception("not found any img file in {}".format(img_file)) - img_end = {'jpg', 'bmp', 'png', 'jpeg', 'rgb', 'tif', 'tiff', 'gif'} + img_end = {'jpg', 'bmp', 'png', 'jpeg', 'rgb', 'tif', 'tiff', 'gif', 'pdf'} if os.path.isfile(img_file) and _check_image_file(img_file): imgs_lists.append(img_file) elif os.path.isdir(img_file): @@ -73,7 +73,7 @@ def get_image_file_list(img_file): return imgs_lists -def check_and_read_gif(img_path): +def check_and_read(img_path): if os.path.basename(img_path)[-3:] in ['gif', 'GIF']: gif = cv2.VideoCapture(img_path) ret, frame = gif.read() @@ -84,8 +84,26 @@ def check_and_read_gif(img_path): if len(frame.shape) == 2 or frame.shape[-1] == 1: frame = cv2.cvtColor(frame, cv2.COLOR_GRAY2RGB) imgvalue = frame[:, :, ::-1] - return imgvalue, True - return None, False + return imgvalue, True, False + elif os.path.basename(img_path)[-3:] in ['pdf']: + import fitz + from PIL import Image + imgs = [] + with fitz.open(img_path) as pdf: + for pg in range(0, pdf.pageCount): + page = pdf[pg] + mat = fitz.Matrix(2, 2) + pm = page.getPixmap(matrix=mat, alpha=False) + + # if width or height > 2000 pixels, don't enlarge the image + if pm.width > 2000 or pm.height > 2000: + pm = page.getPixmap(matrix=fitz.Matrix(1, 1), alpha=False) + + img = Image.frombytes("RGB", [pm.width, pm.height], pm.samples) + img = cv2.cvtColor(np.array(img), cv2.COLOR_RGB2BGR) + imgs.append(img) + return imgs, False, True + return None, False, False def load_vqa_bio_label_maps(label_map_path): diff --git a/ppstructure/README.md b/ppstructure/README.md index 856de5a4306de987378dafc65e582f280be4bef3..fb3697bc1066262833ee20bcbb8f79833f264f14 100644 --- a/ppstructure/README.md +++ b/ppstructure/README.md @@ -1,120 +1,115 @@ English | [简体中文](README_ch.md) - [1. Introduction](#1-introduction) -- [2. Update log](#2-update-log) -- [3. Features](#3-features) -- [4. Results](#4-results) - - [4.1 Layout analysis and table recognition](#41-layout-analysis-and-table-recognition) - - [4.2 DOC-VQA](#42-doc-vqa) -- [5. Quick start](#5-quick-start) -- [6. PP-Structure System](#6-pp-structure-system) - - [6.1 Layout analysis and table recognition](#61-layout-analysis-and-table-recognition) - - [6.1.1 Layout analysis](#611-layout-analysis) - - [6.1.2 Table recognition](#612-table-recognition) - - [6.2 DOC-VQA](#62-doc-vqa) -- [7. Model List](#7-model-list) - - [7.1 Layout analysis model](#71-layout-analysis-model) - - [7.2 OCR and table recognition model](#72-ocr-and-table-recognition-model) - - [7.3 DOC-VQA model](#73-doc-vqa-model) +- [2. Features](#2-features) +- [3. Results](#3-results) + - [3.1 Layout analysis and table recognition](#31-layout-analysis-and-table-recognition) + - [3.2 Layout Recovery](#32-layout-recovery) + - [3.3 KIE](#33-kie) +- [4. Quick start](#4-quick-start) +- [5. Model List](#5-model-list) ## 1. Introduction -PP-Structure is an OCR toolkit that can be used for document analysis and processing with complex structures, designed to help developers better complete document understanding tasks +PP-Structure is an intelligent document analysis system developed by the PaddleOCR team, which aims to help developers better complete tasks related to document understanding such as layout analysis and table recognition. -## 2. Update log -* 2022.02.12 DOC-VQA add LayoutLMv2 model。 -* 2021.12.07 add [DOC-VQA SER and RE tasks](vqa/README.md)。 +The pipeline of PP-Structurev2 system is shown below. The document image first passes through the image direction correction module to identify the direction of the entire image and complete the direction correction. Then, two tasks of layout information analysis and key information extraction can be completed. -## 3. Features +- In the layout analysis task, the image first goes through the layout analysis model to divide the image into different areas such as text, table, and figure, and then analyze these areas separately. For example, the table area is sent to the form recognition module for structured recognition, and the text area is sent to the OCR engine for text recognition. Finally, the layout recovery module restores it to a word or pdf file with the same layout as the original image; +- In the key information extraction task, the OCR engine is first used to extract the text content, and then the SER(semantic entity recognition) module obtains the semantic entities in the image, and finally the RE(relationship extraction) module obtains the correspondence between the semantic entities, thereby extracting the required key information. + -The main features of PP-Structure are as follows: +More technical details: 👉 [PP-Structurev2 Technical Report](docs/PP-Structurev2_introduction.md) -- Support the layout analysis of documents, divide the documents into 5 types of areas **text, title, table, image and list** (conjunction with Layout-Parser) -- Support to extract the texts from the text, title, picture and list areas (used in conjunction with PP-OCR) -- Support to extract excel files from the table areas -- Support python whl package and command line usage, easy to use -- Support custom training for layout analysis and table structure tasks -- Support Document Visual Question Answering (DOC-VQA) tasks: Semantic Entity Recognition (SER) and Relation Extraction (RE) +PP-Structurev2 supports independent use or flexible collocation of each module. For example, you can use layout analysis alone or table recognition alone. Click the corresponding link below to get the tutorial for each independent module: -## 4. Results +- [Layout Analysis](layout/README.md) +- [Table Recognition](table/README.md) +- [Key Information Extraction](kie/README.md) +- [Layout Recovery](recovery/README.md) -### 4.1 Layout analysis and table recognition +## 2. Features - - -The figure shows the pipeline of layout analysis + table recognition. The image is first divided into four areas of image, text, title and table by layout analysis, and then OCR detection and recognition is performed on the three areas of image, text and title, and the table is performed table recognition, where the image will also be stored for use. - -### 4.2 DOC-VQA - -* SER -* -![](docs/vqa/result_ser/zh_val_0_ser.jpg) | ![](docs/vqa/result_ser/zh_val_42_ser.jpg) ----|--- - -Different colored boxes in the figure represent different categories. For xfun dataset, there are three categories: query, answer and header: +The main features of PP-Structurev2 are as follows: +- Support layout analysis of documents in the form of images/pdfs, which can be divided into areas such as **text, titles, tables, figures, formulas, etc.**; +- Support common Chinese and English **table detection** tasks; +- Support structured table recognition, and output the final result to **Excel file**; +- Support multimodal-based Key Information Extraction (KIE) tasks - **Semantic Entity Recognition** (SER) and **Relation Extraction (RE); +- Support **layout recovery**, that is, restore the document in word or pdf format with the same layout as the original image; +- Support customized training and multiple inference deployment methods such as python whl package quick start; +- Connect with the semi-automatic data labeling tool PPOCRLabel, which supports the labeling of layout analysis, table recognition, and SER. -* Dark purple: header -* Light purple: query -* Army green: answer +## 3. Results -The corresponding category and OCR recognition results are also marked at the top left of the OCR detection box. +PP-Structurev2 supports the independent use or flexible collocation of each module. For example, layout analysis can be used alone, or table recognition can be used alone. Only the visualization effects of several representative usage methods are shown here. +### 3.1 Layout analysis and table recognition -* RE - -![](docs/vqa/result_re/zh_val_21_re.jpg) | ![](docs/vqa/result_re/zh_val_40_re.jpg) ----|--- +The figure shows the pipeline of layout analysis + table recognition. The image is first divided into four areas of image, text, title and table by layout analysis, and then OCR detection and recognition is performed on the three areas of image, text and title, and the table is performed table recognition, where the image will also be stored for use. + +### 3.2 Layout recovery -In the figure, the red box represents the question, the blue box represents the answer, and the question and answer are connected by green lines. The corresponding category and OCR recognition results are also marked at the top left of the OCR detection box. +The following figure shows the effect of layout recovery based on the results of layout analysis and table recognition in the previous section. + -## 5. Quick start +### 3.3 KIE -Start from [Quick Installation](./docs/quickstart.md) +* SER -## 6. PP-Structure System +Different colored boxes in the figure represent different categories. -### 6.1 Layout analysis and table recognition +
+ +
-![pipeline](docs/table/pipeline.jpg) +
+ +
-In PP-Structure, the image will be divided into 5 types of areas **text, title, image list and table**. For the first 4 types of areas, directly use PP-OCR system to complete the text detection and recognition. For the table area, after the table structuring process, the table in image is converted into an Excel file with the same table style. +
+ +
-#### 6.1.1 Layout analysis +
+ +
-Layout analysis classifies image by region, including the use of Python scripts of layout analysis tools, extraction of designated category detection boxes, performance indicators, and custom training layout analysis models. For details, please refer to [document](layout/README.md). +
+ +
-#### 6.1.2 Table recognition +* RE -Table recognition converts table images into excel documents, which include the detection and recognition of table text and the prediction of table structure and cell coordinates. For detailed instructions, please refer to [document](table/README.md) +In the figure, the red box represents `Question`, the blue box represents `Answer`, and `Question` and `Answer` are connected by green lines. -### 6.2 DOC-VQA +
+ +
-Document Visual Question Answering (DOC-VQA) if a type of Visual Question Answering (VQA), which includes Semantic Entity Recognition (SER) and Relation Extraction (RE) tasks. Based on SER task, text recognition and classification in images can be completed. Based on THE RE task, we can extract the relation of the text content in the image, such as judge the problem pair. For details, please refer to [document](vqa/README.md) +
+ +
-## 7. Model List +
+ +
-PP-Structure Series Model List (Updating) +
+ +
-### 7.1 Layout analysis model +## 4. Quick start -|model name|description|download|label_map| -| --- | --- | --- |--- | -| ppyolov2_r50vd_dcn_365e_publaynet | The layout analysis model trained on the PubLayNet dataset can divide image into 5 types of areas **text, title, table, picture, and list** | [PubLayNet](https://paddle-model-ecology.bj.bcebos.com/model/layout-parser/ppyolov2_r50vd_dcn_365e_publaynet.tar) | {0: "Text", 1: "Title", 2: "List", 3:"Table", 4:"Figure"}| +Start from [Quick Start](./docs/quickstart_en.md). -### 7.2 OCR and table recognition model +## 5. Model List -|model name|description|model size|download| -| --- | --- | --- | --- | -|ch_PP-OCRv3_det_slim|[New] slim quantization with distillation lightweight model, supporting Chinese, English, multilingual text detection| 1.1M |[inference model](https://paddleocr.bj.bcebos.com/PP-OCRv3/chinese/ch_PP-OCRv3_det_slim_infer.tar) / [trained model](https://paddleocr.bj.bcebos.com/PP-OCRv3/chinese/ch_PP-OCRv3_det_slim_distill_train.tar)| -|ch_PP-OCRv3_rec_slim |[New] Slim qunatization with distillation lightweight model, supporting Chinese, English text recognition| 4.9M |[inference model](https://paddleocr.bj.bcebos.com/PP-OCRv3/chinese/ch_PP-OCRv3_rec_slim_infer.tar) / [trained model](https://paddleocr.bj.bcebos.com/PP-OCRv3/chinese/ch_PP-OCRv3_rec_slim_train.tar) | -|ch_ppstructure_mobile_v2.0_SLANet|Chinese table recognition model trained on PubTabNet dataset based on SLANet|9.3M|[inference model](https://paddleocr.bj.bcebos.com/ppstructure/models/slanet/ch_ppstructure_mobile_v2.0_SLANet_infer.tar) / [trained model](https://paddleocr.bj.bcebos.com/ppstructure/models/slanet/ch_ppstructure_mobile_v2.0_SLANet_train.tar) | +Some tasks need to use both the structured analysis models and the OCR models. For example, the table recognition task needs to use the table recognition model for structured analysis, and the OCR model to recognize the text in the table. Please select the appropriate models according to your specific needs. -### 7.3 DOC-VQA model +For structural analysis related model downloads, please refer to: +- [PP-Structure Model Zoo](./docs/models_list_en.md) -|model name|description|model size|download| -| --- | --- | --- | --- | -|ser_LayoutXLM_xfun_zhd|SER model trained on xfun Chinese dataset based on LayoutXLM|1.4G|[inference model coming soon]() / [trained model](https://paddleocr.bj.bcebos.com/pplayout/ser_LayoutXLM_xfun_zh.tar) | -|re_LayoutXLM_xfun_zh|RE model trained on xfun Chinese dataset based on LayoutXLM|1.4G|[inference model coming soon]() / [trained model](https://paddleocr.bj.bcebos.com/pplayout/re_LayoutXLM_xfun_zh.tar) | +For OCR related model downloads, please refer to: +- [PP-OCR Model Zoo](../doc/doc_en/models_list_en.md) -If you need to use other models, you can download the model in [PPOCR model_list](../doc/doc_en/models_list_en.md) and [PPStructure model_list](./docs/models_list.md) diff --git a/ppstructure/README_ch.md b/ppstructure/README_ch.md index 64af0cbe53265c85fd9027fe48e82102f4b5ea57..87a9c625b32c32e9c7fffb8ebc9b9fdf3b2130db 100644 --- a/ppstructure/README_ch.md +++ b/ppstructure/README_ch.md @@ -3,135 +3,120 @@ # PP-Structure 文档分析 - [1. 简介](#1) -- [2. 近期更新](#2) -- [3. 特性](#3) -- [4. 效果展示](#4) - - [4.1 版面分析和表格识别](#41) - - [4.2 DocVQA](#42) -- [5. 快速体验](#5) -- [6. PP-Structure 介绍](#6) - - [6.1 版面分析+表格识别](#61) - - [6.1.1 版面分析](#611) - - [6.1.2 表格识别](#612) - - [6.2 DocVQA](#62) -- [7. 模型库](#7) - - [7.1 版面分析模型](#71) - - [7.2 OCR和表格识别模型](#72) - - [7.3 DocVQA 模型](#73) +- [2. 特性](#2) +- [3. 效果展示](#3) + - [3.1 版面分析和表格识别](#31) + - [3.2 版面恢复](#32) + - [3.3 关键信息抽取](#33) +- [4. 快速体验](#4) +- [5. 模型库](#5) ## 1. 简介 -PP-Structure是一个可用于复杂文档结构分析和处理的OCR工具包,旨在帮助开发者更好的完成文档理解相关任务。 - -## 2. 近期更新 -* 2022.02.12 DocVQA增加LayoutLMv2模型。 -* 2021.12.07 新增[DOC-VQA任务SER和RE](vqa/README.md)。 - - -## 3. 特性 - -PP-Structure的主要特性如下: -- 支持对图片形式的文档进行版面分析,可以划分**文字、标题、表格、图片以及列表**5类区域(与Layout-Parser联合使用) -- 支持文字、标题、图片以及列表区域提取为文字字段(与PP-OCR联合使用) -- 支持表格区域进行结构化分析,最终结果输出Excel文件 -- 支持python whl包和命令行两种方式,简单易用 -- 支持版面分析和表格结构化两类任务自定义训练 -- 支持文档视觉问答(Document Visual Question Answering,DocVQA)任务-语义实体识别(Semantic Entity Recognition,SER)和关系抽取(Relation Extraction,RE) - - -## 4. 效果展示 - - -### 4.1 版面分析和表格识别 - - - -图中展示了版面分析+表格识别的整体流程,图片先有版面分析划分为图像、文本、标题和表格四种区域,然后对图像、文本和标题三种区域进行OCR的检测识别,对表格进行表格识别,其中图像还会被存储下来以便使用。 - - -### 4.2 DOC-VQA - -* SER +PP-Structure是PaddleOCR团队自研的智能文档分析系统,旨在帮助开发者更好的完成版面分析、表格识别等文档理解相关任务。 -![](./docs/vqa/result_ser/zh_val_0_ser.jpg) | ![](./docs/vqa/result_ser/zh_val_42_ser.jpg) ----|--- +PP-Structurev2系统流程图如下所示,文档图像首先经过图像矫正模块,判断整图方向并完成转正,随后可以完成版面信息分析与关键信息抽取2类任务。 +- 版面分析任务中,图像首先经过版面分析模型,将图像划分为文本、表格、图像等不同区域,随后对这些区域分别进行识别,如,将表格区域送入表格识别模块进行结构化识别,将文本区域送入OCR引擎进行文字识别,最后使用版面恢复模块将其恢复为与原始图像布局一致的word或者pdf格式的文件; +- 关键信息抽取任务中,首先使用OCR引擎提取文本内容,然后由语义实体识别模块获取图像中的语义实体,最后经关系抽取模块获取语义实体之间的对应关系,从而提取需要的关键信息。 + -图中不同颜色的框表示不同的类别,对于XFUN数据集,有`QUESTION`, `ANSWER`, `HEADER` 3种类别 +更多技术细节:👉 [PP-Structurev2技术报告](docs/PP-Structurev2_introduction.md) -* 深紫色:HEADER -* 浅紫色:QUESTION -* 军绿色:ANSWER +PP-Structurev2支持各个模块独立使用或灵活搭配,如,可以单独使用版面分析,或单独使用表格识别,点击下面相应链接获取各个独立模块的使用教程: -在OCR检测框的左上方也标出了对应的类别和OCR识别结果。 +- [版面分析](layout/README_ch.md) +- [表格识别](table/README_ch.md) +- [关键信息抽取](kie/README_ch.md) +- [版面复原](recovery/README_ch.md) -* RE + +## 2. 特性 -![](./docs/vqa/result_re/zh_val_21_re.jpg) | ![](./docs/vqa/result_re/zh_val_40_re.jpg) ----|--- +PP-Structurev2的主要特性如下: +- 支持对图片/pdf形式的文档进行版面分析,可以划分**文字、标题、表格、图片、公式等**区域; +- 支持通用的中英文**表格检测**任务; +- 支持表格区域进行结构化识别,最终结果输出**Excel文件**; +- 支持基于多模态的关键信息抽取(Key Information Extraction,KIE)任务-**语义实体识别**(Semantic Entity Recognition,SER)和**关系抽取**(Relation Extraction,RE); +- 支持**版面复原**,即恢复为与原始图像布局一致的word或者pdf格式的文件; +- 支持自定义训练及python whl包调用等多种推理部署方式,简单易用; +- 与半自动数据标注工具PPOCRLabel打通,支持版面分析、表格识别、SER三种任务的标注。 + +## 3. 效果展示 +PP-Structurev2支持各个模块独立使用或灵活搭配,如,可以单独使用版面分析,或单独使用表格识别,这里仅展示几种代表性使用方式的可视化效果。 -图中红色框表示问题,蓝色框表示答案,问题和答案之间使用绿色线连接。在OCR检测框的左上方也标出了对应的类别和OCR识别结果。 + +### 3.1 版面分析和表格识别 +下图展示了版面分析+表格识别的整体流程,图片先有版面分析划分为图像、文本、标题和表格四种区域,然后对图像、文本和标题三种区域进行OCR的检测识别,对表格进行表格识别,其中图像还会被存储下来以便使用。 + - -## 5. 快速体验 + +### 3.2 版面恢复 +下图展示了基于上一节版面分析和表格识别的结果进行版面恢复的效果。 + -请参考[快速使用](./docs/quickstart.md)教程。 - -## 6. PP-Structure 介绍 + +### 3.3 关键信息抽取 - -### 6.1 版面分析+表格识别 +* SER -![pipeline](./docs/table/pipeline.jpg) +图中不同颜色的框表示不同的类别。 -在PP-Structure中,图片会先经由Layout-Parser进行版面分析,在版面分析中,会对图片里的区域进行分类,包括**文字、标题、图片、列表和表格**5类。对于前4类区域,直接使用PP-OCR完成对应区域文字检测与识别。对于表格类区域,经过表格结构化处理后,表格图片转换为相同表格样式的Excel文件。 +
+ +
- -#### 6.1.1 版面分析 +
+ +
-版面分析对文档数据进行区域分类,其中包括版面分析工具的Python脚本使用、提取指定类别检测框、性能指标以及自定义训练版面分析模型,详细内容可以参考[文档](layout/README_ch.md)。 +
+ +
- -#### 6.1.2 表格识别 +
+ +
-表格识别将表格图片转换为excel文档,其中包含对于表格文本的检测和识别以及对于表格结构和单元格坐标的预测,详细说明参考[文档](table/README_ch.md)。 +
+ +
- -### 6.2 DocVQA +* RE -DocVQA指文档视觉问答,其中包括语义实体识别 (Semantic Entity Recognition, SER) 和关系抽取 (Relation Extraction, RE) 任务。基于 SER 任务,可以完成对图像中的文本识别与分类;基于 RE 任务,可以完成对图象中的文本内容的关系提取,如判断问题对(pair),详细说明参考[文档](vqa/README.md)。 +图中红色框表示`问题`,蓝色框表示`答案`,`问题`和`答案`之间使用绿色线连接。 - -## 7. 模型库 +
+ +
-PP-Structure系列模型列表(更新中) +
+ +
- -### 7.1 版面分析模型 +
+ +
-|模型名称|模型简介|下载地址| label_map| -| --- | --- | --- | --- | -| ppyolov2_r50vd_dcn_365e_publaynet | PubLayNet 数据集训练的版面分析模型,可以划分**文字、标题、表格、图片以及列表**5类区域 | [PubLayNet](https://paddle-model-ecology.bj.bcebos.com/model/layout-parser/ppyolov2_r50vd_dcn_365e_publaynet.tar) | {0: "Text", 1: "Title", 2: "List", 3:"Table", 4:"Figure"}| +
+ +
- -### 7.2 OCR和表格识别模型 + +## 4. 快速体验 -|模型名称|模型简介|模型大小|下载地址| -| --- | --- | --- | --- | -|ch_PP-OCRv3_det_slim|【最新】slim量化+蒸馏版超轻量模型,支持中英文、多语种文本检测| 1.1M |[推理模型](https://paddleocr.bj.bcebos.com/PP-OCRv3/chinese/ch_PP-OCRv3_det_slim_infer.tar) / [训练模型](https://paddleocr.bj.bcebos.com/PP-OCRv3/chinese/ch_PP-OCRv3_det_slim_distill_train.tar)| -|ch_PP-OCRv3_rec_slim |【最新】slim量化版超轻量模型,支持中英文、数字识别| 4.9M |[推理模型](https://paddleocr.bj.bcebos.com/PP-OCRv3/chinese/ch_PP-OCRv3_rec_slim_infer.tar) / [训练模型](https://paddleocr.bj.bcebos.com/PP-OCRv3/chinese/ch_PP-OCRv3_rec_slim_train.tar) | -|ch_ppstructure_mobile_v2.0_SLANet|基于SLANet在PubTabNet数据集上训练的中文表格识别模型|9.3M|[推理模型](https://paddleocr.bj.bcebos.com/ppstructure/models/slanet/ch_ppstructure_mobile_v2.0_SLANet_infer.tar) / [训练模型](https://paddleocr.bj.bcebos.com/ppstructure/models/slanet/ch_ppstructure_mobile_v2.0_SLANet_train.tar) | +请参考[快速使用](./docs/quickstart.md)教程。 + +## 5. 模型库 - -### 7.3 DocVQA 模型 +部分任务需要同时用到结构化分析模型和OCR模型,如表格识别需要使用表格识别模型进行结构化解析,同时也要用到OCR模型对表格内的文字进行识别,请根据具体需求选择合适的模型。 -|模型名称|模型简介|模型大小|下载地址| -| --- | --- | --- | --- | -|ser_LayoutXLM_xfun_zhd|基于LayoutXLM在xfun中文数据集上训练的SER模型|1.4G|[推理模型 coming soon]() / [训练模型](https://paddleocr.bj.bcebos.com/pplayout/ser_LayoutXLM_xfun_zh.tar) | -|re_LayoutXLM_xfun_zh|基于LayoutXLM在xfun中文数据集上训练的RE模型|1.4G|[推理模型 coming soon]() / [训练模型](https://paddleocr.bj.bcebos.com/pplayout/re_LayoutXLM_xfun_zh.tar) | +结构化分析相关模型下载可以参考: +- [PP-Structure 模型库](./docs/models_list.md) +OCR相关模型下载可以参考: +- [PP-OCR 模型库](../doc/doc_ch/models_list.md) -更多模型下载,可以参考 [PP-OCR model_list](../doc/doc_ch/models_list.md) and [PP-Structure model_list](./docs/models_list.md) diff --git a/ppstructure/docs/PP-Structurev2_introduction.md b/ppstructure/docs/PP-Structurev2_introduction.md new file mode 100644 index 0000000000000000000000000000000000000000..360acdb25e46ea1afbe195350b28a71b1411797a --- /dev/null +++ b/ppstructure/docs/PP-Structurev2_introduction.md @@ -0,0 +1,426 @@ +# PP-Structurev2 + +## 目录 + +- [1. 背景](#1-背景) +- [2. 简介](#3-简介) +- [3. 整图方向矫正](#3-整图方向矫正) +- [4. 版面信息结构化](#4-版面信息结构化) + - [4.1 版面分析](#41-版面分析) + - [4.2 表格识别](#42-表格识别) + - [4.3 版面恢复](#43-版面恢复) +- [5. 关键信息抽取](#5-关键信息抽取) +- [6. Reference](#6-Reference) + +## 1. 背景 + +现实场景中包含大量的文档图像,它们以图片等非结构化形式存储。基于文档图像的结构化分析与信息抽取对于数据的数字化存储以及产业的数字化转型至关重要。基于该考虑,PaddleOCR自研并发布了PP-Structure智能文档分析系统,旨在帮助开发者更好的完成版面分析、表格识别、关键信息抽取等文档理解相关任务。 + +近期,PaddleOCR团队针对PP-Structurev1的版面分析、表格识别、关键信息抽取模块,进行了共计8个方面的升级,同时新增整图方向矫正、文档复原等功能,打造出一个全新的、效果更优的文档分析系统:PP-Structurev2。 + +## 2. 简介 + +PP-Structurev2在PP-Structurev1的基础上进一步改进,主要有以下3个方面升级: + + * **系统功能升级** :新增图像矫正和版面复原模块,图像转word/pdf、关键信息抽取能力全覆盖! + * **系统性能优化** : + * 版面分析:发布轻量级版面分析模型,速度提升**12倍**,平均CPU耗时仅需**70ms**! + * 表格识别:设计3大优化策略,预测耗时不变情况下,模型精度提升**6%**。 + * 关键信息抽取:设计视觉无关模型结构,语义实体识别精度提升**2.8%**,关系抽取精度提升**9.1%**。 + * **中文场景适配** :完成对版面分析与表格识别的中文场景适配,开源**开箱即用**的中文场景版面结构化模型! + +PP-Structurev2系统流程图如下所示,文档图像首先经过图像矫正模块,判断整图方向并完成转正,随后可以完成版面信息分析与关键信息抽取2类任务。版面分析任务中,图像首先经过版面分析模型,将图像划分为文本、表格、图像等不同区域,随后对这些区域分别进行识别,如,将表格区域送入表格识别模块进行结构化识别,将文本区域送入OCR引擎进行文字识别,最后使用版面恢复模块将其恢复为与原始图像布局一致的word或者pdf格式的文件;关键信息抽取任务中,首先使用OCR引擎提取文本内容,然后由语义实体识别模块获取图像中的语义实体,最后经关系抽取模块获取语义实体之间的对应关系,从而提取需要的关键信息。 + +
+ +
+ + +从算法改进思路来看,对系统中的3个关键子模块,共进行了8个方面的改进。 + +* 版面分析 + * PP-PicoDet:轻量级版面分析模型 + * FGD:兼顾全局与局部特征的模型蒸馏算法 + +* 表格识别 + * PP-LCNet: CPU友好型轻量级骨干网络 + * CSP-PAN:轻量级高低层特征融合模块 + * SLAHead:结构与位置信息对齐的特征解码模块 + +* 关键信息抽取 + * VI-LayoutXLM:视觉特征无关的多模态预训练模型结构 + * TB-YX:考虑阅读顺序的文本行排序逻辑 + * UDML:联合互学习知识蒸馏策略 + +最终,与PP-Structurev1相比: + +- 版面分析模型参数量减少95.6%,推理速度提升12倍,精度提升0.4%; +- 表格识别预测耗时不变,模型精度提升6%,端到端TEDS提升2%; +- 关键信息抽取模型速度提升2.8倍,语义实体识别模型精度提升2.8%;关系抽取模型精度提升9.1%。 + +下面对各个模块进行详细介绍。 + +## 3. 整图方向矫正 + +由于训练集一般以正方向图像为主,旋转过的文档图像直接输入模型会增加识别难度,影响识别效果。PP-Structurev2引入了整图方向矫正模块来判断含文字图像的方向,并将其进行方向调整。 + +我们直接调用PaddleClas中提供的文字图像方向分类模型-[PULC_text_image_orientation](https://github.com/PaddlePaddle/PaddleClas/blob/develop/docs/zh_CN/PULC/PULC_text_image_orientation.md),该模型部分数据集图像如下所示。不同于文本行方向分类器,文字图像方向分类模型针对整图进行方向判别。文字图像方向分类模型在验证集上精度高达99%,单张图像CPU预测耗时仅为`2.16ms`。 + +
+ +
+ +## 4. 版面信息结构化 + +### 4.1 版面分析 + +版面分析指的是对图片形式的文档进行区域划分,定位其中的关键区域,如文字、标题、表格、图片等,PP-Structurev1使用了PaddleDetection中开源的高效检测算法PP-YOLOv2完成版面分析的任务。 + +在PP-Structurev2中,我们发布基于PP-PicoDet的轻量级版面分析模型,并针对版面分析场景定制图像尺度,同时使用FGD知识蒸馏算法,进一步提升模型精度。最终CPU上`41ms`即可完成版面分析过程(仅包含模型推理时间,数据预处理耗时大约50ms左右)。在公开数据集PubLayNet 上,消融实验如下: + +| 实验序号 | 策略 | 模型存储(M) | mAP | CPU预测耗时(ms) | +|:------:|:------:|:------:|:------:|:------:| +| 1 | PP-YOLOv2(640*640) | 221 | 93.6% | 512 | +| 2 | PP-PicoDet-LCNet2.5x(640*640) | 29.7 | 92.5% |53.2| +| 3 | PP-PicoDet-LCNet2.5x(800*608) | 29.7 | 94.2% |83.1 | +| 4 | PP-PicoDet-LCNet1.0x(800*608) | 9.7 | 93.5% | 41.2| +| 5 | PP-PicoDet-LCNet1.0x(800*608) + FGD | 9.7 | 94% |41.2| + +* 测试条件 + * paddle版本:2.3.0 + * CPU:Intel(R) Xeon(R) Gold 6148 CPU @ 2.40GHz,开启mkldnn,线程数为10 + +在PubLayNet数据集上,与其他方法的性能对比如下表所示。可以看到,和基于Detectron2的版面分析工具layoutparser相比,我们的模型精度高出大约5%,预测速度快约69倍。 + +| 模型 | mAP | CPU预测耗时 | +|-------------------|-----------|------------| +| layoutparser (Detectron2) | 88.98% | 2.9s | +| PP-Structurev2 (PP-PicoDet) | **94%** | 41.2ms | + +[PubLayNet](https://github.com/ibm-aur-nlp/PubLayNet)数据集是一个大型的文档图像数据集,包含Text、Title、Tale、Figure、List,共5个类别。数据集中包含335,703张训练集、11,245张验证集和11,405张测试集。训练数据与标注示例图如下所示: + +
+ +
+ + +#### 4.1.1 优化策略 + +**(1)轻量级版面分析模型PP-PicoDet** + +`PP-PicoDet`是PaddleDetection中提出的轻量级目标检测模型,通过使用PP-LCNet骨干网络、CSP-PAN特征融合模块、SimOTA标签分配方法等优化策略,最终在CPU与移动端具有卓越的性能。我们将PP-Structurev1中采用的PP-YOLOv2模型替换为`PP-PicoDet`,同时针对版面分析场景优化预测尺度,从针对目标检测设计的`640*640`调整为更适配文档图像的`800*608`,在`1.0x`配置下,模型精度与PP-YOLOv2相当,CPU平均预测速度可提升11倍。 + +**(1)FGD知识蒸馏** + +FGD(Focal and Global Knowledge Distillation for Detectors),是一种兼顾局部全局特征信息的模型蒸馏方法,分为Focal蒸馏和Global蒸馏2个部分。Focal蒸馏分离图像的前景和背景,让学生模型分别关注教师模型的前景和背景部分特征的关键像素;Global蒸馏部分重建不同像素之间的关系并将其从教师转移到学生,以补偿Focal蒸馏中丢失的全局信息。我们基于FGD蒸馏策略,使用教师模型PP-PicoDet-LCNet2.5x(mAP=94.2%)蒸馏学生模型PP-PicoDet-LCNet1.0x(mAP=93.5%),可将学生模型精度提升0.5%,和教师模型仅差0.2%,而预测速度比教师模型快1倍。 + +#### 4.1.2 场景适配 + +**(1)中文版面分析** + +除了英文公开数据集PubLayNet,我们也在中文场景进行了场景适配与方法验证。[CDLA](https://github.com/buptlihang/CDLA)是一个中文文档版面分析数据集,面向中文文献类(论文)场景,包含正文、标题等10个label。数据集中包含5,000张训练集和1,000张验证集。训练数据与标注示例图如下所示: + + +
+ +
+ + +在CDLA 数据集上,消融实验如下: + +| 实验序号 | 策略 | mAP | +|:------:|:------:|:------:| +| 1 | PP-YOLOv2 | 84.7% | +| 2 | PP-PicoDet-LCNet2.5x(800*608) | 87.8% | +| 3 | PP-PicoDet-LCNet1.0x(800*608) | 84.5% | +| 4 | PP-PicoDet-LCNet1.0x(800*608) + FGD | 86.8% | + + +**(2)表格版面分析** + +在实际应用中,很多场景并不关注图像中的图片、文本等版面区域,而仅需要提取文档图像中的表格,此时版面分析任务退化为一个表格检测任务,表格检测往往也是表格识别的前序任务。面向中英文文档场景,我们整理了开源领域含表格的版面分析数据集,包括TableBank、DocBank等。融合后的数据集中包含496,405张训练集与9,495张验证集图像。 + +在表格数据集上,消融实验如下: + +| 实验序号 | 策略 | mAP | +|:------:|:------:|:------:| +| 1 | PP-YOLOv2 |91.3% | +| 2 | PP-PicoDet-LCNet2.5x(800*608) | 95.9% | +| 3 | PP-PicoDet-LCNet1.0x(800*608) | 95.2% | +| 4 | PP-PicoDet-LCNet1.0x(800*608) + FGD | 95.7% | + +表格检测效果示意图如下: + +
+ +
+ +### 4.2 表格识别 + +基于深度学习的表格识别算法种类丰富,PP-Structurev1中,我们基于文本识别算法RARE研发了端到端表格识别算法TableRec-RARE,模型输出为表格结构的HTML表示,进而可以方便地转化为Excel文件。PP-Structurev2中,我们对模型结构和损失函数等5个方面进行升级,提出了 SLANet (Structure Location Alignment Network) ,模型结构如下图所示: + +
+ +
+ +在PubTabNet英文表格识别数据集上的消融实验如下: + +|策略|Acc|TEDS|推理速度(CPU+MKLDNN)|模型大小| +|---|---|---|---|---| +|TableRec-RARE| 71.73% | 93.88% |779ms |6.8M| +|+PP-LCNet| 74.71% |94.37% |778ms| 8.7M| +|+CSP-PAN| 75.68%| 94.72% |708ms| 9.3M| +|+SLAHead| 77.7%|94.85%| 766ms| 9.2M| +|+MergeToken| 76.31%| 95.89%|766ms| 9.2M| + +* 测试环境 + * paddle版本:2.3.1 + * CPU:Intel(R) Xeon(R) Gold 6148 CPU @ 2.40GHz,开启mkldnn,线程数为10 + +在PubtabNet英文表格识别数据集上,和其他方法对比如下: + +|策略|Acc|TEDS|推理速度(CPU+MKLDNN)|模型大小| +|---|---|---|---|---| +|TableMaster|77.9%|96.12%|2144ms|253M| +|TableRec-RARE| 71.73% | 93.88% |779ms |6.8M| +|SLANet|76.31%| 95.89%|766ms|9.2M| + +#### 4.2.1 优化策略 + +**(1) CPU友好型轻量级骨干网络PP-LCNet** + +PP-LCNet是结合Intel-CPU端侧推理特性而设计的轻量高性能骨干网络,该方案在图像分类任务上取得了比ShuffleNetV2、MobileNetV3、GhostNet等轻量级模型更优的“精度-速度”均衡。PP-Structurev2中,我们采用PP-LCNet作为骨干网络,表格识别模型精度从71.73%提升至72.98%;同时加载通过SSLD知识蒸馏方案训练得到的图像分类模型权重作为表格识别的预训练模型,最终精度进一步提升2.95%至74.71%。 + +**(2)轻量级高低层特征融合模块CSP-PAN** + +对骨干网络提取的特征进行融合,可以有效解决尺度变化较大等复杂场景中的模型预测问题。早期,FPN模块被提出并用于特征融合,但是它的特征融合过程仅包含单向(高->低),融合不够充分。CSP-PAN基于PAN进行改进,在保证特征融合更为充分的同时,使用CSP block、深度可分离卷积等策略减小了计算量。在表格识别场景中,我们进一步将CSP-PAN的通道数从128降低至96以降低模型大小。最终表格识别模型精度提升0.97%至75.68%,预测速度提升10%。 + +**(3)结构与位置信息对齐的特征解码模块SLAHead** + +TableRec-RARE的TableAttentionHead如下图a所示,TableAttentionHead在执行完全部step的计算后拿到最终隐藏层状态表征(hiddens),随后hiddens经由SDM(Structure Decode Module)和CLDM(Cell Location Decode Module)模块生成全部的表格结构token和单元格坐标。但是这种设计忽略了单元格token和坐标之间一一对应的关系。 + +PP-Structurev2中,我们设计SLAHead模块,对单元格token和坐标之间做了对齐操作,如下图b所示。在SLAHead中,每一个step的隐藏层状态表征会分别送入SDM和CLDM来得到当前step的token和坐标,每个step的token和坐标输出分别进行concat得到表格的html表达和全部单元格的坐标。此外,考虑到表格识别模型的单元格准确率依赖于表格结构的识别准确,我们将损失函数中表格结构分支与单元格定位分支的权重比从1:1提升到8:1,并使用收敛更稳定的Smoothl1 Loss替换定位分支中的MSE Loss。最终模型精度从75.68%提高至77.7%。 + + +
+ +
+ + +**(4)其他** + +TableRec-RARE算法中,我们使用``和``两个单独的token来表示一个非跨行列单元格,这种表示方式限制了网络对于单元格数量较多表格的处理能力。 + +PP-Structurev2中,我们参考TableMaster中的token处理方法,将``和``合并为一个token-``。合并token后,验证集中token长度大于500的图片也参与模型评估,最终模型精度降低为76.31%,但是端到端TEDS提升1.04%。 + +#### 4.2.2 中文场景适配 + +除了上述模型策略的升级外,本次升级还开源了中文表格识别模型。在实际应用场景中,表格图像存在着各种各样的倾斜角度(PubTabNet数据集不存在该问题),因此在中文模型中,我们将单元格坐标回归的点数从2个(左上,右下)增加到4个(左上,右上,右下,左下)。在内部测试集上,模型升级前后指标如下: +|模型|acc| +|---|---| +|TableRec-RARE|44.3%| +|SLANet|59.35%| + +可视化结果如下,左为输入图像,右为识别的html表格 + + +
+ +
+ +
+ +
+ +
+ +
+ +
+ +
+ +
+ +
+ + + + +### 4.3 版面恢复 + +版面恢复指的是文档图像经过OCR识别、版面分析、表格识别等方法处理后的内容可以与原始文档保持相同的排版方式,并输出到word等文档中。PP-Structurev2中,我们版面恢复系统,包含版面分析、表格识别、OCR文本检测与识别等子模块。 +下图展示了版面恢复的结果: + +
+ +
+ +## 5. 关键信息抽取 + +关键信息抽取指的是针对文档图像的文字内容,提取出用户关注的关键信息,如身份证中的姓名、住址等字段。PP-Structure中支持了基于多模态LayoutLM系列模型的语义实体识别 (Semantic Entity Recognition, SER) 以及关系抽取 (Relation Extraction, RE) 任务。PP-Structurev2中,我们对模型结构以及下游任务训练方法进行升级,提出了VI-LayoutXLM(Visual-feature Independent LayoutXLM),具体流程图如下所示。 + + +
+ +
+ + +具体优化策略包括: + +* VI-LayoutXLM:视觉特征无关的多模态预训练模型结构 +* TB-YX:考虑人类阅读顺序的文本行排序逻辑 +* UDML:联合互学习知识蒸馏策略 + +XFUND-zh数据集上,SER任务的消融实验如下所示。 + +| 实验序号 | 策略 | 模型大小(G) | 精度 | GPU预测耗时(ms) | CPU预测耗时(ms) | +|:------:|:------:|:------:|:------:|:------:|:------:| +| 1 | LayoutXLM | 1.4 | 89.50% | 59.35 | 766.16 | +| 2 | VI-LayoutXLM | 1.1 | 90.46% | 23.71 | 675.58 | +| 3 | 实验2 + TB-YX文本行排序 | 1.1 | 92.50% | 23.71 | 675.58 | +| 4 | 实验3 + UDML蒸馏 | 1.1 | 93.19% | 23.71 | 675.58 | +| 5 | 实验3 + UDML蒸馏 | 1.1 | **93.19%** | **15.49** | **675.58** | + +* 测试条件 + * paddle版本:2.3.0 + * GPU:V100,实验5的GPU预测耗时使用`trt+fp16`测试得到,环境为cuda10.2+ cudnn8.1.1 + trt7.2.3.4,其他实验的预测耗时统计中没有使用TRT。 + * CPU:Intel(R) Xeon(R) Gold 6148 CPU @ 2.40GHz,开启mkldnn,线程数为10 + +在XFUND数据集上,与其他方法的效果对比如下所示。 + +| 模型 | SER Hmean | RE Hmean | +|-------------------|-----------|------------| +| LayoutLMv2-base | 85.44% | 67.77% | +| LayoutXLM-base | 89.24% | 70.73% | +| StrucTexT-large | 92.29% | **86.81%** | +| VI-LayoutXLM-base (ours) | **93.19%** | 83.92% | + + +### 5.1 优化策略 + +**(1) VI-LayoutXLM(Visual-feature Independent LayoutXLM)** + +LayoutLMv2以及LayoutXLM中引入视觉骨干网络,用于提取视觉特征,并与后续的text embedding进行联合,作为多模态的输入embedding。但是该模块为基于`ResNet_x101_64x4d`的特征提取网络,特征抽取阶段耗时严重,因此我们将其去除,同时仍然保留文本、位置以及布局等信息,最终发现针对LayoutXLM进行改进,下游SER任务精度无损,针对LayoutLMv2进行改进,下游SER任务精度仅降低`2.1%`,而模型大小减小了约`340M`。具体消融实验如下所示。 + +| 模型 | 模型大小 (G) | F-score | 精度收益 | +|-----------------|----------|---------|--------| +| LayoutLMv2 | 0.76 | 84.20% | - | +| VI-LayoutLMv2 | 0.42 | 82.10% | -2.10% | +| LayoutXLM | 1.4 | 89.50% | - | +| VI-LayouXLM | 1.1 | 90.46% | +0.96% | + +同时,基于XFUND数据集,VI-LayoutXLM在RE任务上的精度也进一步提升了`1.06%`。 + +**(2) TB-YX排序方法(Threshold-Based YX sorting algorithm)** + +文本阅读顺序对于信息抽取与文本理解等任务至关重要,传统多模态模型中,没有考虑不同OCR工具可能产生的不正确阅读顺序,而模型输入中包含位置编码,阅读顺序会直接影响预测结果,在预处理中,我们对文本行按照从上到下,从左到右(YX)的顺序进行排序,为防止文本行位置轻微干扰带来的排序结果不稳定问题,在排序的过程中,引入位置偏移阈值Th,对于Y方向距离小于Th的2个文本内容,使用x方向的位置从左到右进行排序。TB-YX排序方法伪代码如下所示。 + +```py +def order_by_tbyx(ocr_info, th=20): + """ + ocr_info: a list of dict, which contains bbox information([x1, y1, x2, y2]) + th: threshold of the position threshold + """ + res = sorted(ocr_info, key=lambda r: (r["bbox"][1], r["bbox"][0])) # sort using y1 first and then x1 + for i in range(len(res) - 1): + for j in range(i, 0, -1): + # restore the order using the + if abs(res[j + 1]["bbox"][1] - res[j]["bbox"][1]) < th and \ + (res[j + 1]["bbox"][0] < res[j]["bbox"][0]): + tmp = deepcopy(res[j]) + res[j] = deepcopy(res[j + 1]) + res[j + 1] = deepcopy(tmp) + else: + break + return res +``` + +不同排序方法的结果对比如下所示,可以看出引入偏离阈值之后,排序结果更加符合人类的阅读顺序。 + +
+ +
+ + +使用该策略,最终XFUND数据集上,SER任务F1指标提升`2.06%`,RE任务F1指标提升`7.04%`。 + +**(3) 互学习蒸馏策略** + +UDML(Unified-Deep Mutual Learning)联合互学习是PP-OCRv2与PP-OCRv3中采用的对于文本识别非常有效的提升模型效果的策略。在训练时,引入2个完全相同的模型进行互学习,计算2个模型之间的互蒸馏损失函数(DML loss),同时对transformer中间层的输出结果计算距离损失函数(L2 loss)。使用该策略,最终XFUND数据集上,SER任务F1指标提升`0.6%`,RE任务F1指标提升`5.01%`。 + +最终优化后模型基于SER任务的可视化结果如下所示。 + +
+ +
+ +
+ +
+ + +RE任务的可视化结果如下所示。 + + +
+ +
+ +
+ +
+ +### 5.2 更多场景消融实验 + +我们在FUNSD数据集上,同时基于RE任务进行对本次升级策略进行验证,具体实验结果如下所示,可以看出该方案针对不同任务,在不同数据集上均有非常明显的精度收益。 + +#### 5.2.1 XFUND_zh数据集 + +**RE任务结果** + +| 实验序号 | 策略 | 模型大小(G) | F1-score | +|:------:|:------------:|:---------:|:----------:| +| 1 | LayoutXLM | 1.4 | 70.81% | +| 2 | VI-LayoutXLM | 1.1 | 71.87% | +| 3 | 实验2 + PP-OCR排序 | 1.1 | 78.91% | +| 4 | 实验3 + UDML蒸馏 | 1.1 | **83.92%** | + + +#### 5.2.2 FUNSD数据集 + +**SER任务结果** + +| 实验序号 | 策略 | F1-score | +|:------:|:------:|:------:| +| 1 | LayoutXLM | 82.28% | +| 2 | PP-Structurev2 SER | **87.79%** | + + +**RE任务结果** + +| 实验序号 | 策略 | F1-score | +|:------:|:------:|:------:| +| 1 | LayoutXLM | 53.13% | +| 2 | PP-Structurev2 SER | **74.87%** | + + +## 6. Reference +* [1] Zhong X, ShafieiBavani E, Jimeno Yepes A. Image-based table recognition: data, model, and evaluation[C]//European Conference on Computer Vision. Springer, Cham, 2020: 564-580. +* [2] Cui C, Gao T, Wei S. Yuning Du, Ruoyu Guo, Shuilong Dong, Bin Lu, Ying Zhou, Xueying Lv, Qiwen Liu, Xiaoguang Hu, Dianhai Yu, and Yanjun Ma* [J]. Pplcnet: A lightweight cpu convolutional neural network, 2021, 3. +* [3] Lin T Y, Dollár P, Girshick R, et al. Feature pyramid networks for object detection[C]//Proceedings of the IEEE conference on computer vision and pattern recognition. 2017: 2117-2125. +* [4] Yu G, Chang Q, Lv W, et al. PP-PicoDet: A Better Real-Time Object Detector on Mobile Devices[J]. arXiv preprint arXiv:2111.00902, 2021. +* [5] Bochkovskiy A, Wang C Y, Liao H Y M. Yolov4: Optimal speed and accuracy of object detection[J]. arXiv preprint arXiv:2004.10934, 2020. +* [6] Ye J, Qi X, He Y, et al. PingAn-VCGroup's Solution for ICDAR 2021 Competition on Scientific Literature Parsing Task B: Table Recognition to HTML[J]. arXiv preprint arXiv:2105.01848, 2021. +* [7] Zhong X, Tang J, Yepes A J. Publaynet: largest dataset ever for document layout analysis[C]//2019 International Conference on Document Analysis and Recognition (ICDAR). IEEE, 2019: 1015-1022. +* [8] CDLA:https://github.com/buptlihang/CDLA +* [9]Gao L, Huang Y, Déjean H, et al. ICDAR 2019 competition on table detection and recognition (cTDaR)[C]//2019 International Conference on Document Analysis and Recognition (ICDAR). IEEE, 2019: 1510-1515. +* [10] Mondal A, Lipps P, Jawahar C V. IIIT-AR-13K: a new dataset for graphical object detection in documents[C]//International Workshop on Document Analysis Systems. Springer, Cham, 2020: 216-230. +* [11] Tal ocr_tabel:https://ai.100tal.com/dataset +* [12] Li M, Cui L, Huang S, et al. Tablebank: A benchmark dataset for table detection and recognition[J]. arXiv preprint arXiv:1903.01949, 2019. +* [13]Li M, Xu Y, Cui L, et al. DocBank: A benchmark dataset for document layout analysis[J]. arXiv preprint arXiv:2006.01038, 2020. +* [14] Xu Y, Li M, Cui L, et al. Layoutlm: Pre-training of text and layout for document image understanding[C]//Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 2020: 1192-1200. +* [15] Xu Y, Xu Y, Lv T, et al. LayoutLMv2: Multi-modal pre-training for visually-rich document understanding[J]. arXiv preprint arXiv:2012.14740, 2020. +* [16] Xu Y, Lv T, Cui L, et al. Layoutxlm: Multimodal pre-training for multilingual visually-rich document understanding[J]. arXiv preprint arXiv:2104.08836, 2021. +* [17] Xu Y, Lv T, Cui L, et al. XFUND: A Benchmark Dataset for Multilingual Visually Rich Form Understanding[C]//Findings of the Association for Computational Linguistics: ACL 2022. 2022: 3214-3224. +* [18] Jaume G, Ekenel H K, Thiran J P. Funsd: A dataset for form understanding in noisy scanned documents[C]//2019 International Conference on Document Analysis and Recognition Workshops (ICDARW). IEEE, 2019, 2: 1-6. diff --git a/ppstructure/docs/inference.md b/ppstructure/docs/inference.md index 7604246da5a79b0ee2939c9fb4c91602531ec7de..516db82784ce98abba6db14c795fe7323be508e0 100644 --- a/ppstructure/docs/inference.md +++ b/ppstructure/docs/inference.md @@ -1,38 +1,41 @@ # 基于Python预测引擎推理 -- [1. Structure](#1) +- [1. 版面信息抽取](#1) - [1.1 版面分析+表格识别](#1.1) - [1.2 版面分析](#1.2) - [1.3 表格识别](#1.3) -- [2. DocVQA](#2) +- [2. 关键信息抽取](#2) -## 1. Structure +## 1. 版面信息抽取 进入`ppstructure`目录 ```bash cd ppstructure -```` +``` 下载模型 ```bash mkdir inference && cd inference -# 下载PP-OCRv2文本检测模型并解压 -wget https://paddleocr.bj.bcebos.com/PP-OCRv2/chinese/ch_PP-OCRv2_det_slim_quant_infer.tar && tar xf ch_PP-OCRv2_det_slim_quant_infer.tar -# 下载PP-OCRv2文本识别模型并解压 -wget https://paddleocr.bj.bcebos.com/PP-OCRv2/chinese/ch_PP-OCRv2_rec_slim_quant_infer.tar && tar xf ch_PP-OCRv2_rec_slim_quant_infer.tar -# 下载超轻量级英文表格预测模型并解压 -wget https://paddleocr.bj.bcebos.com/dygraph_v2.0/table/en_ppocr_mobile_v2.0_table_structure_infer.tar && tar xf en_ppocr_mobile_v2.0_table_structure_infer.tar +# 下载PP-Structurev2版面分析模型并解压 +wget https://paddleocr.bj.bcebos.com/ppstructure/models/layout/picodet_lcnet_x1_0_layout_infer.tar && tar xf picodet_lcnet_x1_0_layout_infer.tar +# 下载PP-OCRv3文本检测模型并解压 +wget https://paddleocr.bj.bcebos.com/PP-OCRv3/chinese/ch_PP-OCRv3_det_infer.tar && tar xf ch_PP-OCRv3_det_infer.tar +# 下载PP-OCRv3文本识别模型并解压 +wget https://paddleocr.bj.bcebos.com/PP-OCRv3/chinese/ch_PP-OCRv3_rec_infer.tar && tar xf ch_PP-OCRv3_rec_infer.tar +# 下载PP-Structurev2表格识别模型并解压 +wget https://paddleocr.bj.bcebos.com/ppstructure/models/slanet/ch_ppstructure_mobile_v2.0_SLANet_infer.tar && tar xf ch_ppstructure_mobile_v2.0_SLANet_infer.tar cd .. ``` ### 1.1 版面分析+表格识别 ```bash -python3 predict_system.py --det_model_dir=inference/ch_PP-OCRv2_det_slim_quant_infer \ - --rec_model_dir=inference/ch_PP-OCRv2_rec_slim_quant_infer \ - --table_model_dir=inference/en_ppocr_mobile_v2.0_table_structure_infer \ +python3 predict_system.py --det_model_dir=inference/ch_PP-OCRv3_det_infer \ + --rec_model_dir=inference/ch_PP-OCRv3_rec_infer \ + --table_model_dir=inference/ch_ppstructure_mobile_v2.0_SLANet_infer \ + --layout_model_dir=inference/picodet_lcnet_x1_0_layout_infer \ --image_dir=./docs/table/1.png \ --rec_char_dict_path=../ppocr/utils/ppocr_keys_v1.txt \ - --table_char_dict_path=../ppocr/utils/dict/table_structure_dict.txt \ + --table_char_dict_path=../ppocr/utils/dict/table_structure_dict_ch.txt \ --output=../output \ --vis_font_path=../doc/fonts/simfang.ttf ``` @@ -41,19 +44,23 @@ python3 predict_system.py --det_model_dir=inference/ch_PP-OCRv2_det_slim_quant_i ### 1.2 版面分析 ```bash -python3 predict_system.py --image_dir=./docs/table/1.png --table=false --ocr=false --output=../output/ +python3 predict_system.py --layout_model_dir=inference/picodet_lcnet_x1_0_layout_infer \ + --image_dir=./docs/table/1.png \ + --output=../output \ + --table=false \ + --ocr=false ``` 运行完成后,每张图片会在`output`字段指定的目录下的`structure`目录下有一个同名目录,图片区域会被裁剪之后保存下来,图片名为表格在图片里的坐标。版面分析结果会存储在`res.txt`文件中。 ### 1.3 表格识别 ```bash -python3 predict_system.py --det_model_dir=inference/ch_PP-OCRv2_det_slim_quant_infer \ - --rec_model_dir=inference/ch_PP-OCRv2_rec_slim_quant_infer \ - --table_model_dir=inference/en_ppocr_mobile_v2.0_table_structure_infer \ +python3 predict_system.py --det_model_dir=inference/ch_PP-OCRv3_det_infer \ + --rec_model_dir=inference/ch_PP-OCRv3_rec_infer \ + --table_model_dir=inference/ch_ppstructure_mobile_v2.0_SLANet_infer \ --image_dir=./docs/table/table.jpg \ --rec_char_dict_path=../ppocr/utils/ppocr_keys_v1.txt \ - --table_char_dict_path=../ppocr/utils/dict/table_structure_dict.txt \ + --table_char_dict_path=../ppocr/utils/dict/table_structure_dict_ch.txt \ --output=../output \ --vis_font_path=../doc/fonts/simfang.ttf \ --layout=false @@ -61,20 +68,22 @@ python3 predict_system.py --det_model_dir=inference/ch_PP-OCRv2_det_slim_quant_i 运行完成后,每张图片会在`output`字段指定的目录下的`structure`目录下有一个同名目录,表格会存储为一个excel,excel文件名为`[0,0,img_h,img_w]`。 -## 2. DocVQA +## 2. 关键信息抽取 ```bash cd ppstructure -# 下载模型 mkdir inference && cd inference -# 下载SER xfun 模型并解压 -wget https://paddleocr.bj.bcebos.com/pplayout/PP-Layout_v1.0_ser_pretrained.tar && tar xf PP-Layout_v1.0_ser_pretrained.tar +# 下载SER XFUND 模型并解压 +wget https://paddleocr.bj.bcebos.com/ppstructure/models/vi_layoutxlm/ser_vi_layoutxlm_xfund_infer.tar && tar -xf ser_vi_layoutxlm_xfund_infer.tar cd .. - -python3 predict_system.py --model_name_or_path=vqa/PP-Layout_v1.0_ser_pretrained/ \ - --mode=vqa \ - --image_dir=vqa/images/input/zh_val_0.jpg \ - --vis_font_path=../doc/fonts/simfang.ttf +python3 kie/predict_kie_token_ser.py \ + --kie_algorithm=LayoutXLM \ + --ser_model_dir=../inference/ser_vi_layoutxlm_xfund_infer \ + --image_dir=./docs/kie/input/zh_val_42.jpg \ + --ser_dict_path=../ppocr/utils/dict/kie_dict/xfund_class_list.txt \ + --vis_font_path=../doc/fonts/simfang.ttf \ + --ocr_order_method="tb-yx" ``` -运行完成后,每张图片会在`output`字段指定的目录下的`vqa`目录下存放可视化之后的图片,图片名和输入图片名一致。 + +运行完成后,每张图片会在`output`字段指定的目录下的`kie`目录下存放可视化之后的图片,图片名和输入图片名一致。 diff --git a/ppstructure/docs/inference_en.md b/ppstructure/docs/inference_en.md index 2a0fb30543eaa06c4ede5f82a443135c959db37d..71019ec70f80e44bc16d2b0d07b0bb93b475b7e7 100644 --- a/ppstructure/docs/inference_en.md +++ b/ppstructure/docs/inference_en.md @@ -1,13 +1,13 @@ # Python Inference -- [1. Structure](#1) +- [1. Layout Structured Analysis](#1) - [1.1 layout analysis + table recognition](#1.1) - [1.2 layout analysis](#1.2) - [1.3 table recognition](#1.3) -- [2. DocVQA](#2) +- [2. Key Information Extraction](#2) -## 1. Structure +## 1. Layout Structured Analysis Go to the `ppstructure` directory ```bash @@ -18,23 +18,26 @@ download model ```bash mkdir inference && cd inference -# Download the PP-OCRv2 text detection model and unzip it -wget https://paddleocr.bj.bcebos.com/PP-OCRv2/chinese/ch_PP-OCRv2_det_slim_quant_infer.tar && tar xf ch_PP-OCRv2_det_slim_quant_infer.tar -# Download the PP-OCRv2 text recognition model and unzip it -wget https://paddleocr.bj.bcebos.com/PP-OCRv2/chinese/ch_PP-OCRv2_rec_slim_quant_infer.tar && tar xf ch_PP-OCRv2_rec_slim_quant_infer.tar -# Download the ultra-lightweight English table structure model and unzip it -wget https://paddleocr.bj.bcebos.com/dygraph_v2.0/table/en_ppocr_mobile_v2.0_table_structure_infer.tar && tar xf en_ppocr_mobile_v2.0_table_structure_infer.tar +# Download the PP-Structurev2 layout analysis model and unzip it +wget https://paddleocr.bj.bcebos.com/ppstructure/models/layout/picodet_lcnet_x1_0_layout_infer.tar && tar xf picodet_lcnet_x1_0_layout_infer.tar +# Download the PP-OCRv3 text detection model and unzip it +wget https://paddleocr.bj.bcebos.com/PP-OCRv3/chinese/ch_PP-OCRv3_det_infer.tar && tar xf ch_PP-OCRv3_det_infer.tar +# Download the PP-OCRv3 text recognition model and unzip it +wget https://paddleocr.bj.bcebos.com/PP-OCRv3/chinese/ch_PP-OCRv3_rec_infer.tar && tar xf ch_PP-OCRv3_rec_infer.tar +# Download the PP-Structurev2 form recognition model and unzip it +wget https://paddleocr.bj.bcebos.com/ppstructure/models/slanet/ch_ppstructure_mobile_v2.0_SLANet_infer.tar && tar xf ch_ppstructure_mobile_v2.0_SLANet_infer.tar cd .. ``` ### 1.1 layout analysis + table recognition ```bash -python3 predict_system.py --det_model_dir=inference/ch_PP-OCRv2_det_slim_quant_infer \ - --rec_model_dir=inference/ch_PP-OCRv2_rec_slim_quant_infer \ - --table_model_dir=inference/en_ppocr_mobile_v2.0_table_structure_infer \ +python3 predict_system.py --det_model_dir=inference/ch_PP-OCRv3_det_infer \ + --rec_model_dir=inference/ch_PP-OCRv3_rec_infer \ + --table_model_dir=inference/ch_ppstructure_mobile_v2.0_SLANet_infer \ + --layout_model_dir=inference/picodet_lcnet_x1_0_layout_infer \ --image_dir=./docs/table/1.png \ --rec_char_dict_path=../ppocr/utils/ppocr_keys_v1.txt \ - --table_char_dict_path=../ppocr/utils/dict/table_structure_dict.txt \ + --table_char_dict_path=../ppocr/utils/dict/table_structure_dict_ch.txt \ --output=../output \ --vis_font_path=../doc/fonts/simfang.ttf ``` @@ -43,19 +46,23 @@ After the operation is completed, each image will have a directory with the same ### 1.2 layout analysis ```bash -python3 predict_system.py --image_dir=./docs/table/1.png --table=false --ocr=false --output=../output/ +python3 predict_system.py --layout_model_dir=inference/picodet_lcnet_x1_0_layout_infer \ + --image_dir=./docs/table/1.png \ + --output=../output \ + --table=false \ + --ocr=false ``` After the operation is completed, each image will have a directory with the same name in the `structure` directory under the directory specified by the `output` field. Each picture in image will be cropped and saved. The filename of picture area is their coordinates in the image. Layout analysis results will be stored in the `res.txt` file ### 1.3 table recognition ```bash -python3 predict_system.py --det_model_dir=inference/ch_PP-OCRv2_det_slim_quant_infer \ - --rec_model_dir=inference/ch_PP-OCRv2_rec_slim_quant_infer \ - --table_model_dir=inference/en_ppocr_mobile_v2.0_table_structure_infer \ +python3 predict_system.py --det_model_dir=inference/ch_PP-OCRv3_det_infer \ + --rec_model_dir=inference/ch_PP-OCRv3_rec_infer \ + --table_model_dir=inference/ch_ppstructure_mobile_v2.0_SLANet_infer \ --image_dir=./docs/table/table.jpg \ --rec_char_dict_path=../ppocr/utils/ppocr_keys_v1.txt \ - --table_char_dict_path=../ppocr/utils/dict/table_structure_dict.txt \ + --table_char_dict_path=../ppocr/utils/dict/table_structure_dict_ch.txt \ --output=../output \ --vis_font_path=../doc/fonts/simfang.ttf \ --layout=false @@ -63,19 +70,22 @@ python3 predict_system.py --det_model_dir=inference/ch_PP-OCRv2_det_slim_quant_i After the operation is completed, each image will have a directory with the same name in the `structure` directory under the directory specified by the `output` field. Each table in the image will be stored as an excel. The filename of excel is their coordinates in the image. -## 2. DocVQA +## 2. Key Information Extraction ```bash cd ppstructure -# download model mkdir inference && cd inference -wget https://paddleocr.bj.bcebos.com/pplayout/PP-Layout_v1.0_ser_pretrained.tar && tar xf PP-Layout_v1.0_ser_pretrained.tar +# download model +wget https://paddleocr.bj.bcebos.com/ppstructure/models/vi_layoutxlm/ser_vi_layoutxlm_xfund_infer.tar && tar -xf ser_vi_layoutxlm_xfund_infer.tar cd .. - -python3 predict_system.py --model_name_or_path=vqa/PP-Layout_v1.0_ser_pretrained/ \ - --mode=vqa \ - --image_dir=vqa/images/input/zh_val_0.jpg \ - --vis_font_path=../doc/fonts/simfang.ttf +python3 kie/predict_kie_token_ser.py \ + --kie_algorithm=LayoutXLM \ + --ser_model_dir=../inference/ser_vi_layoutxlm_xfund_infer \ + --image_dir=./docs/kie/input/zh_val_42.jpg \ + --ser_dict_path=../ppocr/utils/dict/kie_dict/xfund_class_list.txt \ + --vis_font_path=../doc/fonts/simfang.ttf \ + --ocr_order_method="tb-yx" ``` -After the operation is completed, each image will store the visualized image in the `vqa` directory under the directory specified by the `output` field, and the image name is the same as the input image name. + +After the operation is completed, each image will store the visualized image in the `kie` directory under the directory specified by the `output` field, and the image name is the same as the input image name. diff --git a/ppstructure/docs/installation.md b/ppstructure/docs/installation.md deleted file mode 100644 index 3f564cb2ddfe546642e6f92e2c024bbe3a1f7ffc..0000000000000000000000000000000000000000 --- a/ppstructure/docs/installation.md +++ /dev/null @@ -1,26 +0,0 @@ -- [快速安装](#快速安装) - - [1. PaddlePaddle 和 PaddleOCR](#1-paddlepaddle-和-paddleocr) - - [2. 安装其他依赖](#2-安装其他依赖) - - [2.1 VQA所需依赖](#21--vqa所需依赖) - -# 快速安装 - -## 1. PaddlePaddle 和 PaddleOCR - -可参考[PaddleOCR安装文档](../../doc/doc_ch/installation.md) - -## 2. 安装其他依赖 - -### 2.1 VQA所需依赖 -* paddleocr - -```bash -pip3 install paddleocr -``` - -* PaddleNLP -```bash -git clone https://github.com/PaddlePaddle/PaddleNLP -b develop -cd PaddleNLP -pip3 install -e . -``` diff --git a/ppstructure/docs/installation_en.md b/ppstructure/docs/installation_en.md deleted file mode 100644 index 02b02db0c58f60a5296734b93563510732a7286d..0000000000000000000000000000000000000000 --- a/ppstructure/docs/installation_en.md +++ /dev/null @@ -1,30 +0,0 @@ -# Quick installation - -- [1. PaddlePaddle 和 PaddleOCR](#1) -- [2. Install other dependencies](#2) - - [2.1 VQA](#21) - - - -## 1. PaddlePaddle and PaddleOCR - -Please refer to [PaddleOCR installation documentation](../../doc/doc_en/installation_en.md) - - -## 2. Install other dependencies - - -### 2.1 VQA - -* paddleocr - -```bash -pip3 install paddleocr -``` - -* PaddleNLP -```bash -git clone https://github.com/PaddlePaddle/PaddleNLP -b develop -cd PaddleNLP -pip3 install -e . -``` diff --git a/ppstructure/docs/vqa/input/zh_val_0.jpg b/ppstructure/docs/kie/input/zh_val_0.jpg similarity index 100% rename from ppstructure/docs/vqa/input/zh_val_0.jpg rename to ppstructure/docs/kie/input/zh_val_0.jpg diff --git a/ppstructure/docs/vqa/input/zh_val_21.jpg b/ppstructure/docs/kie/input/zh_val_21.jpg similarity index 100% rename from ppstructure/docs/vqa/input/zh_val_21.jpg rename to ppstructure/docs/kie/input/zh_val_21.jpg diff --git a/ppstructure/docs/vqa/input/zh_val_40.jpg b/ppstructure/docs/kie/input/zh_val_40.jpg similarity index 100% rename from ppstructure/docs/vqa/input/zh_val_40.jpg rename to ppstructure/docs/kie/input/zh_val_40.jpg diff --git a/ppstructure/docs/vqa/input/zh_val_42.jpg b/ppstructure/docs/kie/input/zh_val_42.jpg similarity index 100% rename from ppstructure/docs/vqa/input/zh_val_42.jpg rename to ppstructure/docs/kie/input/zh_val_42.jpg diff --git a/ppstructure/docs/vqa/result_re/zh_val_21_re.jpg b/ppstructure/docs/kie/result_re/zh_val_21_re.jpg similarity index 100% rename from ppstructure/docs/vqa/result_re/zh_val_21_re.jpg rename to ppstructure/docs/kie/result_re/zh_val_21_re.jpg diff --git a/ppstructure/docs/vqa/result_re/zh_val_40_re.jpg b/ppstructure/docs/kie/result_re/zh_val_40_re.jpg similarity index 100% rename from ppstructure/docs/vqa/result_re/zh_val_40_re.jpg rename to ppstructure/docs/kie/result_re/zh_val_40_re.jpg diff --git a/ppstructure/docs/vqa/result_re/zh_val_42_re.jpg b/ppstructure/docs/kie/result_re/zh_val_42_re.jpg similarity index 100% rename from ppstructure/docs/vqa/result_re/zh_val_42_re.jpg rename to ppstructure/docs/kie/result_re/zh_val_42_re.jpg diff --git a/ppstructure/docs/vqa/result_re_with_gt_ocr/zh_val_42_re.jpg b/ppstructure/docs/kie/result_re_with_gt_ocr/zh_val_42_re.jpg similarity index 100% rename from ppstructure/docs/vqa/result_re_with_gt_ocr/zh_val_42_re.jpg rename to ppstructure/docs/kie/result_re_with_gt_ocr/zh_val_42_re.jpg diff --git a/ppstructure/docs/vqa/result_ser/zh_val_0_ser.jpg b/ppstructure/docs/kie/result_ser/zh_val_0_ser.jpg similarity index 100% rename from ppstructure/docs/vqa/result_ser/zh_val_0_ser.jpg rename to ppstructure/docs/kie/result_ser/zh_val_0_ser.jpg diff --git a/ppstructure/docs/vqa/result_ser/zh_val_42_ser.jpg b/ppstructure/docs/kie/result_ser/zh_val_42_ser.jpg similarity index 100% rename from ppstructure/docs/vqa/result_ser/zh_val_42_ser.jpg rename to ppstructure/docs/kie/result_ser/zh_val_42_ser.jpg diff --git a/ppstructure/docs/vqa/result_ser_with_gt_ocr/zh_val_42_ser.jpg b/ppstructure/docs/kie/result_ser_with_gt_ocr/zh_val_42_ser.jpg similarity index 100% rename from ppstructure/docs/vqa/result_ser_with_gt_ocr/zh_val_42_ser.jpg rename to ppstructure/docs/kie/result_ser_with_gt_ocr/zh_val_42_ser.jpg diff --git a/ppstructure/docs/models_list.md b/ppstructure/docs/models_list.md index ef2994cabea38709464780d25b5f32c3b9801b4c..935d12d756eec467574f9ae32d48c70a3ea054c3 100644 --- a/ppstructure/docs/models_list.md +++ b/ppstructure/docs/models_list.md @@ -10,13 +10,17 @@ ## 1. 版面分析模型 -|模型名称|模型简介|下载地址|label_map| -| --- | --- | --- | --- | -| ppyolov2_r50vd_dcn_365e_publaynet | PubLayNet 数据集训练的版面分析模型,可以划分**文字、标题、表格、图片以及列表**5类区域 | [推理模型](https://paddle-model-ecology.bj.bcebos.com/model/layout-parser/ppyolov2_r50vd_dcn_365e_publaynet.tar) / [训练模型](https://paddle-model-ecology.bj.bcebos.com/model/layout-parser/ppyolov2_r50vd_dcn_365e_publaynet_pretrained.pdparams) |{0: "Text", 1: "Title", 2: "List", 3:"Table", 4:"Figure"}| -| ppyolov2_r50vd_dcn_365e_tableBank_word | TableBank Word 数据集训练的版面分析模型,只能检测表格 | [推理模型](https://paddle-model-ecology.bj.bcebos.com/model/layout-parser/ppyolov2_r50vd_dcn_365e_tableBank_word.tar) | {0:"Table"}| -| ppyolov2_r50vd_dcn_365e_tableBank_latex | TableBank Latex 数据集训练的版面分析模型,只能检测表格 | [推理模型](https://paddle-model-ecology.bj.bcebos.com/model/layout-parser/ppyolov2_r50vd_dcn_365e_tableBank_latex.tar) | {0:"Table"}| +|模型名称|模型简介|推理模型大小|下载地址|dict path| +| --- | --- | --- | --- | --- | +| picodet_lcnet_x1_0_fgd_layout | 基于PicoDet LCNet_x1_0和FGD蒸馏在PubLayNet 数据集训练的英文版面分析模型,可以划分**文字、标题、表格、图片以及列表**5类区域 | 9.7M | [推理模型](https://paddleocr.bj.bcebos.com/ppstructure/models/layout/picodet_lcnet_x1_0_fgd_layout_infer.tar) / [训练模型](https://paddleocr.bj.bcebos.com/ppstructure/models/layout/picodet_lcnet_x1_0_fgd_layout.pdparams) | [PubLayNet dict](../../ppocr/utils/dict/layout_dict/layout_publaynet_dict.txt) | +| ppyolov2_r50vd_dcn_365e_publaynet | 基于PP-YOLOv2在PubLayNet数据集上训练的英文版面分析模型 | 221M | [推理模型](https://paddle-model-ecology.bj.bcebos.com/model/layout-parser/ppyolov2_r50vd_dcn_365e_publaynet.tar) / [训练模型](https://paddle-model-ecology.bj.bcebos.com/model/layout-parser/ppyolov2_r50vd_dcn_365e_publaynet_pretrained.pdparams) | 同上 | +| picodet_lcnet_x1_0_fgd_layout_cdla | CDLA数据集训练的中文版面分析模型,可以划分为**表格、图片、图片标题、表格、表格标题、页眉、脚本、引用、公式**10类区域 | 9.7M | [推理模型](https://paddleocr.bj.bcebos.com/ppstructure/models/layout/picodet_lcnet_x1_0_fgd_layout_cdla_infer.tar) / [训练模型](https://paddleocr.bj.bcebos.com/ppstructure/models/layout/picodet_lcnet_x1_0_fgd_layout_cdla.pdparams) | [CDLA dict](../../ppocr/utils/dict/layout_dict/layout_cdla_dict.txt) | +| picodet_lcnet_x1_0_fgd_layout_table | 表格数据集训练的版面分析模型,支持中英文文档表格区域的检测 | 9.7M | [推理模型](https://paddleocr.bj.bcebos.com/ppstructure/models/layout/picodet_lcnet_x1_0_fgd_layout_table_infer.tar) / [训练模型](https://paddleocr.bj.bcebos.com/ppstructure/models/layout/picodet_lcnet_x1_0_fgd_layout_table.pdparams) | [Table dict](../../ppocr/utils/dict/layout_dict/layout_table_dict.txt) | +| ppyolov2_r50vd_dcn_365e_tableBank_word | 基于PP-YOLOv2在TableBank Word 数据集训练的版面分析模型,支持英文文档表格区域的检测 | 221M | [推理模型](https://paddle-model-ecology.bj.bcebos.com/model/layout-parser/ppyolov2_r50vd_dcn_365e_tableBank_word.tar) | 同上 | +| ppyolov2_r50vd_dcn_365e_tableBank_latex | 基于PP-YOLOv2在TableBank Latex数据集训练的版面分析模型,支持英文文档表格区域的检测 | 221M | [推理模型](https://paddle-model-ecology.bj.bcebos.com/model/layout-parser/ppyolov2_r50vd_dcn_365e_tableBank_latex.tar) | 同上 | + ## 2. OCR和表格识别模型 @@ -24,8 +28,8 @@ |模型名称|模型简介|推理模型大小|下载地址| | --- | --- | --- | --- | -|en_ppocr_mobile_v2.0_table_det|PubLayNet数据集训练的英文表格场景的文字检测|4.7M|[推理模型](https://paddleocr.bj.bcebos.com/dygraph_v2.0/table/en_ppocr_mobile_v2.0_table_det_infer.tar) / [训练模型](https://paddleocr.bj.bcebos.com/dygraph_v2.1/table/en_ppocr_mobile_v2.0_table_det_train.tar) | -|en_ppocr_mobile_v2.0_table_rec|PubLayNet数据集训练的英文表格场景的文字识别|6.9M|[推理模型](https://paddleocr.bj.bcebos.com/dygraph_v2.0/table/en_ppocr_mobile_v2.0_table_rec_infer.tar) / [训练模型](https://paddleocr.bj.bcebos.com/dygraph_v2.1/table/en_ppocr_mobile_v2.0_table_rec_train.tar) | +|en_ppocr_mobile_v2.0_table_det|PubTabNet数据集训练的英文表格场景的文字检测|4.7M|[推理模型](https://paddleocr.bj.bcebos.com/dygraph_v2.0/table/en_ppocr_mobile_v2.0_table_det_infer.tar) / [训练模型](https://paddleocr.bj.bcebos.com/dygraph_v2.1/table/en_ppocr_mobile_v2.0_table_det_train.tar) | +|en_ppocr_mobile_v2.0_table_rec|PubTabNet数据集训练的英文表格场景的文字识别|6.9M|[推理模型](https://paddleocr.bj.bcebos.com/dygraph_v2.0/table/en_ppocr_mobile_v2.0_table_rec_infer.tar) / [训练模型](https://paddleocr.bj.bcebos.com/dygraph_v2.1/table/en_ppocr_mobile_v2.0_table_rec_train.tar) | 如需要使用其他OCR模型,可以在 [PP-OCR model_list](../../doc/doc_ch/models_list.md) 下载模型或者使用自己训练好的模型配置到 `det_model_dir`, `rec_model_dir`两个字段即可。 @@ -34,9 +38,9 @@ |模型名称|模型简介|推理模型大小|下载地址| | --- | --- | --- | --- | -|en_ppocr_mobile_v2.0_table_structure|基于TableRec-RARE在PubTabNet数据集上训练的英文表格识别模型|18.6M|[推理模型](https://paddleocr.bj.bcebos.com/dygraph_v2.0/table/en_ppocr_mobile_v2.0_table_structure_infer.tar) / [训练模型](https://paddleocr.bj.bcebos.com/dygraph_v2.1/table/en_ppocr_mobile_v2.0_table_structure_train.tar) | -|en_ppstructure_mobile_v2.0_SLANet|基于SLANet在PubTabNet数据集上训练的英文表格识别模型|9M|[推理模型](https://paddleocr.bj.bcebos.com/ppstructure/models/slanet/en_ppstructure_mobile_v2.0_SLANet_infer.tar) / [训练模型](https://paddleocr.bj.bcebos.com/ppstructure/models/slanet/en_ppstructure_mobile_v2.0_SLANet_train.tar) | -|ch_ppstructure_mobile_v2.0_SLANet|基于SLANet在PubTabNet数据集上训练的中文表格识别模型|9.3M|[推理模型](https://paddleocr.bj.bcebos.com/ppstructure/models/slanet/ch_ppstructure_mobile_v2.0_SLANet_infer.tar) / [训练模型](https://paddleocr.bj.bcebos.com/ppstructure/models/slanet/ch_ppstructure_mobile_v2.0_SLANet_train.tar) | +|en_ppocr_mobile_v2.0_table_structure|基于TableRec-RARE在PubTabNet数据集上训练的英文表格识别模型|6.8M|[推理模型](https://paddleocr.bj.bcebos.com/dygraph_v2.0/table/en_ppocr_mobile_v2.0_table_structure_infer.tar) / [训练模型](https://paddleocr.bj.bcebos.com/dygraph_v2.1/table/en_ppocr_mobile_v2.0_table_structure_train.tar) | +|en_ppstructure_mobile_v2.0_SLANet|基于SLANet在PubTabNet数据集上训练的英文表格识别模型|9.2M|[推理模型](https://paddleocr.bj.bcebos.com/ppstructure/models/slanet/en_ppstructure_mobile_v2.0_SLANet_infer.tar) / [训练模型](https://paddleocr.bj.bcebos.com/ppstructure/models/slanet/en_ppstructure_mobile_v2.0_SLANet_train.tar) | +|ch_ppstructure_mobile_v2.0_SLANet|基于SLANet的中文表格识别模型|9.3M|[推理模型](https://paddleocr.bj.bcebos.com/ppstructure/models/slanet/ch_ppstructure_mobile_v2.0_SLANet_infer.tar) / [训练模型](https://paddleocr.bj.bcebos.com/ppstructure/models/slanet/ch_ppstructure_mobile_v2.0_SLANet_train.tar) | diff --git a/ppstructure/docs/models_list_en.md b/ppstructure/docs/models_list_en.md index 64a7cdebc3e3c7ac18ae2f61013aa4e8a7c3ead8..85531fb753c4e32f0cdc9296ab97a9faebbb0ebd 100644 --- a/ppstructure/docs/models_list_en.md +++ b/ppstructure/docs/models_list_en.md @@ -4,18 +4,20 @@ - [2. OCR and Table Recognition](#2-ocr-and-table-recognition) - [2.1 OCR](#21-ocr) - [2.2 Table Recognition](#22-table-recognition) -- [3. VQA](#3-vqa) -- [4. KIE](#4-kie) - +- [3. KIE](#3-kie) + ## 1. Layout Analysis -|model name| description |download|label_map| -| --- |---------------------------------------------------------------------------------------------------------------------------------------------------------| --- | --- | -| ppyolov2_r50vd_dcn_365e_publaynet | The layout analysis model trained on the PubLayNet dataset, the model can recognition 5 types of areas such as **text, title, table, picture and list** | [inference model](https://paddle-model-ecology.bj.bcebos.com/model/layout-parser/ppyolov2_r50vd_dcn_365e_publaynet.tar) / [trained model](https://paddle-model-ecology.bj.bcebos.com/model/layout-parser/ppyolov2_r50vd_dcn_365e_publaynet_pretrained.pdparams) |{0: "Text", 1: "Title", 2: "List", 3:"Table", 4:"Figure"}| -| ppyolov2_r50vd_dcn_365e_tableBank_word | The layout analysis model trained on the TableBank Word dataset, the model can only detect tables | [inference model](https://paddle-model-ecology.bj.bcebos.com/model/layout-parser/ppyolov2_r50vd_dcn_365e_tableBank_word.tar) | {0:"Table"}| -| ppyolov2_r50vd_dcn_365e_tableBank_latex | The layout analysis model trained on the TableBank Latex dataset, the model can only detect tables | [inference model](https://paddle-model-ecology.bj.bcebos.com/model/layout-parser/ppyolov2_r50vd_dcn_365e_tableBank_latex.tar) | {0:"Table"}| +|model name| description | inference model size |download|dict path| +| --- |---------------------------------------------------------------------------------------------------------------------------------------------------------| --- | --- | --- | +| picodet_lcnet_x1_0_fgd_layout | The layout analysis English model trained on the PubLayNet dataset based on PicoDet LCNet_x1_0 and FGD . the model can recognition 5 types of areas such as **Text, Title, Table, Picture and List** | 9.7M | [inference model](https://paddleocr.bj.bcebos.com/ppstructure/models/layout/picodet_lcnet_x1_0_fgd_layout_infer.tar) / [trained model](https://paddleocr.bj.bcebos.com/ppstructure/models/layout/picodet_lcnet_x1_0_fgd_layout.pdparams) | [PubLayNet dict](../../ppocr/utils/dict/layout_dict/layout_publaynet_dict.txt) | +| ppyolov2_r50vd_dcn_365e_publaynet | The layout analysis English model trained on the PubLayNet dataset based on PP-YOLOv2 | 221M | [inference_moel]](https://paddle-model-ecology.bj.bcebos.com/model/layout-parser/ppyolov2_r50vd_dcn_365e_publaynet.tar) / [trained model](https://paddle-model-ecology.bj.bcebos.com/model/layout-parser/ppyolov2_r50vd_dcn_365e_publaynet_pretrained.pdparams) | sme as above | +| picodet_lcnet_x1_0_fgd_layout_cdla | The layout analysis Chinese model trained on the CDLA dataset, the model can recognition 10 types of areas such as **Table、Figure、Figure caption、Table、Table caption、Header、Footer、Reference、Equation** | 9.7M | [inference model](https://paddleocr.bj.bcebos.com/ppstructure/models/layout/picodet_lcnet_x1_0_fgd_layout_cdla_infer.tar) / [trained model](https://paddleocr.bj.bcebos.com/ppstructure/models/layout/picodet_lcnet_x1_0_fgd_layout_cdla.pdparams) | [CDLA dict](../../ppocr/utils/dict/layout_dict/layout_cdla_dict.txt) | +| picodet_lcnet_x1_0_fgd_layout_table | The layout analysis model trained on the table dataset, the model can detect tables in Chinese and English documents | 9.7M | [inference model](https://paddleocr.bj.bcebos.com/ppstructure/models/layout/picodet_lcnet_x1_0_fgd_layout_table_infer.tar) / [trained model](https://paddleocr.bj.bcebos.com/ppstructure/models/layout/picodet_lcnet_x1_0_fgd_layout_table.pdparams) | [Table dict](../../ppocr/utils/dict/layout_dict/layout_table_dict.txt) | +| ppyolov2_r50vd_dcn_365e_tableBank_word | The layout analysis model trained on the TableBank Word dataset based on PP-YOLOv2, the model can detect tables in English documents | 221M | [inference model](https://paddle-model-ecology.bj.bcebos.com/model/layout-parser/ppyolov2_r50vd_dcn_365e_tableBank_word.tar) | same as above | +| ppyolov2_r50vd_dcn_365e_tableBank_latex | The layout analysis model trained on the TableBank Latex dataset based on PP-YOLOv2, the model can detect tables in English documents | 221M | [inference model](https://paddle-model-ecology.bj.bcebos.com/model/layout-parser/ppyolov2_r50vd_dcn_365e_tableBank_latex.tar) | same as above | ## 2. OCR and Table Recognition @@ -35,24 +37,30 @@ If you need to use other OCR models, you can download the model in [PP-OCR model |model| description |inference model size|download| | --- |-----------------------------------------------------------------------------| --- | --- | -|en_ppocr_mobile_v2.0_table_structure| English table recognition model trained on PubTabNet dataset based on TableRec-RARE |18.6M|[inference model](https://paddleocr.bj.bcebos.com/dygraph_v2.0/table/en_ppocr_mobile_v2.0_table_structure_infer.tar) / [trained model](https://paddleocr.bj.bcebos.com/dygraph_v2.1/table/en_ppocr_mobile_v2.0_table_structure_train.tar) | -|en_ppstructure_mobile_v2.0_SLANet|English table recognition model trained on PubTabNet dataset based on SLANet|9M|[inference model](https://paddleocr.bj.bcebos.com/ppstructure/models/slanet/en_ppstructure_mobile_v2.0_SLANet_infer.tar) / [trained model](https://paddleocr.bj.bcebos.com/ppstructure/models/slanet/en_ppstructure_mobile_v2.0_SLANet_train.tar) | -|ch_ppstructure_mobile_v2.0_SLANet|Chinese table recognition model trained on PubTabNet dataset based on SLANet|9.3M|[inference model](https://paddleocr.bj.bcebos.com/ppstructure/models/slanet/ch_ppstructure_mobile_v2.0_SLANet_infer.tar) / [trained model](https://paddleocr.bj.bcebos.com/ppstructure/models/slanet/ch_ppstructure_mobile_v2.0_SLANet_train.tar) | +|en_ppocr_mobile_v2.0_table_structure| English table recognition model trained on PubTabNet dataset based on TableRec-RARE |6.8M|[inference model](https://paddleocr.bj.bcebos.com/dygraph_v2.0/table/en_ppocr_mobile_v2.0_table_structure_infer.tar) / [trained model](https://paddleocr.bj.bcebos.com/dygraph_v2.1/table/en_ppocr_mobile_v2.0_table_structure_train.tar) | +|en_ppstructure_mobile_v2.0_SLANet|English table recognition model trained on PubTabNet dataset based on SLANet|9.2M|[inference model](https://paddleocr.bj.bcebos.com/ppstructure/models/slanet/en_ppstructure_mobile_v2.0_SLANet_infer.tar) / [trained model](https://paddleocr.bj.bcebos.com/ppstructure/models/slanet/en_ppstructure_mobile_v2.0_SLANet_train.tar) | +|ch_ppstructure_mobile_v2.0_SLANet|Chinese table recognition model based on SLANet|9.3M|[inference model](https://paddleocr.bj.bcebos.com/ppstructure/models/slanet/ch_ppstructure_mobile_v2.0_SLANet_infer.tar) / [trained model](https://paddleocr.bj.bcebos.com/ppstructure/models/slanet/ch_ppstructure_mobile_v2.0_SLANet_train.tar) | -## 3. VQA - -|model| description |inference model size|download| -| --- |----------------------------------------------------------------| --- | --- | -|ser_LayoutXLM_xfun_zh| SER model trained on xfun Chinese dataset based on LayoutXLM |1.4G|[inference model](https://paddleocr.bj.bcebos.com/pplayout/ser_LayoutXLM_xfun_zh_infer.tar) / [trained model](https://paddleocr.bj.bcebos.com/pplayout/ser_LayoutXLM_xfun_zh.tar) | -|re_LayoutXLM_xfun_zh| Re model trained on xfun Chinese dataset based on LayoutXLM |1.4G|[inference model coming soon]() / [trained model](https://paddleocr.bj.bcebos.com/pplayout/re_LayoutXLM_xfun_zh.tar) | -|ser_LayoutLMv2_xfun_zh| SER model trained on xfun Chinese dataset based on LayoutXLMv2 |778M|[inference model](https://paddleocr.bj.bcebos.com/pplayout/ser_LayoutLMv2_xfun_zh_infer.tar) / [trained model](https://paddleocr.bj.bcebos.com/pplayout/ser_LayoutLMv2_xfun_zh.tar) | -|re_LayoutLMv2_xfun_zh| Re model trained on xfun Chinese dataset based on LayoutXLMv2 |765M|[inference model coming soon]() / [trained model](https://paddleocr.bj.bcebos.com/pplayout/re_LayoutLMv2_xfun_zh.tar) | -|ser_LayoutLM_xfun_zh| SER model trained on xfun Chinese dataset based on LayoutLM |430M|[inference model](https://paddleocr.bj.bcebos.com/pplayout/ser_LayoutLM_xfun_zh_infer.tar) / [trained model](https://paddleocr.bj.bcebos.com/pplayout/ser_LayoutLM_xfun_zh.tar) | - - -## 4. KIE - -|model|description|model size|download| -| --- | --- | --- | --- | -|SDMGR|Key Information Extraction Model|78M|[inference model coming soon]() / [trained model](https://paddleocr.bj.bcebos.com/dygraph_v2.1/kie/kie_vgg16.tar)| +## 3. KIE + +On XFUND_zh dataset, Accuracy and time cost of different models on V100 GPU are as follows. + +|Model|Backbone|Task|Config|Hmean|Time cost(ms)|Download link| +| --- | --- | --- | --- | --- | --- |--- | +|VI-LayoutXLM| VI-LayoutXLM-base | SER | [ser_vi_layoutxlm_xfund_zh_udml.yml](../../configs/kie/vi_layoutxlm/ser_vi_layoutxlm_xfund_zh_udml.yml)|**93.19%**| 15.49| [trained model](https://paddleocr.bj.bcebos.com/ppstructure/models/vi_layoutxlm/ser_vi_layoutxlm_xfund_pretrained.tar)| +|LayoutXLM| LayoutXLM-base | SER | [ser_layoutxlm_xfund_zh.yml](../../configs/kie/layoutlm_series/ser_layoutxlm_xfund_zh.yml)|90.38%| 19.49 |[trained model](https://paddleocr.bj.bcebos.com/pplayout/ser_LayoutXLM_xfun_zh.tar)| +|LayoutLM| LayoutLM-base | SER | [ser_layoutlm_xfund_zh.yml](../../configs/kie/layoutlm_series/ser_layoutlm_xfund_zh.yml)|77.31%|-|[trained model](https://paddleocr.bj.bcebos.com/pplayout/ser_LayoutLM_xfun_zh.tar)| +|LayoutLMv2| LayoutLMv2-base | SER | [ser_layoutlmv2_xfund_zh.yml](../../configs/kie/layoutlm_series/ser_layoutlmv2_xfund_zh.yml)|85.44%|31.46|[trained model](https://paddleocr.bj.bcebos.com/pplayout/ser_LayoutLMv2_xfun_zh.tar)| +|VI-LayoutXLM| VI-LayoutXLM-base | RE | [re_vi_layoutxlm_xfund_zh_udml.yml](../../configs/kie/vi_layoutxlm/re_vi_layoutxlm_xfund_zh_udml.yml)|**83.92%**|15.49|[trained model](https://paddleocr.bj.bcebos.com/ppstructure/models/vi_layoutxlm/re_vi_layoutxlm_xfund_pretrained.tar)| +|LayoutXLM| LayoutXLM-base | RE | [re_layoutxlm_xfund_zh.yml](../../configs/kie/layoutlm_series/re_layoutxlm_xfund_zh.yml)|74.83%|19.49|[trained model](https://paddleocr.bj.bcebos.com/pplayout/re_LayoutXLM_xfun_zh.tar)| +|LayoutLMv2| LayoutLMv2-base | RE | [re_layoutlmv2_xfund_zh.yml](../../configs/kie/layoutlm_series/re_layoutlmv2_xfund_zh.yml)|67.77%|31.46|[trained model](https://paddleocr.bj.bcebos.com/pplayout/re_LayoutLMv2_xfun_zh.tar)| + +* Note: The above time cost information just considers inference time without preprocess or postprocess, test environment: `V100 GPU + CUDA 10.2 + CUDNN 8.1.1 + TRT 7.2.3.4` + + +On wildreceipt dataset, the algorithm result is as follows: + +|Model|Backbone|Config|Hmean|Download link| +| --- | --- | --- | --- | --- | +|SDMGR|VGG6|[configs/kie/sdmgr/kie_unet_sdmgr.yml](../../configs/kie/sdmgr/kie_unet_sdmgr.yml)|86.7%|[trained model](https://paddleocr.bj.bcebos.com/dygraph_v2.1/kie/kie_vgg16.tar)| diff --git a/ppstructure/docs/ppstructurev2_pipeline.png b/ppstructure/docs/ppstructurev2_pipeline.png new file mode 100644 index 0000000000000000000000000000000000000000..b53a290a6dbc396449374cc694dd01c304325739 Binary files /dev/null and b/ppstructure/docs/ppstructurev2_pipeline.png differ diff --git a/ppstructure/docs/quickstart.md b/ppstructure/docs/quickstart.md index f4645bdfe011a12370bedc7bd7a125b28ded41ff..f19ee2591aba955ff09b2404d3ca85c80b75d781 100644 --- a/ppstructure/docs/quickstart.md +++ b/ppstructure/docs/quickstart.md @@ -1,34 +1,57 @@ # PP-Structure 快速开始 -- [1. 安装依赖包](#1-安装依赖包) +- [1. 准备环境](#1-准备环境) - [2. 便捷使用](#2-便捷使用) - [2.1 命令行使用](#21-命令行使用) - [2.1.1 图像方向分类+版面分析+表格识别](#211-图像方向分类版面分析表格识别) - [2.1.2 版面分析+表格识别](#212-版面分析表格识别) - [2.1.3 版面分析](#213-版面分析) - [2.1.4 表格识别](#214-表格识别) - - [2.1.5 DocVQA](#215-docvqa) - - [2.2 代码使用](#22-代码使用) - - [2.2.1 图像方向分类版面分析表格识别](#221-图像方向分类版面分析表格识别) + - [2.1.5 关键信息抽取](#215-关键信息抽取) + - [2.1.6 版面恢复](#216-版面恢复) + - [2.2 Python脚本使用](#22-Python脚本使用) + - [2.2.1 图像方向分类+版面分析+表格识别](#221-图像方向分类版面分析表格识别) - [2.2.2 版面分析+表格识别](#222-版面分析表格识别) - [2.2.3 版面分析](#223-版面分析) - [2.2.4 表格识别](#224-表格识别) - - [2.2.5 DocVQA](#225-docvqa) + - [2.2.5 关键信息抽取](#225-关键信息抽取) + - [2.2.6 版面恢复](#226-版面恢复) - [2.3 返回结果说明](#23-返回结果说明) - [2.3.1 版面分析+表格识别](#231-版面分析表格识别) - - [2.3.2 DocVQA](#232-docvqa) + - [2.3.2 关键信息抽取](#232-关键信息抽取) - [2.4 参数说明](#24-参数说明) - +- [3. 小结](#3-小结) -## 1. 安装依赖包 +## 1. 准备环境 +### 1.1 安装PaddlePaddle +> 如果您没有基础的Python运行环境,请参考[运行环境准备](../../doc/doc_ch/environment.md)。 + +- 您的机器安装的是CUDA9或CUDA10,请运行以下命令安装 + + ```bash + python3 -m pip install paddlepaddle-gpu -i https://mirror.baidu.com/pypi/simple + ``` + +- 您的机器是CPU,请运行以下命令安装 + + ```bash + python3 -m pip install paddlepaddle -i https://mirror.baidu.com/pypi/simple + ``` + +更多的版本需求,请参照[飞桨官网安装文档](https://www.paddlepaddle.org.cn/install/quick)中的说明进行操作。 + +### 1.2 安装PaddleOCR whl包 ```bash -# 安装 paddleocr,推荐使用2.5+版本 -pip3 install "paddleocr>=2.5" -# 安装 DocVQA依赖包paddlenlp(如不需要DocVQA功能,可跳过) -pip install paddlenlp +# 安装 paddleocr,推荐使用2.6版本 +pip3 install "paddleocr>=2.6" + +# 安装 图像方向分类依赖包paddleclas(如不需要图像方向分类功能,可跳过) +pip3 install paddleclas +# 安装 关键信息抽取 依赖包(如不需要KIE功能,可跳过) +pip3 install -r kie/requirements.txt ``` @@ -40,37 +63,46 @@ pip install paddlenlp #### 2.1.1 图像方向分类+版面分析+表格识别 ```bash -paddleocr --image_dir=PaddleOCR/ppstructure/docs/table/1.png --type=structure --image_orientation=true +paddleocr --image_dir=ppstructure/docs/table/1.png --type=structure --image_orientation=true ``` #### 2.1.2 版面分析+表格识别 ```bash -paddleocr --image_dir=PaddleOCR/ppstructure/docs/table/1.png --type=structure +paddleocr --image_dir=ppstructure/docs/table/1.png --type=structure ``` #### 2.1.3 版面分析 ```bash -paddleocr --image_dir=PaddleOCR/ppstructure/docs/table/1.png --type=structure --table=false --ocr=false +paddleocr --image_dir=ppstructure/docs/table/1.png --type=structure --table=false --ocr=false ``` #### 2.1.4 表格识别 ```bash -paddleocr --image_dir=PaddleOCR/ppstructure/docs/table/table.jpg --type=structure --layout=false +paddleocr --image_dir=ppstructure/docs/table/table.jpg --type=structure --layout=false ``` -#### 2.1.5 DocVQA -请参考:[文档视觉问答](../vqa/README.md)。 +#### 2.1.5 关键信息抽取 +关键信息抽取暂不支持通过whl包调用,详细使用教程请参考:[关键信息抽取教程](../kie/README_ch.md)。 + + + +#### 2.1.6 版面恢复 + +```bash +paddleocr --image_dir=ppstructure/docs/table/1.png --type=structure --recovery=true +``` -### 2.2 代码使用 + +### 2.2 Python脚本使用 -#### 2.2.1 图像方向分类版面分析表格识别 +#### 2.2.1 图像方向分类+版面分析+表格识别 ```python import os @@ -80,7 +112,7 @@ from paddleocr import PPStructure,draw_structure_result,save_structure_res table_engine = PPStructure(show_log=True, image_orientation=True) save_folder = './output' -img_path = 'PaddleOCR/ppstructure/docs/table/1.png' +img_path = 'ppstructure/docs/table/1.png' img = cv2.imread(img_path) result = table_engine(img) save_structure_res(result, save_folder,os.path.basename(img_path).split('.')[0]) @@ -91,7 +123,7 @@ for line in result: from PIL import Image -font_path = 'PaddleOCR/doc/fonts/simfang.ttf' # PaddleOCR下提供字体包 +font_path = 'doc/fonts/simfang.ttf' # PaddleOCR下提供字体包 image = Image.open(img_path).convert('RGB') im_show = draw_structure_result(image, result,font_path=font_path) im_show = Image.fromarray(im_show) @@ -109,7 +141,7 @@ from paddleocr import PPStructure,draw_structure_result,save_structure_res table_engine = PPStructure(show_log=True) save_folder = './output' -img_path = 'PaddleOCR/ppstructure/docs/table/1.png' +img_path = 'ppstructure/docs/table/1.png' img = cv2.imread(img_path) result = table_engine(img) save_structure_res(result, save_folder,os.path.basename(img_path).split('.')[0]) @@ -120,7 +152,7 @@ for line in result: from PIL import Image -font_path = 'PaddleOCR/doc/fonts/simfang.ttf' # PaddleOCR下提供字体包 +font_path = 'doc/fonts/simfang.ttf' # PaddleOCR下提供字体包 image = Image.open(img_path).convert('RGB') im_show = draw_structure_result(image, result,font_path=font_path) im_show = Image.fromarray(im_show) @@ -138,7 +170,7 @@ from paddleocr import PPStructure,save_structure_res table_engine = PPStructure(table=False, ocr=False, show_log=True) save_folder = './output' -img_path = 'PaddleOCR/ppstructure/docs/table/1.png' +img_path = 'ppstructure/docs/table/1.png' img = cv2.imread(img_path) result = table_engine(img) save_structure_res(result, save_folder, os.path.basename(img_path).split('.')[0]) @@ -149,6 +181,7 @@ for line in result: ``` + #### 2.2.4 表格识别 ```python @@ -159,7 +192,7 @@ from paddleocr import PPStructure,save_structure_res table_engine = PPStructure(layout=False, show_log=True) save_folder = './output' -img_path = 'PaddleOCR/ppstructure/docs/table/table.jpg' +img_path = 'ppstructure/docs/table/table.jpg' img = cv2.imread(img_path) result = table_engine(img) save_structure_res(result, save_folder, os.path.basename(img_path).split('.')[0]) @@ -170,13 +203,40 @@ for line in result: ``` -#### 2.2.5 DocVQA +#### 2.2.5 关键信息抽取 + +关键信息抽取暂不支持通过whl包调用,详细使用教程请参考:[关键信息抽取教程](../kie/README_ch.md)。 -请参考:[文档视觉问答](../vqa/README.md)。 + + +#### 2.2.6 版面恢复 + +```python +import os +import cv2 +from paddleocr import PPStructure,save_structure_res +from paddelocr.ppstructure.recovery.recovery_to_doc import sorted_layout_boxes, convert_info_docx + +table_engine = PPStructure(layout=False, show_log=True) + +save_folder = './output' +img_path = 'ppstructure/docs/table/1.png' +img = cv2.imread(img_path) +result = table_engine(img) +save_structure_res(result, save_folder, os.path.basename(img_path).split('.')[0]) + +for line in result: + line.pop('img') + print(line) + +h, w, _ = img.shape +res = sorted_layout_boxes(res, w) +convert_info_docx(img, result, save_folder, os.path.basename(img_path).split('.')[0]) +``` ### 2.3 返回结果说明 -PP-Structure的返回结果为一个dict组成的list,示例如下 +PP-Structure的返回结果为一个dict组成的list,示例如下: #### 2.3.1 版面分析+表格识别 @@ -189,7 +249,7 @@ PP-Structure的返回结果为一个dict组成的list,示例如下 } ] ``` -dict 里各个字段说明如下 +dict 里各个字段说明如下: | 字段 | 说明| | --- |---| @@ -208,9 +268,9 @@ dict 里各个字段说明如下 ``` -#### 2.3.2 DocVQA +#### 2.3.2 关键信息抽取 -请参考:[文档视觉问答](../vqa/README.md)。 +请参考:[关键信息抽取教程](../kie/README_ch.md)。 ### 2.4 参数说明 @@ -226,15 +286,21 @@ dict 里各个字段说明如下 | layout_dict_path | 版面分析模型字典| ../ppocr/utils/dict/layout_publaynet_dict.txt | | layout_score_threshold | 版面分析模型检测框阈值| 0.5| | layout_nms_threshold | 版面分析模型nms阈值| 0.5| -| vqa_algorithm | vqa模型算法| LayoutXLM| +| kie_algorithm | kie模型算法| LayoutXLM| | ser_model_dir | ser模型 inference 模型地址| None| | ser_dict_path | ser模型字典| ../train_data/XFUND/class_list_xfun.txt| -| mode | structure or vqa | structure | +| mode | structure or kie | structure | | image_orientation | 前向中是否执行图像方向分类 | False | | layout | 前向中是否执行版面分析 | True | | table | 前向中是否执行表格识别 | True | | ocr | 对于版面分析中的非表格区域,是否执行ocr。当layout为False时会被自动设置为False| True | | recovery | 前向中是否执行版面恢复| False | +| save_pdf | 版面恢复导出docx文件的同时,是否导出pdf文件 | False | | structure_version | 模型版本,可选 PP-structure和PP-structurev2 | PP-structure | 大部分参数和PaddleOCR whl包保持一致,见 [whl包文档](../../doc/doc_ch/whl.md) + + +## 3. 小结 + +通过本节内容,相信您已经熟练掌握通过PaddleOCR whl包调用PP-Structure相关功能的使用方法,您可以参考[文档教程](../../README_ch.md#文档教程),获取包括模型训练、推理部署等更详细的使用教程。 \ No newline at end of file diff --git a/ppstructure/docs/quickstart_en.md b/ppstructure/docs/quickstart_en.md index b4dee3f02d3c2762ef71720995f4da697ae43622..f0fbc86394dab00f1715f8f8fda30f3116c4fd07 100644 --- a/ppstructure/docs/quickstart_en.md +++ b/ppstructure/docs/quickstart_en.md @@ -1,38 +1,64 @@ # PP-Structure Quick Start -- [1. Install package](#1-install-package) -- [2. Use](#2-use) +- [1. Environment Preparation](#1-environment-preparation) +- [2. Quick Use](#2-quick-use) - [2.1 Use by command line](#21-use-by-command-line) - [2.1.1 image orientation + layout analysis + table recognition](#211-image-orientation--layout-analysis--table-recognition) - [2.1.2 layout analysis + table recognition](#212-layout-analysis--table-recognition) - [2.1.3 layout analysis](#213-layout-analysis) - [2.1.4 table recognition](#214-table-recognition) - - [2.1.5 DocVQA](#215-docvqa) - - [2.2 Use by code](#22-use-by-code) + - [2.1.5 Key Information Extraction](#215-Key-Information-Extraction) + - [2.1.6 layout recovery](#216-layout-recovery) + - [2.2 Use by python script](#22-use-by-python-script) - [2.2.1 image orientation + layout analysis + table recognition](#221-image-orientation--layout-analysis--table-recognition) - [2.2.2 layout analysis + table recognition](#222-layout-analysis--table-recognition) - [2.2.3 layout analysis](#223-layout-analysis) - [2.2.4 table recognition](#224-table-recognition) - - [2.2.5 DocVQA](#225-docvqa) + - [2.2.5 Key Information Extraction](#225-Key-Information-Extraction) + - [2.2.6 layout recovery](#226-layout-recovery) - [2.3 Result description](#23-result-description) - [2.3.1 layout analysis + table recognition](#231-layout-analysis--table-recognition) - - [2.3.2 DocVQA](#232-docvqa) + - [2.3.2 Key Information Extraction](#232-Key-Information-Extraction) - [2.4 Parameter Description](#24-parameter-description) +- [3. Summary](#3-summary) -## 1. Install package +## 1. Environment Preparation +### 1.1 Install PaddlePaddle + +> If you do not have a Python environment, please refer to [Environment Preparation](./environment_en.md). + +- If you have CUDA 9 or CUDA 10 installed on your machine, please run the following command to install + + ```bash + python3 -m pip install paddlepaddle-gpu -i https://mirror.baidu.com/pypi/simple + ``` + +- If you have no available GPU on your machine, please run the following command to install the CPU version + + ```bash + python3 -m pip install paddlepaddle -i https://mirror.baidu.com/pypi/simple + ``` + +For more software version requirements, please refer to the instructions in [Installation Document](https://www.paddlepaddle.org.cn/install/quick) for operation. + +### 1.2 Install PaddleOCR Whl Package ```bash -# Install paddleocr, version 2.5+ is recommended -pip3 install "paddleocr>=2.5" -# Install the DocVQA dependency package paddlenlp (if you do not use the DocVQA, you can skip it) -pip install paddlenlp +# Install paddleocr, version 2.6 is recommended +pip3 install "paddleocr>=2.6" +# Install the image direction classification dependency package paddleclas (if you do not use the image direction classification, you can skip it) +pip3 install paddleclas + +# Install the KIE dependency packages (if you do not use the KIE, you can skip it) +pip3 install -r kie/requirements.txt ``` -## 2. Use + +## 2. Quick Use ### 2.1 Use by command line @@ -40,34 +66,40 @@ pip install paddlenlp #### 2.1.1 image orientation + layout analysis + table recognition ```bash -paddleocr --image_dir=PaddleOCR/ppstructure/docs/table/1.png --type=structure --image_orientation=true +paddleocr --image_dir=ppstructure/docs/table/1.png --type=structure --image_orientation=true ``` #### 2.1.2 layout analysis + table recognition ```bash -paddleocr --image_dir=PaddleOCR/ppstructure/docs/table/1.png --type=structure +paddleocr --image_dir=ppstructure/docs/table/1.png --type=structure ``` #### 2.1.3 layout analysis ```bash -paddleocr --image_dir=PaddleOCR/ppstructure/docs/table/1.png --type=structure --table=false --ocr=false +paddleocr --image_dir=ppstructure/docs/table/1.png --type=structure --table=false --ocr=false ``` #### 2.1.4 table recognition ```bash -paddleocr --image_dir=PaddleOCR/ppstructure/docs/table/table.jpg --type=structure --layout=false +paddleocr --image_dir=ppstructure/docs/table/table.jpg --type=structure --layout=false ``` -#### 2.1.5 DocVQA +#### 2.1.5 Key Information Extraction + +Key information extraction does not currently support use by the whl package. For detailed usage tutorials, please refer to: [Key Information Extraction](../kie/README.md). -Please refer to: [Documentation Visual Q&A](../vqa/README.md) . + +#### 2.1.6 layout recovery +```bash +paddleocr --image_dir=ppstructure/docs/table/1.png --type=structure --recovery=true +``` -### 2.2 Use by code +### 2.2 Use by python script #### 2.2.1 image orientation + layout analysis + table recognition @@ -80,7 +112,7 @@ from paddleocr import PPStructure,draw_structure_result,save_structure_res table_engine = PPStructure(show_log=True, image_orientation=True) save_folder = './output' -img_path = 'PaddleOCR/ppstructure/docs/table/1.png' +img_path = 'ppstructure/docs/table/1.png' img = cv2.imread(img_path) result = table_engine(img) save_structure_res(result, save_folder,os.path.basename(img_path).split('.')[0]) @@ -91,7 +123,7 @@ for line in result: from PIL import Image -font_path = 'PaddleOCR/doc/fonts/simfang.ttf' # PaddleOCR下提供字体包 +font_path = 'doc/fonts/simfang.ttf' # PaddleOCR下提供字体包 image = Image.open(img_path).convert('RGB') im_show = draw_structure_result(image, result,font_path=font_path) im_show = Image.fromarray(im_show) @@ -109,7 +141,7 @@ from paddleocr import PPStructure,draw_structure_result,save_structure_res table_engine = PPStructure(show_log=True) save_folder = './output' -img_path = 'PaddleOCR/ppstructure/docs/table/1.png' +img_path = 'ppstructure/docs/table/1.png' img = cv2.imread(img_path) result = table_engine(img) save_structure_res(result, save_folder,os.path.basename(img_path).split('.')[0]) @@ -120,7 +152,7 @@ for line in result: from PIL import Image -font_path = 'PaddleOCR/doc/fonts/simfang.ttf' # PaddleOCR下提供字体包 +font_path = 'doc/fonts/simfang.ttf' # font provieded in PaddleOCR image = Image.open(img_path).convert('RGB') im_show = draw_structure_result(image, result,font_path=font_path) im_show = Image.fromarray(im_show) @@ -138,7 +170,7 @@ from paddleocr import PPStructure,save_structure_res table_engine = PPStructure(table=False, ocr=False, show_log=True) save_folder = './output' -img_path = 'PaddleOCR/ppstructure/docs/table/1.png' +img_path = 'ppstructure/docs/table/1.png' img = cv2.imread(img_path) result = table_engine(img) save_structure_res(result, save_folder, os.path.basename(img_path).split('.')[0]) @@ -159,7 +191,7 @@ from paddleocr import PPStructure,save_structure_res table_engine = PPStructure(layout=False, show_log=True) save_folder = './output' -img_path = 'PaddleOCR/ppstructure/docs/table/table.jpg' +img_path = 'ppstructure/docs/table/table.jpg' img = cv2.imread(img_path) result = table_engine(img) save_structure_res(result, save_folder, os.path.basename(img_path).split('.')[0]) @@ -170,9 +202,35 @@ for line in result: ``` -#### 2.2.5 DocVQA +#### 2.2.5 Key Information Extraction -Please refer to: [Documentation Visual Q&A](../vqa/README.md) . +Key information extraction does not currently support use by the whl package. For detailed usage tutorials, please refer to: [Key Information Extraction](../kie/README.md). + + +#### 2.2.6 layout recovery + +```python +import os +import cv2 +from paddleocr import PPStructure,save_structure_res +from paddelocr.ppstructure.recovery.recovery_to_doc import sorted_layout_boxes, convert_info_docx + +table_engine = PPStructure(layout=False, show_log=True) + +save_folder = './output' +img_path = 'ppstructure/docs/table/1.png' +img = cv2.imread(img_path) +result = table_engine(img) +save_structure_res(result, save_folder, os.path.basename(img_path).split('.')[0]) + +for line in result: + line.pop('img') + print(line) + +h, w, _ = img.shape +res = sorted_layout_boxes(res, w) +convert_info_docx(img, result, save_folder, os.path.basename(img_path).split('.')[0]) +``` ### 2.3 Result description @@ -194,8 +252,8 @@ Each field in dict is described as follows: | field | description | | --- |---| -|type| Type of image area. | -|bbox| The coordinates of the image area in the original image, respectively [upper left corner x, upper left corner y, lower right corner x, lower right corner y]. | +|type| Type of image area. | +|bbox| The coordinates of the image area in the original image, respectively [upper left corner x, upper left corner y, lower right corner x, lower right corner y]. | |res| OCR or table recognition result of the image area.
table: a dict with field descriptions as follows:
        `html`: html str of table.
        In the code usage mode, set return_ocr_result_in_table=True whrn call can get the detection and recognition results of each text in the table area, corresponding to the following fields:
        `boxes`: text detection boxes.
        `rec_res`: text recognition results.
OCR: A tuple containing the detection boxes and recognition results of each single text. | After the recognition is completed, each image will have a directory with the same name under the directory specified by the `output` field. Each table in the image will be stored as an excel, and the picture area will be cropped and saved. The filename of excel and picture is their coordinates in the image. @@ -208,9 +266,9 @@ After the recognition is completed, each image will have a directory with the sa ``` -#### 2.3.2 DocVQA +#### 2.3.2 Key Information Extraction -Please refer to: [Documentation Visual Q&A](../vqa/README.md) . +Please refer to: [Key Information Extraction](../kie/README.md) . ### 2.4 Parameter Description @@ -226,15 +284,21 @@ Please refer to: [Documentation Visual Q&A](../vqa/README.md) . | layout_dict_path | The dictionary path of layout analysis model| ../ppocr/utils/dict/layout_publaynet_dict.txt | | layout_score_threshold | The box threshold path of layout analysis model| 0.5| | layout_nms_threshold | The nms threshold path of layout analysis model| 0.5| -| vqa_algorithm | vqa model algorithm| LayoutXLM| +| kie_algorithm | kie model algorithm| LayoutXLM| | ser_model_dir | Ser model inference model path| None| | ser_dict_path | The dictionary path of Ser model| ../train_data/XFUND/class_list_xfun.txt| -| mode | structure or vqa | structure | +| mode | structure or kie | structure | | image_orientation | Whether to perform image orientation classification in forward | False | | layout | Whether to perform layout analysis in forward | True | | table | Whether to perform table recognition in forward | True | | ocr | Whether to perform ocr for non-table areas in layout analysis. When layout is False, it will be automatically set to False| True | | recovery | Whether to perform layout recovery in forward| False | +| save_pdf | Whether to convert docx to pdf when recovery| False | | structure_version | Structure version, optional PP-structure and PP-structurev2 | PP-structure | Most of the parameters are consistent with the PaddleOCR whl package, see [whl package documentation](../../doc/doc_en/whl.md) + + +## 3. Summary + +Through the content in this section, you can master the use of PP-Structure related functions through PaddleOCR whl package. Please refer to [documentation tutorial](../../README.md) for more detailed usage tutorials including model training, inference and deployment, etc. \ No newline at end of file diff --git a/ppstructure/docs/recovery/recovery.jpg b/ppstructure/docs/recovery/recovery.jpg new file mode 100644 index 0000000000000000000000000000000000000000..a3817ab70eff5b380072701b70ab227ae6c8184c Binary files /dev/null and b/ppstructure/docs/recovery/recovery.jpg differ diff --git a/ppstructure/docs/table/layout.jpg b/ppstructure/docs/table/layout.jpg index db7246b314556d73cd49d049b9b480887b6ef994..c5c39dac7267d8c76121ee686a5931a551903d6f 100644 Binary files a/ppstructure/docs/table/layout.jpg and b/ppstructure/docs/table/layout.jpg differ diff --git a/ppstructure/docs/table/paper-image.jpg b/ppstructure/docs/table/paper-image.jpg index db7246b314556d73cd49d049b9b480887b6ef994..c5c39dac7267d8c76121ee686a5931a551903d6f 100644 Binary files a/ppstructure/docs/table/paper-image.jpg and b/ppstructure/docs/table/paper-image.jpg differ diff --git a/ppstructure/docs/table/recovery.jpg b/ppstructure/docs/table/recovery.jpg deleted file mode 100644 index bee2e2fb3499ec4b348e2b2f1475a87c9c562190..0000000000000000000000000000000000000000 Binary files a/ppstructure/docs/table/recovery.jpg and /dev/null differ diff --git a/ppstructure/kie/README.md b/ppstructure/kie/README.md new file mode 100644 index 0000000000000000000000000000000000000000..adb19a3ca729821ab16bf8f0f8ec14c2376de1de --- /dev/null +++ b/ppstructure/kie/README.md @@ -0,0 +1,260 @@ +English | [简体中文](README_ch.md) + +- [1. Introduction](#1-introduction) + +- [2. Accuracy and performance](#2-Accuracy-and-performance) +- [3. Visualization](#3-Visualization) + - [3.1 SER](#31-ser) + - [3.2 RE](#32-re) +- [4. Usage](#4-usage) + - [4.1 Prepare for the environment](#41-Prepare-for-the-environment) + - [4.2 Quick start](#42-Quick-start) + - [4.3 More](#43-More) +- [5. Reference](#5-Reference) +- [6. License](#6-License) + + +## 1. Introduction + +Key information extraction (KIE) refers to extracting key information from text or images. As downstream task of OCR, the key information extraction task of document image has many practical application scenarios, such as form recognition, ticket information extraction, ID card information extraction, etc. + +PP-Structure conducts research based on the LayoutXLM multi-modal, and proposes the VI-LayoutXLM, which gets rid of visual features when finetuning the downstream tasks. An textline sorting method is also utilized to fit in reading order. What's more, UDML knowledge distillation is used for higher accuracy. Finally, the accuracy and inference speed of VI-LayoutXLM surpass those of LayoutXLM. + +The main features of the key information extraction module in PP-Structure are as follows. + + +- Integrate multi-modal methods such as [LayoutXLM](https://arxiv.org/pdf/2104.08836.pdf), VI-LayoutXLM, and PP-OCR inference engine. +- Supports Semantic Entity Recognition (SER) and Relation Extraction (RE) tasks based on multimodal methods. Based on the SER task, the text recognition and classification in the image can be completed; based on the RE task, the relationship extraction of the text content in the image can be completed, such as judging the problem pair (pair). +- Supports custom training for SER tasks and RE tasks. +- Supports end-to-end system prediction and evaluation of OCR+SER. +- Supports end-to-end system prediction of OCR+SER+RE. +- Support SER model export and inference using PaddleInference. + + +## 2. Accuracy and performance + +We evaluate the methods on the Chinese dataset of [XFUND](https://github.com/doc-analysis/XFUND), and the performance is as follows + +|Model | Backbone | Task | Config file | Hmean | Inference time (ms) | Download link| +| --- | --- | --- | --- | --- | --- | --- | +|VI-LayoutXLM| VI-LayoutXLM-base | SER | [ser_vi_layoutxlm_xfund_zh_udml.yml](../../configs/kie/vi_layoutxlm/ser_vi_layoutxlm_xfund_zh_udml.yml)|**93.19%**| 15.49|[trained model](https://paddleocr.bj.bcebos.com/ppstructure/models/vi_layoutxlm/ser_vi_layoutxlm_xfund_pretrained.tar)| +|LayoutXLM| LayoutXLM-base | SER | [ser_layoutxlm_xfund_zh.yml](../../configs/kie/layoutlm_series/ser_layoutxlm_xfund_zh.yml)|90.38%| 19.49 | [trained model](https://paddleocr.bj.bcebos.com/pplayout/ser_LayoutXLM_xfun_zh.tar)| +|VI-LayoutXLM| VI-LayoutXLM-base | RE | [re_vi_layoutxlm_xfund_zh_udml.yml](../../configs/kie/vi_layoutxlm/re_vi_layoutxlm_xfund_zh_udml.yml)|**83.92%**| 15.49|[trained model](https://paddleocr.bj.bcebos.com/ppstructure/models/vi_layoutxlm/re_vi_layoutxlm_xfund_pretrained.tar)| +|LayoutXLM| LayoutXLM-base | RE | [re_layoutxlm_xfund_zh.yml](../../configs/kie/layoutlm_series/re_layoutxlm_xfund_zh.yml)|74.83%| 19.49|[trained model](https://paddleocr.bj.bcebos.com/pplayout/re_LayoutXLM_xfun_zh.tar)| + + +* Note:Inference environment:V100 GPU + cuda10.2 + cudnn8.1.1 + TensorRT 7.2.3.4,tested using fp16. + +For more KIE models in PaddleOCR, please refer to [KIE model zoo](../../doc/doc_en/algorithm_overview_en.md). + + +## 3. Visualization + +There are two main solutions to the key information extraction task based on VI-LayoutXLM series model. + +(1) Text detection + text recognition + semantic entity recognition (SER) + +(2) Text detection + text recognition + semantic entity recognition (SER) + relationship extraction (RE) + + +The following images are demo results of the SER and RE models. For more detailed introduction to the above solutions, please refer to [KIE Guide](./how_to_do_kie.md). + +### 3.1 SER + +Demo results for SER task are as follows. + +
+ +
+ +
+ +
+ +
+ +
+ +
+ +
+ + + +**Note:** test pictures are from [xfund dataset](https://github.com/doc-analysis/XFUND), [invoice dataset](https://aistudio.baidu.com/aistudio/datasetdetail/165561) and a composite ID card dataset. + + +Boxes of different colors in the image represent different categories. + +The invoice and application form images have three categories: `request`, `answer` and `header`. The `question` and 'answer' can be used to extract the relationship. + +For the ID card image, the mdoel can be directly identify the key information such as `name`, `gender`, `nationality`, so that the subsequent relationship extraction process is not required, and the key information extraction task can be completed using only on model. + +### 3.2 RE + +Demo results for RE task are as follows. + + +
+ +
+ +
+ +
+ +
+ +
+ +Red boxes are questions, blue boxes are answers. The green lines means the two conected objects are a pair. + + +## 4. Usage + +### 4.1 Prepare for the environment + + +Use the following command to install KIE dependencies. + + +```bash +git clone https://github.com/PaddlePaddle/PaddleOCR.git +cd PaddleOCR +pip install -r requirements.txt +pip install -r ppstructure/kie/requirements.txt +# 安装PaddleOCR引擎用于预测 +pip install paddleocr -U +``` + +The visualized results of SER are saved in the `./output` folder by default. Examples of results are as follows. + + +
+ +
+ + +### 4.2 Quick start + +Here we use XFUND dataset to quickly experience the SER model and RE model. + + +#### 4.2.1 Prepare for the dataset + +```bash +mkdir train_data +cd train_data +# download and uncompress the dataset +wget https://paddleocr.bj.bcebos.com/ppstructure/dataset/XFUND.tar && tar -xf XFUND.tar +cd .. +``` + +#### 4.2.2 Predict images using the trained model + +Use the following command to download the models. + +```bash +mkdir pretrained_model +cd pretrained_model +# download and uncompress the SER trained model +wget https://paddleocr.bj.bcebos.com/ppstructure/models/vi_layoutxlm/ser_vi_layoutxlm_xfund_pretrained.tar && tar -xf ser_vi_layoutxlm_xfund_pretrained.tar + +# download and uncompress the RE trained model +wget https://paddleocr.bj.bcebos.com/ppstructure/models/vi_layoutxlm/re_vi_layoutxlm_xfund_pretrained.tar && tar -xf re_vi_layoutxlm_xfund_pretrained.tar +``` + + +If you want to use OCR engine to obtain end-to-end prediction results, you can use the following command to predict. + +```bash +# just predict using SER trained model +python3 tools/infer_kie_token_ser.py \ + -c configs/kie/vi_layoutxlm/ser_vi_layoutxlm_xfund_zh.yml \ + -o Architecture.Backbone.checkpoints=./pretrain_models/ser_vi_layoutxlm_xfund_pretrained/best_accuracy \ + Global.infer_img=./ppstructure/docs/kie/input/zh_val_42.jpg + +# predict using SER and RE trained model at the same time +python3 ./tools/infer_kie_token_ser_re.py \ + -c configs/kie/vi_layoutxlm/re_vi_layoutxlm_xfund_zh.yml \ + -o Architecture.Backbone.checkpoints=./pretrain_models/re_vi_layoutxlm_xfund_pretrained/best_accuracy \ + Global.infer_img=./train_data/XFUND/zh_val/image/zh_val_42.jpg \ + -c_ser configs/kie/vi_layoutxlm/ser_vi_layoutxlm_xfund_zh.yml \ + -o_ser Architecture.Backbone.checkpoints=./pretrain_models/ser_vi_layoutxlm_xfund_pretrained/best_accuracy +``` + +The visual result images and the predicted text file will be saved in the `Global.save_res_path` directory. + + +If you want to load the text detection and recognition results collected before, you can use the following command to predict. + +```bash +# just predict using SER trained model +python3 tools/infer_kie_token_ser.py \ + -c configs/kie/vi_layoutxlm/ser_vi_layoutxlm_xfund_zh.yml \ + -o Architecture.Backbone.checkpoints=./pretrain_models/ser_vi_layoutxlm_xfund_pretrained/best_accuracy \ + Global.infer_img=./train_data/XFUND/zh_val/val.json \ + Global.infer_mode=False + +# predict using SER and RE trained model at the same time +python3 ./tools/infer_kie_token_ser_re.py \ + -c configs/kie/vi_layoutxlm/re_vi_layoutxlm_xfund_zh.yml \ + -o Architecture.Backbone.checkpoints=./pretrain_models/re_vi_layoutxlm_xfund_pretrained/best_accuracy \ + Global.infer_img=./train_data/XFUND/zh_val/val.json \ + Global.infer_mode=False \ + -c_ser configs/kie/vi_layoutxlm/ser_vi_layoutxlm_xfund_zh.yml \ + -o_ser Architecture.Backbone.checkpoints=./pretrain_models/ser_vi_layoutxlm_xfund_pretrained/best_accuracy +``` + +#### 4.2.3 Inference using PaddleInference + +At present, only SER model supports inference using PaddleInference. + +Firstly, download the inference SER inference model. + + +```bash +mkdir inference +cd inference +wget https://paddleocr.bj.bcebos.com/ppstructure/models/vi_layoutxlm/ser_vi_layoutxlm_xfund_infer.tar && tar -xf ser_vi_layoutxlm_xfund_infer.tar +``` + +Use the following command for inference. + + +```bash +cd ppstructure +python3 kie/predict_kie_token_ser.py \ + --kie_algorithm=LayoutXLM \ + --ser_model_dir=../inference/ser_vi_layoutxlm_xfund_infer \ + --image_dir=./docs/kie/input/zh_val_42.jpg \ + --ser_dict_path=../train_data/XFUND/class_list_xfun.txt \ + --vis_font_path=../doc/fonts/simfang.ttf \ + --ocr_order_method="tb-yx" +``` + +The visual results and text file will be saved in directory `output`. + + +### 4.3 More + +For training, evaluation and inference tutorial for KIE models, please refer to [KIE doc](../../doc/doc_en/kie_en.md). + +For training, evaluation and inference tutorial for text detection models, please refer to [text detection doc](../../doc/doc_en/detection_en.md). + +For training, evaluation and inference tutorial for text recognition models, please refer to [text recognition doc](../../doc/doc_en/recognition.md). + +If you want to finish the KIE tasks in your scene, and don't know what to prepare, please refer to [End cdoc](../../doc/doc_en/recognition.md). + +To complete the key information extraction task in your own scenario from data preparation to model selection, please refer to: [Guide to End-to-end KIE](./how_to_do_kie_en.md)。 + + +## 5. Reference + +- LayoutXLM: Multimodal Pre-training for Multilingual Visually-rich Document Understanding, https://arxiv.org/pdf/2104.08836.pdf +- microsoft/unilm/layoutxlm, https://github.com/microsoft/unilm/tree/master/layoutxlm +- XFUND dataset, https://github.com/doc-analysis/XFUND + +## 6. License + +The content of this project itself is licensed under the [Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0)](https://creativecommons.org/licenses/by-nc-sa/4.0/) diff --git a/ppstructure/kie/README_ch.md b/ppstructure/kie/README_ch.md new file mode 100644 index 0000000000000000000000000000000000000000..56c99ab73abe2b33ccfa18d4181312cd5f4d3622 --- /dev/null +++ b/ppstructure/kie/README_ch.md @@ -0,0 +1,241 @@ +[English](README.md) | 简体中文 + +# 关键信息抽取 + +- [1. 简介](#1-简介) +- [2. 精度与性能](#2-精度与性能) +- [3. 效果演示](#3-效果演示) + - [3.1 SER](#31-ser) + - [3.2 RE](#32-re) +- [4. 使用](#4-使用) + - [4.1 准备环境](#41-准备环境) + - [4.2 快速开始](#42-快速开始) + - [4.3 更多](#43-更多) +- [5. 参考链接](#5-参考链接) +- [6. License](#6-License) + + +## 1. 简介 + +关键信息抽取 (Key Information Extraction, KIE)指的是是从文本或者图像中,抽取出关键的信息。针对文档图像的关键信息抽取任务作为OCR的下游任务,存在非常多的实际应用场景,如表单识别、车票信息抽取、身份证信息抽取等。 + +PP-Structure 基于 LayoutXLM 文档多模态系列方法进行研究与优化,设计了视觉特征无关的多模态模型结构VI-LayoutXLM,同时引入符合阅读顺序的文本行排序方法以及UDML联合互学习蒸馏方法,最终在精度与速度均超越LayoutXLM。 + +PP-Structure中关键信息抽取模块的主要特性如下: + +- 集成[LayoutXLM](https://arxiv.org/pdf/2104.08836.pdf)、VI-LayoutXLM等多模态模型以及PP-OCR预测引擎。 +- 支持基于多模态方法的语义实体识别 (Semantic Entity Recognition, SER) 以及关系抽取 (Relation Extraction, RE) 任务。基于 SER 任务,可以完成对图像中的文本识别与分类;基于 RE 任务,可以完成对图象中的文本内容的关系提取,如判断问题对(pair)。 +- 支持SER任务和RE任务的自定义训练。 +- 支持OCR+SER的端到端系统预测与评估。 +- 支持OCR+SER+RE的端到端系统预测。 +- 支持SER模型的动转静导出与基于PaddleInfernece的模型推理。 + + +## 2. 精度与性能 + + +我们在 [XFUND](https://github.com/doc-analysis/XFUND) 的中文数据集上对算法进行了评估,SER与RE上的任务性能如下 + +|模型|骨干网络|任务|配置文件|hmean|预测耗时(ms)|下载链接| +| --- | --- | --- | --- | --- | --- | --- | +|VI-LayoutXLM| VI-LayoutXLM-base | SER | [ser_vi_layoutxlm_xfund_zh_udml.yml](../../configs/kie/vi_layoutxlm/ser_vi_layoutxlm_xfund_zh_udml.yml)|**93.19%**| 15.49|[训练模型](https://paddleocr.bj.bcebos.com/ppstructure/models/vi_layoutxlm/ser_vi_layoutxlm_xfund_pretrained.tar)| +|LayoutXLM| LayoutXLM-base | SER | [ser_layoutxlm_xfund_zh.yml](../../configs/kie/layoutlm_series/ser_layoutxlm_xfund_zh.yml)|90.38%| 19.49 | [训练模型](https://paddleocr.bj.bcebos.com/pplayout/ser_LayoutXLM_xfun_zh.tar)| +|VI-LayoutXLM| VI-LayoutXLM-base | RE | [re_vi_layoutxlm_xfund_zh_udml.yml](../../configs/kie/vi_layoutxlm/re_vi_layoutxlm_xfund_zh_udml.yml)|**83.92%**| 15.49|[训练模型](https://paddleocr.bj.bcebos.com/ppstructure/models/vi_layoutxlm/re_vi_layoutxlm_xfund_pretrained.tar)| +|LayoutXLM| LayoutXLM-base | RE | [re_layoutxlm_xfund_zh.yml](../../configs/kie/layoutlm_series/re_layoutxlm_xfund_zh.yml)|74.83%| 19.49|[训练模型](https://paddleocr.bj.bcebos.com/pplayout/re_LayoutXLM_xfun_zh.tar)| + + +* 注:预测耗时测试条件:V100 GPU + cuda10.2 + cudnn8.1.1 + TensorRT 7.2.3.4,使用FP16进行测试。 + +更多关于PaddleOCR中关键信息抽取模型的介绍,请参考[关键信息抽取模型库](../../doc/doc_ch/algorithm_overview.md)。 + + +## 3. 效果演示 + +基于多模态模型的关键信息抽取任务有2种主要的解决方案。 + +(1)文本检测 + 文本识别 + 语义实体识别(SER) +(2)文本检测 + 文本识别 + 语义实体识别(SER) + 关系抽取(RE) + +下面给出SER与RE任务的示例效果,关于上述解决方案的详细介绍,请参考[关键信息抽取全流程指南](./how_to_do_kie.md)。 + +### 3.1 SER + +对于SER任务,效果如下所示。 + +
+ +
+ +
+ +
+ +
+ +
+ +
+ +
+ +**注意:** 测试图片来源于[XFUND数据集](https://github.com/doc-analysis/XFUND)、[发票数据集](https://aistudio.baidu.com/aistudio/datasetdetail/165561)以及合成的身份证数据集。 + + +图中不同颜色的框表示不同的类别。 + +图中的发票以及申请表图像,有`QUESTION`, `ANSWER`, `HEADER` 3种类别,识别的`QUESTION`, `ANSWER`可以用于后续的问题与答案的关系抽取。 + +图中的身份证图像,则直接识别出其中的`姓名`、`性别`、`民族`等关键信息,这样就无需后续的关系抽取过程,一个模型即可完成关键信息抽取。 + + +### 3.2 RE + +对于RE任务,效果如下所示。 + +
+ +
+ +
+ +
+ +
+ +
+ + +红色框是问题,蓝色框是答案。绿色线条表示连接的两端为一个key-value的pair。 + +## 4. 使用 + +### 4.1 准备环境 + +使用下面的命令安装运行SER与RE关键信息抽取的依赖。 + +```bash +git clone https://github.com/PaddlePaddle/PaddleOCR.git +cd PaddleOCR +pip install -r requirements.txt +pip install -r ppstructure/kie/requirements.txt +# 安装PaddleOCR引擎用于预测 +pip install paddleocr -U +``` + +### 4.2 快速开始 + +下面XFUND数据集,快速体验SER模型与RE模型。 + +#### 4.2.1 准备数据 + +```bash +mkdir train_data +cd train_data +# 下载与解压数据 +wget https://paddleocr.bj.bcebos.com/ppstructure/dataset/XFUND.tar && tar -xf XFUND.tar +cd .. +``` + +#### 4.2.2 基于动态图的预测 + +首先下载模型。 + +```bash +mkdir pretrained_model +cd pretrained_model +# 下载并解压SER预训练模型 +wget https://paddleocr.bj.bcebos.com/ppstructure/models/vi_layoutxlm/ser_vi_layoutxlm_xfund_pretrained.tar && tar -xf ser_vi_layoutxlm_xfund_pretrained.tar + +# 下载并解压RE预训练模型 +wget https://paddleocr.bj.bcebos.com/ppstructure/models/vi_layoutxlm/re_vi_layoutxlm_xfund_pretrained.tar && tar -xf re_vi_layoutxlm_xfund_pretrained.tar +``` + +如果希望使用OCR引擎,获取端到端的预测结果,可以使用下面的命令进行预测。 + +```bash +# 仅预测SER模型 +python3 tools/infer_kie_token_ser.py \ + -c configs/kie/vi_layoutxlm/ser_vi_layoutxlm_xfund_zh.yml \ + -o Architecture.Backbone.checkpoints=./pretrain_models/ser_vi_layoutxlm_xfund_pretrained/best_accuracy \ + Global.infer_img=./ppstructure/docs/kie/input/zh_val_42.jpg + +# SER + RE模型串联 +python3 ./tools/infer_kie_token_ser_re.py \ + -c configs/kie/vi_layoutxlm/re_vi_layoutxlm_xfund_zh.yml \ + -o Architecture.Backbone.checkpoints=./pretrain_models/re_vi_layoutxlm_xfund_pretrained/best_accuracy \ + Global.infer_img=./train_data/XFUND/zh_val/image/zh_val_42.jpg \ + -c_ser configs/kie/vi_layoutxlm/ser_vi_layoutxlm_xfund_zh.yml \ + -o_ser Architecture.Backbone.checkpoints=./pretrain_models/ser_vi_layoutxlm_xfund_pretrained/best_accuracy +``` + +`Global.save_res_path`目录中会保存可视化的结果图像以及预测的文本文件。 + + +如果希望加载标注好的文本检测与识别结果,仅预测可以使用下面的命令进行预测。 + +```bash +# 仅预测SER模型 +python3 tools/infer_kie_token_ser.py \ + -c configs/kie/vi_layoutxlm/ser_vi_layoutxlm_xfund_zh.yml \ + -o Architecture.Backbone.checkpoints=./pretrain_models/ser_vi_layoutxlm_xfund_pretrained/best_accuracy \ + Global.infer_img=./train_data/XFUND/zh_val/val.json \ + Global.infer_mode=False + +# SER + RE模型串联 +python3 ./tools/infer_kie_token_ser_re.py \ + -c configs/kie/vi_layoutxlm/re_vi_layoutxlm_xfund_zh.yml \ + -o Architecture.Backbone.checkpoints=./pretrain_models/re_vi_layoutxlm_xfund_pretrained/best_accuracy \ + Global.infer_img=./train_data/XFUND/zh_val/val.json \ + Global.infer_mode=False \ + -c_ser configs/kie/vi_layoutxlm/ser_vi_layoutxlm_xfund_zh.yml \ + -o_ser Architecture.Backbone.checkpoints=./pretrain_models/ser_vi_layoutxlm_xfund_pretrained/best_accuracy +``` + +#### 4.2.3 基于PaddleInference的预测 + +目前仅SER模型支持PaddleInference推理。 + +首先下载SER的推理模型。 + + +```bash +mkdir inference +cd inference +wget https://paddleocr.bj.bcebos.com/ppstructure/models/vi_layoutxlm/ser_vi_layoutxlm_xfund_infer.tar && tar -xf ser_vi_layoutxlm_xfund_infer.tar +``` + +执行下面的命令进行预测。 + +```bash +cd ppstructure +python3 kie/predict_kie_token_ser.py \ + --kie_algorithm=LayoutXLM \ + --ser_model_dir=../inference/ser_vi_layoutxlm_xfund_infer \ + --image_dir=./docs/kie/input/zh_val_42.jpg \ + --ser_dict_path=../train_data/XFUND/class_list_xfun.txt \ + --vis_font_path=../doc/fonts/simfang.ttf \ + --ocr_order_method="tb-yx" +``` + +可视化结果保存在`output`目录下。 + +### 4.3 更多 + +关于KIE模型的训练评估与推理,请参考:[关键信息抽取教程](../../doc/doc_ch/kie.md)。 + +关于文本检测模型的训练评估与推理,请参考:[文本检测教程](../../doc/doc_ch/detection.md)。 + +关于文本识别模型的训练评估与推理,请参考:[文本识别教程](../../doc/doc_ch/recognition.md)。 + +关于怎样在自己的场景中完成关键信息抽取任务,请参考:[关键信息抽取全流程指南](./how_to_do_kie.md)。 + + +## 5. 参考链接 + +- LayoutXLM: Multimodal Pre-training for Multilingual Visually-rich Document Understanding, https://arxiv.org/pdf/2104.08836.pdf +- microsoft/unilm/layoutxlm, https://github.com/microsoft/unilm/tree/master/layoutxlm +- XFUND dataset, https://github.com/doc-analysis/XFUND + +## 6. License + +The content of this project itself is licensed under the [Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0)](https://creativecommons.org/licenses/by-nc-sa/4.0/) diff --git a/ppstructure/vqa/how_to_do_kie.md b/ppstructure/kie/how_to_do_kie.md similarity index 100% rename from ppstructure/vqa/how_to_do_kie.md rename to ppstructure/kie/how_to_do_kie.md diff --git a/ppstructure/kie/how_to_do_kie_en.md b/ppstructure/kie/how_to_do_kie_en.md new file mode 100644 index 0000000000000000000000000000000000000000..23b2394f5aa3911a1311d3bc3be8f362861d34af --- /dev/null +++ b/ppstructure/kie/how_to_do_kie_en.md @@ -0,0 +1,179 @@ + +# Key Information Extraction Pipeline + +- [1. Introduction](#1-Introduction) + - [1.1 Background](#11-Background) + - [1.2 Mainstream Deep-learning Solutions](#12-Mainstream-Deep-learning-Solutions) +- [2. KIE Pipeline](#2-KIE-Pipeline) + - [2.1 Train OCR Models](#21-Train-OCR-Models) + - [2.2 Train KIE Models](#22-Train-KIE-Models) +- [3. Reference](#3-Reference) + + +## 1. Introduction + +### 1.1 Background + +Key information extraction (KIE) refers to extracting key information from text or images. As the downstream task of OCR, KIE of document image has many practical application scenarios, such as form recognition, ticket information extraction, ID card information extraction, etc. However, it is time-consuming and laborious to extract key information from these document images by manpower. It's challengable but also valuable to combine multi-modal features (visual, layout, text, etc) together and complete KIE tasks. + +For the document images in a specific scene, the position and layout of the key information are relatively fixed. Therefore, in the early stage of the research, there are many methods based on template matching to extract the key information. This method is still widely used in many simple scenarios at present. However, it takes long time to adjut the template for different scenarios. + + +The KIE in the document image generally contains 2 subtasks, which is as shown follows. + +* (1) SER: semantic entity recognition, which classifies each detected textline, such as dividing it into name and ID card. As shown in the red boxes in the following figure. + +* (2) RE: relationship extraction, which matches the question and answer based on SER results. As shown in the figure below, the yellow arrows match the question and answer. + +
+ +
+ + + +### 1.2 Mainstream Deep-learning Solutions + +General KIE methods are based on Named Entity Recognition (NER), but such methods only use text information and ignore location and visual feature information, which leads to limited accuracy. In recent years, most scholars have started to combine mutil-modal features to improve the accuracy of KIE model. The main methods are as follows: + +* (1) Grid based methods. These methods mainly focus on the fusion of multi-modal information at the image level. Most texts are of character granularity. The text and structure information embedding method is simple, such as the algorithm of chargrid [1]. + +* (2) Token based methods. These methods refer to the NLP methods such as Bert, which encode the position, vision and other feature information into the multi-modal model, and conduct pre-training on large-scale datasets, so that in downstream tasks, only a small amount of annotation data is required to obtain excellent results. The representative algorithms are layoutlm [2], layoutlmv2 [3], layoutxlm [4], structext [5], etc. + +* (3) GCN based methods. These methods try to learn the structural information between images and characters, so as to solve the problem of extracting open set information (templates not seen in the training set), such as GCN [6], SDMGR [7] and other algorithms. + +* (4) End to end based methods: these methods put the existing OCR character recognition and KIE information extraction tasks into a unified network for common learning, and strengthen each other in the learning process. Such as TRIE [8]. + + +For more detailed introduction of the algorithms, please refer to Chapter 6 of [Diving into OCR](https://aistudio.baidu.com/aistudio/education/group/info/25207). + +## 2. KIE Pipeline + +Token based methods such as LayoutXLM are implemented in PaddleOCR. What's more, in PP-Structurev2, we simplify the LayoutXLM model and proposed VI-LayoutXLM, in which the visual feature extraction module is removed for speed-up. The textline sorting strategy conforming to the human reading order and UDML knowledge distillation strategy are utilized for higher model accuracy. + + +In the non end-to-end KIE method, KIE needs at least ** 2 steps**. Firstly, the OCR model is used to extract the text and its position. Secondly, the KIE model is used to extract the key information according to the image, text position and text content. + + +### 2.1 Train OCR Models + +#### 2.1.1 Text Detection + +**(1) Data** + +Most of the models provided in PaddleOCR are general models. In the process of text detection, the detection of adjacent text lines is generally based on the distance of the position. As shown in the figure above, when using PP-OCRv3 general English detection model for text detection, it is easy to detect the two fields representing different propoerties as one. Therefore, it is suggested to finetune a detection model according to your scenario firstly during the KIE task. + + +During data annotation, the different key information needs to be separated. Otherwise, it will increase the difficulty of subsequent KIE tasks. + +For downstream tasks, generally speaking, `200~300` training images can guarantee the basic training effect. If there is not too much prior knowledge, **`200~300`** images can be labeled firstly for subsequent text detection model training. + +**(2) Model** + +In terms of model selection, PP-OCRv3 detection model is recommended. For more information about the training methods of the detection model, please refer to: [Text detection tutorial](../../doc/doc_en/detection_en.md) and [PP-OCRv3 detection model tutorial](../../doc/doc_ch/PPOCRv3_det_train.md). + +#### 2.1.2 Text recognition + + +Compared with the natural scene, the text recognition in the document image is generally relatively easier (the background is not too complex), so **it is suggested to** try the PP-OCRv3 general text recognition model provided in PaddleOCR ([PP-OCRv3 model list](../../doc/doc_en/models_list_en.md)) + + +**(1) Data** + +However, there are also some challenges in some document scenarios, such as rare words in ID card scenarios and special fonts in invoice and other scenarios. These problems will increase the difficulty of text recognition. At this time, if you want to ensure or further improve the model accuracy, it is recommended to load PP-OCRv3 model based on the text recognition dataset of specific document scenarios for finetuning. + +In the process of model finetuning, it is recommended to prepare at least `5000` vertical scene text recognition images to ensure the basic model fine-tuning effect. If you want to improve the accuracy and generalization ability of the model, you can synthesize more text recognition images similar to the scene, collect general real text recognition data from the public data set, and add them to the text recognition training process. In the training process, it is suggested that the ratio of real data, synthetic data and general data of each epoch should be around `1:1:1`, which can be controlled by setting the sampling ratio of different data sources. If there are 3 training text files, including 10k, 20k and 50k pieces of data respectively, the data can be set in the configuration file as follows: + +```yml +Train: + dataset: + name: SimpleDataSet + data_dir: ./train_data/ + label_file_list: + - ./train_data/train_list_10k.txt + - ./train_data/train_list_10k.txt + - ./train_data/train_list_50k.txt + ratio_list: [1.0, 0.5, 0.2] + ... +``` + +**(2) Model** + +In terms of model selection, PP-OCRv3 recognition model is recommended. For more information about the training methods of the recognition model, please refer to: [Text recognition tutorial](../../doc/doc_en/recognition_en.md) and [PP-OCRv3 model list](../../doc/doc_en/models_list_en.md). + + +### 2.2 Train KIE Models + +There are two main methods to extract the key information from the recognized texts. + +(1) Directly use SER model to obtain the key information category. For example, in the ID card scenario, we mark "name" and "Geoff Sample" as "name_key" and "name_value", respectively. The **text field** corresponding to the category "name_value" finally identified is the key information we need. + +(2) Joint use SER and RE models. For this case, we firstly use SER model to obtain all questions (keys) and questions (values) for the image text, and then use RE model to match all keys and values to find the relationship, so as to complete the extraction of key information. + +#### 2.2.1 SER + +Take the ID card scenario as an example. The key information generally includes `name`, `DOB`, etc. We can directly mark the corresponding fields as specific categories, as shown in the following figure. + +
+ +
+ +**Note:** + +- In the labeling process, text content without key information about KIE shall be labeled as`other`, which is equivalent to background information. For example, in the ID card scenario, if we do not pay attention to `DOB` information, we can mark the categories of `DOB` and `Area manager` as `other`. +- In the annotation process of, it is required to annotate the **textline** position rather than the character. + + +In terms of data, generally speaking, for relatively fixed scenes, **50** training images can achieve acceptable effects. You can refer to [PPOCRLabel](../../PPOCRLabel/README.md) for finish the labeling process. + +In terms of model, it is recommended to use the VI-layoutXLM model proposed in PP-Structurev2. It is improved based on the LayoutXLM model, removing the visual feature extraction module, and further improving the model inference speed without the significant reduction on model accuracy. For more tutorials, please refer to [VI-LayoutXLM introduction](../../doc/doc_en/algorithm_kie_vi_layoutxlm_en.md) and [KIE tutorial](../../doc/doc_en/kie_en.md). + + +#### 2.2.2 SER + RE + +The SER model is mainly used to identify all keys and values in the document image, and the RE model is mainly used to match all keys and values. + +Taking the ID card scenario as an example, the key information generally includes key information such as `name`, `DOB`, etc. in the SER stage, we need to identify all questions (keys) and answers (values). The demo annotation is as follows. All keys can be annotated as `question`, and all keys can be annotated as `answer`. + + +
+ +
+ + +In the RE stage, the ID and connection information of each field need to be marked, as shown in the following figure. + +
+ +
+ +For each textline, you need to add 'ID' and 'linking' field information. The 'ID' records the unique identifier of the textline. Different text contents in the same images cannot be repeated. The 'linking' is a list that records the connection information between different texts. If the ID of the field "name" is 0 and the ID of the field "Geoff Sample" is 1, then they all have [[0, 1]] 'linking' marks, indicating that the fields with `id=0` and `id=1` form a key value relationship (the fields such as DOB and Expires are similar, and will not be repeated here). + + +**Note:** + +-During annotation, if value is multiple textines, a key value pair can be added in linking, such as `[[0, 1], [0, 2]]`. + +In terms of data, generally speaking, for relatively fixed scenes, about **50** training images can achieve acceptable effects. + +In terms of model, it is recommended to use the VI-layoutXLM model proposed in PP-Structurev2. It is improved based on the LayoutXLM model, removing the visual feature extraction module, and further improving the model inference speed without the significant reduction on model accuracy. For more tutorials, please refer to [VI-LayoutXLM introduction](../../doc/doc_en/algorithm_kie_vi_layoutxlm_en.md) and [KIE tutorial](../../doc/doc_en/kie_en.md). + + + +## 3. Reference + + +[1] Katti A R, Reisswig C, Guder C, et al. Chargrid: Towards understanding 2d documents[J]. arXiv preprint arXiv:1809.08799, 2018. + +[2] Xu Y, Li M, Cui L, et al. Layoutlm: Pre-training of text and layout for document image understanding[C]//Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 2020: 1192-1200. + +[3] Xu Y, Xu Y, Lv T, et al. LayoutLMv2: Multi-modal pre-training for visually-rich document understanding[J]. arXiv preprint arXiv:2012.14740, 2020. + +[4]: Xu Y, Lv T, Cui L, et al. Layoutxlm: Multimodal pre-training for multilingual visually-rich document understanding[J]. arXiv preprint arXiv:2104.08836, 2021. + +[5] Li Y, Qian Y, Yu Y, et al. StrucTexT: Structured Text Understanding with Multi-Modal Transformers[C]//Proceedings of the 29th ACM International Conference on Multimedia. 2021: 1912-1920. + +[6] Liu X, Gao F, Zhang Q, et al. Graph convolution for multimodal information extraction from visually rich documents[J]. arXiv preprint arXiv:1903.11279, 2019. + +[7] Sun H, Kuang Z, Yue X, et al. Spatial Dual-Modality Graph Reasoning for Key Information Extraction[J]. arXiv preprint arXiv:2103.14470, 2021. + +[8] Zhang P, Xu Y, Cheng Z, et al. Trie: End-to-end text reading and information extraction for document understanding[C]//Proceedings of the 28th ACM International Conference on Multimedia. 2020: 1413-1422. diff --git a/ppstructure/vqa/predict_vqa_token_ser.py b/ppstructure/kie/predict_kie_token_ser.py similarity index 98% rename from ppstructure/vqa/predict_vqa_token_ser.py rename to ppstructure/kie/predict_kie_token_ser.py index 7647af9d10684bc6621b32e95d55e05948cb59b7..48cfc528a28e0a2bdfb51d3a537f26e891ae3286 100644 --- a/ppstructure/vqa/predict_vqa_token_ser.py +++ b/ppstructure/kie/predict_kie_token_ser.py @@ -30,7 +30,7 @@ from ppocr.data import create_operators, transform from ppocr.postprocess import build_post_process from ppocr.utils.logging import get_logger from ppocr.utils.visual import draw_ser_results -from ppocr.utils.utility import get_image_file_list, check_and_read_gif +from ppocr.utils.utility import get_image_file_list, check_and_read from ppstructure.utility import parse_args from paddleocr import PaddleOCR @@ -49,7 +49,7 @@ class SerPredictor(object): pre_process_list = [{ 'VQATokenLabelEncode': { - 'algorithm': args.vqa_algorithm, + 'algorithm': args.kie_algorithm, 'class_path': args.ser_dict_path, 'contains_re': False, 'ocr_engine': self.ocr_engine, @@ -138,7 +138,7 @@ def main(args): os.path.join(args.output, 'infer.txt'), mode='w', encoding='utf-8') as f_w: for image_file in image_file_list: - img, flag = check_and_read_gif(image_file) + img, flag, _ = check_and_read(image_file) if not flag: img = cv2.imread(image_file) img = img[:, :, ::-1] diff --git a/ppstructure/kie/requirements.txt b/ppstructure/kie/requirements.txt new file mode 100644 index 0000000000000000000000000000000000000000..53a7315d051704640b9a692ffaa52ce05fd16274 --- /dev/null +++ b/ppstructure/kie/requirements.txt @@ -0,0 +1,7 @@ +sentencepiece +yacs +seqeval +git+https://github.com/PaddlePaddle/PaddleNLP +pypandoc +attrdict +python_docx diff --git a/ppstructure/vqa/tools/eval_with_label_end2end.py b/ppstructure/kie/tools/eval_with_label_end2end.py similarity index 99% rename from ppstructure/vqa/tools/eval_with_label_end2end.py rename to ppstructure/kie/tools/eval_with_label_end2end.py index b13ffb568fd9610fee5d5a246c501ed5b90de91a..b0fd84363f450dfb7e4ef18e53adc17ef088cf18 100644 --- a/ppstructure/vqa/tools/eval_with_label_end2end.py +++ b/ppstructure/kie/tools/eval_with_label_end2end.py @@ -20,7 +20,7 @@ from shapely.geometry import Polygon import numpy as np from collections import defaultdict import operator -import Levenshtein +from rapidfuzz.distance import Levenshtein import argparse import json import copy diff --git a/ppstructure/vqa/tools/trans_funsd_label.py b/ppstructure/kie/tools/trans_funsd_label.py similarity index 100% rename from ppstructure/vqa/tools/trans_funsd_label.py rename to ppstructure/kie/tools/trans_funsd_label.py diff --git a/ppstructure/vqa/tools/trans_xfun_data.py b/ppstructure/kie/tools/trans_xfun_data.py similarity index 100% rename from ppstructure/vqa/tools/trans_xfun_data.py rename to ppstructure/kie/tools/trans_xfun_data.py diff --git a/ppstructure/layout/README.md b/ppstructure/layout/README_ch.md similarity index 87% rename from ppstructure/layout/README.md rename to ppstructure/layout/README_ch.md index 3762544b834d752a705216ca3f93d326aa1391ad..d5598fc1a896ea4cfcc94619e1744b9b7ec288b3 100644 --- a/ppstructure/layout/README.md +++ b/ppstructure/layout/README_ch.md @@ -1,29 +1,18 @@ - [1. 简介](#1-简介) - - [2. 安装](#2-安装) - - [2.1 安装PaddlePaddle](#21-安装paddlepaddle) - [2.2 安装PaddleDetection](#22-安装paddledetection) - - [3. 数据准备](#3-数据准备) - - [3.1 英文数据集](#31-英文数据集) - [3.2 更多数据集](#32-更多数据集) - - [4. 开始训练](#4-开始训练) - - [4.1 启动训练](#41-启动训练) - [4.2 FGD蒸馏训练](#42-FGD蒸馏训练) - - [5. 模型评估与预测](#5-模型评估与预测) - - [5.1 指标评估](#51-指标评估) - [5.2 测试版面分析结果](#52-测试版面分析结果) - - [6 模型导出与预测](#6-模型导出与预测) - - [6.1 模型导出](#61-模型导出) - - [6.2 模型推理](#62-模型推理) # 版面分析 @@ -63,7 +52,7 @@ python3 -m pip install "paddlepaddle>=2.2" -i https://mirror.baidu.com/pypi/simp git clone https://github.com/PaddlePaddle/PaddleDetection.git ``` -- **(2)安装其他依赖 ** +- **(2)安装其他依赖** ```bash cd PaddleDetection @@ -138,7 +127,7 @@ json文件包含所有图像的标注,数据以字典嵌套的方式存放, ``` { - + 'segmentation': # 物体的分割标注 'area': 60518.099043117836, # 物体的区域面积 'iscrowd': 0, # iscrowd @@ -166,15 +155,17 @@ json文件包含所有图像的标注,数据以字典嵌套的方式存放, 提供了训练脚本、评估脚本和预测脚本,本节将以PubLayNet预训练模型为例进行讲解。 -如果不希望训练,直接体验后面的模型评估、预测、动转静、推理的流程,可以下载提供的预训练模型,并跳过本部分。 +如果不希望训练,直接体验后面的模型评估、预测、动转静、推理的流程,可以下载提供的预训练模型(PubLayNet数据集),并跳过本部分。 ``` mkdir pretrained_model cd pretrained_model -# 下载并解压PubLayNet预训练模型 +# 下载PubLayNet预训练模型 wget https://paddleocr.bj.bcebos.com/ppstructure/models/layout/picodet_lcnet_x1_0_layout.pdparams ``` +下载更多[版面分析模型](../docs/models_list.md)(中文CDLA数据集预训练模型、表格预训练模型) + ### 4.1. 启动训练 开始训练: @@ -184,7 +175,7 @@ wget https://paddleocr.bj.bcebos.com/ppstructure/models/layout/picodet_lcnet_x1_ 如果你希望训练自己的数据集,需要修改配置文件中的数据配置、类别数。 -以`configs/picodet/legacy_model/application/layout_detection/picodet_lcnet_x1_0_layout.yml` 为例,修改的内容如下所示。 +以`configs/picodet/legacy_model/application/layout_analysis/picodet_lcnet_x1_0_layout.yml` 为例,修改的内容如下所示。 ```yaml metric: COCO @@ -223,16 +214,20 @@ TestDataset: # 训练日志会自动保存到 log 目录中 # 单卡训练 +export CUDA_VISIBLE_DEVICES=0 python3 tools/train.py \ - -c configs/picodet/legacy_model/application/layout_detection/picodet_lcnet_x1_0_layout.yml \ + -c configs/picodet/legacy_model/application/layout_analysis/picodet_lcnet_x1_0_layout.yml \ --eval # 多卡训练,通过--gpus参数指定卡号 +export CUDA_VISIBLE_DEVICES=0,1,2,3 python3 -m paddle.distributed.launch --gpus '0,1,2,3' tools/train.py \ - -c configs/picodet/legacy_model/application/layout_detection/picodet_lcnet_x1_0_layout.yml \ + -c configs/picodet/legacy_model/application/layout_analysis/picodet_lcnet_x1_0_layout.yml \ --eval ``` +**注意:**如果训练时显存out memory,将TrainReader中batch_size调小,同时LearningRate中base_lr等比例减小。发布的config均由8卡训练得到,如果改变GPU卡数为1,那么base_lr需要减小8倍。 + 正常启动训练后,会看到以下log输出: ``` @@ -254,9 +249,11 @@ PaddleDetection支持了基于FGD([Focal and Global Knowledge Distillation for D 更换数据集,修改【TODO】配置中的数据配置、类别数,具体可以参考4.1。启动训练: ```bash -python3 -m paddle.distributed.launch --gpus '0,1,2,3' tools/train.py \ - -c configs/picodet/legacy_model/application/layout_detection/picodet_lcnet_x1_0_layout.yml \ - --slim_config configs/picodet/legacy_model/application/layout_detection/picodet_lcnet_x2_5_layout.yml \ +# 单卡训练 +export CUDA_VISIBLE_DEVICES=0 +python3 tools/train.py \ + -c configs/picodet/legacy_model/application/layout_analysis/picodet_lcnet_x1_0_layout.yml \ + --slim_config configs/picodet/legacy_model/application/layout_analysis/picodet_lcnet_x2_5_layout.yml \ --eval ``` @@ -267,13 +264,13 @@ python3 -m paddle.distributed.launch --gpus '0,1,2,3' tools/train.py \ ### 5.1. 指标评估 -训练中模型参数默认保存在`output/picodet_lcnet_x1_0_layout`目录下。在评估指标时,需要设置`weights`指向保存的参数文件。评估数据集可以通过 `configs/picodet/legacy_model/application/layout_detection/picodet_lcnet_x1_0_layout.yml` 修改`EvalDataset`中的 `image_dir`、`anno_path`和`dataset_dir` 设置。 +训练中模型参数默认保存在`output/picodet_lcnet_x1_0_layout`目录下。在评估指标时,需要设置`weights`指向保存的参数文件。评估数据集可以通过 `configs/picodet/legacy_model/application/layout_analysis/picodet_lcnet_x1_0_layout.yml` 修改`EvalDataset`中的 `image_dir`、`anno_path`和`dataset_dir` 设置。 ```bash # GPU 评估, weights 为待测权重 python3 tools/eval.py \ - -c configs/picodet/legacy_model/application/layout_detection/picodet_lcnet_x1_0_layout.yml \ - -o weigths=./output/picodet_lcnet_x1_0_layout/best_model + -c configs/picodet/legacy_model/application/layout_analysis/picodet_lcnet_x1_0_layout.yml \ + -o weights=./output/picodet_lcnet_x1_0_layout/best_model ``` 会输出以下信息,打印出mAP、AP0.5等信息。 @@ -299,8 +296,8 @@ python3 tools/eval.py \ ``` python3 tools/eval.py \ - -c configs/picodet/legacy_model/application/layout_detection/picodet_lcnet_x1_0_layout.yml \ - --slim_config configs/picodet/legacy_model/application/layout_detection/picodet_lcnet_x2_5_layout.yml \ + -c configs/picodet/legacy_model/application/layout_analysis/picodet_lcnet_x1_0_layout.yml \ + --slim_config configs/picodet/legacy_model/application/layout_analysis/picodet_lcnet_x2_5_layout.yml \ -o weights=output/picodet_lcnet_x2_5_layout/best_model ``` @@ -311,18 +308,17 @@ python3 tools/eval.py \ ### 5.2. 测试版面分析结果 -预测使用的配置文件必须与训练一致,如您通过 `python3 tools/train.py -c configs/picodet/legacy_model/application/layout_detection/picodet_lcnet_x1_0_layout.yml` 完成了模型的训练过程。 - -使用 PaddleDetection 训练好的模型,您可以使用如下命令进行中文模型预测。 +预测使用的配置文件必须与训练一致,如您通过 `python3 tools/train.py -c configs/picodet/legacy_model/application/layout_analysis/picodet_lcnet_x1_0_layout.yml` 完成了模型的训练过程。 +使用 PaddleDetection 训练好的模型,您可以使用如下命令进行模型预测。 ```bash python3 tools/infer.py \ - -c configs/picodet/legacy_model/application/layout_detection/picodet_lcnet_x1_0_layout.yml \ + -c configs/picodet/legacy_model/application/layout_analysis/picodet_lcnet_x1_0_layout.yml \ -o weights='output/picodet_lcnet_x1_0_layout/best_model.pdparams' \ --infer_img='docs/images/layout.jpg' \ --output_dir=output_dir/ \ - --draw_threshold=0.4 + --draw_threshold=0.5 ``` - `--infer_img`: 推理单张图片,也可以通过`--infer_dir`推理文件中的所有图片。 @@ -335,16 +331,15 @@ python3 tools/infer.py \ ``` python3 tools/infer.py \ - -c configs/picodet/legacy_model/application/layout_detection/picodet_lcnet_x1_0_layout.yml \ - --slim_config configs/picodet/legacy_model/application/layout_detection/picodet_lcnet_x2_5_layout.yml \ + -c configs/picodet/legacy_model/application/layout_analysis/picodet_lcnet_x1_0_layout.yml \ + --slim_config configs/picodet/legacy_model/application/layout_analysis/picodet_lcnet_x2_5_layout.yml \ -o weights='output/picodet_lcnet_x2_5_layout/best_model.pdparams' \ --infer_img='docs/images/layout.jpg' \ --output_dir=output_dir/ \ - --draw_threshold=0.4 + --draw_threshold=0.5 ``` - ## 6. 模型导出与预测 @@ -356,7 +351,7 @@ inference 模型(`paddle.jit.save`保存的模型) 一般是模型训练, ```bash python3 tools/export_model.py \ - -c configs/picodet/legacy_model/application/layout_detection/picodet_lcnet_x1_0_layout.yml \ + -c configs/picodet/legacy_model/application/layout_analysis/picodet_lcnet_x1_0_layout.yml \ -o weights=output/picodet_lcnet_x1_0_layout/best_model \ --output_dir=output_inference/ ``` @@ -377,8 +372,8 @@ FGD蒸馏模型转inference模型步骤如下: ```bash python3 tools/export_model.py \ - -c configs/picodet/legacy_model/application/publayernet_lcnet_x1_5/picodet_student.yml \ - --slim_config configs/picodet/legacy_model/application/publayernet_lcnet_x1_5/picodet_teacher.yml \ + -c configs/picodet/legacy_model/application/layout_analysis/picodet_lcnet_x1_0_layout.yml \ + --slim_config configs/picodet/legacy_model/application/layout_analysis/picodet_lcnet_x2_5_layout.yml \ -o weights=./output/picodet_lcnet_x2_5_layout/best_model \ --output_dir=output_inference/ ``` @@ -404,7 +399,7 @@ python3 deploy/python/infer.py \ ------------------------------------------ ----------- Model Configuration ----------- Model Arch: PicoDet -Transform Order: +Transform Order: --transform op: Resize --transform op: NormalizeImage --transform op: Permute @@ -466,4 +461,3 @@ preprocess_time(ms): 2172.50, inference_time(ms): 11.90, postprocess_time(ms): 1 year={2022} } ``` - diff --git a/ppstructure/layout/__init__.py b/ppstructure/layout/__init__.py new file mode 100644 index 0000000000000000000000000000000000000000..1d11e265597c7c8e39098a228108da3bb954b892 --- /dev/null +++ b/ppstructure/layout/__init__.py @@ -0,0 +1,13 @@ +# copyright (c) 2020 PaddlePaddle Authors. All Rights Reserve. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. diff --git a/ppstructure/layout/layout_in_ocr.md b/ppstructure/layout/layout_in_ocr.md deleted file mode 100644 index 3762544b834d752a705216ca3f93d326aa1391ad..0000000000000000000000000000000000000000 --- a/ppstructure/layout/layout_in_ocr.md +++ /dev/null @@ -1,469 +0,0 @@ -- [1. 简介](#1-简介) - -- [2. 安装](#2-安装) - - - [2.1 安装PaddlePaddle](#21-安装paddlepaddle) - - [2.2 安装PaddleDetection](#22-安装paddledetection) - -- [3. 数据准备](#3-数据准备) - - - [3.1 英文数据集](#31-英文数据集) - - [3.2 更多数据集](#32-更多数据集) - -- [4. 开始训练](#4-开始训练) - - - [4.1 启动训练](#41-启动训练) - - [4.2 FGD蒸馏训练](#42-FGD蒸馏训练) - -- [5. 模型评估与预测](#5-模型评估与预测) - - - [5.1 指标评估](#51-指标评估) - - [5.2 测试版面分析结果](#52-测试版面分析结果) - -- [6 模型导出与预测](#6-模型导出与预测) - - - [6.1 模型导出](#61-模型导出) - - - [6.2 模型推理](#62-模型推理) - -# 版面分析 - -## 1. 简介 - -版面分析指的是对图片形式的文档进行区域划分,定位其中的关键区域,如文字、标题、表格、图片等。版面分析算法基于[PaddleDetection](https://github.com/PaddlePaddle/PaddleDetection)的轻量模型PP-PicoDet进行开发。 - -
- -
- - - -## 2. 安装依赖 - -### 2.1. 安装PaddlePaddle - -- **(1) 安装PaddlePaddle** - -```bash -python3 -m pip install --upgrade pip - -# GPU安装 -python3 -m pip install "paddlepaddle-gpu>=2.2" -i https://mirror.baidu.com/pypi/simple - -# CPU安装 -python3 -m pip install "paddlepaddle>=2.2" -i https://mirror.baidu.com/pypi/simple -``` -更多需求,请参照[安装文档](https://www.paddlepaddle.org.cn/install/quick)中的说明进行操作。 - -### 2.2. 安装PaddleDetection - -- **(1)下载PaddleDetection源码** - -```bash -git clone https://github.com/PaddlePaddle/PaddleDetection.git -``` - -- **(2)安装其他依赖 ** - -```bash -cd PaddleDetection -python3 -m pip install -r requirements.txt -``` - -## 3. 数据准备 - -如果希望直接体验预测过程,可以跳过数据准备,下载我们提供的预训练模型。 - -### 3.1. 英文数据集 - -下载文档分析数据集[PubLayNet](https://developer.ibm.com/exchanges/data/all/publaynet/)(数据集96G),包含5个类:`{0: "Text", 1: "Title", 2: "List", 3:"Table", 4:"Figure"}` - -``` -# 下载数据 -wget https://dax-cdn.cdn.appdomain.cloud/dax-publaynet/1.0.0/publaynet.tar.gz -# 解压数据 -tar -xvf publaynet.tar.gz -``` - -解压之后的**目录结构:** - -``` -|-publaynet - |- test - |- PMC1277013_00004.jpg - |- PMC1291385_00002.jpg - | ... - |- train.json - |- train - |- PMC1291385_00002.jpg - |- PMC1277013_00004.jpg - | ... - |- val.json - |- val - |- PMC538274_00004.jpg - |- PMC539300_00004.jpg - | ... -``` - -**数据分布:** - -| File or Folder | Description | num | -| :------------- | :------------- | ------- | -| `train/` | 训练集图片 | 335,703 | -| `val/` | 验证集图片 | 11,245 | -| `test/` | 测试集图片 | 11,405 | -| `train.json` | 训练集标注文件 | - | -| `val.json` | 验证集标注文件 | - | - -**标注格式:** - -json文件包含所有图像的标注,数据以字典嵌套的方式存放,包含以下key: - -- info,表示标注文件info。 - -- licenses,表示标注文件licenses。 - -- images,表示标注文件中图像信息列表,每个元素是一张图像的信息。如下为其中一张图像的信息: - - ``` - { - 'file_name': 'PMC4055390_00006.jpg', # file_name - 'height': 601, # image height - 'width': 792, # image width - 'id': 341427 # image id - } - ``` - -- annotations,表示标注文件中目标物体的标注信息列表,每个元素是一个目标物体的标注信息。如下为其中一个目标物体的标注信息: - - ``` - { - - 'segmentation': # 物体的分割标注 - 'area': 60518.099043117836, # 物体的区域面积 - 'iscrowd': 0, # iscrowd - 'image_id': 341427, # image id - 'bbox': [50.58, 490.86, 240.15, 252.16], # bbox [x1,y1,w,h] - 'category_id': 1, # category_id - 'id': 3322348 # image id - } - ``` - -### 3.2. 更多数据集 - -我们提供了CDLA(中文版面分析)、TableBank(表格版面分析)等数据集的下连接,处理为上述标注文件json格式,即可以按相同方式进行训练。 - -| dataset | 简介 | -| ------------------------------------------------------------ | ------------------------------------------------------------ | -| [cTDaR2019_cTDaR](https://cndplab-founder.github.io/cTDaR2019/) | 用于表格检测(TRACKA)和表格识别(TRACKB)。图片类型包含历史数据集(以cTDaR_t0开头,如cTDaR_t00872.jpg)和现代数据集(以cTDaR_t1开头,cTDaR_t10482.jpg)。 | -| [IIIT-AR-13K](http://cvit.iiit.ac.in/usodi/iiitar13k.php) | 手动注释公开的年度报告中的图形或页面而构建的数据集,包含5类:table, figure, natural image, logo, and signature | -| [CDLA](https://github.com/buptlihang/CDLA) | 中文文档版面分析数据集,面向中文文献类(论文)场景,包含10类:Table、Figure、Figure caption、Table、Table caption、Header、Footer、Reference、Equation | -| [TableBank](https://github.com/doc-analysis/TableBank) | 用于表格检测和识别大型数据集,包含Word和Latex2种文档格式 | -| [DocBank](https://github.com/doc-analysis/DocBank) | 使用弱监督方法构建的大规模数据集(500K文档页面),用于文档布局分析,包含12类:Author、Caption、Date、Equation、Figure、Footer、List、Paragraph、Reference、Section、Table、Title | - - -## 4. 开始训练 - -提供了训练脚本、评估脚本和预测脚本,本节将以PubLayNet预训练模型为例进行讲解。 - -如果不希望训练,直接体验后面的模型评估、预测、动转静、推理的流程,可以下载提供的预训练模型,并跳过本部分。 - -``` -mkdir pretrained_model -cd pretrained_model -# 下载并解压PubLayNet预训练模型 -wget https://paddleocr.bj.bcebos.com/ppstructure/models/layout/picodet_lcnet_x1_0_layout.pdparams -``` - -### 4.1. 启动训练 - -开始训练: - -* 修改配置文件 - -如果你希望训练自己的数据集,需要修改配置文件中的数据配置、类别数。 - - -以`configs/picodet/legacy_model/application/layout_detection/picodet_lcnet_x1_0_layout.yml` 为例,修改的内容如下所示。 - -```yaml -metric: COCO -# 类别数 -num_classes: 5 - -TrainDataset: - !COCODataSet - # 修改为你自己的训练数据目录 - image_dir: train - # 修改为你自己的训练数据标签文件 - anno_path: train.json - # 修改为你自己的训练数据根目录 - dataset_dir: /root/publaynet/ - data_fields: ['image', 'gt_bbox', 'gt_class', 'is_crowd'] - -EvalDataset: - !COCODataSet - # 修改为你自己的验证数据目录 - image_dir: val - # 修改为你自己的验证数据标签文件 - anno_path: val.json - # 修改为你自己的验证数据根目录 - dataset_dir: /root/publaynet/ - -TestDataset: - !ImageFolder - # 修改为你自己的测试数据标签文件 - anno_path: /root/publaynet/val.json -``` - -* 开始训练,在训练时,会默认下载PP-PicoDet预训练模型,这里无需预先下载。 - -```bash -# GPU训练 支持单卡,多卡训练 -# 训练日志会自动保存到 log 目录中 - -# 单卡训练 -python3 tools/train.py \ - -c configs/picodet/legacy_model/application/layout_detection/picodet_lcnet_x1_0_layout.yml \ - --eval - -# 多卡训练,通过--gpus参数指定卡号 -python3 -m paddle.distributed.launch --gpus '0,1,2,3' tools/train.py \ - -c configs/picodet/legacy_model/application/layout_detection/picodet_lcnet_x1_0_layout.yml \ - --eval -``` - -正常启动训练后,会看到以下log输出: - -``` -[08/15 04:02:30] ppdet.utils.checkpoint INFO: Finish loading model weights: /root/.cache/paddle/weights/LCNet_x1_0_pretrained.pdparams -[08/15 04:02:46] ppdet.engine INFO: Epoch: [0] [ 0/1929] learning_rate: 0.040000 loss_vfl: 1.216707 loss_bbox: 1.142163 loss_dfl: 0.544196 loss: 2.903065 eta: 17 days, 13:50:26 batch_cost: 15.7452 data_cost: 2.9112 ips: 1.5243 images/s -[08/15 04:03:19] ppdet.engine INFO: Epoch: [0] [ 20/1929] learning_rate: 0.064000 loss_vfl: 1.180627 loss_bbox: 0.939552 loss_dfl: 0.442436 loss: 2.628206 eta: 2 days, 12:18:53 batch_cost: 1.5770 data_cost: 0.0008 ips: 15.2184 images/s -[08/15 04:03:47] ppdet.engine INFO: Epoch: [0] [ 40/1929] learning_rate: 0.088000 loss_vfl: 0.543321 loss_bbox: 1.071401 loss_dfl: 0.457817 loss: 2.057003 eta: 2 days, 0:07:03 batch_cost: 1.3190 data_cost: 0.0007 ips: 18.1954 images/s -[08/15 04:04:12] ppdet.engine INFO: Epoch: [0] [ 60/1929] learning_rate: 0.112000 loss_vfl: 0.630989 loss_bbox: 0.859183 loss_dfl: 0.384702 loss: 1.883143 eta: 1 day, 19:01:29 batch_cost: 1.2177 data_cost: 0.0006 ips: 19.7087 images/s -``` - -- `--eval`表示训练的同时,进行评估, 评估过程中默认将最佳模型,保存为 `output/picodet_lcnet_x1_0_layout/best_accuracy` 。 - -**注意,预测/评估时的配置文件请务必与训练一致。** - -### 4.2. FGD蒸馏训练 - -PaddleDetection支持了基于FGD([Focal and Global Knowledge Distillation for Detectors](https://arxiv.org/abs/2111.11837v1))蒸馏的目标检测模型训练过程,FGD蒸馏分为两个部分`Focal`和`Global`。`Focal`蒸馏分离图像的前景和背景,让学生模型分别关注教师模型的前景和背景部分特征的关键像素;`Global`蒸馏部分重建不同像素之间的关系并将其从教师转移到学生,以补偿`Focal`蒸馏中丢失的全局信息。 - -更换数据集,修改【TODO】配置中的数据配置、类别数,具体可以参考4.1。启动训练: - -```bash -python3 -m paddle.distributed.launch --gpus '0,1,2,3' tools/train.py \ - -c configs/picodet/legacy_model/application/layout_detection/picodet_lcnet_x1_0_layout.yml \ - --slim_config configs/picodet/legacy_model/application/layout_detection/picodet_lcnet_x2_5_layout.yml \ - --eval -``` - -- `-c`: 指定模型配置文件。 -- `--slim_config`: 指定压缩策略配置文件。 - -## 5. 模型评估与预测 - -### 5.1. 指标评估 - -训练中模型参数默认保存在`output/picodet_lcnet_x1_0_layout`目录下。在评估指标时,需要设置`weights`指向保存的参数文件。评估数据集可以通过 `configs/picodet/legacy_model/application/layout_detection/picodet_lcnet_x1_0_layout.yml` 修改`EvalDataset`中的 `image_dir`、`anno_path`和`dataset_dir` 设置。 - -```bash -# GPU 评估, weights 为待测权重 -python3 tools/eval.py \ - -c configs/picodet/legacy_model/application/layout_detection/picodet_lcnet_x1_0_layout.yml \ - -o weigths=./output/picodet_lcnet_x1_0_layout/best_model -``` - -会输出以下信息,打印出mAP、AP0.5等信息。 - -```py - Average Precision (AP) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.935 - Average Precision (AP) @[ IoU=0.50 | area= all | maxDets=100 ] = 0.979 - Average Precision (AP) @[ IoU=0.75 | area= all | maxDets=100 ] = 0.956 - Average Precision (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.404 - Average Precision (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.782 - Average Precision (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.969 - Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 1 ] = 0.539 - Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 10 ] = 0.938 - Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.949 - Average Recall (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.495 - Average Recall (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.818 - Average Recall (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.978 -[08/15 07:07:09] ppdet.engine INFO: Total sample number: 11245, averge FPS: 24.405059207157436 -[08/15 07:07:09] ppdet.engine INFO: Best test bbox ap is 0.935. -``` - -使用FGD蒸馏模型进行评估: - -``` -python3 tools/eval.py \ - -c configs/picodet/legacy_model/application/layout_detection/picodet_lcnet_x1_0_layout.yml \ - --slim_config configs/picodet/legacy_model/application/layout_detection/picodet_lcnet_x2_5_layout.yml \ - -o weights=output/picodet_lcnet_x2_5_layout/best_model -``` - -- `-c`: 指定模型配置文件。 -- `--slim_config`: 指定蒸馏策略配置文件。 -- `-o weights`: 指定蒸馏算法训好的模型路径。 - -### 5.2. 测试版面分析结果 - - -预测使用的配置文件必须与训练一致,如您通过 `python3 tools/train.py -c configs/picodet/legacy_model/application/layout_detection/picodet_lcnet_x1_0_layout.yml` 完成了模型的训练过程。 - -使用 PaddleDetection 训练好的模型,您可以使用如下命令进行中文模型预测。 - - -```bash -python3 tools/infer.py \ - -c configs/picodet/legacy_model/application/layout_detection/picodet_lcnet_x1_0_layout.yml \ - -o weights='output/picodet_lcnet_x1_0_layout/best_model.pdparams' \ - --infer_img='docs/images/layout.jpg' \ - --output_dir=output_dir/ \ - --draw_threshold=0.4 -``` - -- `--infer_img`: 推理单张图片,也可以通过`--infer_dir`推理文件中的所有图片。 -- `--output_dir`: 指定可视化结果保存路径。 -- `--draw_threshold`:指定绘制结果框的NMS阈值。 - -预测图片如下所示,图片会存储在`output_dir`路径中。 - -使用FGD蒸馏模型进行测试: - -``` -python3 tools/infer.py \ - -c configs/picodet/legacy_model/application/layout_detection/picodet_lcnet_x1_0_layout.yml \ - --slim_config configs/picodet/legacy_model/application/layout_detection/picodet_lcnet_x2_5_layout.yml \ - -o weights='output/picodet_lcnet_x2_5_layout/best_model.pdparams' \ - --infer_img='docs/images/layout.jpg' \ - --output_dir=output_dir/ \ - --draw_threshold=0.4 -``` - - - -## 6. 模型导出与预测 - - -### 6.1 模型导出 - -inference 模型(`paddle.jit.save`保存的模型) 一般是模型训练,把模型结构和模型参数保存在文件中的固化模型,多用于预测部署场景。 训练过程中保存的模型是checkpoints模型,保存的只有模型的参数,多用于恢复训练等。 与checkpoints模型相比,inference 模型会额外保存模型的结构信息,在预测部署、加速推理上性能优越,灵活方便,适合于实际系统集成。 - -版面分析模型转inference模型步骤如下: - -```bash -python3 tools/export_model.py \ - -c configs/picodet/legacy_model/application/layout_detection/picodet_lcnet_x1_0_layout.yml \ - -o weights=output/picodet_lcnet_x1_0_layout/best_model \ - --output_dir=output_inference/ -``` - -* 如无需导出后处理,请指定:`-o export.benchmark=True`(如果-o已出现过,此处删掉-o) -* 如无需导出NMS,请指定:`-o export.nms=False` - -转换成功后,在目录下有三个文件: - -``` -output_inference/picodet_lcnet_x1_0_layout/ - ├── model.pdiparams # inference模型的参数文件 - ├── model.pdiparams.info # inference模型的参数信息,可忽略 - └── model.pdmodel # inference模型的模型结构文件 -``` - -FGD蒸馏模型转inference模型步骤如下: - -```bash -python3 tools/export_model.py \ - -c configs/picodet/legacy_model/application/publayernet_lcnet_x1_5/picodet_student.yml \ - --slim_config configs/picodet/legacy_model/application/publayernet_lcnet_x1_5/picodet_teacher.yml \ - -o weights=./output/picodet_lcnet_x2_5_layout/best_model \ - --output_dir=output_inference/ -``` - - - -### 6.2 模型推理 - -版面恢复任务进行推理,可以执行如下命令: - -```bash -python3 deploy/python/infer.py \ - --model_dir=output_inference/picodet_lcnet_x1_0_layout/ \ - --image_file=docs/images/layout.jpg \ - --device=CPU -``` - -- --device:指定GPU、CPU设备 - -模型推理完成,会看到以下log输出 - -``` ------------------------------------------- ------------ Model Configuration ----------- -Model Arch: PicoDet -Transform Order: ---transform op: Resize ---transform op: NormalizeImage ---transform op: Permute ---transform op: PadStride --------------------------------------------- -class_id:0, confidence:0.9921, left_top:[20.18,35.66],right_bottom:[341.58,600.99] -class_id:0, confidence:0.9914, left_top:[19.77,611.42],right_bottom:[341.48,901.82] -class_id:0, confidence:0.9904, left_top:[369.36,375.10],right_bottom:[691.29,600.59] -class_id:0, confidence:0.9835, left_top:[369.60,608.60],right_bottom:[691.38,736.72] -class_id:0, confidence:0.9830, left_top:[369.58,805.38],right_bottom:[690.97,901.80] -class_id:0, confidence:0.9716, left_top:[383.68,271.44],right_bottom:[688.93,335.39] -class_id:0, confidence:0.9452, left_top:[370.82,34.48],right_bottom:[688.10,63.54] -class_id:1, confidence:0.8712, left_top:[370.84,771.03],right_bottom:[519.30,789.13] -class_id:3, confidence:0.9856, left_top:[371.28,67.85],right_bottom:[685.73,267.72] -save result to: output/layout.jpg -Test iter 0 ------------------- Inference Time Info ---------------------- -total_time(ms): 2196.0, img_num: 1 -average latency time(ms): 2196.00, QPS: 0.455373 -preprocess_time(ms): 2172.50, inference_time(ms): 11.90, postprocess_time(ms): 11.60 -``` - -- Model:模型结构 -- Transform Order:预处理操作 -- class_id、confidence、left_top、right_bottom:分别表示类别id、置信度、左上角坐标、右下角坐标 -- save result to:可视化版面分析结果保存路径,默认保存到`./output`文件夹 -- Inference Time Info:推理时间,其中preprocess_time表示预处理耗时,inference_time表示模型预测耗时,postprocess_time表示后处理耗时 - -可视化版面结果如下图所示 - -
- -
- - - -## Citations - -``` -@inproceedings{zhong2019publaynet, - title={PubLayNet: largest dataset ever for document layout analysis}, - author={Zhong, Xu and Tang, Jianbin and Yepes, Antonio Jimeno}, - booktitle={2019 International Conference on Document Analysis and Recognition (ICDAR)}, - year={2019}, - volume={}, - number={}, - pages={1015-1022}, - doi={10.1109/ICDAR.2019.00166}, - ISSN={1520-5363}, - month={Sep.}, - organization={IEEE} -} - -@inproceedings{yang2022focal, - title={Focal and global knowledge distillation for detectors}, - author={Yang, Zhendong and Li, Zhe and Jiang, Xiaohu and Gong, Yuan and Yuan, Zehuan and Zhao, Danpei and Yuan, Chun}, - booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition}, - pages={4643--4652}, - year={2022} -} -``` - diff --git a/ppstructure/layout/predict_layout.py b/ppstructure/layout/predict_layout.py index a58a63f4931336686cf7e7b4841819b17c31fdbf..9f8c884e144654901737191141622abfaa872d24 100755 --- a/ppstructure/layout/predict_layout.py +++ b/ppstructure/layout/predict_layout.py @@ -28,12 +28,13 @@ import tools.infer.utility as utility from ppocr.data import create_operators, transform from ppocr.postprocess import build_post_process from ppocr.utils.logging import get_logger -from ppocr.utils.utility import get_image_file_list, check_and_read_gif +from ppocr.utils.utility import get_image_file_list, check_and_read from ppstructure.utility import parse_args from picodet_postprocess import PicoDetPostProcess logger = get_logger() + class LayoutPredictor(object): def __init__(self, args): pre_process_list = [{ @@ -109,7 +110,7 @@ def main(args): repeats = 50 for image_file in image_file_list: - img, flag = check_and_read_gif(image_file) + img, flag, _ = check_and_read(image_file) if not flag: img = cv2.imread(image_file) if img is None: diff --git a/ppstructure/predict_system.py b/ppstructure/predict_system.py index 053a8aac00ffe762dd05d7f8030db9aaa32c0f8a..d63ab3b3daf018af7d0872e42bd14b8823d193ae 100644 --- a/ppstructure/predict_system.py +++ b/ppstructure/predict_system.py @@ -28,13 +28,12 @@ import time import logging from copy import deepcopy -from ppocr.utils.utility import get_image_file_list, check_and_read_gif +from ppocr.utils.utility import get_image_file_list, check_and_read from ppocr.utils.logging import get_logger from tools.infer.predict_system import TextSystem from ppstructure.layout.predict_layout import LayoutPredictor from ppstructure.table.predict_table import TableSystem, to_excel from ppstructure.utility import parse_args, draw_structure_result -from ppstructure.recovery.recovery_to_doc import convert_info_docx logger = get_logger() @@ -75,10 +74,10 @@ class StructureSystem(object): else: self.table_system = TableSystem(args) - elif self.mode == 'vqa': + elif self.mode == 'kie': raise NotImplementedError - def __call__(self, img, return_ocr_result_in_table=False): + def __call__(self, img, img_idx=0, return_ocr_result_in_table=False): time_dict = { 'image_orientation': 0, 'layout': 0, @@ -86,7 +85,7 @@ class StructureSystem(object): 'table_match': 0, 'det': 0, 'rec': 0, - 'vqa': 0, + 'kie': 0, 'all': 0 } start = time.time() @@ -143,8 +142,8 @@ class StructureSystem(object): time_dict['det'] += ocr_time_dict['det'] time_dict['rec'] += ocr_time_dict['rec'] - # remove style char, - # when using the recognition model trained on the PubtabNet dataset, + # remove style char, + # when using the recognition model trained on the PubtabNet dataset, # it will recognize the text format in the table, such as style_token = [ '', '', '', '', '', @@ -169,36 +168,40 @@ class StructureSystem(object): 'type': region['label'].lower(), 'bbox': [x1, y1, x2, y2], 'img': roi_img, - 'res': res + 'res': res, + 'img_idx': img_idx }) end = time.time() time_dict['all'] = end - start return res_list, time_dict - elif self.mode == 'vqa': + elif self.mode == 'kie': raise NotImplementedError return None, None -def save_structure_res(res, save_folder, img_name): +def save_structure_res(res, save_folder, img_name, img_idx=0): excel_save_folder = os.path.join(save_folder, img_name) os.makedirs(excel_save_folder, exist_ok=True) res_cp = deepcopy(res) # save res with open( - os.path.join(excel_save_folder, 'res.txt'), 'w', + os.path.join(excel_save_folder, 'res_{}.txt'.format(img_idx)), + 'w', encoding='utf8') as f: for region in res_cp: roi_img = region.pop('img') f.write('{}\n'.format(json.dumps(region))) - if region['type'] == 'table' and len(region[ + if region['type'].lower() == 'table' and len(region[ 'res']) > 0 and 'html' in region['res']: - excel_path = os.path.join(excel_save_folder, - '{}.xlsx'.format(region['bbox'])) + excel_path = os.path.join( + excel_save_folder, + '{}_{}.xlsx'.format(region['bbox'], img_idx)) to_excel(region['res']['html'], excel_path) - elif region['type'] == 'figure': - img_path = os.path.join(excel_save_folder, - '{}.jpg'.format(region['bbox'])) + elif region['type'].lower() == 'figure': + img_path = os.path.join( + excel_save_folder, + '{}_{}.jpg'.format(region['bbox'], img_idx)) cv2.imwrite(img_path, roi_img) @@ -214,28 +217,75 @@ def main(args): for i, image_file in enumerate(image_file_list): logger.info("[{}/{}] {}".format(i, img_num, image_file)) - img, flag = check_and_read_gif(image_file) + img, flag_gif, flag_pdf = check_and_read(image_file) img_name = os.path.basename(image_file).split('.')[0] - if not flag: + if not flag_gif and not flag_pdf: img = cv2.imread(image_file) - if img is None: - logger.error("error in loading image:{}".format(image_file)) - continue - res, time_dict = structure_sys(img) - if structure_sys.mode == 'structure': - save_structure_res(res, save_folder, img_name) - draw_img = draw_structure_result(img, res, args.vis_font_path) - img_save_path = os.path.join(save_folder, img_name, 'show.jpg') - elif structure_sys.mode == 'vqa': - raise NotImplementedError - # draw_img = draw_ser_results(img, res, args.vis_font_path) - # img_save_path = os.path.join(save_folder, img_name + '.jpg') - cv2.imwrite(img_save_path, draw_img) - logger.info('result save to {}'.format(img_save_path)) - if args.recovery: - convert_info_docx(img, res, save_folder, img_name) + if not flag_pdf: + if img is None: + logger.error("error in loading image:{}".format(image_file)) + continue + res, time_dict = structure_sys(img) + + if structure_sys.mode == 'structure': + save_structure_res(res, save_folder, img_name) + draw_img = draw_structure_result(img, res, args.vis_font_path) + img_save_path = os.path.join(save_folder, img_name, 'show.jpg') + elif structure_sys.mode == 'kie': + raise NotImplementedError + # draw_img = draw_ser_results(img, res, args.vis_font_path) + # img_save_path = os.path.join(save_folder, img_name + '.jpg') + cv2.imwrite(img_save_path, draw_img) + logger.info('result save to {}'.format(img_save_path)) + if args.recovery: + try: + from ppstructure.recovery.recovery_to_doc import sorted_layout_boxes, convert_info_docx + h, w, _ = img.shape + res = sorted_layout_boxes(res, w) + convert_info_docx(img, res, save_folder, img_name, + args.save_pdf) + except Exception as ex: + logger.error( + "error in layout recovery image:{}, err msg: {}".format( + image_file, ex)) + continue + else: + pdf_imgs = img + all_res = [] + for index, img in enumerate(pdf_imgs): + + res, time_dict = structure_sys(img, index) + if structure_sys.mode == 'structure' and res != []: + save_structure_res(res, save_folder, img_name, index) + draw_img = draw_structure_result(img, res, + args.vis_font_path) + img_save_path = os.path.join(save_folder, img_name, + 'show_{}.jpg'.format(index)) + elif structure_sys.mode == 'kie': + raise NotImplementedError + # draw_img = draw_ser_results(img, res, args.vis_font_path) + # img_save_path = os.path.join(save_folder, img_name + '.jpg') + if res != []: + cv2.imwrite(img_save_path, draw_img) + logger.info('result save to {}'.format(img_save_path)) + if args.recovery and res != []: + from ppstructure.recovery.recovery_to_doc import sorted_layout_boxes, convert_info_docx + h, w, _ = img.shape + res = sorted_layout_boxes(res, w) + all_res += res + + if args.recovery and all_res != []: + try: + convert_info_docx(img, all_res, save_folder, img_name, + args.save_pdf) + except Exception as ex: + logger.error( + "error in layout recovery image:{}, err msg: {}".format( + image_file, ex)) + continue + logger.info("Predict time : {:.3f}s".format(time_dict['all'])) diff --git a/ppstructure/recovery/README.md b/ppstructure/recovery/README.md index 883dbef3e829dfa213644b610af1ca279dac8641..90a6a2c3c4189dc885d698e4cac2d1a24a49d1df 100644 --- a/ppstructure/recovery/README.md +++ b/ppstructure/recovery/README.md @@ -6,10 +6,12 @@ English | [简体中文](README_ch.md) - [2.1 Installation dependencies](#2.1) - [2.2 Install PaddleOCR](#2.2) - [3. Quick Start](#3) + - [3.1 Download models](#3.1) + - [3.2 Layout recovery](#3.2) -## 1. Introduction +## 1. Introduction Layout recovery means that after OCR recognition, the content is still arranged like the original document pictures, and the paragraphs are output to word document in the same order. @@ -17,8 +19,9 @@ Layout recovery combines [layout analysis](../layout/README.md)、[table recogni The following figure shows the result:
- +
+ ## 2. Install @@ -33,14 +36,14 @@ The following figure shows the result: python3 -m pip install --upgrade pip # GPU installation -python3 -m pip install "paddlepaddle-gpu>=2.2" -i https://mirror.baidu.com/pypi/simple +python3 -m pip install "paddlepaddle-gpu" -i https://mirror.baidu.com/pypi/simple # CPU installation -python3 -m pip install "paddlepaddle>=2.2" -i https://mirror.baidu.com/pypi/simple +python3 -m pip install "paddlepaddle" -i https://mirror.baidu.com/pypi/simple ```` -For more requirements, please refer to the instructions in [Installation Documentation](https://www.paddlepaddle.org.cn/install/quick). +For more requirements, please refer to the instructions in [Installation Documentation](https://www.paddlepaddle.org.cn/en/install/quick?docurl=/documentation/docs/en/install/pip/macos-pip_en.html). @@ -67,20 +70,61 @@ python3 -m pip install -r ppstructure/recovery/requirements.txt ## 3. Quick Start -```python + +### 3.1 Download models + +If input is English document, download English models: + +```bash cd PaddleOCR/ppstructure # download model mkdir inference && cd inference # Download the detection model of the ultra-lightweight English PP-OCRv3 model and unzip it -wget https://paddleocr.bj.bcebos.com/PP-OCRv3/chinese/ch_PP-OCRv3_det_infer.tar && tar xf ch_PP-OCRv3_det_infer.tar +https://paddleocr.bj.bcebos.com/PP-OCRv3/english/en_PP-OCRv3_det_infer.tar && tar xf en_PP-OCRv3_det_infer.tar # Download the recognition model of the ultra-lightweight English PP-OCRv3 model and unzip it -wget https://paddleocr.bj.bcebos.com/PP-OCRv3/chinese/ch_PP-OCRv3_rec_infer.tar && tar xf ch_PP-OCRv3_rec_infer.tar +wget https://paddleocr.bj.bcebos.com/PP-OCRv3/english/en_PP-OCRv3_rec_infer.tar && tar xf en_PP-OCRv3_rec_infer.tar # Download the ultra-lightweight English table inch model and unzip it -wget https://paddleocr.bj.bcebos.com/dygraph_v2.0/table/en_ppocr_mobile_v2.0_table_structure_infer.tar && tar xf en_ppocr_mobile_v2.0_table_structure_infer.tar +wget https://paddleocr.bj.bcebos.com/ppstructure/models/slanet/en_ppstructure_mobile_v2.0_SLANet_infer.tar && tar xf en_ppstructure_mobile_v2.0_SLANet_infer.tar +# Download the layout model of publaynet dataset and unzip it +wget https://paddleocr.bj.bcebos.com/ppstructure/models/layout/picodet_lcnet_x1_0_fgd_layout_infer.tar && tar xf picodet_lcnet_x1_0_fgd_layout_infer.tar cd .. -# run -python3 predict_system.py --det_model_dir=inference/en_PP-OCRv3_det_infer --rec_model_dir=inference/en_PP-OCRv3_rec_infer --table_model_dir=inference/en_ppocr_mobile_v2.0_table_structure_infer --rec_char_dict_path=../ppocr/utils/en_dict.txt --table_char_dict_path=../ppocr/utils/dict/table_structure_dict.txt --output ./output/table --rec_image_shape=3,48,320 --vis_font_path=../doc/fonts/simfang.ttf --recovery=True --image_dir=./docs/table/1.png +``` +If input is Chinese document,download Chinese models: +[Chinese and English ultra-lightweight PP-OCRv3 model](https://github.com/PaddlePaddle/PaddleOCR/blob/dygraph/README.md#pp-ocr-series-model-listupdate-on-september-8th)、[表格识别模型](https://github.com/PaddlePaddle/PaddleOCR/blob/dygraph/ppstructure/docs/models_list.md#22-表格识别模型)、[版面分析模型](https://github.com/PaddlePaddle/PaddleOCR/blob/dygraph/ppstructure/docs/models_list.md#1-版面分析模型) + + +### 3.2 Layout recovery + + +```bash +python3 predict_system.py \ + --image_dir=./docs/table/1.png \ + --det_model_dir=inference/en_PP-OCRv3_det_infer \ + --rec_model_dir=inference/en_PP-OCRv3_rec_infer \ + --rec_char_dict_path=../ppocr/utils/en_dict.txt \ + --table_model_dir=inference/en_ppstructure_mobile_v2.0_SLANet_infer \ + --table_char_dict_path=../ppocr/utils/dict/table_structure_dict.txt \ + --layout_model_dir=inference/picodet_lcnet_x1_0_fgd_layout_infer \ + --layout_dict_path=../ppocr/utils/dict/layout_dict/layout_publaynet_dict.txt \ + --vis_font_path=../doc/fonts/simfang.ttf \ + --recovery=True \ + --save_pdf=False \ + --output=../output/ ``` -After running, the docx of each picture will be saved in the directory specified by the output field \ No newline at end of file +After running, the docx of each picture will be saved in the directory specified by the output field + +Field: + +- image_dir:test file测试文件, can be picture, picture directory, pdf file, pdf file directory +- det_model_dir:OCR detection model path +- rec_model_dir:OCR recognition model path +- rec_char_dict_path:OCR recognition dict path. If the Chinese model is used, change to "../ppocr/utils/ppocr_keys_v1.txt". And if you trained the model on your own dataset, change to the trained dictionary +- table_model_dir:tabel recognition model path +- table_char_dict_path:tabel recognition dict path. If the Chinese model is used, no need to change +- layout_model_dir:layout analysis model path +- layout_dict_path:layout analysis dict path. If the Chinese model is used, change to "../ppocr/utils/dict/layout_dict/layout_cdla_dict.txt" +- recovery:whether to enable layout of recovery, default False +- save_pdf:when recovery file, whether to save pdf file, default False +- output:save the recovery result path diff --git a/ppstructure/recovery/README_ch.md b/ppstructure/recovery/README_ch.md index 5a05abffd0399387bc0d22d878e64d03d8894a79..9215976d37e89c7f02a61a5dfcf2127ff98c998e 100644 --- a/ppstructure/recovery/README_ch.md +++ b/ppstructure/recovery/README_ch.md @@ -8,19 +8,22 @@ - [2.2 安装PaddleOCR](#2.2) - [3. 使用](#3) + - [3.1 下载模型](#3.1) + - [3.2 版面恢复](#3.2) -## 1. 简介 +## 1. 简介 版面恢复就是在OCR识别后,内容仍然像原文档图片那样排列着,段落不变、顺序不变的输出到word文档中等。 -版面恢复结合了[版面分析](../layout/README_ch.md)、[表格识别](../table/README_ch.md)技术,从而更好地恢复图片、表格、标题等内容,下图展示了版面恢复的结果: +版面恢复结合了[版面分析](../layout/README_ch.md)、[表格识别](../table/README_ch.md)技术,从而更好地恢复图片、表格、标题等内容,支持pdf文档、文档图片格式的输入文件,下图展示了版面恢复的结果:
- +
+ ## 2. 安装 @@ -35,21 +38,15 @@ python3 -m pip install --upgrade pip # GPU安装 -python3 -m pip install "paddlepaddle-gpu>=2.2" -i https://mirror.baidu.com/pypi/simple +python3 -m pip install "paddlepaddle-gpu" -i https://mirror.baidu.com/pypi/simple # CPU安装 -python3 -m pip install "paddlepaddle>=2.2" -i https://mirror.baidu.com/pypi/simple +python3 -m pip install "paddlepaddle" -i https://mirror.baidu.com/pypi/simple ``` 更多需求,请参照[安装文档](https://www.paddlepaddle.org.cn/install/quick)中的说明进行操作。 -* **(2)安装依赖** - -```bash -python3 -m pip install -r ppstructure/recovery/requirements.txt -``` - ### 2.2 安装PaddleOCR @@ -75,23 +72,66 @@ python3 -m pip install -r ppstructure/recovery/requirements.txt ## 3. 使用 -恢复给定文档的版面: + + +### 3.1 下载模型 + +如果输入为英文文档类型,下载英文模型 -```python +```bash cd PaddleOCR/ppstructure # 下载模型 mkdir inference && cd inference -# 下载超英文轻量级PP-OCRv3模型的检测模型并解压 -wget https://paddleocr.bj.bcebos.com/PP-OCRv3/chinese/ch_PP-OCRv3_det_infer.tar && tar xf ch_PP-OCRv3_det_infer.tar -# 下载英文轻量级PP-OCRv3模型的识别模型并解压 -wget https://paddleocr.bj.bcebos.com/PP-OCRv3/chinese/ch_PP-OCRv3_rec_infer.tar && tar xf ch_PP-OCRv3_rec_infer.tar -# 下载超轻量级英文表格英寸模型并解压 -wget https://paddleocr.bj.bcebos.com/dygraph_v2.0/table/en_ppocr_mobile_v2.0_table_structure_infer.tar && tar xf en_ppocr_mobile_v2.0_table_structure_infer.tar +# 下载英文超轻量PP-OCRv3检测模型并解压 +wget https://paddleocr.bj.bcebos.com/PP-OCRv3/english/en_PP-OCRv3_det_infer.tar && tar xf en_PP-OCRv3_det_infer.tar +# 下载英文超轻量PP-OCRv3识别模型并解压 +wget https://paddleocr.bj.bcebos.com/PP-OCRv3/english/en_PP-OCRv3_rec_infer.tar && tar xf en_PP-OCRv3_rec_infer.tar +# 下载英文表格识别模型并解压 +wget https://paddleocr.bj.bcebos.com/ppstructure/models/slanet/en_ppstructure_mobile_v2.0_SLANet_infer.tar && tar xf en_ppstructure_mobile_v2.0_SLANet_infer.tar +# 下载英文版面分析模型 +wget https://paddleocr.bj.bcebos.com/ppstructure/models/layout/picodet_lcnet_x1_0_fgd_layout_infer.tar && tar xf picodet_lcnet_x1_0_fgd_layout_infer.tar cd .. -# 执行预测 -python3 predict_system.py --det_model_dir=inference/en_PP-OCRv3_det_infer --rec_model_dir=inference/en_PP-OCRv3_rec_infer --table_model_dir=inference/en_ppocr_mobile_v2.0_table_structure_infer --rec_char_dict_path=../ppocr/utils/en_dict.txt --table_char_dict_path=../ppocr/utils/dict/table_structure_dict.txt --output ./output/table --rec_image_shape=3,48,320 --vis_font_path=../doc/fonts/simfang.ttf --recovery=True --image_dir=./docs/table/1.png ``` -运行完成后,每张图片的docx文档会保存到output字段指定的目录下 +如果输入为中文文档类型,在下述链接中下载中文模型即可: + +[PP-OCRv3中英文超轻量文本检测和识别模型](https://github.com/PaddlePaddle/PaddleOCR/blob/dygraph/README_ch.md#pp-ocr%E7%B3%BB%E5%88%97%E6%A8%A1%E5%9E%8B%E5%88%97%E8%A1%A8%E6%9B%B4%E6%96%B0%E4%B8%AD)、[表格识别模型](https://github.com/PaddlePaddle/PaddleOCR/blob/dygraph/ppstructure/docs/models_list.md#22-表格识别模型)、[版面分析模型](https://github.com/PaddlePaddle/PaddleOCR/blob/dygraph/ppstructure/docs/models_list.md#1-版面分析模型) + + + +### 3.2 版面恢复 + +使用下载的模型恢复给定文档的版面,以英文模型为例,执行如下命令: + +```bash +python3 predict_system.py \ + --image_dir=./docs/table/1.png \ + --det_model_dir=inference/en_PP-OCRv3_det_infer \ + --rec_model_dir=inference/en_PP-OCRv3_rec_infer \ + --rec_char_dict_path=../ppocr/utils/en_dict.txt \ + --table_model_dir=inference/en_ppstructure_mobile_v2.0_SLANet_infer \ + --table_char_dict_path=../ppocr/utils/dict/table_structure_dict.txt \ + --layout_model_dir=inference/picodet_lcnet_x1_0_fgd_layout_infer \ + --layout_dict_path=../ppocr/utils/dict/layout_dict/layout_publaynet_dict.txt \ + --vis_font_path=../doc/fonts/simfang.ttf \ + --recovery=True \ + --save_pdf=False \ + --output=../output/ +``` +运行完成后,恢复版面的docx文档会保存到`output`字段指定的目录下 + +字段含义: + +- image_dir:测试文件,可以是图片、图片目录、pdf文件、pdf文件目录 +- det_model_dir:OCR检测模型路径 +- rec_model_dir:OCR识别模型路径 +- rec_char_dict_path:OCR识别字典,如果更换为中文模型,需要更改为"../ppocr/utils/ppocr_keys_v1.txt",如果您在自己的数据集上训练的模型,则更改为训练的字典的文件 +- table_model_dir:表格识别模型路径 +- table_char_dict_path:表格识别字典,如果更换为中文模型,不需要更换字典 +- layout_model_dir:版面分析模型路径 +- layout_dict_path:版面分析字典,如果更换为中文模型,需要更改为"../ppocr/utils/dict/layout_dict/layout_cdla_dict.txt" +- recovery:是否进行版面恢复,默认False +- save_pdf:进行版面恢复导出docx文档的同时,是否保存为pdf文件,默认为False +- output:版面恢复结果保存路径 diff --git a/ppstructure/recovery/__init__.py b/ppstructure/recovery/__init__.py new file mode 100644 index 0000000000000000000000000000000000000000..1d11e265597c7c8e39098a228108da3bb954b892 --- /dev/null +++ b/ppstructure/recovery/__init__.py @@ -0,0 +1,13 @@ +# copyright (c) 2020 PaddlePaddle Authors. All Rights Reserve. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. diff --git a/ppstructure/recovery/recovery_to_doc.py b/ppstructure/recovery/recovery_to_doc.py index 5278217d5b983008d357b6b1be3ab1b883a4939d..0a556093d17050f65440f8e962015a86de696107 100644 --- a/ppstructure/recovery/recovery_to_doc.py +++ b/ppstructure/recovery/recovery_to_doc.py @@ -22,21 +22,23 @@ from docx import shared from docx.enum.text import WD_ALIGN_PARAGRAPH from docx.enum.section import WD_SECTION from docx.oxml.ns import qn +from docx.enum.table import WD_TABLE_ALIGNMENT + +from ppstructure.recovery.table_process import HtmlToDocx from ppocr.utils.logging import get_logger logger = get_logger() -def convert_info_docx(img, res, save_folder, img_name): +def convert_info_docx(img, res, save_folder, img_name, save_pdf): doc = Document() doc.styles['Normal'].font.name = 'Times New Roman' doc.styles['Normal']._element.rPr.rFonts.set(qn('w:eastAsia'), u'宋体') doc.styles['Normal'].font.size = shared.Pt(6.5) - h, w, _ = img.shape - res = sorted_layout_boxes(res, w) flag = 1 for i, region in enumerate(res): + img_idx = region['img_idx'] if flag == 2 and region['layout'] == 'single': section = doc.add_section(WD_SECTION.CONTINUOUS) section._sectPr.xpath('./w:cols')[0].set(qn('w:num'), '1') @@ -46,10 +48,10 @@ def convert_info_docx(img, res, save_folder, img_name): section._sectPr.xpath('./w:cols')[0].set(qn('w:num'), '2') flag = 2 - if region['type'] == 'Figure': + if region['type'].lower() == 'figure': excel_save_folder = os.path.join(save_folder, img_name) img_path = os.path.join(excel_save_folder, - '{}.jpg'.format(region['bbox'])) + '{}_{}.jpg'.format(region['bbox'], img_idx)) paragraph_pic = doc.add_paragraph() paragraph_pic.alignment = WD_ALIGN_PARAGRAPH.CENTER run = paragraph_pic.add_run("") @@ -57,40 +59,38 @@ def convert_info_docx(img, res, save_folder, img_name): run.add_picture(img_path, width=shared.Inches(5)) elif flag == 2: run.add_picture(img_path, width=shared.Inches(2)) - elif region['type'] == 'Title': + elif region['type'].lower() == 'title': doc.add_heading(region['res'][0]['text']) - elif region['type'] == 'Text': + elif region['type'].lower() == 'table': + paragraph = doc.add_paragraph() + new_parser = HtmlToDocx() + new_parser.table_style = 'TableGrid' + table = new_parser.handle_table(html=region['res']['html']) + new_table = deepcopy(table) + new_table.alignment = WD_TABLE_ALIGNMENT.CENTER + paragraph.add_run().element.addnext(new_table._tbl) + + else: paragraph = doc.add_paragraph() paragraph_format = paragraph.paragraph_format for i, line in enumerate(region['res']): if i == 0: paragraph_format.first_line_indent = shared.Inches(0.25) text_run = paragraph.add_run(line['text'] + ' ') - text_run.font.size = shared.Pt(9) - elif region['type'] == 'Table': - pypandoc.convert( - source=region['res']['html'], - format='html', - to='docx', - outputfile='tmp.docx') - tmp_doc = Document('tmp.docx') - paragraph = doc.add_paragraph() - - table = tmp_doc.tables[0] - new_table = deepcopy(table) - new_table.style = doc.styles['Table Grid'] - from docx.enum.table import WD_TABLE_ALIGNMENT - new_table.alignment = WD_TABLE_ALIGNMENT.CENTER - paragraph.add_run().element.addnext(new_table._tbl) - os.remove('tmp.docx') - else: - continue + text_run.font.size = shared.Pt(10) # save to docx docx_path = os.path.join(save_folder, '{}.docx'.format(img_name)) doc.save(docx_path) logger.info('docx save to {}'.format(docx_path)) + # save to pdf + if save_pdf: + pdf_path = os.path.join(save_folder, '{}.pdf'.format(img_name)) + from docx2pdf import convert + convert(docx_path, pdf_path) + logger.info('pdf save to {}'.format(pdf_path)) + def sorted_layout_boxes(res, w): """ @@ -112,7 +112,7 @@ def sorted_layout_boxes(res, w): res_left = [] res_right = [] i = 0 - + while True: if i >= num_boxes: break @@ -137,7 +137,7 @@ def sorted_layout_boxes(res, w): res_left = [] res_right = [] break - elif _boxes[i]['bbox'][0] < w / 4 and _boxes[i]['bbox'][2] < 3*w / 4: + elif _boxes[i]['bbox'][0] < w / 4 and _boxes[i]['bbox'][2] < 3 * w / 4: _boxes[i]['layout'] = 'double' res_left.append(_boxes[i]) i += 1 @@ -157,4 +157,4 @@ def sorted_layout_boxes(res, w): new_res += res_left if res_right: new_res += res_right - return new_res \ No newline at end of file + return new_res diff --git a/ppstructure/recovery/requirements.txt b/ppstructure/recovery/requirements.txt index 04187baa2a72d2ac60f0a4e5ce643f882b7255fb..5ba3099d64574954c65ac8169798759dd7c053ac 100644 --- a/ppstructure/recovery/requirements.txt +++ b/ppstructure/recovery/requirements.txt @@ -1,3 +1,5 @@ -opencv-contrib-python==4.4.0.46 pypandoc -python-docx \ No newline at end of file +python-docx +docx2pdf +fitz +PyMuPDF \ No newline at end of file diff --git a/ppstructure/recovery/table_process.py b/ppstructure/recovery/table_process.py new file mode 100644 index 0000000000000000000000000000000000000000..243aaf8933791bf4704964d9665173fe70982f95 --- /dev/null +++ b/ppstructure/recovery/table_process.py @@ -0,0 +1,632 @@ + +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +""" +This code is refer from:https://github.com/pqzx/html2docx/blob/8f6695a778c68befb302e48ac0ed5201ddbd4524/htmldocx/h2d.py + +""" +import re, argparse +import io, os +import urllib.request +from urllib.parse import urlparse +from html.parser import HTMLParser + +import docx, docx.table +from docx import Document +from docx.shared import RGBColor, Pt, Inches +from docx.enum.text import WD_COLOR, WD_ALIGN_PARAGRAPH +from docx.oxml import OxmlElement +from docx.oxml.ns import qn + +from bs4 import BeautifulSoup + +# values in inches +INDENT = 0.25 +LIST_INDENT = 0.5 +MAX_INDENT = 5.5 # To stop indents going off the page + +# Style to use with tables. By default no style is used. +DEFAULT_TABLE_STYLE = None + +# Style to use with paragraphs. By default no style is used. +DEFAULT_PARAGRAPH_STYLE = None + + +def get_filename_from_url(url): + return os.path.basename(urlparse(url).path) + +def is_url(url): + """ + Not to be used for actually validating a url, but in our use case we only + care if it's a url or a file path, and they're pretty distinguishable + """ + parts = urlparse(url) + return all([parts.scheme, parts.netloc, parts.path]) + +def fetch_image(url): + """ + Attempts to fetch an image from a url. + If successful returns a bytes object, else returns None + :return: + """ + try: + with urllib.request.urlopen(url) as response: + # security flaw? + return io.BytesIO(response.read()) + except urllib.error.URLError: + return None + +def remove_last_occurence(ls, x): + ls.pop(len(ls) - ls[::-1].index(x) - 1) + +def remove_whitespace(string, leading=False, trailing=False): + """Remove white space from a string. + Args: + string(str): The string to remove white space from. + leading(bool, optional): Remove leading new lines when True. + trailing(bool, optional): Remove trailing new lines when False. + Returns: + str: The input string with new line characters removed and white space squashed. + Examples: + Single or multiple new line characters are replaced with space. + >>> remove_whitespace("abc\\ndef") + 'abc def' + >>> remove_whitespace("abc\\n\\n\\ndef") + 'abc def' + New line characters surrounded by white space are replaced with a single space. + >>> remove_whitespace("abc \\n \\n \\n def") + 'abc def' + >>> remove_whitespace("abc \\n \\n \\n def") + 'abc def' + Leading and trailing new lines are replaced with a single space. + >>> remove_whitespace("\\nabc") + ' abc' + >>> remove_whitespace(" \\n abc") + ' abc' + >>> remove_whitespace("abc\\n") + 'abc ' + >>> remove_whitespace("abc \\n ") + 'abc ' + Use ``leading=True`` to remove leading new line characters, including any surrounding + white space: + >>> remove_whitespace("\\nabc", leading=True) + 'abc' + >>> remove_whitespace(" \\n abc", leading=True) + 'abc' + Use ``trailing=True`` to remove trailing new line characters, including any surrounding + white space: + >>> remove_whitespace("abc \\n ", trailing=True) + 'abc' + """ + # Remove any leading new line characters along with any surrounding white space + if leading: + string = re.sub(r'^\s*\n+\s*', '', string) + + # Remove any trailing new line characters along with any surrounding white space + if trailing: + string = re.sub(r'\s*\n+\s*$', '', string) + + # Replace new line characters and absorb any surrounding space. + string = re.sub(r'\s*\n\s*', ' ', string) + # TODO need some way to get rid of extra spaces in e.g. text text + return re.sub(r'\s+', ' ', string) + +def delete_paragraph(paragraph): + # https://github.com/python-openxml/python-docx/issues/33#issuecomment-77661907 + p = paragraph._element + p.getparent().remove(p) + p._p = p._element = None + +font_styles = { + 'b': 'bold', + 'strong': 'bold', + 'em': 'italic', + 'i': 'italic', + 'u': 'underline', + 's': 'strike', + 'sup': 'superscript', + 'sub': 'subscript', + 'th': 'bold', +} + +font_names = { + 'code': 'Courier', + 'pre': 'Courier', +} + +styles = { + 'LIST_BULLET': 'List Bullet', + 'LIST_NUMBER': 'List Number', +} + +class HtmlToDocx(HTMLParser): + + def __init__(self): + super().__init__() + self.options = { + 'fix-html': True, + 'images': True, + 'tables': True, + 'styles': True, + } + self.table_row_selectors = [ + 'table > tr', + 'table > thead > tr', + 'table > tbody > tr', + 'table > tfoot > tr' + ] + self.table_style = DEFAULT_TABLE_STYLE + self.paragraph_style = DEFAULT_PARAGRAPH_STYLE + + def set_initial_attrs(self, document=None): + self.tags = { + 'span': [], + 'list': [], + } + if document: + self.doc = document + else: + self.doc = Document() + self.bs = self.options['fix-html'] # whether or not to clean with BeautifulSoup + self.document = self.doc + self.include_tables = True #TODO add this option back in? + self.include_images = self.options['images'] + self.include_styles = self.options['styles'] + self.paragraph = None + self.skip = False + self.skip_tag = None + self.instances_to_skip = 0 + + def copy_settings_from(self, other): + """Copy settings from another instance of HtmlToDocx""" + self.table_style = other.table_style + self.paragraph_style = other.paragraph_style + + def get_cell_html(self, soup): + # Returns string of td element with opening and closing tags removed + # Cannot use find_all as it only finds element tags and does not find text which + # is not inside an element + return ' '.join([str(i) for i in soup.contents]) + + def add_styles_to_paragraph(self, style): + if 'text-align' in style: + align = style['text-align'] + if align == 'center': + self.paragraph.paragraph_format.alignment = WD_ALIGN_PARAGRAPH.CENTER + elif align == 'right': + self.paragraph.paragraph_format.alignment = WD_ALIGN_PARAGRAPH.RIGHT + elif align == 'justify': + self.paragraph.paragraph_format.alignment = WD_ALIGN_PARAGRAPH.JUSTIFY + if 'margin-left' in style: + margin = style['margin-left'] + units = re.sub(r'[0-9]+', '', margin) + margin = int(float(re.sub(r'[a-z]+', '', margin))) + if units == 'px': + self.paragraph.paragraph_format.left_indent = Inches(min(margin // 10 * INDENT, MAX_INDENT)) + # TODO handle non px units + + def add_styles_to_run(self, style): + if 'color' in style: + if 'rgb' in style['color']: + color = re.sub(r'[a-z()]+', '', style['color']) + colors = [int(x) for x in color.split(',')] + elif '#' in style['color']: + color = style['color'].lstrip('#') + colors = tuple(int(color[i:i+2], 16) for i in (0, 2, 4)) + else: + colors = [0, 0, 0] + # TODO map colors to named colors (and extended colors...) + # For now set color to black to prevent crashing + self.run.font.color.rgb = RGBColor(*colors) + + if 'background-color' in style: + if 'rgb' in style['background-color']: + color = color = re.sub(r'[a-z()]+', '', style['background-color']) + colors = [int(x) for x in color.split(',')] + elif '#' in style['background-color']: + color = style['background-color'].lstrip('#') + colors = tuple(int(color[i:i+2], 16) for i in (0, 2, 4)) + else: + colors = [0, 0, 0] + # TODO map colors to named colors (and extended colors...) + # For now set color to black to prevent crashing + self.run.font.highlight_color = WD_COLOR.GRAY_25 #TODO: map colors + + def apply_paragraph_style(self, style=None): + try: + if style: + self.paragraph.style = style + elif self.paragraph_style: + self.paragraph.style = self.paragraph_style + except KeyError as e: + raise ValueError(f"Unable to apply style {self.paragraph_style}.") from e + + def parse_dict_string(self, string, separator=';'): + new_string = string.replace(" ", '').split(separator) + string_dict = dict([x.split(':') for x in new_string if ':' in x]) + return string_dict + + def handle_li(self): + # check list stack to determine style and depth + list_depth = len(self.tags['list']) + if list_depth: + list_type = self.tags['list'][-1] + else: + list_type = 'ul' # assign unordered if no tag + + if list_type == 'ol': + list_style = styles['LIST_NUMBER'] + else: + list_style = styles['LIST_BULLET'] + + self.paragraph = self.doc.add_paragraph(style=list_style) + self.paragraph.paragraph_format.left_indent = Inches(min(list_depth * LIST_INDENT, MAX_INDENT)) + self.paragraph.paragraph_format.line_spacing = 1 + + def add_image_to_cell(self, cell, image): + # python-docx doesn't have method yet for adding images to table cells. For now we use this + paragraph = cell.add_paragraph() + run = paragraph.add_run() + run.add_picture(image) + + def handle_img(self, current_attrs): + if not self.include_images: + self.skip = True + self.skip_tag = 'img' + return + src = current_attrs['src'] + # fetch image + src_is_url = is_url(src) + if src_is_url: + try: + image = fetch_image(src) + except urllib.error.URLError: + image = None + else: + image = src + # add image to doc + if image: + try: + if isinstance(self.doc, docx.document.Document): + self.doc.add_picture(image) + else: + self.add_image_to_cell(self.doc, image) + except FileNotFoundError: + image = None + if not image: + if src_is_url: + self.doc.add_paragraph("" % src) + else: + # avoid exposing filepaths in document + self.doc.add_paragraph("" % get_filename_from_url(src)) + + + def handle_table(self, html): + """ + To handle nested tables, we will parse tables manually as follows: + Get table soup + Create docx table + Iterate over soup and fill docx table with new instances of this parser + Tell HTMLParser to ignore any tags until the corresponding closing table tag + """ + doc = Document() + table_soup = BeautifulSoup(html, 'html.parser') + rows, cols_len = self.get_table_dimensions(table_soup) + table = doc.add_table(len(rows), cols_len) + table.style = doc.styles['Table Grid'] + cell_row = 0 + for index, row in enumerate(rows): + cols = self.get_table_columns(row) + cell_col = 0 + for col in cols: + colspan = int(col.attrs.get('colspan', 1)) + rowspan = int(col.attrs.get('rowspan', 1)) + + cell_html = self.get_cell_html(col) + + if col.name == 'th': + cell_html = "%s" % cell_html + docx_cell = table.cell(cell_row, cell_col) + while docx_cell.text != '': # Skip the merged cell + cell_col += 1 + docx_cell = table.cell(cell_row, cell_col) + + cell_to_merge = table.cell(cell_row + rowspan - 1, cell_col + colspan - 1) + if docx_cell != cell_to_merge: + docx_cell.merge(cell_to_merge) + + child_parser = HtmlToDocx() + child_parser.copy_settings_from(self) + + child_parser.add_html_to_cell(cell_html or ' ', docx_cell) # occupy the position + + cell_col += colspan + cell_row += 1 + + # skip all tags until corresponding closing tag + self.instances_to_skip = len(table_soup.find_all('table')) + self.skip_tag = 'table' + self.skip = True + self.table = None + return table + + def handle_link(self, href, text): + # Link requires a relationship + is_external = href.startswith('http') + rel_id = self.paragraph.part.relate_to( + href, + docx.opc.constants.RELATIONSHIP_TYPE.HYPERLINK, + is_external=True # don't support anchor links for this library yet + ) + + # Create the w:hyperlink tag and add needed values + hyperlink = docx.oxml.shared.OxmlElement('w:hyperlink') + hyperlink.set(docx.oxml.shared.qn('r:id'), rel_id) + + + # Create sub-run + subrun = self.paragraph.add_run() + rPr = docx.oxml.shared.OxmlElement('w:rPr') + + # add default color + c = docx.oxml.shared.OxmlElement('w:color') + c.set(docx.oxml.shared.qn('w:val'), "0000EE") + rPr.append(c) + + # add underline + u = docx.oxml.shared.OxmlElement('w:u') + u.set(docx.oxml.shared.qn('w:val'), 'single') + rPr.append(u) + + subrun._r.append(rPr) + subrun._r.text = text + + # Add subrun to hyperlink + hyperlink.append(subrun._r) + + # Add hyperlink to run + self.paragraph._p.append(hyperlink) + + def handle_starttag(self, tag, attrs): + if self.skip: + return + if tag == 'head': + self.skip = True + self.skip_tag = tag + self.instances_to_skip = 0 + return + elif tag == 'body': + return + + current_attrs = dict(attrs) + + if tag == 'span': + self.tags['span'].append(current_attrs) + return + elif tag == 'ol' or tag == 'ul': + self.tags['list'].append(tag) + return # don't apply styles for now + elif tag == 'br': + self.run.add_break() + return + + self.tags[tag] = current_attrs + if tag in ['p', 'pre']: + self.paragraph = self.doc.add_paragraph() + self.apply_paragraph_style() + + elif tag == 'li': + self.handle_li() + + elif tag == "hr": + + # This implementation was taken from: + # https://github.com/python-openxml/python-docx/issues/105#issuecomment-62806373 + + self.paragraph = self.doc.add_paragraph() + pPr = self.paragraph._p.get_or_add_pPr() + pBdr = OxmlElement('w:pBdr') + pPr.insert_element_before(pBdr, + 'w:shd', 'w:tabs', 'w:suppressAutoHyphens', 'w:kinsoku', 'w:wordWrap', + 'w:overflowPunct', 'w:topLinePunct', 'w:autoSpaceDE', 'w:autoSpaceDN', + 'w:bidi', 'w:adjustRightInd', 'w:snapToGrid', 'w:spacing', 'w:ind', + 'w:contextualSpacing', 'w:mirrorIndents', 'w:suppressOverlap', 'w:jc', + 'w:textDirection', 'w:textAlignment', 'w:textboxTightWrap', + 'w:outlineLvl', 'w:divId', 'w:cnfStyle', 'w:rPr', 'w:sectPr', + 'w:pPrChange' + ) + bottom = OxmlElement('w:bottom') + bottom.set(qn('w:val'), 'single') + bottom.set(qn('w:sz'), '6') + bottom.set(qn('w:space'), '1') + bottom.set(qn('w:color'), 'auto') + pBdr.append(bottom) + + elif re.match('h[1-9]', tag): + if isinstance(self.doc, docx.document.Document): + h_size = int(tag[1]) + self.paragraph = self.doc.add_heading(level=min(h_size, 9)) + else: + self.paragraph = self.doc.add_paragraph() + + elif tag == 'img': + self.handle_img(current_attrs) + return + + elif tag == 'table': + self.handle_table() + return + + # set new run reference point in case of leading line breaks + if tag in ['p', 'li', 'pre']: + self.run = self.paragraph.add_run() + + # add style + if not self.include_styles: + return + if 'style' in current_attrs and self.paragraph: + style = self.parse_dict_string(current_attrs['style']) + self.add_styles_to_paragraph(style) + + def handle_endtag(self, tag): + if self.skip: + if not tag == self.skip_tag: + return + + if self.instances_to_skip > 0: + self.instances_to_skip -= 1 + return + + self.skip = False + self.skip_tag = None + self.paragraph = None + + if tag == 'span': + if self.tags['span']: + self.tags['span'].pop() + return + elif tag == 'ol' or tag == 'ul': + remove_last_occurence(self.tags['list'], tag) + return + elif tag == 'table': + self.table_no += 1 + self.table = None + self.doc = self.document + self.paragraph = None + + if tag in self.tags: + self.tags.pop(tag) + # maybe set relevant reference to None? + + def handle_data(self, data): + if self.skip: + return + + # Only remove white space if we're not in a pre block. + if 'pre' not in self.tags: + # remove leading and trailing whitespace in all instances + data = remove_whitespace(data, True, True) + + if not self.paragraph: + self.paragraph = self.doc.add_paragraph() + self.apply_paragraph_style() + + # There can only be one nested link in a valid html document + # You cannot have interactive content in an A tag, this includes links + # https://html.spec.whatwg.org/#interactive-content + link = self.tags.get('a') + if link: + self.handle_link(link['href'], data) + else: + # If there's a link, dont put the data directly in the run + self.run = self.paragraph.add_run(data) + spans = self.tags['span'] + for span in spans: + if 'style' in span: + style = self.parse_dict_string(span['style']) + self.add_styles_to_run(style) + + # add font style and name + for tag in self.tags: + if tag in font_styles: + font_style = font_styles[tag] + setattr(self.run.font, font_style, True) + + if tag in font_names: + font_name = font_names[tag] + self.run.font.name = font_name + + def ignore_nested_tables(self, tables_soup): + """ + Returns array containing only the highest level tables + Operates on the assumption that bs4 returns child elements immediately after + the parent element in `find_all`. If this changes in the future, this method will need to be updated + :return: + """ + new_tables = [] + nest = 0 + for table in tables_soup: + if nest: + nest -= 1 + continue + new_tables.append(table) + nest = len(table.find_all('table')) + return new_tables + + def get_table_rows(self, table_soup): + # If there's a header, body, footer or direct child tr tags, add row dimensions from there + return table_soup.select(', '.join(self.table_row_selectors), recursive=False) + + def get_table_columns(self, row): + # Get all columns for the specified row tag. + return row.find_all(['th', 'td'], recursive=False) if row else [] + + def get_table_dimensions(self, table_soup): + # Get rows for the table + rows = self.get_table_rows(table_soup) + # Table is either empty or has non-direct children between table and tr tags + # Thus the row dimensions and column dimensions are assumed to be 0 + + cols = self.get_table_columns(rows[0]) if rows else [] + # Add colspan calculation column number + col_count = 0 + for col in cols: + colspan = col.attrs.get('colspan', 1) + col_count += int(colspan) + + # return len(rows), col_count + return rows, col_count + + def get_tables(self): + if not hasattr(self, 'soup'): + self.include_tables = False + return + # find other way to do it, or require this dependency? + self.tables = self.ignore_nested_tables(self.soup.find_all('table')) + self.table_no = 0 + + def run_process(self, html): + if self.bs and BeautifulSoup: + self.soup = BeautifulSoup(html, 'html.parser') + html = str(self.soup) + if self.include_tables: + self.get_tables() + self.feed(html) + + def add_html_to_document(self, html, document): + if not isinstance(html, str): + raise ValueError('First argument needs to be a %s' % str) + elif not isinstance(document, docx.document.Document) and not isinstance(document, docx.table._Cell): + raise ValueError('Second argument needs to be a %s' % docx.document.Document) + self.set_initial_attrs(document) + self.run_process(html) + + def add_html_to_cell(self, html, cell): + self.set_initial_attrs(cell) + self.run_process(html) + + def parse_html_file(self, filename_html, filename_docx=None): + with open(filename_html, 'r') as infile: + html = infile.read() + self.set_initial_attrs() + self.run_process(html) + if not filename_docx: + path, filename = os.path.split(filename_html) + filename_docx = '%s/new_docx_file_%s' % (path, filename) + self.doc.save('%s.docx' % filename_docx) + + def parse_html_string(self, html): + self.set_initial_attrs() + self.run_process(html) + return self.doc \ No newline at end of file diff --git a/ppstructure/table/README.md b/ppstructure/table/README.md index 3732a89c54b3686a6d8cf390d3b9043826c4f459..e5c85eb9619ea92cd8b31041907d518eeceaf6a5 100644 --- a/ppstructure/table/README.md +++ b/ppstructure/table/README.md @@ -33,8 +33,8 @@ We evaluated the algorithm on the PubTabNet[1] eval dataset, and the |Method|Acc|[TEDS(Tree-Edit-Distance-based Similarity)](https://github.com/ibm-aur-nlp/PubTabNet/tree/master/src)|Speed| | --- | --- | --- | ---| | EDD[2] |x| 88.3 |x| -| TableRec-RARE(ours) |73.8%| 95.3% |1550ms| -| SLANet(ours) | 76.2%| 95.85% |766ms| +| TableRec-RARE(ours) | 71.73%| 93.88% |779ms| +| SLANet(ours) | 76.31%| 95.89%|766ms| The performance indicators are explained as follows: - Acc: The accuracy of the table structure in each image, a wrong token is considered an error. @@ -59,16 +59,16 @@ cd PaddleOCR/ppstructure # download model mkdir inference && cd inference # Download the PP-OCRv3 text detection model and unzip it -wget https://paddleocr.bj.bcebos.com/PP-OCRv3/chinese/ch_PP-OCRv3_det_slim_infer.tar && tar xf ch_PP-OCRv3_det_slim_infer.tar +wget https://paddleocr.bj.bcebos.com/PP-OCRv3/chinese/ch_PP-OCRv3_det_infer.tar && tar xf ch_PP-OCRv3_det_infer.tar # Download the PP-OCRv3 text recognition model and unzip it -wget https://paddleocr.bj.bcebos.com/PP-OCRv3/chinese/ch_PP-OCRv3_rec_slim_infer.tar && tar xf ch_PP-OCRv3_rec_slim_infer.tar +wget https://paddleocr.bj.bcebos.com/PP-OCRv3/chinese/ch_PP-OCRv3_rec_infer.tar && tar xf ch_PP-OCRv3_rec_infer.tar # Download the PP-Structurev2 form recognition model and unzip it wget https://paddleocr.bj.bcebos.com/ppstructure/models/slanet/ch_ppstructure_mobile_v2.0_SLANet_infer.tar && tar xf ch_ppstructure_mobile_v2.0_SLANet_infer.tar cd .. # run python3.7 table/predict_table.py \ - --det_model_dir=inference/ch_PP-OCRv3_det_slim_infer \ - --rec_model_dir=inference/ch_PP-OCRv3_rec_slim_infer \ + --det_model_dir=inference/ch_PP-OCRv3_det_infer \ + --rec_model_dir=inference/ch_PP-OCRv3_rec_infer \ --table_model_dir=inference/ch_ppstructure_mobile_v2.0_SLANet_infer \ --rec_char_dict_path=../ppocr/utils/ppocr_keys_v1.txt \ --table_char_dict_path=../ppocr/utils/dict/table_structure_dict_ch.txt \ diff --git a/ppstructure/table/README_ch.md b/ppstructure/table/README_ch.md index cc73f8bcec727f6eff1bf412fb877373d405e489..086e39348e96abe4320debef1cc11487694ccd49 100644 --- a/ppstructure/table/README_ch.md +++ b/ppstructure/table/README_ch.md @@ -39,8 +39,8 @@ |算法|Acc|[TEDS(Tree-Edit-Distance-based Similarity)](https://github.com/ibm-aur-nlp/PubTabNet/tree/master/src)|Speed| | --- | --- | --- | ---| | EDD[2] |x| 88.3% |x| -| TableRec-RARE(ours) |73.8%| 95.3% |1550ms| -| SLANet(ours) | 76.2%| 95.85% |766ms| +| TableRec-RARE(ours) | 71.73%| 93.88% |779ms| +| SLANet(ours) |76.31%| 95.89%|766ms| 性能指标解释如下: - Acc: 模型对每张图像里表格结构的识别准确率,错一个token就算错误。 @@ -64,16 +64,16 @@ cd PaddleOCR/ppstructure # 下载模型 mkdir inference && cd inference # 下载PP-OCRv3文本检测模型并解压 -wget https://paddleocr.bj.bcebos.com/PP-OCRv3/chinese/ch_PP-OCRv3_det_slim_infer.tar && tar xf ch_PP-OCRv3_det_slim_infer.tar +wget https://paddleocr.bj.bcebos.com/PP-OCRv3/chinese/ch_PP-OCRv3_det_infer.tar && tar xf ch_PP-OCRv3_det_infer.tar # 下载PP-OCRv3文本识别模型并解压 -wget https://paddleocr.bj.bcebos.com/PP-OCRv3/chinese/ch_PP-OCRv3_rec_slim_infer.tar && tar xf ch_PP-OCRv3_rec_slim_infer.tar +wget https://paddleocr.bj.bcebos.com/PP-OCRv3/chinese/ch_PP-OCRv3_rec_infer.tar && tar xf ch_PP-OCRv3_rec_infer.tar # 下载PP-Structurev2表格识别模型并解压 wget https://paddleocr.bj.bcebos.com/ppstructure/models/slanet/ch_ppstructure_mobile_v2.0_SLANet_infer.tar && tar xf ch_ppstructure_mobile_v2.0_SLANet_infer.tar cd .. # 执行表格识别 python table/predict_table.py \ - --det_model_dir=inference/ch_PP-OCRv3_det_slim_infer \ - --rec_model_dir=inference/ch_PP-OCRv3_rec_slim_infer \ + --det_model_dir=inference/ch_PP-OCRv3_det_infer \ + --rec_model_dir=inference/ch_PP-OCRv3_rec_infer \ --table_model_dir=inference/ch_ppstructure_mobile_v2.0_SLANet_infer \ --rec_char_dict_path=../ppocr/utils/ppocr_keys_v1.txt \ --table_char_dict_path=../ppocr/utils/dict/table_structure_dict_ch.txt \ diff --git a/ppstructure/table/predict_structure.py b/ppstructure/table/predict_structure.py index a580947aad428a0744e3da4b8302f047c6b11bee..45cbba3e298004d3711b05e6fb7cffecae637601 100755 --- a/ppstructure/table/predict_structure.py +++ b/ppstructure/table/predict_structure.py @@ -29,7 +29,7 @@ import tools.infer.utility as utility from ppocr.data import create_operators, transform from ppocr.postprocess import build_post_process from ppocr.utils.logging import get_logger -from ppocr.utils.utility import get_image_file_list, check_and_read_gif +from ppocr.utils.utility import get_image_file_list, check_and_read from ppocr.utils.visual import draw_rectangle from ppstructure.utility import parse_args @@ -133,7 +133,7 @@ def main(args): os.path.join(args.output, 'infer.txt'), mode='w', encoding='utf-8') as f_w: for image_file in image_file_list: - img, flag = check_and_read_gif(image_file) + img, flag, _ = check_and_read(image_file) if not flag: img = cv2.imread(image_file) if img is None: diff --git a/ppstructure/table/predict_table.py b/ppstructure/table/predict_table.py index e94347d86144cd66474546e99a2c9dffee4978d9..aeec66deca62f648df249a5833dbfa678d2da612 100644 --- a/ppstructure/table/predict_table.py +++ b/ppstructure/table/predict_table.py @@ -31,7 +31,7 @@ import tools.infer.predict_rec as predict_rec import tools.infer.predict_det as predict_det import tools.infer.utility as utility from tools.infer.predict_system import sorted_boxes -from ppocr.utils.utility import get_image_file_list, check_and_read_gif +from ppocr.utils.utility import get_image_file_list, check_and_read from ppocr.utils.logging import get_logger from ppstructure.table.matcher import TableMatch from ppstructure.table.table_master_match import TableMasterMatcher @@ -194,7 +194,7 @@ def main(args): for i, image_file in enumerate(image_file_list): logger.info("[{}/{}] {}".format(i, img_num, image_file)) - img, flag = check_and_read_gif(image_file) + img, flag, _ = check_and_read(image_file) excel_path = os.path.join( args.output, os.path.basename(image_file).split('.')[0] + '.xlsx') if not flag: diff --git a/ppstructure/table/table_metric/table_metric.py b/ppstructure/table/table_metric/table_metric.py index 9aca98ad785d4614a803fa5a277a6e4a27b3b078..923a9c0071d083de72a2a896d6f62037373d4e73 100755 --- a/ppstructure/table/table_metric/table_metric.py +++ b/ppstructure/table/table_metric/table_metric.py @@ -9,7 +9,7 @@ # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the # Apache 2.0 License for more details. -import distance +from rapidfuzz.distance import Levenshtein from apted import APTED, Config from apted.helpers import Tree from lxml import etree, html @@ -39,17 +39,6 @@ class TableTree(Tree): class CustomConfig(Config): - @staticmethod - def maximum(*sequences): - """Get maximum possible value - """ - return max(map(len, sequences)) - - def normalized_distance(self, *sequences): - """Get distance from 0 to 1 - """ - return float(distance.levenshtein(*sequences)) / self.maximum(*sequences) - def rename(self, node1, node2): """Compares attributes of trees""" #print(node1.tag) @@ -58,23 +47,12 @@ class CustomConfig(Config): if node1.tag == 'td': if node1.content or node2.content: #print(node1.content, ) - return self.normalized_distance(node1.content, node2.content) + return Levenshtein.normalized_distance(node1.content, node2.content) return 0. class CustomConfig_del_short(Config): - @staticmethod - def maximum(*sequences): - """Get maximum possible value - """ - return max(map(len, sequences)) - - def normalized_distance(self, *sequences): - """Get distance from 0 to 1 - """ - return float(distance.levenshtein(*sequences)) / self.maximum(*sequences) - def rename(self, node1, node2): """Compares attributes of trees""" if (node1.tag != node2.tag) or (node1.colspan != node2.colspan) or (node1.rowspan != node2.rowspan): @@ -90,21 +68,10 @@ class CustomConfig_del_short(Config): node1_content = ['####'] if len(node2_content) < 3: node2_content = ['####'] - return self.normalized_distance(node1_content, node2_content) + return Levenshtein.normalized_distance(node1_content, node2_content) return 0. class CustomConfig_del_block(Config): - @staticmethod - def maximum(*sequences): - """Get maximum possible value - """ - return max(map(len, sequences)) - - def normalized_distance(self, *sequences): - """Get distance from 0 to 1 - """ - return float(distance.levenshtein(*sequences)) / self.maximum(*sequences) - def rename(self, node1, node2): """Compares attributes of trees""" if (node1.tag != node2.tag) or (node1.colspan != node2.colspan) or (node1.rowspan != node2.rowspan): @@ -120,7 +87,7 @@ class CustomConfig_del_block(Config): while ' ' in node2_content: print(node2_content.index(' ')) node2_content.pop(node2_content.index(' ')) - return self.normalized_distance(node1_content, node2_content) + return Levenshtein.normalized_distance(node1_content, node2_content) return 0. class TEDS(object): diff --git a/ppstructure/utility.py b/ppstructure/utility.py index 625185e6f5b090641befc35b3b4980c331687cff..bdea0af69e37e15d1f191b2a86c036ae1c2b1e45 100644 --- a/ppstructure/utility.py +++ b/ppstructure/utility.py @@ -38,7 +38,7 @@ def init_args(): parser.add_argument( "--layout_dict_path", type=str, - default="../ppocr/utils/dict/layout_dict/layout_pubalynet_dict.txt") + default="../ppocr/utils/dict/layout_dict/layout_publaynet_dict.txt") parser.add_argument( "--layout_score_threshold", type=float, @@ -49,8 +49,8 @@ def init_args(): type=float, default=0.5, help="Threshold of nms.") - # params for vqa - parser.add_argument("--vqa_algorithm", type=str, default='LayoutXLM') + # params for kie + parser.add_argument("--kie_algorithm", type=str, default='LayoutXLM') parser.add_argument("--ser_model_dir", type=str) parser.add_argument( "--ser_dict_path", @@ -63,7 +63,7 @@ def init_args(): "--mode", type=str, default='structure', - help='structure and vqa is supported') + help='structure and kie is supported') parser.add_argument( "--image_orientation", type=bool, @@ -84,11 +84,18 @@ def init_args(): type=str2bool, default=True, help='In the forward, whether the non-table area is recognition by ocr') + # param for recovery parser.add_argument( "--recovery", - type=bool, + type=str2bool, default=False, help='Whether to enable layout of recovery') + parser.add_argument( + "--save_pdf", + type=str2bool, + default=False, + help='Whether to save pdf file') + return parser diff --git a/ppstructure/vqa/README.md b/ppstructure/vqa/README.md deleted file mode 100644 index 28b794383bceccf655bdf00df5ee0c98841e2e95..0000000000000000000000000000000000000000 --- a/ppstructure/vqa/README.md +++ /dev/null @@ -1,285 +0,0 @@ -English | [简体中文](README_ch.md) - -- [1 Introduction](#1-introduction) -- [2. Performance](#2-performance) -- [3. Effect demo](#3-effect-demo) - - [3.1 SER](#31-ser) - - [3.2 RE](#32-re) -- [4. Install](#4-install) - - [4.1 Install dependencies](#41-install-dependencies) - - [5.3 RE](#53-re) -- [6. Reference Links](#6-reference-links) -- [License](#license) - -# Document Visual Question Answering - -## 1 Introduction - -VQA refers to visual question answering, which mainly asks and answers image content. DOC-VQA is one of the VQA tasks. DOC-VQA mainly asks questions about the text content of text images. - -The DOC-VQA algorithm in PP-Structure is developed based on the PaddleNLP natural language processing algorithm library. - -The main features are as follows: - -- Integrate [LayoutXLM](https://arxiv.org/pdf/2104.08836.pdf) model and PP-OCR prediction engine. -- Supports Semantic Entity Recognition (SER) and Relation Extraction (RE) tasks based on multimodal methods. Based on the SER task, the text recognition and classification in the image can be completed; based on the RE task, the relationship extraction of the text content in the image can be completed, such as judging the problem pair (pair). -- Supports custom training for SER tasks and RE tasks. -- Supports end-to-end system prediction and evaluation of OCR+SER. -- Supports end-to-end system prediction of OCR+SER+RE. - - -This project is an open source implementation of [LayoutXLM: Multimodal Pre-training for Multilingual Visually-rich Document Understanding](https://arxiv.org/pdf/2104.08836.pdf) on Paddle 2.2, -Included fine-tuning code on [XFUND dataset](https://github.com/doc-analysis/XFUND). - -## 2. Performance - -We evaluate the algorithm on the Chinese dataset of [XFUND](https://github.com/doc-analysis/XFUND), and the performance is as follows - -| Model | Task | hmean | Model download address | -|:---:|:---:|:---:| :---:| -| LayoutXLM | SER | 0.9038 | [link](https://paddleocr.bj.bcebos.com/pplayout/ser_LayoutXLM_xfun_zh.tar) | -| LayoutXLM | RE | 0.7483 | [link](https://paddleocr.bj.bcebos.com/pplayout/re_LayoutXLM_xfun_zh.tar) | -| LayoutLMv2 | SER | 0.8544 | [link](https://paddleocr.bj.bcebos.com/pplayout/ser_LayoutLMv2_xfun_zh.tar) -| LayoutLMv2 | RE | 0.6777 | [link](https://paddleocr.bj.bcebos.com/pplayout/re_LayoutLMv2_xfun_zh.tar) | -| LayoutLM | SER | 0.7731 | [link](https://paddleocr.bj.bcebos.com/pplayout/ser_LayoutLM_xfun_zh.tar) | - -## 3. Effect demo - -**Note:** The test images are from the XFUND dataset. - - -### 3.1 SER - -![](../docs/vqa/result_ser/zh_val_0_ser.jpg) | ![](../docs/vqa/result_ser/zh_val_42_ser.jpg) ----|--- - -Boxes with different colors in the figure represent different categories. For the XFUND dataset, there are 3 categories: `QUESTION`, `ANSWER`, `HEADER` - -* Dark purple: HEADER -* Light purple: QUESTION -* Army Green: ANSWER - -The corresponding categories and OCR recognition results are also marked on the upper left of the OCR detection frame. - - -### 3.2 RE - -![](../docs/vqa/result_re/zh_val_21_re.jpg) | ![](../docs/vqa/result_re/zh_val_40_re.jpg) ----|--- - - -The red box in the figure represents the question, the blue box represents the answer, and the question and the answer are connected by a green line. The corresponding categories and OCR recognition results are also marked on the upper left of the OCR detection frame. - -## 4. Install - -### 4.1 Install dependencies - -- **(1) Install PaddlePaddle** - -```bash -python3 -m pip install --upgrade pip - -# GPU installation -python3 -m pip install "paddlepaddle-gpu>=2.2" -i https://mirror.baidu.com/pypi/simple - -# CPU installation -python3 -m pip install "paddlepaddle>=2.2" -i https://mirror.baidu.com/pypi/simple - -```` -For more requirements, please refer to the instructions in [Installation Documentation](https://www.paddlepaddle.org.cn/install/quick). - -### 4.2 Install PaddleOCR - -- **(1) pip install PaddleOCR whl package quickly (prediction only)** - -```bash -python3 -m pip install paddleocr -```` - -- **(2) Download VQA source code (prediction + training)** - -```bash -[Recommended] git clone https://github.com/PaddlePaddle/PaddleOCR - -# If the pull cannot be successful due to network problems, you can also choose to use the hosting on the code cloud: -git clone https://gitee.com/paddlepaddle/PaddleOCR - -# Note: Code cloud hosting code may not be able to synchronize the update of this github project in real time, there is a delay of 3 to 5 days, please use the recommended method first. -```` - -- **(3) Install VQA's `requirements`** - -```bash -python3 -m pip install -r ppstructure/vqa/requirements.txt -```` - -## 5. Usage - -### 5.1 Data and Model Preparation - -If you want to experience the prediction process directly, you can download the pre-training model provided by us, skip the training process, and just predict directly. - -* Download the processed dataset - -The download address of the processed XFUND Chinese dataset: [link](https://paddleocr.bj.bcebos.com/ppstructure/dataset/XFUND.tar). - - -Download and unzip the dataset, and place the dataset in the current directory after unzipping. - -```shell -wget https://paddleocr.bj.bcebos.com/ppstructure/dataset/XFUND.tar -```` - -* Convert the dataset - -If you need to train other XFUND datasets, you can use the following commands to convert the datasets - -```bash -python3 ppstructure/vqa/tools/trans_xfun_data.py --ori_gt_path=path/to/json_path --output_path=path/to/save_path -```` - -* Download the pretrained models -```bash -mkdir pretrain && cd pretrain -#download the SER model -wget https://paddleocr.bj.bcebos.com/pplayout/ser_LayoutXLM_xfun_zh.tar && tar -xvf ser_LayoutXLM_xfun_zh.tar -#download the RE model -wget https://paddleocr.bj.bcebos.com/pplayout/re_LayoutXLM_xfun_zh.tar && tar -xvf re_LayoutXLM_xfun_zh.tar -cd ../ -```` - - -### 5.2 SER - -Before starting training, you need to modify the following four fields - -1. `Train.dataset.data_dir`: point to the directory where the training set images are stored -2. `Train.dataset.label_file_list`: point to the training set label file -3. `Eval.dataset.data_dir`: refers to the directory where the validation set images are stored -4. `Eval.dataset.label_file_list`: point to the validation set label file - -* start training -```shell -CUDA_VISIBLE_DEVICES=0 python3 tools/train.py -c configs/vqa/ser/layoutxlm.yml -```` - -Finally, `precision`, `recall`, `hmean` and other indicators will be printed. -In the `./output/ser_layoutxlm/` folder will save the training log, the optimal model and the model for the latest epoch. - -* resume training - -To resume training, assign the folder path of the previously trained model to the `Architecture.Backbone.checkpoints` field. - -```shell -CUDA_VISIBLE_DEVICES=0 python3 tools/train.py -c configs/vqa/ser/layoutxlm.yml -o Architecture.Backbone.checkpoints=path/to/model_dir -```` - -* evaluate - -Evaluation requires assigning the folder path of the model to be evaluated to the `Architecture.Backbone.checkpoints` field. - -```shell -CUDA_VISIBLE_DEVICES=0 python3 tools/eval.py -c configs/vqa/ser/layoutxlm.yml -o Architecture.Backbone.checkpoints=path/to/model_dir -```` -Finally, `precision`, `recall`, `hmean` and other indicators will be printed - -* `OCR + SER` tandem prediction based on training engine - -Use the following command to complete the series prediction of `OCR engine + SER`, taking the SER model based on LayoutXLM as an example:: - -```shell -python3.7 tools/export_model.py -c configs/vqa/ser/layoutxlm.yml -o Architecture.Backbone.checkpoints=pretrain/ser_LayoutXLM_xfun_zh/ Global.save_inference_dir=output/ser/infer -```` - -Finally, the prediction result visualization image and the prediction result text file will be saved in the directory configured by the `config.Global.save_res_path` field. The prediction result text file is named `infer_results.txt`. - -* End-to-end evaluation of `OCR + SER` prediction system - -First use the `tools/infer_vqa_token_ser.py` script to complete the prediction of the dataset, then use the following command to evaluate. - -```shell -export CUDA_VISIBLE_DEVICES=0 -python3 tools/eval_with_label_end2end.py --gt_json_path XFUND/zh_val/xfun_normalize_val.json --pred_json_path output_res/infer_results.txt -```` -* export model - -Use the following command to complete the model export of the SER model, taking the SER model based on LayoutXLM as an example: - -```shell -python3.7 tools/export_model.py -c configs/vqa/ser/layoutxlm.yml -o Architecture.Backbone.checkpoints=pretrain/ser_LayoutXLM_xfun_zh/ Global.save_inference_dir=output/ser/infer -``` -The converted model will be stored in the directory specified by the `Global.save_inference_dir` field. - -* `OCR + SER` tandem prediction based on prediction engine - -Use the following command to complete the tandem prediction of `OCR + SER` based on the prediction engine, taking the SER model based on LayoutXLM as an example: - -```shell -cd ppstructure -CUDA_VISIBLE_DEVICES=0 python3.7 vqa/predict_vqa_token_ser.py --vqa_algorithm=LayoutXLM --ser_model_dir=../output/ser/infer --ser_dict_path=../train_data/XFUND/class_list_xfun.txt --vis_font_path=../doc/fonts/simfang.ttf --image_dir=docs/vqa/input/zh_val_42.jpg --output=output -``` -After the prediction is successful, the visualization images and results will be saved in the directory specified by the `output` field - - -### 5.3 RE - -* start training - -Before starting training, you need to modify the following four fields - -1. `Train.dataset.data_dir`: point to the directory where the training set images are stored -2. `Train.dataset.label_file_list`: point to the training set label file -3. `Eval.dataset.data_dir`: refers to the directory where the validation set images are stored -4. `Eval.dataset.label_file_list`: point to the validation set label file - -```shell -CUDA_VISIBLE_DEVICES=0 python3 tools/train.py -c configs/vqa/re/layoutxlm.yml -```` - -Finally, `precision`, `recall`, `hmean` and other indicators will be printed. -In the `./output/re_layoutxlm/` folder will save the training log, the optimal model and the model for the latest epoch. - -* resume training - -To resume training, assign the folder path of the previously trained model to the `Architecture.Backbone.checkpoints` field. - -```shell -CUDA_VISIBLE_DEVICES=0 python3 tools/train.py -c configs/vqa/re/layoutxlm.yml -o Architecture.Backbone.checkpoints=path/to/model_dir -```` - -* evaluate - -Evaluation requires assigning the folder path of the model to be evaluated to the `Architecture.Backbone.checkpoints` field. - -```shell -CUDA_VISIBLE_DEVICES=0 python3 tools/eval.py -c configs/vqa/re/layoutxlm.yml -o Architecture.Backbone.checkpoints=path/to/model_dir -```` -Finally, `precision`, `recall`, `hmean` and other indicators will be printed - -* Use `OCR engine + SER + RE` tandem prediction - -Use the following command to complete the series prediction of `OCR engine + SER + RE`, taking the pretrained SER and RE models as an example: -```shell -export CUDA_VISIBLE_DEVICES=0 -python3 tools/infer_vqa_token_ser_re.py -c configs/vqa/re/layoutxlm.yml -o Architecture.Backbone.checkpoints=pretrain/re_LayoutXLM_xfun_zh/Global.infer_img=ppstructure/docs/vqa/input/zh_val_21.jpg -c_ser configs/vqa/ser/layoutxlm. yml -o_ser Architecture.Backbone.checkpoints=pretrain/ser_LayoutXLM_xfun_zh/ -```` - -Finally, the prediction result visualization image and the prediction result text file will be saved in the directory configured by the `config.Global.save_res_path` field. The prediction result text file is named `infer_results.txt`. - -* export model - -cooming soon - -* `OCR + SER + RE` tandem prediction based on prediction engine - -cooming soon - -## 6. Reference Links - -- LayoutXLM: Multimodal Pre-training for Multilingual Visually-rich Document Understanding, https://arxiv.org/pdf/2104.08836.pdf -- microsoft/unilm/layoutxlm, https://github.com/microsoft/unilm/tree/master/layoutxlm -- XFUND dataset, https://github.com/doc-analysis/XFUND - -## License - -The content of this project itself is licensed under the [Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0)](https://creativecommons.org/licenses/by-nc-sa/4.0/) diff --git a/ppstructure/vqa/README_ch.md b/ppstructure/vqa/README_ch.md deleted file mode 100644 index f168110ed9b2e750b3b2ee6f5ab0116daebc3e77..0000000000000000000000000000000000000000 --- a/ppstructure/vqa/README_ch.md +++ /dev/null @@ -1,283 +0,0 @@ -[English](README.md) | 简体中文 - -- [1. 简介](#1-简介) -- [2. 性能](#2-性能) -- [3. 效果演示](#3-效果演示) - - [3.1 SER](#31-ser) - - [3.2 RE](#32-re) -- [4. 安装](#4-安装) - - [4.1 安装依赖](#41-安装依赖) - - [4.2 安装PaddleOCR(包含 PP-OCR 和 VQA)](#42-安装paddleocr包含-pp-ocr-和-vqa) -- [5. 使用](#5-使用) - - [5.1 数据和预训练模型准备](#51-数据和预训练模型准备) - - [5.2 SER](#52-ser) - - [5.3 RE](#53-re) -- [6. 参考链接](#6-参考链接) -- [License](#license) - -# 文档视觉问答(DOC-VQA) - -## 1. 简介 - -VQA指视觉问答,主要针对图像内容进行提问和回答,DOC-VQA是VQA任务中的一种,DOC-VQA主要针对文本图像的文字内容提出问题。 - -PP-Structure 里的 DOC-VQA算法基于PaddleNLP自然语言处理算法库进行开发。 - -主要特性如下: - -- 集成[LayoutXLM](https://arxiv.org/pdf/2104.08836.pdf)模型以及PP-OCR预测引擎。 -- 支持基于多模态方法的语义实体识别 (Semantic Entity Recognition, SER) 以及关系抽取 (Relation Extraction, RE) 任务。基于 SER 任务,可以完成对图像中的文本识别与分类;基于 RE 任务,可以完成对图象中的文本内容的关系提取,如判断问题对(pair)。 -- 支持SER任务和RE任务的自定义训练。 -- 支持OCR+SER的端到端系统预测与评估。 -- 支持OCR+SER+RE的端到端系统预测。 - -本项目是 [LayoutXLM: Multimodal Pre-training for Multilingual Visually-rich Document Understanding](https://arxiv.org/pdf/2104.08836.pdf) 在 Paddle 2.2上的开源实现, -包含了在 [XFUND数据集](https://github.com/doc-analysis/XFUND) 上的微调代码。 - -## 2. 性能 - -我们在 [XFUND](https://github.com/doc-analysis/XFUND) 的中文数据集上对算法进行了评估,性能如下 - -| 模型 | 任务 | hmean | 模型下载地址 | -|:---:|:---:|:---:| :---:| -| LayoutXLM | SER | 0.9038 | [链接](https://paddleocr.bj.bcebos.com/pplayout/ser_LayoutXLM_xfun_zh.tar) | -| LayoutXLM | RE | 0.7483 | [链接](https://paddleocr.bj.bcebos.com/pplayout/re_LayoutXLM_xfun_zh.tar) | -| LayoutLMv2 | SER | 0.8544 | [链接](https://paddleocr.bj.bcebos.com/pplayout/ser_LayoutLMv2_xfun_zh.tar) -| LayoutLMv2 | RE | 0.6777 | [链接](https://paddleocr.bj.bcebos.com/pplayout/re_LayoutLMv2_xfun_zh.tar) | -| LayoutLM | SER | 0.7731 | [链接](https://paddleocr.bj.bcebos.com/pplayout/ser_LayoutLM_xfun_zh.tar) | - -## 3. 效果演示 - -**注意:** 测试图片来源于XFUND数据集。 - -### 3.1 SER - -![](../docs/vqa/result_ser/zh_val_0_ser.jpg) | ![](../docs/vqa/result_ser/zh_val_42_ser.jpg) ----|--- - -图中不同颜色的框表示不同的类别,对于XFUND数据集,有`QUESTION`, `ANSWER`, `HEADER` 3种类别 - -* 深紫色:HEADER -* 浅紫色:QUESTION -* 军绿色:ANSWER - -在OCR检测框的左上方也标出了对应的类别和OCR识别结果。 - -### 3.2 RE - -![](../docs/vqa/result_re/zh_val_21_re.jpg) | ![](../docs/vqa/result_re/zh_val_40_re.jpg) ----|--- - - -图中红色框表示问题,蓝色框表示答案,问题和答案之间使用绿色线连接。在OCR检测框的左上方也标出了对应的类别和OCR识别结果。 - -## 4. 安装 - -### 4.1 安装依赖 - -- **(1) 安装PaddlePaddle** - -```bash -python3 -m pip install --upgrade pip - -# GPU安装 -python3 -m pip install "paddlepaddle-gpu>=2.2" -i https://mirror.baidu.com/pypi/simple - -# CPU安装 -python3 -m pip install "paddlepaddle>=2.2" -i https://mirror.baidu.com/pypi/simple - -``` -更多需求,请参照[安装文档](https://www.paddlepaddle.org.cn/install/quick)中的说明进行操作。 - -### 4.2 安装PaddleOCR(包含 PP-OCR 和 VQA) - -- **(1)pip快速安装PaddleOCR whl包(仅预测)** - -```bash -python3 -m pip install paddleocr -``` - -- **(2)下载VQA源码(预测+训练)** - -```bash -【推荐】git clone https://github.com/PaddlePaddle/PaddleOCR - -# 如果因为网络问题无法pull成功,也可选择使用码云上的托管: -git clone https://gitee.com/paddlepaddle/PaddleOCR - -# 注:码云托管代码可能无法实时同步本github项目更新,存在3~5天延时,请优先使用推荐方式。 -``` - -- **(3)安装VQA的`requirements`** - -```bash -python3 -m pip install -r ppstructure/vqa/requirements.txt -``` - -## 5. 使用 - -### 5.1 数据和预训练模型准备 - -如果希望直接体验预测过程,可以下载我们提供的预训练模型,跳过训练过程,直接预测即可。 - -* 下载处理好的数据集 - -处理好的XFUND中文数据集下载地址:[链接](https://paddleocr.bj.bcebos.com/ppstructure/dataset/XFUND.tar)。 - - -下载并解压该数据集,解压后将数据集放置在当前目录下。 - -```shell -wget https://paddleocr.bj.bcebos.com/ppstructure/dataset/XFUND.tar -``` - -* 转换数据集 - -若需进行其他XFUND数据集的训练,可使用下面的命令进行数据集的转换 - -```bash -python3 ppstructure/vqa/tools/trans_xfun_data.py --ori_gt_path=path/to/json_path --output_path=path/to/save_path -``` - -* 下载预训练模型 -```bash -mkdir pretrain && cd pretrain -#下载SER模型 -wget https://paddleocr.bj.bcebos.com/pplayout/ser_LayoutXLM_xfun_zh.tar && tar -xvf ser_LayoutXLM_xfun_zh.tar -#下载RE模型 -wget https://paddleocr.bj.bcebos.com/pplayout/re_LayoutXLM_xfun_zh.tar && tar -xvf re_LayoutXLM_xfun_zh.tar -cd ../ -``` - -### 5.2 SER - -启动训练之前,需要修改下面的四个字段 - -1. `Train.dataset.data_dir`:指向训练集图片存放目录 -2. `Train.dataset.label_file_list`:指向训练集标注文件 -3. `Eval.dataset.data_dir`:指指向验证集图片存放目录 -4. `Eval.dataset.label_file_list`:指向验证集标注文件 - -* 启动训练 -```shell -CUDA_VISIBLE_DEVICES=0 python3 tools/train.py -c configs/vqa/ser/layoutxlm.yml -``` - -最终会打印出`precision`, `recall`, `hmean`等指标。 -在`./output/ser_layoutxlm/`文件夹中会保存训练日志,最优的模型和最新epoch的模型。 - -* 恢复训练 - -恢复训练需要将之前训练好的模型所在文件夹路径赋值给 `Architecture.Backbone.checkpoints` 字段。 - -```shell -CUDA_VISIBLE_DEVICES=0 python3 tools/train.py -c configs/vqa/ser/layoutxlm.yml -o Architecture.Backbone.checkpoints=path/to/model_dir -``` - -* 评估 - -评估需要将待评估的模型所在文件夹路径赋值给 `Architecture.Backbone.checkpoints` 字段。 - -```shell -CUDA_VISIBLE_DEVICES=0 python3 tools/eval.py -c configs/vqa/ser/layoutxlm.yml -o Architecture.Backbone.checkpoints=path/to/model_dir -``` -最终会打印出`precision`, `recall`, `hmean`等指标 - -* 基于训练引擎的`OCR + SER`串联预测 - -使用如下命令即可完成基于训练引擎的`OCR + SER`的串联预测, 以基于LayoutXLM的SER模型为例: -```shell -CUDA_VISIBLE_DEVICES=0 python3 tools/infer_vqa_token_ser.py -c configs/vqa/ser/layoutxlm.yml -o Architecture.Backbone.checkpoints=pretrain/ser_LayoutXLM_xfun_zh/ Global.infer_img=doc/vqa/input/zh_val_42.jpg -``` - -最终会在`config.Global.save_res_path`字段所配置的目录下保存预测结果可视化图像以及预测结果文本文件,预测结果文本文件名为`infer_results.txt`。 - -* 对`OCR + SER`预测系统进行端到端评估 - -首先使用 `tools/infer_vqa_token_ser.py` 脚本完成数据集的预测,然后使用下面的命令进行评估。 - -```shell -export CUDA_VISIBLE_DEVICES=0 -python3 tools/eval_with_label_end2end.py --gt_json_path XFUND/zh_val/xfun_normalize_val.json --pred_json_path output_res/infer_results.txt -``` -* 模型导出 - -使用如下命令即可完成SER模型的模型导出, 以基于LayoutXLM的SER模型为例: - -```shell -python3.7 tools/export_model.py -c configs/vqa/ser/layoutxlm.yml -o Architecture.Backbone.checkpoints=pretrain/ser_LayoutXLM_xfun_zh/ Global.save_inference_dir=output/ser/infer -``` -转换后的模型会存放在`Global.save_inference_dir`字段指定的目录下。 - -* 基于预测引擎的`OCR + SER`串联预测 - -使用如下命令即可完成基于预测引擎的`OCR + SER`的串联预测, 以基于LayoutXLM的SER模型为例: - -```shell -cd ppstructure -CUDA_VISIBLE_DEVICES=0 python3.7 vqa/predict_vqa_token_ser.py --vqa_algorithm=LayoutXLM --ser_model_dir=../output/ser/infer --ser_dict_path=../train_data/XFUND/class_list_xfun.txt --vis_font_path=../doc/fonts/simfang.ttf --image_dir=docs/vqa/input/zh_val_42.jpg --output=output -``` -预测成功后,可视化图片和结果会保存在`output`字段指定的目录下 - -### 5.3 RE - -* 启动训练 - -启动训练之前,需要修改下面的四个字段 - -1. `Train.dataset.data_dir`:指向训练集图片存放目录 -2. `Train.dataset.label_file_list`:指向训练集标注文件 -3. `Eval.dataset.data_dir`:指指向验证集图片存放目录 -4. `Eval.dataset.label_file_list`:指向验证集标注文件 - -```shell -CUDA_VISIBLE_DEVICES=0 python3 tools/train.py -c configs/vqa/re/layoutxlm.yml -``` - -最终会打印出`precision`, `recall`, `hmean`等指标。 -在`./output/re_layoutxlm/`文件夹中会保存训练日志,最优的模型和最新epoch的模型。 - -* 恢复训练 - -恢复训练需要将之前训练好的模型所在文件夹路径赋值给 `Architecture.Backbone.checkpoints` 字段。 - -```shell -CUDA_VISIBLE_DEVICES=0 python3 tools/train.py -c configs/vqa/re/layoutxlm.yml -o Architecture.Backbone.checkpoints=path/to/model_dir -``` - -* 评估 - -评估需要将待评估的模型所在文件夹路径赋值给 `Architecture.Backbone.checkpoints` 字段。 - -```shell -CUDA_VISIBLE_DEVICES=0 python3 tools/eval.py -c configs/vqa/re/layoutxlm.yml -o Architecture.Backbone.checkpoints=path/to/model_dir -``` -最终会打印出`precision`, `recall`, `hmean`等指标 - -* 基于训练引擎的`OCR + SER + RE`串联预测 - -使用如下命令即可完成基于训练引擎的`OCR + SER + RE`串联预测, 以基于LayoutXLMSER和RE模型为例: -```shell -export CUDA_VISIBLE_DEVICES=0 -python3 tools/infer_vqa_token_ser_re.py -c configs/vqa/re/layoutxlm.yml -o Architecture.Backbone.checkpoints=pretrain/re_LayoutXLM_xfun_zh/ Global.infer_img=ppstructure/docs/vqa/input/zh_val_21.jpg -c_ser configs/vqa/ser/layoutxlm.yml -o_ser Architecture.Backbone.checkpoints=pretrain/ser_LayoutXLM_xfun_zh/ -``` - -最终会在`config.Global.save_res_path`字段所配置的目录下保存预测结果可视化图像以及预测结果文本文件,预测结果文本文件名为`infer_results.txt`。 - -* 模型导出 - -cooming soon - -* 基于预测引擎的`OCR + SER + RE`串联预测 - -cooming soon - -## 6. 参考链接 - -- LayoutXLM: Multimodal Pre-training for Multilingual Visually-rich Document Understanding, https://arxiv.org/pdf/2104.08836.pdf -- microsoft/unilm/layoutxlm, https://github.com/microsoft/unilm/tree/master/layoutxlm -- XFUND dataset, https://github.com/doc-analysis/XFUND - -## License - -The content of this project itself is licensed under the [Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0)](https://creativecommons.org/licenses/by-nc-sa/4.0/) diff --git a/ppstructure/vqa/requirements.txt b/ppstructure/vqa/requirements.txt deleted file mode 100644 index fcd882274c4402ba2a1d34f20ee6e2befa157121..0000000000000000000000000000000000000000 --- a/ppstructure/vqa/requirements.txt +++ /dev/null @@ -1,7 +0,0 @@ -sentencepiece -yacs -seqeval -paddlenlp>=2.2.1 -pypandoc -attrdict -python_docx \ No newline at end of file diff --git a/requirements.txt b/requirements.txt index b15176db3eb42c381c1612f404fd15c6b020b3dc..976d29192abbbf89b8ee6064c0b4ec48d43ad268 100644 --- a/requirements.txt +++ b/requirements.txt @@ -6,7 +6,7 @@ lmdb tqdm numpy visualdl -python-Levenshtein +rapidfuzz opencv-contrib-python==4.4.0.46 cython lxml diff --git a/test_tipc/configs/ch_PP-OCRv2_rec/train_infer_python.txt b/test_tipc/configs/ch_PP-OCRv2_rec/train_infer_python.txt index a96b87dede1e1b4c7b3ed59c4bd9c0470402e7e2..6d20b2df7420371ce964cf8fd5cb29726c000d1d 100644 --- a/test_tipc/configs/ch_PP-OCRv2_rec/train_infer_python.txt +++ b/test_tipc/configs/ch_PP-OCRv2_rec/train_infer_python.txt @@ -41,7 +41,7 @@ inference:tools/infer/predict_rec.py --use_gpu:True|False --enable_mkldnn:False --cpu_threads:6 ---rec_batch_num:1|6 +--rec_batch_num:1 --use_tensorrt:False --precision:fp32 --rec_model_dir: diff --git a/test_tipc/configs/ch_PP-OCRv3_rec/train_infer_python.txt b/test_tipc/configs/ch_PP-OCRv3_rec/train_infer_python.txt index 59fc1bd4160ec77edb0b781c8ffa9845c6a3d5c7..fee08b08ede0f61ae4f57fd42dba303301798a3e 100644 --- a/test_tipc/configs/ch_PP-OCRv3_rec/train_infer_python.txt +++ b/test_tipc/configs/ch_PP-OCRv3_rec/train_infer_python.txt @@ -41,7 +41,7 @@ inference:tools/infer/predict_rec.py --rec_image_shape="3,48,320" --use_gpu:True|False --enable_mkldnn:False --cpu_threads:6 ---rec_batch_num:1|6 +--rec_batch_num:1 --use_tensorrt:False --precision:fp32 --rec_model_dir: diff --git a/test_tipc/configs/ch_ppocr_mobile_v2_0_rec/train_infer_python.txt b/test_tipc/configs/ch_ppocr_mobile_v2_0_rec/train_infer_python.txt index 40f397948936beba0a3a4bdce9aa4a9953ec9d0f..dc490cdc60c2c012549e6fd00c13ec18676ede20 100644 --- a/test_tipc/configs/ch_ppocr_mobile_v2_0_rec/train_infer_python.txt +++ b/test_tipc/configs/ch_ppocr_mobile_v2_0_rec/train_infer_python.txt @@ -41,7 +41,7 @@ inference:tools/infer/predict_rec.py --use_gpu:True|False --enable_mkldnn:False --cpu_threads:6 ---rec_batch_num:1|6 +--rec_batch_num:1 --use_tensorrt:False --precision:fp32 --rec_model_dir: diff --git a/test_tipc/configs/ch_ppocr_server_v2_0_rec/train_infer_python.txt b/test_tipc/configs/ch_ppocr_server_v2_0_rec/train_infer_python.txt index b9a1ae4984c30a08d75b73b884ceb97658eb11c7..85741f98c3fd645a64d8820a046030f1bb7e03c7 100644 --- a/test_tipc/configs/ch_ppocr_server_v2_0_rec/train_infer_python.txt +++ b/test_tipc/configs/ch_ppocr_server_v2_0_rec/train_infer_python.txt @@ -41,7 +41,7 @@ inference:tools/infer/predict_rec.py --use_gpu:True|False --enable_mkldnn:False --cpu_threads:6 ---rec_batch_num:1|6 +--rec_batch_num:1 --use_tensorrt:False --precision:fp32 --rec_model_dir: diff --git a/test_tipc/configs/en_table_structure/table_mv3.yml b/test_tipc/configs/en_table_structure/table_mv3.yml index 6ff31fc262b4380b4cc5258a7b2e098ada39dba0..edcbe2c3b00e8d8a56ad8dd9f208e283b511b86e 100755 --- a/test_tipc/configs/en_table_structure/table_mv3.yml +++ b/test_tipc/configs/en_table_structure/table_mv3.yml @@ -4,7 +4,7 @@ Global: log_smooth_window: 20 print_batch_step: 5 save_model_dir: ./output/table_mv3/ - save_epoch_step: 3 + save_epoch_step: 400 # evaluation is run every 400 iterations after the 0th iteration eval_batch_step: [0, 40000] cal_metric_during_train: True @@ -17,7 +17,8 @@ Global: # for data or label process character_dict_path: ppocr/utils/dict/table_structure_dict.txt character_type: en - max_text_length: 800 + max_text_length: &max_text_length 500 + box_format: &box_format 'xyxy' # 'xywh', 'xyxy', 'xyxyxyxy' infer_mode: False Optimizer: @@ -37,12 +38,14 @@ Architecture: Backbone: name: MobileNetV3 scale: 1.0 - model_name: large + model_name: small + disable_se: true Head: name: TableAttentionHead hidden_size: 256 loc_type: 2 - max_text_length: 800 + max_text_length: *max_text_length + loc_reg_num: &loc_reg_num 4 Loss: name: TableAttentionLoss @@ -70,6 +73,8 @@ Train: learn_empty_box: False merge_no_span_structure: False replace_empty_cell_token: False + loc_reg_num: *loc_reg_num + max_text_length: *max_text_length - TableBoxEncode: - ResizeTableImage: max_len: 488 @@ -102,6 +107,8 @@ Eval: learn_empty_box: False merge_no_span_structure: False replace_empty_cell_token: False + loc_reg_num: *loc_reg_num + max_text_length: *max_text_length - TableBoxEncode: - ResizeTableImage: max_len: 488 diff --git a/test_tipc/configs/en_table_structure/train_infer_python.txt b/test_tipc/configs/en_table_structure/train_infer_python.txt index 633b6185d976ac61408283025bd4ba305187317d..3fd5dc9f60a9621026d488e5654cd7e1421e8b65 100644 --- a/test_tipc/configs/en_table_structure/train_infer_python.txt +++ b/test_tipc/configs/en_table_structure/train_infer_python.txt @@ -54,6 +54,6 @@ random_infer_input:[{float32,[3,488,488]}] ===========================train_benchmark_params========================== batch_size:32 fp_items:fp32|fp16 -epoch:1 +epoch:2 --profiler_options:batch_range=[10,20];state=GPU;tracer_option=Default;profile_path=model.profile flags:FLAGS_eager_delete_tensor_gb=0.0;FLAGS_fraction_of_gpu_memory_to_use=0.98;FLAGS_conv_workspace_size_limit=4096 diff --git a/test_tipc/configs/layoutxlm_ser/train_infer_python.txt b/test_tipc/configs/layoutxlm_ser/train_infer_python.txt index 5284ffabe2de4eb8bb000e7fb745ef2846ed6b64..549a31e69e367237ec0396778162a5f91c8b7412 100644 --- a/test_tipc/configs/layoutxlm_ser/train_infer_python.txt +++ b/test_tipc/configs/layoutxlm_ser/train_infer_python.txt @@ -9,7 +9,7 @@ Global.save_model_dir:./output/ Train.loader.batch_size_per_card:lite_train_lite_infer=4|whole_train_whole_infer=8 Architecture.Backbone.checkpoints:null train_model_name:latest -train_infer_img_dir:ppstructure/docs/vqa/input/zh_val_42.jpg +train_infer_img_dir:ppstructure/docs/kie/input/zh_val_42.jpg null:null ## trainer:norm_train @@ -37,7 +37,7 @@ export2:null infer_model:null infer_export:null infer_quant:False -inference:ppstructure/vqa/predict_vqa_token_ser.py --vqa_algorithm=LayoutXLM --ser_dict_path=train_data/XFUND/class_list_xfun.txt --output=output +inference:ppstructure/kie/predict_kie_token_ser.py --kie_algorithm=LayoutXLM --ser_dict_path=train_data/XFUND/class_list_xfun.txt --output=output --use_gpu:True|False --enable_mkldnn:False --cpu_threads:6 @@ -45,7 +45,7 @@ inference:ppstructure/vqa/predict_vqa_token_ser.py --vqa_algorithm=LayoutXLM - --use_tensorrt:False --precision:fp32 --ser_model_dir: ---image_dir:./ppstructure/docs/vqa/input/zh_val_42.jpg +--image_dir:./ppstructure/docs/kie/input/zh_val_42.jpg null:null --benchmark:False null:null diff --git a/test_tipc/configs/rec_mtb_nrtr/train_infer_python.txt b/test_tipc/configs/rec_mtb_nrtr/train_infer_python.txt index fed8ba26753bb770e062f751a9ba1e8e35fc6843..4a8fda0fea76da41a0a13b61f35d96a4d230d488 100644 --- a/test_tipc/configs/rec_mtb_nrtr/train_infer_python.txt +++ b/test_tipc/configs/rec_mtb_nrtr/train_infer_python.txt @@ -41,7 +41,7 @@ inference:tools/infer/predict_rec.py --rec_char_dict_path=./ppocr/utils/EN_symbo --use_gpu:True|False --enable_mkldnn:False --cpu_threads:6 ---rec_batch_num:1|6 +--rec_batch_num:1 --use_tensorrt:False --precision:fp32 --rec_model_dir: diff --git a/test_tipc/configs/rec_mv3_none_bilstm_ctc_v2_0/train_infer_python.txt b/test_tipc/configs/rec_mv3_none_bilstm_ctc_v2_0/train_infer_python.txt index db89b4c78d72d1853096d6b44b73a7ca61792dfe..22c29c9b233ac908741accd7eb85fb3832fb0c0f 100644 --- a/test_tipc/configs/rec_mv3_none_bilstm_ctc_v2_0/train_infer_python.txt +++ b/test_tipc/configs/rec_mv3_none_bilstm_ctc_v2_0/train_infer_python.txt @@ -41,7 +41,7 @@ inference:tools/infer/predict_rec.py --rec_char_dict_path=./ppocr/utils/ic15_dic --use_gpu:True|False --enable_mkldnn:False --cpu_threads:6 ---rec_batch_num:1|6 +--rec_batch_num:1 --use_tensorrt:False --precision:fp32 --rec_model_dir: diff --git a/test_tipc/configs/rec_mv3_none_none_ctc_v2_0/train_infer_python.txt b/test_tipc/configs/rec_mv3_none_none_ctc_v2_0/train_infer_python.txt index 003e91ff3d95e62d4353d7c4545e780ecd2f9708..d91c55e8852eee2cc7913235308f6d1f31e1f2e9 100644 --- a/test_tipc/configs/rec_mv3_none_none_ctc_v2_0/train_infer_python.txt +++ b/test_tipc/configs/rec_mv3_none_none_ctc_v2_0/train_infer_python.txt @@ -41,7 +41,7 @@ inference:tools/infer/predict_rec.py --rec_char_dict_path=./ppocr/utils/ic15_dic --use_gpu:True|False --enable_mkldnn:False --cpu_threads:6 ---rec_batch_num:1|6 +--rec_batch_num:1 --use_tensorrt:False --precision:fp32 --rec_model_dir: diff --git a/test_tipc/configs/rec_mv3_tps_bilstm_att_v2_0/train_infer_python.txt b/test_tipc/configs/rec_mv3_tps_bilstm_att_v2_0/train_infer_python.txt index c7b416c83323863a905929a2effcb1d3ad856422..77dc79cdae8bf4843ad17282885b46a33e64ce53 100644 --- a/test_tipc/configs/rec_mv3_tps_bilstm_att_v2_0/train_infer_python.txt +++ b/test_tipc/configs/rec_mv3_tps_bilstm_att_v2_0/train_infer_python.txt @@ -41,7 +41,7 @@ inference:tools/infer/predict_rec.py --rec_char_dict_path=./ppocr/utils/ic15_dic --use_gpu:True|False --enable_mkldnn:False --cpu_threads:6 ---rec_batch_num:1|6 +--rec_batch_num:1 --use_tensorrt:False --precision:fp32 --rec_model_dir: diff --git a/test_tipc/configs/rec_mv3_tps_bilstm_ctc_v2_0/train_infer_python.txt b/test_tipc/configs/rec_mv3_tps_bilstm_ctc_v2_0/train_infer_python.txt index 0c6e2d1da7f163521e8859bd8c96436b2a6bac64..f38c8d8d67bae84232749e60952a5c73871f9a88 100644 --- a/test_tipc/configs/rec_mv3_tps_bilstm_ctc_v2_0/train_infer_python.txt +++ b/test_tipc/configs/rec_mv3_tps_bilstm_ctc_v2_0/train_infer_python.txt @@ -41,7 +41,7 @@ inference:tools/infer/predict_rec.py --rec_char_dict_path=./ppocr/utils/ic15_dic --use_gpu:True|False --enable_mkldnn:False --cpu_threads:6 ---rec_batch_num:1|6 +--rec_batch_num:1 --use_tensorrt:False --precision:fp32 --rec_model_dir: diff --git a/test_tipc/configs/rec_r31_robustscanner/train_infer_python.txt b/test_tipc/configs/rec_r31_robustscanner/train_infer_python.txt index 07498c9e81ada9652343b8d8fff0f102d4684380..1bf8dc0b6c5ba707d572bc0ad44818d5a51c8800 100644 --- a/test_tipc/configs/rec_r31_robustscanner/train_infer_python.txt +++ b/test_tipc/configs/rec_r31_robustscanner/train_infer_python.txt @@ -1,6 +1,6 @@ ===========================train_params=========================== model_name:rec_r31_robustscanner -python:python +python:python3.7 gpu_list:0|0,1 Global.use_gpu:True|True Global.auto_cast:null @@ -39,11 +39,11 @@ infer_export:tools/export_model.py -c test_tipc/configs/rec_r31_robustscanner/re infer_quant:False inference:tools/infer/predict_rec.py --rec_char_dict_path=./ppocr/utils/dict90.txt --rec_image_shape="3,48,48,160" --use_space_char=False --rec_algorithm="RobustScanner" --use_gpu:True|False ---enable_mkldnn:True|False ---cpu_threads:1|6 ---rec_batch_num:1|6 ---use_tensorrt:False|False ---precision:fp32|int8 +--enable_mkldnn:False +--cpu_threads:6 +--rec_batch_num:1 +--use_tensorrt:False +--precision:fp32 --rec_model_dir: --image_dir:./inference/rec_inference --save_log_path:./test/output/ diff --git a/test_tipc/configs/rec_r31_sar/train_infer_python.txt b/test_tipc/configs/rec_r31_sar/train_infer_python.txt index 03ec54abb65ac41d3b5ad4f6e2fdcf7abb34c344..4acc6223e3b65211d62f2f128150e1c76f286674 100644 --- a/test_tipc/configs/rec_r31_sar/train_infer_python.txt +++ b/test_tipc/configs/rec_r31_sar/train_infer_python.txt @@ -41,7 +41,7 @@ inference:tools/infer/predict_rec.py --rec_char_dict_path=./ppocr/utils/dict90.t --use_gpu:True --enable_mkldnn:False --cpu_threads:6 ---rec_batch_num:1|6 +--rec_batch_num:1 --use_tensorrt:False --precision:fp32 --rec_model_dir: diff --git a/test_tipc/configs/rec_r32_gaspin_bilstm_att/train_infer_python.txt b/test_tipc/configs/rec_r32_gaspin_bilstm_att/train_infer_python.txt index 115dfd661abc64db9e14c629f79099be7b6ff0e0..ac378b36046d532a887056183de9c7788f628b76 100644 --- a/test_tipc/configs/rec_r32_gaspin_bilstm_att/train_infer_python.txt +++ b/test_tipc/configs/rec_r32_gaspin_bilstm_att/train_infer_python.txt @@ -41,7 +41,7 @@ inference:tools/infer/predict_rec.py --rec_char_dict_path=./ppocr/utils/dict/spi --use_gpu:True|False --enable_mkldnn:False --cpu_threads:6 ---rec_batch_num:1|6 +--rec_batch_num:1 --use_tensorrt:False --precision:fp32 --rec_model_dir: diff --git a/test_tipc/configs/rec_r34_vd_none_bilstm_ctc_v2_0/train_infer_python.txt b/test_tipc/configs/rec_r34_vd_none_bilstm_ctc_v2_0/train_infer_python.txt index 07a6190b0ef09da5cd20b9dd8ea922544c578710..b53efbd6ba5db36813733f6682bde1cfd614c6ee 100644 --- a/test_tipc/configs/rec_r34_vd_none_bilstm_ctc_v2_0/train_infer_python.txt +++ b/test_tipc/configs/rec_r34_vd_none_bilstm_ctc_v2_0/train_infer_python.txt @@ -41,7 +41,7 @@ inference:tools/infer/predict_rec.py --rec_char_dict_path=./ppocr/utils/ic15_dic --use_gpu:True|False --enable_mkldnn:False --cpu_threads:6 ---rec_batch_num:1|6 +--rec_batch_num:1 --use_tensorrt:False --precision:fp32 --rec_model_dir: diff --git a/test_tipc/configs/rec_r34_vd_none_none_ctc_v2_0/train_infer_python.txt b/test_tipc/configs/rec_r34_vd_none_none_ctc_v2_0/train_infer_python.txt index 145793aa472d8330daf9321f44692a03e7ef6354..7d953968b8a9d3f62f7c6fb48ed65bd9743d5ba3 100644 --- a/test_tipc/configs/rec_r34_vd_none_none_ctc_v2_0/train_infer_python.txt +++ b/test_tipc/configs/rec_r34_vd_none_none_ctc_v2_0/train_infer_python.txt @@ -41,7 +41,7 @@ inference:tools/infer/predict_rec.py --rec_char_dict_path=./ppocr/utils/ic15_dic --use_gpu:True|False --enable_mkldnn:False --cpu_threads:6 ---rec_batch_num:1|6 +--rec_batch_num:1 --use_tensorrt:False --precision:fp32 --rec_model_dir: diff --git a/test_tipc/configs/rec_r34_vd_tps_bilstm_att_v2_0/train_infer_python.txt b/test_tipc/configs/rec_r34_vd_tps_bilstm_att_v2_0/train_infer_python.txt index 759518a4a11a17e076401bb8dd193617c9f10530..0910ff840e350333a26de9b959229b6f8d39c19e 100644 --- a/test_tipc/configs/rec_r34_vd_tps_bilstm_att_v2_0/train_infer_python.txt +++ b/test_tipc/configs/rec_r34_vd_tps_bilstm_att_v2_0/train_infer_python.txt @@ -41,7 +41,7 @@ inference:tools/infer/predict_rec.py --rec_char_dict_path=./ppocr/utils/ic15_dic --use_gpu:True|False --enable_mkldnn:False --cpu_threads:6 ---rec_batch_num:1|6 +--rec_batch_num:1 --use_tensorrt:False --precision:fp32 --rec_model_dir: diff --git a/test_tipc/configs/rec_r34_vd_tps_bilstm_ctc_v2_0/train_infer_python.txt b/test_tipc/configs/rec_r34_vd_tps_bilstm_ctc_v2_0/train_infer_python.txt index ecc898341ce14dfed0de4290b798dd70078ae2da..33144e622e5fbb399e6dd274196812e2d44dc0fd 100644 --- a/test_tipc/configs/rec_r34_vd_tps_bilstm_ctc_v2_0/train_infer_python.txt +++ b/test_tipc/configs/rec_r34_vd_tps_bilstm_ctc_v2_0/train_infer_python.txt @@ -41,7 +41,7 @@ inference:tools/infer/predict_rec.py --rec_char_dict_path=./ppocr/utils/ic15_dic --use_gpu:True|False --enable_mkldnn:False --cpu_threads:6 ---rec_batch_num:1|6 +--rec_batch_num:1 --use_tensorrt:False --precision:fp32 --rec_model_dir: diff --git a/test_tipc/configs/rec_r45_abinet/train_infer_python.txt b/test_tipc/configs/rec_r45_abinet/train_infer_python.txt index ecab1bcbbde11fc6d14357b6715033704c2c3316..04fc188649c77c62b43307cb2fff2249f28bddae 100644 --- a/test_tipc/configs/rec_r45_abinet/train_infer_python.txt +++ b/test_tipc/configs/rec_r45_abinet/train_infer_python.txt @@ -41,7 +41,7 @@ inference:tools/infer/predict_rec.py --rec_char_dict_path=./ppocr/utils/ic15_dic --use_gpu:True|False --enable_mkldnn:False --cpu_threads:6 ---rec_batch_num:1|6 +--rec_batch_num:1 --use_tensorrt:False --precision:fp32 --rec_model_dir: diff --git a/test_tipc/configs/rec_r45_visionlan/rec_r45_visionlan.yml b/test_tipc/configs/rec_r45_visionlan/rec_r45_visionlan.yml new file mode 100644 index 0000000000000000000000000000000000000000..860e4f53043138e7434d71a816fdf051048be6f7 --- /dev/null +++ b/test_tipc/configs/rec_r45_visionlan/rec_r45_visionlan.yml @@ -0,0 +1,108 @@ +Global: + use_gpu: true + epoch_num: 8 + log_smooth_window: 200 + print_batch_step: 200 + save_model_dir: ./output/rec/r45_visionlan + save_epoch_step: 1 + # evaluation is run every 2000 iterations + eval_batch_step: [0, 2000] + cal_metric_during_train: True + pretrained_model: + checkpoints: + save_inference_dir: + use_visualdl: False + infer_img: doc/imgs_words/en/word_2.png + # for data or label process + character_dict_path: + max_text_length: &max_text_length 25 + training_step: &training_step LA + infer_mode: False + use_space_char: False + save_res_path: ./output/rec/predicts_visionlan.txt + +Optimizer: + name: Adam + beta1: 0.9 + beta2: 0.999 + clip_norm: 20.0 + group_lr: true + training_step: *training_step + lr: + name: Piecewise + decay_epochs: [6] + values: [0.0001, 0.00001] + regularizer: + name: 'L2' + factor: 0 + +Architecture: + model_type: rec + algorithm: VisionLAN + Transform: + Backbone: + name: ResNet45 + strides: [2, 2, 2, 1, 1] + Head: + name: VLHead + n_layers: 3 + n_position: 256 + n_dim: 512 + max_text_length: *max_text_length + training_step: *training_step + +Loss: + name: VLLoss + mode: *training_step + weight_res: 0.5 + weight_mas: 0.5 + +PostProcess: + name: VLLabelDecode + +Metric: + name: RecMetric + is_filter: true + + +Train: + dataset: + name: SimpleDataSet + data_dir: ./train_data/ic15_data/ + label_file_list: ["./train_data/ic15_data/rec_gt_train.txt"] + transforms: + - DecodeImage: # load image + img_mode: RGB + channel_first: False + - ABINetRecAug: + - VLLabelEncode: # Class handling label + - VLRecResizeImg: + image_shape: [3, 64, 256] + - KeepKeys: + keep_keys: ['image', 'label', 'label_res', 'label_sub', 'label_id', 'length'] # dataloader will return list in this order + loader: + shuffle: True + batch_size_per_card: 220 + drop_last: True + num_workers: 4 + +Eval: + dataset: + name: SimpleDataSet + data_dir: ./train_data/ic15_data + label_file_list: ["./train_data/ic15_data/rec_gt_test.txt"] + transforms: + - DecodeImage: # load image + img_mode: RGB + channel_first: False + - VLLabelEncode: # Class handling label + - VLRecResizeImg: + image_shape: [3, 64, 256] + - KeepKeys: + keep_keys: ['image', 'label', 'label_res', 'label_sub', 'label_id', 'length'] # dataloader will return list in this order + loader: + shuffle: False + drop_last: False + batch_size_per_card: 64 + num_workers: 4 + diff --git a/test_tipc/configs/rec_r45_visionlan/train_infer_python.txt b/test_tipc/configs/rec_r45_visionlan/train_infer_python.txt new file mode 100644 index 0000000000000000000000000000000000000000..79618edafa794a683e085fb1b8050358342e1f77 --- /dev/null +++ b/test_tipc/configs/rec_r45_visionlan/train_infer_python.txt @@ -0,0 +1,53 @@ +===========================train_params=========================== +model_name:rec_r45_visionlan +python:python3.7 +gpu_list:0|0,1 +Global.use_gpu:True|True +Global.auto_cast:null +Global.epoch_num:lite_train_lite_infer=2|whole_train_whole_infer=300 +Global.save_model_dir:./output/ +Train.loader.batch_size_per_card:lite_train_lite_infer=32|whole_train_whole_infer=64 +Global.pretrained_model:null +train_model_name:latest +train_infer_img_dir:./inference/rec_inference +null:null +## +trainer:norm_train +norm_train:tools/train.py -c test_tipc/configs/rec_r45_visionlan/rec_r45_visionlan.yml -o +pact_train:null +fpgm_train:null +distill_train:null +null:null +null:null +## +===========================eval_params=========================== +eval:tools/eval.py -c test_tipc/configs/rec_r45_visionlan/rec_r45_visionlan.yml -o +null:null +## +===========================infer_params=========================== +Global.save_inference_dir:./output/ +Global.checkpoints: +norm_export:tools/export_model.py -c test_tipc/configs/rec_r45_visionlan/rec_r45_visionlan.yml -o +quant_export:null +fpgm_export:null +distill_export:null +export1:null +export2:null +## +train_model:./inference/rec_r45_visionlan_train/best_accuracy +infer_export:tools/export_model.py -c test_tipc/configs/rec_r45_visionlan/rec_r45_visionlan.yml -o +infer_quant:False +inference:tools/infer/predict_rec.py --rec_char_dict_path=./ppocr/utils/ic15_dict.txt --rec_image_shape="3,64,256" --rec_algorithm="VisionLAN" --use_space_char=False +--use_gpu:True|False +--enable_mkldnn:False +--cpu_threads:6 +--rec_batch_num:1 +--use_tensorrt:False +--precision:fp32 +--rec_model_dir: +--image_dir:./inference/rec_inference +--save_log_path:./test/output/ +--benchmark:True +null:null +===========================infer_benchmark_params========================== +random_infer_input:[{float32,[3,64,256]}] diff --git a/test_tipc/configs/rec_r50_fpn_vd_none_srn/train_infer_python.txt b/test_tipc/configs/rec_r50_fpn_vd_none_srn/train_infer_python.txt index b5a5286010a5830dc23031b3e0885247fb6ae53f..c1cfd1fcd930c6992982feeb3c118dbc5a56f226 100644 --- a/test_tipc/configs/rec_r50_fpn_vd_none_srn/train_infer_python.txt +++ b/test_tipc/configs/rec_r50_fpn_vd_none_srn/train_infer_python.txt @@ -41,7 +41,7 @@ inference:tools/infer/predict_rec.py --rec_char_dict_path=./ppocr/utils/ic15_dic --use_gpu:True|False --enable_mkldnn:False --cpu_threads:6 ---rec_batch_num:1|6 +--rec_batch_num:1 --use_tensorrt:False --precision:fp32 --rec_model_dir: diff --git a/test_tipc/configs/rec_svtrnet/train_infer_python.txt b/test_tipc/configs/rec_svtrnet/train_infer_python.txt index a7e4a24063b2e248f2ab92d5efd257a2837c0a34..5508c0411cfdc7102ccec7a00c59c2a5e1a54998 100644 --- a/test_tipc/configs/rec_svtrnet/train_infer_python.txt +++ b/test_tipc/configs/rec_svtrnet/train_infer_python.txt @@ -41,7 +41,7 @@ inference:tools/infer/predict_rec.py --rec_char_dict_path=./ppocr/utils/ic15_dic --use_gpu:True|False --enable_mkldnn:False --cpu_threads:6 ---rec_batch_num:1|6 +--rec_batch_num:1 --use_tensorrt:False --precision:fp32 --rec_model_dir: diff --git a/test_tipc/configs/rec_vitstr_none_ce/train_infer_python.txt b/test_tipc/configs/rec_vitstr_none_ce/train_infer_python.txt index 04c5742ea2ddaf01e782d8b39c21bcbcfa0a7ce7..187c11544998626af556e3eeef5f958fbe42fea0 100644 --- a/test_tipc/configs/rec_vitstr_none_ce/train_infer_python.txt +++ b/test_tipc/configs/rec_vitstr_none_ce/train_infer_python.txt @@ -41,7 +41,7 @@ inference:tools/infer/predict_rec.py --rec_char_dict_path=./ppocr/utils/EN_symbo --use_gpu:True|False --enable_mkldnn:False --cpu_threads:6 ---rec_batch_num:1|6 +--rec_batch_num:1 --use_tensorrt:False --precision:fp32 --rec_model_dir: diff --git a/test_tipc/configs/vi_layoutxlm_ser/train_infer_python.txt b/test_tipc/configs/vi_layoutxlm_ser/train_infer_python.txt index 59d347461171487c186c052e290f6b13236aa5c9..adad78bb76e34635a632ef7c1b55e212bc4b636a 100644 --- a/test_tipc/configs/vi_layoutxlm_ser/train_infer_python.txt +++ b/test_tipc/configs/vi_layoutxlm_ser/train_infer_python.txt @@ -9,7 +9,7 @@ Global.save_model_dir:./output/ Train.loader.batch_size_per_card:lite_train_lite_infer=4|whole_train_whole_infer=8 Architecture.Backbone.checkpoints:null train_model_name:latest -train_infer_img_dir:ppstructure/docs/vqa/input/zh_val_42.jpg +train_infer_img_dir:ppstructure/docs/kie/input/zh_val_42.jpg null:null ## trainer:norm_train @@ -37,7 +37,7 @@ export2:null infer_model:null infer_export:null infer_quant:False -inference:ppstructure/vqa/predict_vqa_token_ser.py --vqa_algorithm=LayoutXLM --ser_dict_path=train_data/XFUND/class_list_xfun.txt --output=output --ocr_order_method=tb-yx +inference:ppstructure/kie/predict_kie_token_ser.py --kie_algorithm=LayoutXLM --ser_dict_path=train_data/XFUND/class_list_xfun.txt --output=output --ocr_order_method=tb-yx --use_gpu:True|False --enable_mkldnn:False --cpu_threads:6 @@ -45,7 +45,7 @@ inference:ppstructure/vqa/predict_vqa_token_ser.py --vqa_algorithm=LayoutXLM - --use_tensorrt:False --precision:fp32 --ser_model_dir: ---image_dir:./ppstructure/docs/vqa/input/zh_val_42.jpg +--image_dir:./ppstructure/docs/kie/input/zh_val_42.jpg null:null --benchmark:False null:null diff --git a/test_tipc/prepare.sh b/test_tipc/prepare.sh index 259a1159cb326760384645b2aff313b75da6084a..5b5740113fc319accc8150f71c865a3f0465876d 100644 --- a/test_tipc/prepare.sh +++ b/test_tipc/prepare.sh @@ -107,8 +107,7 @@ if [ ${MODE} = "benchmark_train" ];then cd ../ fi if [ ${model_name} == "layoutxlm_ser" ] || [ ${model_name} == "vi_layoutxlm_ser" ]; then - pip install -r ppstructure/vqa/requirements.txt - pip install paddlenlp\>=2.3.5 --force-reinstall -i https://mirrors.aliyun.com/pypi/simple/ + pip install -r ppstructure/kie/requirements.txt wget -nc -P ./train_data/ https://paddleocr.bj.bcebos.com/ppstructure/dataset/XFUND.tar --no-check-certificate cd ./train_data/ && tar xf XFUND.tar # expand gt.txt 10 times @@ -161,6 +160,8 @@ if [ ${MODE} = "lite_train_lite_infer" ];then ln -s ./icdar2015_lite ./icdar2015 wget -nc -P ./ic15_data/ https://paddleocr.bj.bcebos.com/dataset/rec_gt_train_lite.txt --no-check-certificate wget -nc -P ./ic15_data/ https://paddleocr.bj.bcebos.com/dataset/rec_gt_test_lite.txt --no-check-certificate + mv ic15_data/rec_gt_train_lite.txt ic15_data/rec_gt_train.txt + mv ic15_data/rec_gt_test_lite.txt ic15_data/rec_gt_test.txt cd ../ cd ./inference && tar xf rec_inference.tar && cd ../ if [ ${model_name} == "ch_PP-OCRv2_det" ] || [ ${model_name} == "ch_PP-OCRv2_det_PACT" ]; then @@ -221,8 +222,7 @@ if [ ${MODE} = "lite_train_lite_infer" ];then cd ./pretrain_models/ && tar xf rec_r32_gaspin_bilstm_att_train.tar && cd ../ fi if [ ${model_name} == "layoutxlm_ser" ] || [ ${model_name} == "vi_layoutxlm_ser" ]; then - pip install -r ppstructure/vqa/requirements.txt - pip install paddlenlp\>=2.3.5 --force-reinstall -i https://mirrors.aliyun.com/pypi/simple/ + pip install -r ppstructure/kie/requirements.txt wget -nc -P ./train_data/ https://paddleocr.bj.bcebos.com/ppstructure/dataset/XFUND.tar --no-check-certificate cd ./train_data/ && tar xf XFUND.tar cd ../ diff --git a/tools/eval.py b/tools/eval.py index 38d72d178db45a4787ddc09c865afba9222f385a..3d1d3813d33e251ec83a9729383fe772bc4cc225 100755 --- a/tools/eval.py +++ b/tools/eval.py @@ -23,6 +23,7 @@ __dir__ = os.path.dirname(os.path.abspath(__file__)) sys.path.insert(0, __dir__) sys.path.insert(0, os.path.abspath(os.path.join(__dir__, '..'))) +import paddle from ppocr.data import build_dataloader from ppocr.modeling.architectures import build_model from ppocr.postprocess import build_post_process @@ -86,6 +87,30 @@ def main(): else: model_type = None + # build metric + eval_class = build_metric(config['Metric']) + # amp + use_amp = config["Global"].get("use_amp", False) + amp_level = config["Global"].get("amp_level", 'O2') + amp_custom_black_list = config['Global'].get('amp_custom_black_list',[]) + if use_amp: + AMP_RELATED_FLAGS_SETTING = { + 'FLAGS_cudnn_batchnorm_spatial_persistent': 1, + 'FLAGS_max_inplace_grad_add': 8, + } + paddle.fluid.set_flags(AMP_RELATED_FLAGS_SETTING) + scale_loss = config["Global"].get("scale_loss", 1.0) + use_dynamic_loss_scaling = config["Global"].get( + "use_dynamic_loss_scaling", False) + scaler = paddle.amp.GradScaler( + init_loss_scaling=scale_loss, + use_dynamic_loss_scaling=use_dynamic_loss_scaling) + if amp_level == "O2": + model = paddle.amp.decorate( + models=model, level=amp_level, master_weight=True) + else: + scaler = None + best_model_dict = load_model( config, model, model_type=config['Architecture']["model_type"]) if len(best_model_dict): @@ -93,11 +118,9 @@ def main(): for k, v in best_model_dict.items(): logger.info('{}:{}'.format(k, v)) - # build metric - eval_class = build_metric(config['Metric']) # start eval metric = program.eval(model, valid_dataloader, post_process_class, - eval_class, model_type, extra_input) + eval_class, model_type, extra_input, scaler, amp_level, amp_custom_black_list) logger.info('metric eval ***************') for k, v in metric.items(): logger.info('{}:{}'.format(k, v)) diff --git a/tools/infer/predict_cls.py b/tools/infer/predict_cls.py index ed2f47c04de6f4ab6a874db052e953a1ce4e0b76..d2b7108ca35666acfa53e785686fd7b9dfc21ed5 100755 --- a/tools/infer/predict_cls.py +++ b/tools/infer/predict_cls.py @@ -30,7 +30,7 @@ import traceback import tools.infer.utility as utility from ppocr.postprocess import build_post_process from ppocr.utils.logging import get_logger -from ppocr.utils.utility import get_image_file_list, check_and_read_gif +from ppocr.utils.utility import get_image_file_list, check_and_read logger = get_logger() @@ -128,7 +128,7 @@ def main(args): valid_image_file_list = [] img_list = [] for image_file in image_file_list: - img, flag = check_and_read_gif(image_file) + img, flag, _ = check_and_read(image_file) if not flag: img = cv2.imread(image_file) if img is None: diff --git a/tools/infer/predict_det.py b/tools/infer/predict_det.py index 394a48948b1f284bd405532769b76eeb298668bd..9f5c480d3c55367a02eacb48bed6ae3d38282f05 100755 --- a/tools/infer/predict_det.py +++ b/tools/infer/predict_det.py @@ -27,7 +27,7 @@ import sys import tools.infer.utility as utility from ppocr.utils.logging import get_logger -from ppocr.utils.utility import get_image_file_list, check_and_read_gif +from ppocr.utils.utility import get_image_file_list, check_and_read from ppocr.data import create_operators, transform from ppocr.postprocess import build_post_process import json @@ -289,7 +289,7 @@ if __name__ == "__main__": os.makedirs(draw_img_save) save_results = [] for image_file in image_file_list: - img, flag = check_and_read_gif(image_file) + img, flag, _ = check_and_read(image_file) if not flag: img = cv2.imread(image_file) if img is None: diff --git a/tools/infer/predict_e2e.py b/tools/infer/predict_e2e.py index fb2859f0c7e0d3aa0b87dbe11123dfc88f4b4e8e..de315d701c7172ded4d30e48e79abee367f42239 100755 --- a/tools/infer/predict_e2e.py +++ b/tools/infer/predict_e2e.py @@ -27,7 +27,7 @@ import sys import tools.infer.utility as utility from ppocr.utils.logging import get_logger -from ppocr.utils.utility import get_image_file_list, check_and_read_gif +from ppocr.utils.utility import get_image_file_list, check_and_read from ppocr.data import create_operators, transform from ppocr.postprocess import build_post_process @@ -148,7 +148,7 @@ if __name__ == "__main__": if not os.path.exists(draw_img_save): os.makedirs(draw_img_save) for image_file in image_file_list: - img, flag = check_and_read_gif(image_file) + img, flag, _ = check_and_read(image_file) if not flag: img = cv2.imread(image_file) if img is None: diff --git a/tools/infer/predict_rec.py b/tools/infer/predict_rec.py index 53dab6f26d8b84a224360f2fa6fe5f411eea751f..176e2c68e2c9b2e08f9b56378c45a57733faf8cd 100755 --- a/tools/infer/predict_rec.py +++ b/tools/infer/predict_rec.py @@ -30,7 +30,7 @@ import paddle import tools.infer.utility as utility from ppocr.postprocess import build_post_process from ppocr.utils.logging import get_logger -from ppocr.utils.utility import get_image_file_list, check_and_read_gif +from ppocr.utils.utility import get_image_file_list, check_and_read logger = get_logger() @@ -68,7 +68,7 @@ class TextRecognizer(object): 'name': 'SARLabelDecode', "character_dict_path": args.rec_char_dict_path, "use_space_char": args.use_space_char - } + } elif self.rec_algorithm == "VisionLAN": postprocess_params = { 'name': 'VLLabelDecode', @@ -349,6 +349,13 @@ class TextRecognizer(object): for beg_img_no in range(0, img_num, batch_num): end_img_no = min(img_num, beg_img_no + batch_num) norm_img_batch = [] + if self.rec_algorithm == "SRN": + encoder_word_pos_list = [] + gsrm_word_pos_list = [] + gsrm_slf_attn_bias1_list = [] + gsrm_slf_attn_bias2_list = [] + if self.rec_algorithm == "SAR": + valid_ratios = [] imgC, imgH, imgW = self.rec_image_shape[:3] max_wh_ratio = imgW / imgH # max_wh_ratio = 0 @@ -357,22 +364,16 @@ class TextRecognizer(object): wh_ratio = w * 1.0 / h max_wh_ratio = max(max_wh_ratio, wh_ratio) for ino in range(beg_img_no, end_img_no): - if self.rec_algorithm == "SAR": norm_img, _, _, valid_ratio = self.resize_norm_img_sar( img_list[indices[ino]], self.rec_image_shape) norm_img = norm_img[np.newaxis, :] valid_ratio = np.expand_dims(valid_ratio, axis=0) - valid_ratios = [] valid_ratios.append(valid_ratio) norm_img_batch.append(norm_img) elif self.rec_algorithm == "SRN": norm_img = self.process_image_srn( img_list[indices[ino]], self.rec_image_shape, 8, 25) - encoder_word_pos_list = [] - gsrm_word_pos_list = [] - gsrm_slf_attn_bias1_list = [] - gsrm_slf_attn_bias2_list = [] encoder_word_pos_list.append(norm_img[1]) gsrm_word_pos_list.append(norm_img[2]) gsrm_slf_attn_bias1_list.append(norm_img[3]) @@ -399,7 +400,9 @@ class TextRecognizer(object): norm_img_batch.append(norm_img) elif self.rec_algorithm == "RobustScanner": norm_img, _, _, valid_ratio = self.resize_norm_img_sar( - img_list[indices[ino]], self.rec_image_shape, width_downsample_ratio=0.25) + img_list[indices[ino]], + self.rec_image_shape, + width_downsample_ratio=0.25) norm_img = norm_img[np.newaxis, :] valid_ratio = np.expand_dims(valid_ratio, axis=0) valid_ratios = [] @@ -484,12 +487,8 @@ class TextRecognizer(object): elif self.rec_algorithm == "RobustScanner": valid_ratios = np.concatenate(valid_ratios) word_positions_list = np.concatenate(word_positions_list) - inputs = [ - norm_img_batch, - valid_ratios, - word_positions_list - ] - + inputs = [norm_img_batch, valid_ratios, word_positions_list] + if self.use_onnx: input_dict = {} input_dict[self.input_tensor.name] = norm_img_batch @@ -555,7 +554,7 @@ def main(args): res = text_recognizer([img] * int(args.rec_batch_num)) for image_file in image_file_list: - img, flag = check_and_read_gif(image_file) + img, flag, _ = check_and_read(image_file) if not flag: img = cv2.imread(image_file) if img is None: diff --git a/tools/infer/predict_sr.py b/tools/infer/predict_sr.py index b10d90bf1d6ce3de6d2947e9cc1f73443736518d..ca99f6819f4b207ecc0f0d1383fe1d26d07fbf50 100755 --- a/tools/infer/predict_sr.py +++ b/tools/infer/predict_sr.py @@ -30,7 +30,7 @@ import paddle import tools.infer.utility as utility from ppocr.postprocess import build_post_process from ppocr.utils.logging import get_logger -from ppocr.utils.utility import get_image_file_list, check_and_read_gif +from ppocr.utils.utility import get_image_file_list, check_and_read logger = get_logger() @@ -120,7 +120,7 @@ def main(args): res = text_recognizer([img] * int(args.sr_batch_num)) for image_file in image_file_list: - img, flag = check_and_read_gif(image_file) + img, flag, _ = check_and_read(image_file) if not flag: img = Image.open(image_file).convert("RGB") if img is None: diff --git a/tools/infer/predict_system.py b/tools/infer/predict_system.py index 73b7155baa9f869da928b5be03692c08115489ee..e0f2c41fa2aba23491efee920afbd76db1ec84e0 100755 --- a/tools/infer/predict_system.py +++ b/tools/infer/predict_system.py @@ -32,7 +32,7 @@ import tools.infer.utility as utility import tools.infer.predict_rec as predict_rec import tools.infer.predict_det as predict_det import tools.infer.predict_cls as predict_cls -from ppocr.utils.utility import get_image_file_list, check_and_read_gif +from ppocr.utils.utility import get_image_file_list, check_and_read from ppocr.utils.logging import get_logger from tools.infer.utility import draw_ocr_box_txt, get_rotate_crop_image logger = get_logger() @@ -120,11 +120,14 @@ def sorted_boxes(dt_boxes): _boxes = list(sorted_boxes) for i in range(num_boxes - 1): - if abs(_boxes[i + 1][0][1] - _boxes[i][0][1]) < 10 and \ - (_boxes[i + 1][0][0] < _boxes[i][0][0]): - tmp = _boxes[i] - _boxes[i] = _boxes[i + 1] - _boxes[i + 1] = tmp + for j in range(i, 0, -1): + if abs(_boxes[j + 1][0][1] - _boxes[j][0][1]) < 10 and \ + (_boxes[j + 1][0][0] < _boxes[j][0][0]): + tmp = _boxes[j] + _boxes[j] = _boxes[j + 1] + _boxes[j + 1] = tmp + else: + break return _boxes @@ -156,7 +159,7 @@ def main(args): count = 0 for idx, image_file in enumerate(image_file_list): - img, flag = check_and_read_gif(image_file) + img, flag, _ = check_and_read(image_file) if not flag: img = cv2.imread(image_file) if img is None: diff --git a/tools/infer/utility.py b/tools/infer/utility.py index 3a5e8cc131f8dd710595641be6d8a2b3ab2b3c62..9baf66d7f469a3bf6c9a140e034aee3a635a5c8e 100644 --- a/tools/infer/utility.py +++ b/tools/infer/utility.py @@ -181,14 +181,21 @@ def create_predictor(args, mode, logger): return sess, sess.get_inputs()[0], None, None else: - model_file_path = model_dir + "/inference.pdmodel" - params_file_path = model_dir + "/inference.pdiparams" + file_names = ['model', 'inference'] + for file_name in file_names: + model_file_path = '{}/{}.pdmodel'.format(model_dir, file_name) + params_file_path = '{}/{}.pdiparams'.format(model_dir, file_name) + if os.path.exists(model_file_path) and os.path.exists( + params_file_path): + break if not os.path.exists(model_file_path): - raise ValueError("not find model file path {}".format( - model_file_path)) + raise ValueError( + "not find model.pdmodel or inference.pdmodel in {}".format( + model_dir)) if not os.path.exists(params_file_path): - raise ValueError("not find params file path {}".format( - params_file_path)) + raise ValueError( + "not find model.pdiparams or inference.pdiparams in {}".format( + model_dir)) config = inference.Config(model_file_path, params_file_path) @@ -218,23 +225,24 @@ def create_predictor(args, mode, logger): min_subgraph_size, # skip the minmum trt subgraph use_calib_mode=False) - # collect shape - if args.shape_info_filename is not None: - if not os.path.exists(args.shape_info_filename): - config.collect_shape_range_info(args.shape_info_filename) - logger.info( - f"collect dynamic shape info into : {args.shape_info_filename}" - ) + # collect shape + if args.shape_info_filename is not None: + if not os.path.exists(args.shape_info_filename): + config.collect_shape_range_info( + args.shape_info_filename) + logger.info( + f"collect dynamic shape info into : {args.shape_info_filename}" + ) + else: + logger.info( + f"dynamic shape info file( {args.shape_info_filename} ) already exists, not need to generate again." + ) + config.enable_tuned_tensorrt_dynamic_shape( + args.shape_info_filename, True) else: logger.info( - f"dynamic shape info file( {args.shape_info_filename} ) already exists, not need to generate again." + f"when using tensorrt, dynamic shape is a suggested option, you can use '--shape_info_filename=shape.txt' for offline dygnamic shape tuning" ) - config.enable_tuned_tensorrt_dynamic_shape( - args.shape_info_filename, True) - else: - logger.info( - f"when using tensorrt, dynamic shape is a suggested option, you can use '--shape_info_filename=shape.txt' for offline dygnamic shape tuning" - ) elif args.use_xpu: config.enable_xpu(10 * 1024 * 1024) @@ -542,7 +550,7 @@ def text_visual(texts, def base64_to_cv2(b64str): import base64 data = base64.b64decode(b64str.encode('utf8')) - data = np.fromstring(data, np.uint8) + data = np.frombuffer(data, np.uint8) data = cv2.imdecode(data, cv2.IMREAD_COLOR) return data diff --git a/tools/infer_kie.py b/tools/infer_kie.py index 346e2e0aeeee695ab49577b6b13dcc058150df1a..9375434cc887b08dfa746420a6c73c58c6e04797 100755 --- a/tools/infer_kie.py +++ b/tools/infer_kie.py @@ -88,6 +88,29 @@ def draw_kie_result(batch, node, idx_to_cls, count): cv2.imwrite(save_path, vis_img) logger.info("The Kie Image saved in {}".format(save_path)) +def write_kie_result(fout, node, data): + """ + Write infer result to output file, sorted by the predict label of each line. + The format keeps the same as the input with additional score attribute. + """ + import json + label = data['label'] + annotations = json.loads(label) + max_value, max_idx = paddle.max(node, -1), paddle.argmax(node, -1) + node_pred_label = max_idx.numpy().tolist() + node_pred_score = max_value.numpy().tolist() + res = [] + for i, label in enumerate(node_pred_label): + pred_score = '{:.2f}'.format(node_pred_score[i]) + pred_res = { + 'label': label, + 'transcription': annotations[i]['transcription'], + 'score': pred_score, + 'points': annotations[i]['points'], + } + res.append(pred_res) + res.sort(key=lambda x: x['label']) + fout.writelines([json.dumps(res, ensure_ascii=False) + '\n']) def main(): global_config = config['Global'] @@ -114,7 +137,7 @@ def main(): warmup_times = 0 count_t = [] - with open(save_res_path, "wb") as fout: + with open(save_res_path, "w") as fout: with open(config['Global']['infer_img'], "rb") as f: lines = f.readlines() for index, data_line in enumerate(lines): @@ -139,6 +162,8 @@ def main(): node = F.softmax(node, -1) count_t.append(time.time() - st) draw_kie_result(batch, node, idx_to_cls, index) + write_kie_result(fout, node, data) + fout.close() logger.info("success!") logger.info("It took {} s for predict {} images.".format( np.sum(count_t), len(count_t))) diff --git a/tools/infer_vqa_token_ser.py b/tools/infer_kie_token_ser.py similarity index 97% rename from tools/infer_vqa_token_ser.py rename to tools/infer_kie_token_ser.py index a15d83b17cc738a5c3349d461c3bce119c2355e7..2fc5749b9c10b9c89bc16e561fbe9c5ce58eb13c 100755 --- a/tools/infer_vqa_token_ser.py +++ b/tools/infer_kie_token_ser.py @@ -75,6 +75,8 @@ class SerPredictor(object): self.ocr_engine = PaddleOCR( use_angle_cls=False, show_log=False, + rec_model_dir=global_config.get("kie_rec_model_dir", None), + det_model_dir=global_config.get("kie_det_model_dir", None), use_gpu=global_config['use_gpu']) # create data ops diff --git a/tools/infer_vqa_token_ser_re.py b/tools/infer_kie_token_ser_re.py similarity index 97% rename from tools/infer_vqa_token_ser_re.py rename to tools/infer_kie_token_ser_re.py index 51378bdaeb03d4ec6d7684de80625c5029963745..3ee696f28470a16205be628b3aeb586ef7a9c6a6 100755 --- a/tools/infer_vqa_token_ser_re.py +++ b/tools/infer_kie_token_ser_re.py @@ -39,7 +39,7 @@ from ppocr.utils.visual import draw_re_results from ppocr.utils.logging import get_logger from ppocr.utils.utility import get_image_file_list, load_vqa_bio_label_maps, print_dict from tools.program import ArgsParser, load_config, merge_config -from tools.infer_vqa_token_ser import SerPredictor +from tools.infer_kie_token_ser import SerPredictor class ReArgsParser(ArgsParser): @@ -205,9 +205,7 @@ if __name__ == '__main__': result = ser_re_engine(data) result = result[0] fout.write(img_path + "\t" + json.dumps( - { - "ser_result": result, - }, ensure_ascii=False) + "\n") + result, ensure_ascii=False) + "\n") img_res = draw_re_results(img_path, result) cv2.imwrite(save_img_path, img_res) diff --git a/tools/infer_sr.py b/tools/infer_sr.py index 0bc2f6aaa7c4400676268ec64d37e721af0f99c2..df4334f3427e57b9062dd819aa16c110fd771e8c 100755 --- a/tools/infer_sr.py +++ b/tools/infer_sr.py @@ -63,14 +63,14 @@ def main(): elif op_name in ['SRResize']: op[op_name]['infer_mode'] = True elif op_name == 'KeepKeys': - op[op_name]['keep_keys'] = ['imge_lr'] + op[op_name]['keep_keys'] = ['img_lr'] transforms.append(op) global_config['infer_mode'] = True ops = create_operators(transforms, global_config) - save_res_path = config['Global'].get('save_res_path', "./infer_result") - if not os.path.exists(os.path.dirname(save_res_path)): - os.makedirs(os.path.dirname(save_res_path)) + save_visual_path = config['Global'].get('save_visual', "infer_result/") + if not os.path.exists(os.path.dirname(save_visual_path)): + os.makedirs(os.path.dirname(save_visual_path)) model.eval() for file in get_image_file_list(config['Global']['infer_img']): @@ -87,7 +87,7 @@ def main(): fm_sr = (sr_img.numpy() * 255).transpose(1, 2, 0).astype(np.uint8) fm_lr = (lr_img.numpy() * 255).transpose(1, 2, 0).astype(np.uint8) img_name_pure = os.path.split(file)[-1] - cv2.imwrite("infer_result/sr_{}".format(img_name_pure), + cv2.imwrite("{}/sr_{}".format(save_visual_path, img_name_pure), fm_sr[:, :, ::-1]) logger.info("The visualized image saved in infer_result/sr_{}".format( img_name_pure)) diff --git a/tools/program.py b/tools/program.py index 195b09b43da93d8c9285c064ef267e01623a733c..16d3d4035af933cda01b422ea56e9e2895ec2b88 100755 --- a/tools/program.py +++ b/tools/program.py @@ -162,18 +162,18 @@ def to_float32(preds): for k in preds: if isinstance(preds[k], dict) or isinstance(preds[k], list): preds[k] = to_float32(preds[k]) - else: - preds[k] = paddle.to_tensor(preds[k], dtype='float32') + elif isinstance(preds[k], paddle.Tensor): + preds[k] = preds[k].astype(paddle.float32) elif isinstance(preds, list): for k in range(len(preds)): if isinstance(preds[k], dict): preds[k] = to_float32(preds[k]) elif isinstance(preds[k], list): preds[k] = to_float32(preds[k]) - else: - preds[k] = paddle.to_tensor(preds[k], dtype='float32') - else: - preds = paddle.to_tensor(preds, dtype='float32') + elif isinstance(preds[k], paddle.Tensor): + preds[k] = preds[k].astype(paddle.float32) + elif isinstance(preds, paddle.Tensor): + preds = preds.astype(paddle.float32) return preds @@ -190,7 +190,9 @@ def train(config, pre_best_model_dict, logger, log_writer=None, - scaler=None): + scaler=None, + amp_level='O2', + amp_custom_black_list=[]): cal_metric_during_train = config['Global'].get('cal_metric_during_train', False) calc_epoch_interval = config['Global'].get('calc_epoch_interval', 1) @@ -230,7 +232,8 @@ def train(config, use_srn = config['Architecture']['algorithm'] == "SRN" extra_input_models = [ - "SRN", "NRTR", "SAR", "SEED", "SVTR", "SPIN", "VisionLAN", "RobustScanner" + "SRN", "NRTR", "SAR", "SEED", "SVTR", "SPIN", "VisionLAN", + "RobustScanner" ] extra_input = False if config['Architecture']['algorithm'] == 'Distillation': @@ -276,10 +279,10 @@ def train(config, model_average = True # use amp if scaler: - with paddle.amp.auto_cast(level='O2'): + with paddle.amp.auto_cast(level=amp_level, custom_black_list=amp_custom_black_list): if model_type == 'table' or extra_input: preds = model(images, data=batch[1:]) - elif model_type in ["kie", 'vqa']: + elif model_type in ["kie"]: preds = model(batch) else: preds = model(images) @@ -292,7 +295,7 @@ def train(config, else: if model_type == 'table' or extra_input: preds = model(images, data=batch[1:]) - elif model_type in ["kie", 'vqa', 'sr']: + elif model_type in ["kie", 'sr']: preds = model(batch) else: preds = model(images) @@ -381,7 +384,9 @@ def train(config, eval_class, model_type, extra_input=extra_input, - scaler=scaler) + scaler=scaler, + amp_level=amp_level, + amp_custom_black_list=amp_custom_black_list) cur_metric_str = 'cur metric, {}'.format(', '.join( ['{}: {}'.format(k, v) for k, v in cur_metric.items()])) logger.info(cur_metric_str) @@ -472,7 +477,9 @@ def eval(model, eval_class, model_type=None, extra_input=False, - scaler=None): + scaler=None, + amp_level='O2', + amp_custom_black_list = []): model.eval() with paddle.no_grad(): total_frame = 0.0 @@ -493,46 +500,27 @@ def eval(model, # use amp if scaler: - with paddle.amp.auto_cast(level='O2'): + with paddle.amp.auto_cast(level=amp_level, custom_black_list=amp_custom_black_list): if model_type == 'table' or extra_input: preds = model(images, data=batch[1:]) - elif model_type in ["kie", 'vqa']: + elif model_type in ["kie"]: preds = model(batch) elif model_type in ['sr']: preds = model(batch) sr_img = preds["sr_img"] lr_img = preds["lr_img"] - - for i in (range(sr_img.shape[0])): - fm_sr = (sr_img[i].numpy() * 255).transpose( - 1, 2, 0).astype(np.uint8) - fm_lr = (lr_img[i].numpy() * 255).transpose( - 1, 2, 0).astype(np.uint8) - cv2.imwrite("output/images/{}_{}_sr.jpg".format( - sum_images, i), fm_sr) - cv2.imwrite("output/images/{}_{}_lr.jpg".format( - sum_images, i), fm_lr) else: preds = model(images) + preds = to_float32(preds) else: if model_type == 'table' or extra_input: preds = model(images, data=batch[1:]) - elif model_type in ["kie", 'vqa']: + elif model_type in ["kie"]: preds = model(batch) elif model_type in ['sr']: preds = model(batch) sr_img = preds["sr_img"] lr_img = preds["lr_img"] - - for i in (range(sr_img.shape[0])): - fm_sr = (sr_img[i].numpy() * 255).transpose( - 1, 2, 0).astype(np.uint8) - fm_lr = (lr_img[i].numpy() * 255).transpose( - 1, 2, 0).astype(np.uint8) - cv2.imwrite("output/images/{}_{}_sr.jpg".format( - sum_images, i), fm_sr) - cv2.imwrite("output/images/{}_{}_lr.jpg".format( - sum_images, i), fm_lr) else: preds = model(images) @@ -545,11 +533,12 @@ def eval(model, # Obtain usable results from post-processing methods total_time += time.time() - start # Evaluate the results of the current batch - if model_type in ['kie']: - eval_class(preds, batch_numpy) - elif model_type in ['table', 'vqa']: - post_result = post_process_class(preds, batch_numpy) - eval_class(post_result, batch_numpy) + if model_type in ['table', 'kie']: + if post_process_class is None: + eval_class(preds, batch_numpy) + else: + post_result = post_process_class(preds, batch_numpy) + eval_class(post_result, batch_numpy) elif model_type in ['sr']: eval_class(preds, batch_numpy) else: diff --git a/tools/train.py b/tools/train.py index 0c881ecae8daf78860829b1419178358c2209f25..d0f200189e34265b3c080ac9e25eb80d29c705b7 100755 --- a/tools/train.py +++ b/tools/train.py @@ -138,15 +138,15 @@ def main(config, device, logger, vdl_writer): # build metric eval_class = build_metric(config['Metric']) - # load pretrain model - pre_best_model_dict = load_model(config, model, optimizer, - config['Architecture']["model_type"]) + logger.info('train dataloader has {} iters'.format(len(train_dataloader))) if valid_dataloader is not None: logger.info('valid dataloader has {} iters'.format( len(valid_dataloader))) use_amp = config["Global"].get("use_amp", False) + amp_level = config["Global"].get("amp_level", 'O2') + amp_custom_black_list = config['Global'].get('amp_custom_black_list',[]) if use_amp: AMP_RELATED_FLAGS_SETTING = { 'FLAGS_cudnn_batchnorm_spatial_persistent': 1, @@ -159,17 +159,22 @@ def main(config, device, logger, vdl_writer): scaler = paddle.amp.GradScaler( init_loss_scaling=scale_loss, use_dynamic_loss_scaling=use_dynamic_loss_scaling) - model, optimizer = paddle.amp.decorate( - models=model, optimizers=optimizer, level='O2', master_weight=True) + if amp_level == "O2": + model, optimizer = paddle.amp.decorate( + models=model, optimizers=optimizer, level=amp_level, master_weight=True) else: scaler = None + # load pretrain model + pre_best_model_dict = load_model(config, model, optimizer, + config['Architecture']["model_type"]) + if config['Global']['distributed']: model = paddle.DataParallel(model) # start train program.train(config, train_dataloader, valid_dataloader, device, model, loss_class, optimizer, lr_scheduler, post_process_class, - eval_class, pre_best_model_dict, logger, vdl_writer, scaler) + eval_class, pre_best_model_dict, logger, vdl_writer, scaler,amp_level, amp_custom_black_list) def test_reader(config, device, logger):