diff --git a/PPOCRLabel/README.md b/PPOCRLabel/README.md index 9b882812f33a781a448a4f0a89fe15c349f587ae..c28ad72aeb68459c85058f634236e71d00767ca4 100644 --- a/PPOCRLabel/README.md +++ b/PPOCRLabel/README.md @@ -164,10 +164,10 @@ python PPOCRLabel.py - Default model: PPOCRLabel uses the Chinese and English ultra-lightweight OCR model in PaddleOCR by default, supports Chinese, English and number recognition, and multiple language detection. -- Model language switching: Changing the built-in model language is supportable by clicking "PaddleOCR"-"Choose OCR Model" in the menu bar. Currently supported languages​include French, German, Korean, and Japanese. +- Model language switching: Changing the built-in model language is supportable by clicking "PaddleOCR"-"Choose OCR Model" in the menu bar. Currently supported languages include French, German, Korean, and Japanese. For specific model download links, please refer to [PaddleOCR Model List](https://github.com/PaddlePaddle/PaddleOCR/blob/develop/doc/doc_en/models_list_en.md#multilingual-recognition-modelupdating) -- **Custom Model**: If users want to replace the built-in model with their own inference model, they can follow the [Custom Model Code Usage](https://github.com/PaddlePaddle/PaddleOCR/blob/release/2.3/doc/doc_en/whl_en.md#31-use-by-code) by modifying PPOCRLabel.py for [Instantiation of PaddleOCR class](https://github.com/PaddlePaddle/PaddleOCR/blob/release/ 2.3/PPOCRLabel/PPOCRLabel.py#L116) : +- **Custom Model**: If users want to replace the built-in model with their own inference model, they can follow the [Custom Model Code Usage](https://github.com/PaddlePaddle/PaddleOCR/blob/release/2.3/doc/doc_en/whl_en.md#31-use-by-code) by modifying PPOCRLabel.py for [Instantiation of PaddleOCR class](https://github.com/PaddlePaddle/PaddleOCR/blob/release/2.3/PPOCRLabel/PPOCRLabel.py#L116) : add parameter `det_model_dir` in `self.ocr = PaddleOCR(use_pdserving=False, use_angle_cls=True, det=True, cls=True, use_gpu=gpu, lang=lang) ` diff --git a/PPOCRLabel/README_ch.md b/PPOCRLabel/README_ch.md index 9cd11a7c5c4cbe9a02d9921e32017bb23be77c88..73899d8c4b4abb7f9a8028788a1639675165f676 100644 --- a/PPOCRLabel/README_ch.md +++ b/PPOCRLabel/README_ch.md @@ -73,25 +73,26 @@ PPOCRLabel --lang ch # 启动 > 如果上述安装出现问题,可以参考3.6节 错误提示 -#### 1.2.2 本地构建whl包并安装 +#### 1.2.2 通过Python脚本运行PPOCRLabel + +如果您对PPOCRLabel文件有所更改(例如指定新的内置模型),通过Python脚本运行会更加方面的看到更改的结果。如果仍然需要通过whl包启动,则需要参考下节重新编译whl包。 ```bash -cd PaddleOCR/PPOCRLabel -python3 setup.py bdist_wheel -pip3 install dist/PPOCRLabel-1.0.2-py2.py3-none-any.whl -i https://mirror.baidu.com/pypi/simple +cd ./PPOCRLabel # 切换到PPOCRLabel目录 +python PPOCRLabel.py --lang ch ``` -#### 1.2.3 通过Python脚本运行PPOCRLabel +#### 1.2.3 本地构建whl包并安装 -如果您对PPOCRLabel文件有所更改,通过Python脚本运行会更加方面的看到更改的结果 +编译与安装新的whl包,其中1.0.2为版本号,可在 `setup.py` 中指定新版本。 ```bash -cd ./PPOCRLabel # 切换到PPOCRLabel目录 -python PPOCRLabel.py --lang ch +cd PaddleOCR/PPOCRLabel +python3 setup.py bdist_wheel +pip3 install dist/PPOCRLabel-1.0.2-py2.py3-none-any.whl -i https://mirror.baidu.com/pypi/simple ``` - ## 2. 使用 ### 2.1 操作步骤 diff --git a/README.md b/README.md index 251d51c0080fe5f2cd9fec76479526d142de368f..80b1f92b2419d2c67096cbc64a6fba05648272cc 100644 --- a/README.md +++ b/README.md @@ -104,7 +104,7 @@ For a new language request, please refer to [Guideline for new language_requests - [Quick Start](./doc/doc_en/quickstart_en.md) - [PaddleOCR Overview and Installation](./doc/doc_en/paddleOCR_overview_en.md) - PP-OCR Industry Landing: from Training to Deployment - - [PP-OCR Model and Configuration](./doc/doc_en/models_and_config_en.md) + - [PP-OCR Model Zoo](./doc/doc_en/models_en.md) - [PP-OCR Model Download](./doc/doc_en/models_list_en.md) - [Python Inference for PP-OCR Model Library](./doc/doc_en/inference_ppocr_en.md) - [PP-OCR Training](./doc/doc_en/training_en.md) @@ -112,6 +112,10 @@ For a new language request, please refer to [Guideline for new language_requests - [Text Recognition](./doc/doc_en/recognition_en.md) - [Text Direction Classification](./doc/doc_en/angle_class_en.md) - [Yml Configuration](./doc/doc_en/config_en.md) + - PP-OCR Models Compression + - [Knowledge Distillation](./doc/doc_en/knowledge_distillation_en.md) + - [Model Quantization](./deploy/slim/quantization/README_en.md) + - [Model Pruning](./deploy/slim/prune/README_en.md) - Inference and Deployment - [C++ Inference](./deploy/cpp_infer/readme_en.md) - [Serving](./deploy/pdserving/README.md) diff --git a/README_ch.md b/README_ch.md index 33d548c44b5ccd21e55882acbde8637551ebd668..df713816faef83ffede1c6bb2a718afcb1c2bb3a 100755 --- a/README_ch.md +++ b/README_ch.md @@ -32,7 +32,7 @@ PaddleOCR旨在打造一套丰富、领先、且实用的OCR工具库,助力 - PP-OCR系列高质量预训练模型,准确的识别效果 - 超轻量PP-OCRv2系列:检测(3.1M)+ 方向分类器(1.4M)+ 识别(8.5M)= 13.0M - 超轻量PP-OCR mobile移动端系列:检测(3.0M)+方向分类器(1.4M)+ 识别(5.0M)= 9.4M - - 通用PPOCR server系列:检测(47.1M)+方向分类器(1.4M)+ 识别(94.9M)= 143.4M + - 通用PP-OCR server系列:检测(47.1M)+方向分类器(1.4M)+ 识别(94.9M)= 143.4M - 支持中英文数字组合识别、竖排文本识别、长文本识别 - 支持多语言识别:韩语、日语、德语、法语等约80种语言 - PP-Structure文档结构化系统 @@ -53,8 +53,8 @@ PaddleOCR旨在打造一套丰富、领先、且实用的OCR工具库,助力 ## 社区、社区贡献与社区常规赛 - 加入社区:微信扫描下方二维码加入官方交流群,与各行各业开发者充分交流,期待您的加入。 -- 社区贡献:[社区贡献](./doc/doc_ch/thirdparty.md)文档中包含了社区用户**使用PaddleOCR开发的各种工具、应用**以及**为PaddleOCR贡献的功能、优化的文档与代码**等,是官方为社区开发者打造的荣誉墙、也是帮助优质项目宣传的广播站。如果您的OCR项目未被收集在文档中,可根据文档说明与我们联系。最新社区贡献可查看[此处](#社区贡献)。 -- 社区常规赛:作为社区贡献的具体承载形式,社区常规赛是面向OCR开发者的积分赛事。首届社区常规赛与[《动手学OCR · 十讲》课程](https://aistudio.baidu.com/aistudio/course/introduce/25207)联合推广。社区常规赛的赛题详情与报名方法可参考[链接](https://github.com/PaddlePaddle/PaddleOCR/issues/4982)。 +- 社区贡献:[社区贡献](./doc/doc_ch/thirdparty.md)文档中包含了社区用户**使用PaddleOCR开发的各种工具、应用**以及**为PaddleOCR贡献的功能、优化的文档与代码**等,是官方为社区开发者打造的荣誉墙、也是帮助优质项目宣传的广播站。如果您的OCR项目未被收集在文档中,可根据文档说明与我们联系。 +- 社区常规赛:社区常规赛是面向OCR开发者的积分赛事,覆盖文档、代码、模型和应用四大类型,以季度为单位评选并发放奖励,赛题详情与报名方法可参考[链接](https://github.com/PaddlePaddle/PaddleOCR/issues/4982)。
@@ -79,30 +79,35 @@ PaddleOCR旨在打造一套丰富、领先、且实用的OCR工具库,助力 ## 文档教程 - [运行环境准备](./doc/doc_ch/environment.md) -- [快速开始(中英文/多语言/文档分析)](./doc/doc_ch/quickstart.md) +- [快速开始(中英文/多语言/版面分析)](./doc/doc_ch/quickstart.md) - [PaddleOCR全景图与项目克隆](./doc/doc_ch/paddleOCR_overview.md) - PP-OCR产业落地:从训练到部署 - - [PP-OCR模型与配置文件](./doc/doc_ch/models_and_config.md) + - [PP-OCR模型库](./doc/doc_ch/models.md) - [PP-OCR模型下载](./doc/doc_ch/models_list.md) - - [PP-OCR模型库快速推理](./doc/doc_ch/inference_ppocr.md) + - [Python引擎的PP-OCR模型库推理](./doc/doc_ch/inference_ppocr.md) - [PP-OCR模型训练](./doc/doc_ch/training.md) - [文本检测](./doc/doc_ch/detection.md) - [文本识别](./doc/doc_ch/recognition.md) - [文本方向分类器](./doc/doc_ch/angle_class.md) - - [知识蒸馏](./doc/doc_ch/knowledge_distillation.md) - [配置文件内容与生成](./doc/doc_ch/config.md) + - PP-OCR模型压缩 + - [知识蒸馏](./doc/doc_ch/knowledge_distillation.md) + - [模型量化](./deploy/slim/quantization/README.md) + - [模型裁剪](./deploy/slim/prune/README.md) - PP-OCR模型推理部署 - [基于C++预测引擎推理](./deploy/cpp_infer/readme.md) - [服务化部署](./deploy/pdserving/README_CN.md) - [端侧部署](./deploy/lite/readme.md) + - [Paddle2ONNX模型转化与预测](./deploy/paddle2onnx/readme.md) - [Benchmark](./doc/doc_ch/benchmark.md) - [PP-Structure信息提取](./ppstructure/README_ch.md) - [版面分析](./ppstructure/layout/README_ch.md) - [表格识别](./ppstructure/table/README_ch.md) - [DocVQA](./ppstructure/vqa/README_ch.md) - - [关键信息提取](./ppstructure/docs/kie.md) -- OCR学术圈 - - [两阶段模型介绍与下载](./doc/doc_ch/algorithm_overview.md) + - [关键信息提取](./ppstructure/docs/kie_ch.md) +- OCR学术前沿模型介绍与下载 + - [文本检测算法](./doc/doc_ch/algorithm_overview.md#11-%E6%96%87%E6%9C%AC%E6%A3%80%E6%B5%8B%E7%AE%97%E6%B3%95) + - [文本识别算法](./doc/doc_ch/algorithm_overview.md#12-%E6%96%87%E6%9C%AC%E8%AF%86%E5%88%AB%E7%AE%97%E6%B3%95) - [端到端PGNet算法](./doc/doc_ch/pgnet.md) - [基于Python脚本预测引擎推理](./doc/doc_ch/inference.md) - [使用PaddleOCR架构添加新算法](./doc/doc_ch/add_new_algorithm.md) @@ -157,17 +162,6 @@ PaddleOCR旨在打造一套丰富、领先、且实用的OCR工具库,助力
- - -## 最新社区贡献 - -- 基于PaddleOCR的社区项目: [FastOCRLabel](https://gitee.com/BaoJianQiang/FastOCRLabel):完整的C#版本标注工具 (@ [包建强](https://gitee.com/BaoJianQiang) ) -- 为PaddleOCR新增功能:非常感谢 [Evezerest](https://github.com/Evezerest), [ninetailskim](https://github.com/ninetailskim), [edencfc](https://github.com/edencfc), [BeyondYourself](https://github.com/BeyondYourself), [1084667371](https://github.com/1084667371) 贡献了[PPOCRLabel](https://github.com/PaddlePaddle/PaddleOCR/blob/release/2.3/PPOCRLabel/README_ch.md) 的完整代码。 -- 代码与文档优化:非常感谢 [BeyondYourself](https://github.com/BeyondYourself) 给PaddleOCR提了很多非常棒的建议,并简化了PaddleOCR的部分代码风格。 -- 多语言语料:非常感谢 [Mejans](https://github.com/Mejans) 给PaddleOCR增加新语言奥克西坦语Occitan的字典和语料([#954](https://github.com/PaddlePaddle/PaddleOCR/pull/954))。 - -完整社区贡献列表可查看[社区贡献文档](./doc/doc_ch/thirdparty.md) - ## 许可证书 diff --git a/deploy/paddle2onnx/readme.md b/deploy/paddle2onnx/readme.md index e08f2adee5d315cecba703ecdf515c09cd1569d2..8e821892142d65caddd6fa3bd8ff24a372fe9a5d 100644 --- a/deploy/paddle2onnx/readme.md +++ b/deploy/paddle2onnx/readme.md @@ -1,4 +1,4 @@ -# paddle2onnx 模型转化与预测 +# Paddle2ONNX模型转化与预测 本章节介绍 PaddleOCR 模型如何转化为 ONNX 模型,并基于 ONNXRuntime 引擎预测。 diff --git a/deploy/pdserving/README.md b/deploy/pdserving/README.md index edbbefae588e9c49860b7fc58b2a1bbd572e9260..7ed52af90df653251e2501a032b26a00d9b96984 100644 --- a/deploy/pdserving/README.md +++ b/deploy/pdserving/README.md @@ -30,7 +30,8 @@ The introduction and tutorial of Paddle Serving service deployment framework ref PaddleOCR operating environment and Paddle Serving operating environment are needed. 1. Please prepare PaddleOCR operating environment reference [link](../../doc/doc_ch/installation.md). - Download the corresponding paddle whl package according to the environment, it is recommended to install version 2.2.1. + + Download the corresponding paddle whl package according to the environment, it is recommended to install version 2.2.2 2. The steps of PaddleServing operating environment prepare are as follows: @@ -52,6 +53,7 @@ wget https://paddle-serving.bj.bcebos.com/test-dev/whl/paddle_serving_app-0.8.3- pip3 install paddle_serving_app-0.8.3-py3-none-any.whl ``` + **note:** If you want to install the latest version of PaddleServing, refer to [link](https://github.com/PaddlePaddle/Serving/blob/v0.7.0/doc/Latest_Packages_CN.md). diff --git a/deploy/pdserving/README_CN.md b/deploy/pdserving/README_CN.md index 3227403f2061475970bcef935507bd834c5f6cd1..98e5fff2c56b03e85d293d526bc55166f3ee5f4e 100644 --- a/deploy/pdserving/README_CN.md +++ b/deploy/pdserving/README_CN.md @@ -8,8 +8,7 @@ PaddleOCR提供2种服务部署方式: # 基于PaddleServing的服务部署 -本文档将介绍如何使用[PaddleServing](https://github.com/PaddlePaddle/Serving/blob/develop/README_CN.md)工具部署PPOCR -动态图模型的pipeline在线服务。 +本文档将介绍如何使用[PaddleServing](https://github.com/PaddlePaddle/Serving/blob/develop/README_CN.md)工具部署PP-OCR动态图模型的pipeline在线服务。 相比较于hubserving部署,PaddleServing具备以下优点: - 支持客户端和服务端之间高并发和高效通信 @@ -32,7 +31,8 @@ PaddleOCR提供2种服务部署方式: 需要准备PaddleOCR的运行环境和Paddle Serving的运行环境。 - 准备PaddleOCR的运行环境[链接](../../doc/doc_ch/installation.md) - 根据环境下载对应的paddle whl包,推荐安装2.2.1版本 + + 根据环境下载对应的paddlepaddle whl包,推荐安装2.2.2版本 - 准备PaddleServing的运行环境,步骤如下 @@ -61,7 +61,7 @@ pip3 install paddle_serving_app-0.8.3-py3-none-any.whl 使用PaddleServing做服务化部署时,需要将保存的inference模型转换为serving易于部署的模型。 -首先,下载PPOCR的[inference模型](https://github.com/PaddlePaddle/PaddleOCR#pp-ocr-series-model-listupdate-on-september-8th) +首先,下载PP-OCR的[inference模型](https://github.com/PaddlePaddle/PaddleOCR#pp-ocr-series-model-listupdate-on-september-8th) ```bash # 下载并解压 OCR 文本检测模型 @@ -109,7 +109,7 @@ python3 -m paddle_serving_client.convert --dirname ./ch_PP-OCRv2_rec_infer/ \ 1. 下载PaddleOCR代码,若已下载可跳过此步骤 ``` git clone https://github.com/PaddlePaddle/PaddleOCR - + # 进入到工作目录 cd PaddleOCR/deploy/pdserving/ ``` diff --git a/deploy/slim/prune/README.md b/deploy/slim/prune/README.md index 7b8dd169c5fa9d01421070f1ccc2bd4e8ed543a2..6d04f1648705071d70c1e9f17cd30d6825f92467 100644 --- a/deploy/slim/prune/README.md +++ b/deploy/slim/prune/README.md @@ -1,5 +1,5 @@ -## 介绍 +# PP-OCR模型裁剪 复杂的模型有利于提高模型的性能,但也导致模型中存在一定冗余,模型裁剪通过移出网络模型中的子模型来减少这种冗余,达到减少模型计算复杂度,提高模型推理性能的目的。 本教程将介绍如何使用飞桨模型压缩库PaddleSlim做PaddleOCR模型的压缩。 @@ -7,13 +7,13 @@ 在开始本教程之前,建议先了解: -1. [PaddleOCR模型的训练方法](../../../doc/doc_ch/quickstart.md) +1. [PaddleOCR模型的训练方法](../../../doc/doc_ch/training.md) 2. [模型裁剪教程](https://github.com/PaddlePaddle/PaddleSlim/blob/release%2F2.0.0/docs/zh_cn/tutorials/pruning/dygraph/filter_pruning.md) - ## 快速开始 模型裁剪主要包括四个步骤: + 1. 安装 PaddleSlim 2. 准备训练好的模型 3. 敏感度分析、裁剪训练 @@ -35,16 +35,19 @@ python3 setup.py install 加载预训练模型后,通过对现有模型的每个网络层进行敏感度分析,得到敏感度文件:sen.pickle,可以通过PaddleSlim提供的[接口](https://github.com/PaddlePaddle/PaddleSlim/blob/9b01b195f0c4bc34a1ab434751cb260e13d64d9e/paddleslim/dygraph/prune/filter_pruner.py#L75)加载文件,获得各网络层在不同裁剪比例下的精度损失。从而了解各网络层冗余度,决定每个网络层的裁剪比例。 敏感度文件内容格式: - sen.pickle(Dict){ +``` +sen.pickle(Dict){ 'layer_weight_name_0': sens_of_each_ratio(Dict){'pruning_ratio_0': acc_loss, 'pruning_ratio_1': acc_loss} 'layer_weight_name_1': sens_of_each_ratio(Dict){'pruning_ratio_0': acc_loss, 'pruning_ratio_1': acc_loss} } - 例子: +例子: { 'conv10_expand_weights': {0.1: 0.006509952684312718, 0.2: 0.01827734339798862, 0.3: 0.014528405644659832, 0.6: 0.06536008804270439, 0.8: 0.11798612250664964, 0.7: 0.12391408417493704, 0.4: 0.030615754498018757, 0.5: 0.047105205602406594} 'conv10_linear_weights': {0.1: 0.05113190831455035, 0.2: 0.07705573833558801, 0.3: 0.12096721757739311, 0.6: 0.5135061352930738, 0.8: 0.7908166677143281, 0.7: 0.7272187676899062, 0.4: 0.1819252083008504, 0.5: 0.3728054727792405} } +``` + 加载敏感度文件后会返回一个字典,字典中的keys为网络模型参数模型的名字,values为一个字典,里面保存了相应网络层的裁剪敏感度信息。例如在例子中,conv10_expand_weights所对应的网络层在裁掉10%的卷积核后模型性能相较原模型会下降0.65%,详细信息可见[PaddleSlim](https://github.com/PaddlePaddle/PaddleSlim/blob/develop/docs/zh_cn/algo/algo.md#2-%E5%8D%B7%E7%A7%AF%E6%A0%B8%E5%89%AA%E8%A3%81%E5%8E%9F%E7%90%86) 进入PaddleOCR根目录,通过以下命令对模型进行敏感度分析训练: diff --git a/deploy/slim/prune/README_en.md b/deploy/slim/prune/README_en.md index f0d652f249686c1d462cd2aa71f4766cf39e763e..aca8d79290016d4602a86ef04fd4e8fa24a39ad7 100644 --- a/deploy/slim/prune/README_en.md +++ b/deploy/slim/prune/README_en.md @@ -1,9 +1,9 @@ -## Introduction +# PP-OCR Models Pruning Generally, a more complex model would achive better performance in the task, but it also leads to some redundancy in the model. Model Pruning is a technique that reduces this redundancy by removing the sub-models in the neural network model, so as to reduce model calculation complexity and improve model inference performance. -This example uses PaddleSlim provided[APIs of Pruning](https://paddlepaddle.github.io/PaddleSlim/api/prune_api/) to compress the OCR model. +This example uses PaddleSlim provided [APIs of Pruning](https://paddlepaddle.github.io/PaddleSlim/api/prune_api/) to compress the OCR model. [PaddleSlim](https://github.com/PaddlePaddle/PaddleSlim), an open source library which integrates model pruning, quantization (including quantization training and offline quantization), distillation, neural network architecture search, and many other commonly used and leading model compression technique in the industry. It is recommended that you could understand following pages before reading this example: @@ -37,25 +37,26 @@ PaddleOCR also provides a series of [models](../../../doc/doc_en/models_list_en. After the pre-trained model is loaded, sensitivity analysis is performed on each network layer of the model to understand the redundancy of each network layer, and save a sensitivity file which named: sen.pickle. After that, user could load the sensitivity file via the [methods provided by PaddleSlim](https://github.com/PaddlePaddle/PaddleSlim/blob/develop/paddleslim/prune/sensitive.py#L221) and determining the pruning ratio of each network layer automatically. For specific details of sensitivity analysis, see:[Sensitivity analysis](https://github.com/PaddlePaddle/PaddleSlim/blob/develop/docs/zh_cn/tutorials/image_classification_sensitivity_analysis_tutorial.md) The data format of sensitivity file: - sen.pickle(Dict){ + +``` +sen.pickle(Dict){ 'layer_weight_name_0': sens_of_each_ratio(Dict){'pruning_ratio_0': acc_loss, 'pruning_ratio_1': acc_loss} 'layer_weight_name_1': sens_of_each_ratio(Dict){'pruning_ratio_0': acc_loss, 'pruning_ratio_1': acc_loss} } - - example: +example: { 'conv10_expand_weights': {0.1: 0.006509952684312718, 0.2: 0.01827734339798862, 0.3: 0.014528405644659832, 0.6: 0.06536008804270439, 0.8: 0.11798612250664964, 0.7: 0.12391408417493704, 0.4: 0.030615754498018757, 0.5: 0.047105205602406594} 'conv10_linear_weights': {0.1: 0.05113190831455035, 0.2: 0.07705573833558801, 0.3: 0.12096721757739311, 0.6: 0.5135061352930738, 0.8: 0.7908166677143281, 0.7: 0.7272187676899062, 0.4: 0.1819252083008504, 0.5: 0.3728054727792405} } +``` + The function would return a dict after loading the sensitivity file. The keys of the dict are name of parameters in each layer. And the value of key is the information about pruning sensitivity of corresponding layer. In example, pruning 10% filter of the layer corresponding to conv10_expand_weights would lead to 0.65% degradation of model performance. The details could be seen at: [Sensitivity analysis](https://github.com/PaddlePaddle/PaddleSlim/blob/develop/docs/zh_cn/algo/algo.md#2-%E5%8D%B7%E7%A7%AF%E6%A0%B8%E5%89%AA%E8%A3%81%E5%8E%9F%E7%90%86) Enter the PaddleOCR root directory,perform sensitivity analysis on the model with the following command: ```bash - python3.7 deploy/slim/prune/sensitivity_anal.py -c configs/det/ch_ppocr_v2.0/ch_det_mv3_db_v2.0.yml -o Global.pretrained_model="your trained model" Global.save_model_dir=./output/prune_model/ - ``` diff --git a/deploy/slim/quantization/README.md b/deploy/slim/quantization/README.md index 62bc408f5eeda6d8366834200e8d8a20d1dc82cd..8d3f779e0028a62d8396601166283f0ee54d43a7 100644 --- a/deploy/slim/quantization/README.md +++ b/deploy/slim/quantization/README.md @@ -1,12 +1,12 @@ -## 介绍 +# PP-OCR模型量化 复杂的模型有利于提高模型的性能,但也导致模型中存在一定冗余,模型量化将全精度缩减到定点数减少这种冗余,达到减少模型计算复杂度,提高模型推理性能的目的。 模型量化可以在基本不损失模型的精度的情况下,将FP32精度的模型参数转换为Int8精度,减小模型参数大小并加速计算,使用量化后的模型在移动端等部署时更具备速度优势。 本教程将介绍如何使用飞桨模型压缩库PaddleSlim做PaddleOCR模型的压缩。 [PaddleSlim](https://github.com/PaddlePaddle/PaddleSlim) 集成了模型剪枝、量化(包括量化训练和离线量化)、蒸馏和神经网络搜索等多种业界常用且领先的模型压缩功能,如果您感兴趣,可以关注并了解。 -在开始本教程之前,建议先了解[PaddleOCR模型的训练方法](../../../doc/doc_ch/quickstart.md)以及[PaddleSlim](https://paddleslim.readthedocs.io/zh_CN/latest/index.html) +在开始本教程之前,建议先了解[PaddleOCR模型的训练方法](../../../doc/doc_ch/training.md)以及[PaddleSlim](https://paddleslim.readthedocs.io/zh_CN/latest/index.html) ## 快速开始 diff --git a/deploy/slim/quantization/README_en.md b/deploy/slim/quantization/README_en.md index 4cafe5f44e48a479cf5b0e4209b8e335a7e4917d..e9e0933d353afca13619aff61b19a0c4242b5653 100644 --- a/deploy/slim/quantization/README_en.md +++ b/deploy/slim/quantization/README_en.md @@ -1,5 +1,5 @@ -## Introduction +# PP-OCR Models Quantization Generally, a more complex model would achieve better performance in the task, but it also leads to some redundancy in the model. Quantization is a technique that reduces this redundancy by reducing the full precision data to a fixed number, diff --git a/doc/doc_ch/environment.md b/doc/doc_ch/environment.md index 3a266c4bb8fe5516f844bea9f0aa21359d51660e..23bec4b978ab34f144a2ec7256e09412f5440646 100644 --- a/doc/doc_ch/environment.md +++ b/doc/doc_ch/environment.md @@ -1,20 +1,19 @@ # 运行环境准备 -Windows和Mac用户推荐使用Anaconda搭建Python环境,Linux用户建议使用docker搭建PyThon环境。 +Windows和Mac用户推荐使用Anaconda搭建Python环境,Linux用户建议使用docker搭建Python环境。 推荐环境: -- PaddlePaddle >= 2.0.0 (2.1.2) -- python3.7 +- PaddlePaddle >= 2.1.2 +- Python 3.7 - CUDA10.1 / CUDA10.2 - CUDNN 7.6 -如果对于Python环境熟悉的用户可以直接跳到第2步安装PaddlePaddle。 +> 如果您已经安装Python环境,可以直接参考[PaddleOCR快速开始](./quickstart.md) * [1. Python环境搭建](#1) + [1.1 Windows](#1.1) + [1.2 Mac](#1.2) + [1.3 Linux](#1.3) -* [2. 安装PaddlePaddle](#2) @@ -212,7 +211,7 @@ Linux用户可选择Anaconda或Docker两种方式运行。如果你熟悉Docker wget https://mirrors.tuna.tsinghua.edu.cn/anaconda/archive/Anaconda3-2021.05-Linux-x86_64.sh # 若您要下载其他版本,需要将最后1个/后的文件名改成您希望下载的版本 - ``` + ``` - 安装Anaconda: @@ -311,21 +310,3 @@ sudo nvidia-docker run --name ppocr -v $PWD:/paddle --shm-size=64G --network=hos # ctrl+P+Q可退出docker 容器,重新进入docker 容器使用如下命令 sudo docker container exec -it ppocr /bin/bash ``` - - - -## 2. 安装PaddlePaddle - -- 如果您的机器安装的是CUDA9或CUDA10,请运行以下命令安装 - -```bash -python3 -m pip install paddlepaddle-gpu -i https://mirror.baidu.com/pypi/simple -``` - -- 如果您的机器是CPU,请运行以下命令安装 - -```bash -python3 -m pip install paddlepaddle -i https://mirror.baidu.com/pypi/simple -``` - -更多的版本需求,请参照[飞桨官网安装文档](https://www.paddlepaddle.org.cn/install/quick)中的说明进行操作。 diff --git a/doc/doc_ch/models_and_config.md b/doc/doc_ch/models_and_config.md deleted file mode 100644 index 89afc89a99bed364fd2abe247946dfe9e552ae86..0000000000000000000000000000000000000000 --- a/doc/doc_ch/models_and_config.md +++ /dev/null @@ -1,47 +0,0 @@ - -# PP-OCR模型与配置文件 -PP-OCR模型与配置文件一章主要补充一些OCR模型的基本概念、配置文件的内容与作用以便对模型后续的参数调整和训练中拥有更好的体验。 - -本章包含三个部分,首先在[PP-OCR模型下载](./models_list.md)中解释PP-OCR模型的类型概念,并提供所有模型的下载链接。然后在[配置文件内容与生成](./config.md)中详细说明调整PP-OCR模型所需的参数。最后的[模型库快速使用](./inference_ppocr.md)是对第一节PP-OCR模型库使用方法的介绍,可以通过Python推理引擎快速利用丰富的模型库模型获得测试结果。 - ------- - -下面我们首先了解一些OCR相关的基本概念: - -- [1. OCR 简要介绍](#1-ocr-----) - * [1.1 OCR 检测模型基本概念](#11-ocr---------) - * [1.2 OCR 识别模型基本概念](#12-ocr---------) - * [1.3 PP-OCR模型](#13-pp-ocr--) - - -## 1. OCR 简要介绍 -本节简要介绍OCR检测模型、识别模型的基本概念,并介绍PaddleOCR的PP-OCR模型。 - -OCR(Optical Character Recognition,光学字符识别)目前是文字识别的统称,已不限于文档或书本文字识别,更包括识别自然场景下的文字,又可以称为STR(Scene Text Recognition)。 - -OCR文字识别一般包括两个部分,文本检测和文本识别;文本检测首先利用检测算法检测到图像中的文本行;然后检测到的文本行用识别算法去识别到具体文字。 - - -### 1.1 OCR 检测模型基本概念 - -文本检测就是要定位图像中的文字区域,然后通常以边界框的形式将单词或文本行标记出来。传统的文字检测算法多是通过手工提取特征的方式,特点是速度快,简单场景效果好,但是面对自然场景,效果会大打折扣。当前多是采用深度学习方法来做。 - -基于深度学习的文本检测算法可以大致分为以下几类: -1. 基于目标检测的方法;一般是预测得到文本框后,通过NMS筛选得到最终文本框,多是四点文本框,对弯曲文本场景效果不理想。典型算法为EAST、Text Box等方法。 -2. 基于分割的方法;将文本行当成分割目标,然后通过分割结果构建外接文本框,可以处理弯曲文本,对于文本交叉场景问题效果不理想。典型算法为DB、PSENet等方法。 -3. 混合目标检测和分割的方法; - - -### 1.2 OCR 识别模型基本概念 - -OCR识别算法的输入数据一般是文本行,背景信息不多,文字占据主要部分,识别算法目前可以分为两类算法: -1. 基于CTC的方法;即识别算法的文字预测模块是基于CTC的,常用的算法组合为CNN+RNN+CTC。目前也有一些算法尝试在网络中加入transformer模块等等。 -2. 基于Attention的方法;即识别算法的文字预测模块是基于Attention的,常用算法组合是CNN+RNN+Attention。 - - -### 1.3 PP-OCR模型 - -PaddleOCR 中集成了很多OCR算法,文本检测算法有DB、EAST、SAST等等,文本识别算法有CRNN、RARE、StarNet、Rosetta、SRN等算法。 - -其中PaddleOCR针对中英文自然场景通用OCR,推出了PP-OCR系列模型,PP-OCR模型由DB+CRNN算法组成,利用海量中文数据训练加上模型调优方法,在中文场景上具备较高的文本检测识别能力。并且PaddleOCR推出了高精度超轻量PP-OCRv2模型,检测模型仅3M,识别模型仅8.5M,利用[PaddleSlim](https://github.com/PaddlePaddle/PaddleSlim)的模型量化方法,可以在保持精度不降低的情况下,将检测模型压缩到0.8M,识别压缩到3M,更加适用于移动端部署场景。 - diff --git a/doc/doc_ch/multi_languages.md b/doc/doc_ch/multi_languages.md index af9ff82e357e5945bfddf10337d0af3cd04390a0..aaf79b8d7ff1c4b98cbb0f365ff48143577e8fec 100644 --- a/doc/doc_ch/multi_languages.md +++ b/doc/doc_ch/multi_languages.md @@ -225,7 +225,7 @@ ppocr 支持使用自己的数据进行自定义训练或finetune, 其中识别 |波兰文|Polish |pl| | 比尔哈文|Bihari |bh| | 罗马尼亚文|Romanian |ro| | 迈蒂利文|Maithili |mai| | 斯洛伐克文|Slovak |sk| | 昂加文|Angika |ang| -| 斯洛文尼亚文|Slovenian |sl| | 孟加拉文|Bhojpuri |bho| +| 斯洛文尼亚文|Slovenian |sl| | 博杰普爾文|Bhojpuri |bho| | 阿尔巴尼亚文|Albanian |sq| | 摩揭陀文 |Magahi |mah| | 瑞典文|Swedish |sv| | 那格浦尔文|Nagpur |sck| | 西瓦希里文|Swahili |sw| | 尼瓦尔文|Newari |new| diff --git a/doc/doc_ch/pgnet.md b/doc/doc_ch/pgnet.md index 0aee58ec1aca24d06305c47569fdf156df6ee874..1234502f7840a9d39e7f0c85b240d3a4e106ccc0 100644 --- a/doc/doc_ch/pgnet.md +++ b/doc/doc_ch/pgnet.md @@ -97,7 +97,7 @@ train.txt标注文件格式如下,文件名和标注信息中间用"\t"分隔 " 图像文件名 json.dumps编码的图像标注信息" rgb/img11.jpg [{"transcription": "ASRAMA", "points": [[214.0, 325.0], [235.0, 308.0], [259.0, 296.0], [286.0, 291.0], [313.0, 295.0], [338.0, 305.0], [362.0, 320.0], [349.0, 347.0], [330.0, 337.0], [310.0, 329.0], [290.0, 324.0], [269.0, 328.0], [249.0, 336.0], [231.0, 346.0]]}, {...}] ``` -json.dumps编码前的图像标注信息是包含多个字典的list,字典中的 `points` 表示文本框的四个点的坐标(x, y),从左上角的点开始顺时针排列。 +json.dumps编码前的图像标注信息是包含多个字典的list,字典中的 `points` 表示文本框的多点坐标(如:4点、8点以及14点等),从左上角的点开始顺时针排列。 `transcription` 表示当前文本框的文字,**当其内容为“###”时,表示该文本框无效,在训练时会跳过。** 如果您想在其他数据集上训练,可以按照上述形式构建标注文件。 diff --git a/doc/doc_ch/quickstart.md b/doc/doc_ch/quickstart.md index 1e0d914140072416710a1b37d72ea88a038793ba..d2126192764fa32c7c7a3651b463b8b23240ea6c 100644 --- a/doc/doc_ch/quickstart.md +++ b/doc/doc_ch/quickstart.md @@ -1,6 +1,9 @@ # PaddleOCR快速开始 -- [1. 安装PaddleOCR whl包](#1) +- [1. 安装](#1) + - [1.1 安装PaddlePaddle](#11) + - [1.2 安装PaddleOCR whl包](#12) + - [2. 便捷使用](#2) - [2.1 命令行使用](#21) - [2.1.1 中英文模型](#211) @@ -9,10 +12,35 @@ - [2.2 Python脚本使用](#22) - [2.2.1 中英文与多语言使用](#221) - [2.2.2 版面分析](#222) +- [3.小结](#3) -## 1. 安装PaddleOCR whl包 +## 1. 安装 + + + +### 1.1 安装PaddlePaddle + +> 如果您没有基础的Python运行环境,请参考[运行环境准备](./environment.md)。 + +- 您的机器安装的是CUDA9或CUDA10,请运行以下命令安装 + + ```bash + python3 -m pip install paddlepaddle-gpu -i https://mirror.baidu.com/pypi/simple + ``` + +- 您的机器是CPU,请运行以下命令安装 + + ```bash + python3 -m pip install paddlepaddle -i https://mirror.baidu.com/pypi/simple + ``` + +更多的版本需求,请参照[飞桨官网安装文档](https://www.paddlepaddle.org.cn/install/quick)中的说明进行操作。 + + + +### 1.2 安装PaddleOCR whl包 ```bash pip install "paddleocr>=2.0.1" # 推荐使用2.0.1+版本 @@ -257,3 +285,11 @@ im_show = draw_structure_result(image, result,font_path=font_path) im_show = Image.fromarray(im_show) im_show.save('result.jpg') ``` + + + +## 3. 小结 + +通过本节内容,相信您已经熟练掌握PaddleOCR whl包的使用方法并获得了初步效果。 + +PaddleOCR是一套丰富领先实用的OCR工具库,打通数据、模型训练、压缩和推理部署全流程,因此在[下一节](./paddleOCR_overview.md)中我们将首先为您介绍PaddleOCR的全景图,然后克隆PaddleOCR项目,正式开启PaddleOCR的应用之旅。 diff --git a/doc/doc_ch/recognition.md b/doc/doc_ch/recognition.md index 4019aa123f6bf1ab0e890b68fbbcd008eb86e440..930fe30a974b6c5073c969165dfad6bb008b0967 100644 --- a/doc/doc_ch/recognition.md +++ b/doc/doc_ch/recognition.md @@ -63,9 +63,9 @@ train_data/rec/train/word_002.jpg 用科技让复杂的世界更简单 | ... ``` -- 测试集 +- 验证集 -同训练集类似,测试集也需要提供一个包含所有图片的文件夹(test)和一个rec_gt_test.txt,测试集的结构如下所示: +同训练集类似,验证集也需要提供一个包含所有图片的文件夹(test)和一个rec_gt_test.txt,测试集的结构如下所示: ``` |-train_data @@ -93,7 +93,7 @@ train_data/rec/train/word_002.jpg 用科技让复杂的世界更简单 ``` # 训练集标签 wget -P ./train_data/ic15_data https://paddleocr.bj.bcebos.com/dataset/rec_gt_train.txt -# 测试集标签 +# 验证集标签 wget -P ./train_data/ic15_data https://paddleocr.bj.bcebos.com/dataset/rec_gt_test.txt ``` diff --git a/doc/doc_en/environment_en.md b/doc/doc_en/environment_en.md index fc87f10c104628df0268bc6f8910c5914aeba225..6521d3c4144aa579be2075d14826e9dcb9ad9dd6 100644 --- a/doc/doc_en/environment_en.md +++ b/doc/doc_en/environment_en.md @@ -1,18 +1,19 @@ # Environment Preparation -Windows and Mac users are recommended to use Anaconda to build a Python environment, and Linux users are recommended to use docker to build a Python environment. If you are familiar with the Python environment, you can skip to step 2 to install PaddlePaddle. +Windows and Mac users are recommended to use Anaconda to build a Python environment, and Linux users are recommended to use docker to build a Python environment. Recommended working environment: -- PaddlePaddle >= 2.0.0 (2.1.2) +- PaddlePaddle >= 2.1.2 - Python 3.7 - CUDA 10.1 / CUDA 10.2 - cuDNN 7.6 +> If you already have a Python environment installed, you can skip to [PaddleOCR Quick Start](./quickstart_en.md). + * [1. Python Environment Setup](#1) + [1.1 Windows](#1.1) + [1.2 Mac](#1.2) + [1.3 Linux](#1.3) -* [2. Install PaddlePaddle 2.0](#2) @@ -330,21 +331,3 @@ You can also visit [DockerHub](https://hub.docker.com/r/paddlepaddle/paddle/tags # ctrl+P+Q to exit docker, to re-enter docker using the following command: sudo docker container exec -it ppocr /bin/bash ``` - - - -## 2. Install PaddlePaddle 2.0 - -- If you have CUDA 9 or CUDA 10 installed on your machine, please run the following command to install - -```bash -python3 -m pip install paddlepaddle-gpu -i https://mirror.baidu.com/pypi/simple -``` - -- If you have no available GPU on your machine, please run the following command to install the CPU version - -```bash -python3 -m pip install paddlepaddle -i https://mirror.baidu.com/pypi/simple -``` - -For more software version requirements, please refer to the instructions in [Installation Document](https://www.paddlepaddle.org.cn/install/quick) for operation. diff --git a/doc/doc_en/models_and_config_en.md b/doc/doc_en/models_and_config_en.md deleted file mode 100644 index c47fb5597eb56c823dff4c6d52cf3b114f3d9c0e..0000000000000000000000000000000000000000 --- a/doc/doc_en/models_and_config_en.md +++ /dev/null @@ -1,48 +0,0 @@ -# PP-OCR Model and Configuration -The chapter on PP-OCR model and configuration file mainly adds some basic concepts of OCR model and the content and role of configuration file to have a better experience in the subsequent parameter adjustment and training of the model. - -This chapter contains three parts. Firstly, [PP-OCR Model Download](./models_list_en.md) explains the concept of PP-OCR model types and provides links to download all models. Then in [Yml Configuration](./config_en.md) details the parameters needed to fine-tune the PP-OCR models. The final [Python Inference for PP-OCR Model Library](./inference_ppocr_en.md) is an introduction to the use of the PP-OCR model library in the first section, which can quickly utilize the rich model library models to obtain test results through the Python inference engine. - ------- - -Let's first understand some basic concepts. - -- [INTRODUCTION ABOUT OCR](#introduction-about-ocr) - * [BASIC CONCEPTS OF OCR DETECTION MODEL](#basic-concepts-of-ocr-detection-model) - * [Basic concepts of OCR recognition model](#basic-concepts-of-ocr-recognition-model) - * [PP-OCR model](#pp-ocr-model) - * [And a table of contents](#and-a-table-of-contents) - * [On the right](#on-the-right) - - -## 1. INTRODUCTION ABOUT OCR - -This section briefly introduces the basic concepts of OCR detection model and recognition model, and introduces PaddleOCR's PP-OCR model. - -OCR (Optical Character Recognition, Optical Character Recognition) is currently the general term for text recognition. It is not limited to document or book text recognition, but also includes recognizing text in natural scenes. It can also be called STR (Scene Text Recognition). - -OCR text recognition generally includes two parts, text detection and text recognition. The text detection module first uses detection algorithms to detect text lines in the image. And then the recognition algorithm to identify the specific text in the text line. - - -### 1.1 BASIC CONCEPTS OF OCR DETECTION MODEL - -Text detection can locate the text area in the image, and then usually mark the word or text line in the form of a bounding box. Traditional text detection algorithms mostly extract features manually, which are characterized by fast speed and good effect in simple scenes, but the effect will be greatly reduced when faced with natural scenes. Currently, deep learning methods are mostly used. - -Text detection algorithms based on deep learning can be roughly divided into the following categories: -1. Method based on target detection. Generally, after the text box is predicted, the final text box is filtered through NMS, which is mostly four-point text box, which is not ideal for curved text scenes. Typical algorithms are methods such as EAST and Text Box. -2. Method based on text segmentation. The text line is regarded as the segmentation target, and then the external text box is constructed through the segmentation result, which can handle curved text, and the effect is not ideal for the text cross scene problem. Typical algorithms are DB, PSENet and other methods. -3. Hybrid target detection and segmentation method. - - -### 1.2 Basic concepts of OCR recognition model - -The input of the OCR recognition algorithm is generally text lines images which has less background information, and the text information occupies the main part. The recognition algorithm can be divided into two types of algorithms: -1. CTC-based method. The text prediction module of the recognition algorithm is based on CTC, and the commonly used algorithm combination is CNN+RNN+CTC. There are also some algorithms that try to add transformer modules to the network and so on. -2. Attention-based method. The text prediction module of the recognition algorithm is based on Attention, and the commonly used algorithm combination is CNN+RNN+Attention. - - -### 1.3 PP-OCR model - -PaddleOCR integrates many OCR algorithms, text detection algorithms include DB, EAST, SAST, etc., text recognition algorithms include CRNN, RARE, StarNet, Rosetta, SRN and other algorithms. - -Among them, PaddleOCR has released the PP-OCR series model for the general OCR in Chinese and English natural scenes. The PP-OCR model is composed of the DB+CRNN algorithm. It uses massive Chinese data training and model tuning methods to have high text detection and recognition capabilities in Chinese scenes. And PaddleOCR has launched a high-precision and ultra-lightweight PP-OCRv2 model. The detection model is only 3M, and the recognition model is only 8.5M. Using [PaddleSlim](https://github.com/PaddlePaddle/PaddleSlim)'s model quantification method, the detection model can be compressed to 0.8M without reducing the accuracy. The recognition is compressed to 3M, which is more suitable for mobile deployment scenarios. diff --git a/doc/doc_en/pgnet_en.md b/doc/doc_en/pgnet_en.md index c7cb3221ccfd897e2fd9062a828c2fe0ceb42024..69df605c3848d23868d7be21610dc8f8c12487e8 100644 --- a/doc/doc_en/pgnet_en.md +++ b/doc/doc_en/pgnet_en.md @@ -92,7 +92,7 @@ rgb/img11.jpg [{"transcription": "ASRAMA", "points": [[214.0, 325.0], [235.0, ``` The image annotation after **json.dumps()** encoding is a list containing multiple dictionaries. -The `points` in the dictionary represent the coordinates (x, y) of the four points of the text box, arranged clockwise from the point at the upper left corner. +The `points` in the dictionary represent the multi-point coordinates (such as: 4 points, 8 points and 14 points, etc.) of the text box, arranged clockwise from the point at the upper left corner. `transcription` represents the text of the current text box. **When its content is "###" it means that the text box is invalid and will be skipped during training.** diff --git a/doc/doc_en/quickstart_en.md b/doc/doc_en/quickstart_en.md index 240a4ba11f3b7df0c518c841d9acee0ae88fcfa8..e44345a8e65f6efc94f83604590d980e052f2abd 100644 --- a/doc/doc_en/quickstart_en.md +++ b/doc/doc_en/quickstart_en.md @@ -1,7 +1,9 @@ # PaddleOCR Quick Start -+ [1. Install PaddleOCR Whl Package](#1-install-paddleocr-whl-package) ++ [1. Installation](#1installation) + + [1.1 Install PaddlePaddle](#11-install-paddlepaddle) + + [1.2 Install PaddleOCR Whl Package](#12-install-paddleocr-whl-package) * [2. Easy-to-Use](#2-easy-to-use) + [2.1 Use by Command Line](#21-use-by-command-line) - [2.1.1 English and Chinese Model](#211-english-and-chinese-model) @@ -10,12 +12,35 @@ + [2.2 Use by Code](#22-use-by-code) - [2.2.1 Chinese & English Model and Multilingual Model](#221-chinese---english-model-and-multilingual-model) - [2.2.2 Layout Analysis](#222-layoutAnalysis) +* [3. Summary](#3) + +## 1. Installation - + -## 1. Install PaddleOCR Whl Package +### 1.1 Install PaddlePaddle + +> If you do not have a Python environment, please refer to [Environment Preparation](./environment_en.md). + +- If you have CUDA 9 or CUDA 10 installed on your machine, please run the following command to install + + ```bash + python3 -m pip install paddlepaddle-gpu -i https://mirror.baidu.com/pypi/simple + ``` + +- If you have no available GPU on your machine, please run the following command to install the CPU version + + ```bash + python3 -m pip install paddlepaddle -i https://mirror.baidu.com/pypi/simple + ``` + +For more software version requirements, please refer to the instructions in [Installation Document](https://www.paddlepaddle.org.cn/install/quick) for operation. + + + +### 1.2 Install PaddleOCR Whl Package ```bash pip install "paddleocr>=2.0.1" # Recommend to use version 2.0.1+ @@ -248,3 +273,11 @@ im_show = draw_structure_result(image, result,font_path=font_path) im_show = Image.fromarray(im_show) im_show.save('result.jpg') ``` + + + +## 3. Summary + +In this section, you have mastered the use of PaddleOCR whl packages and obtained results. + +PaddleOCR is a rich and practical OCR tool library that opens up the whole process of data, model training, compression and inference deployment, so in the [next section](./paddleOCR_overview_en.md) we will first introduce you to the overview of PaddleOCR, and then clone the PaddleOCR project to start the application journey of PaddleOCR. diff --git a/ppocr/losses/det_pse_loss.py b/ppocr/losses/det_pse_loss.py index 9b8ac4b5a5dfac176c398dd0a9e490e5ca67ad5f..6b31343ed4d1687ee8ca44592fba0331b0b287dc 100644 --- a/ppocr/losses/det_pse_loss.py +++ b/ppocr/losses/det_pse_loss.py @@ -121,9 +121,9 @@ class PSELoss(nn.Layer): if neg_num == 0: selected_mask = training_mask - selected_mask = selected_mask.view( - 1, selected_mask.shape[0], - selected_mask.shape[1]).astype('float32') + selected_mask = selected_mask.reshape( + [1, selected_mask.shape[0], selected_mask.shape[1]]).astype( + 'float32') return selected_mask neg_score = paddle.masked_select(score, gt_text <= 0.5) diff --git a/ppstructure/docs/kie.md b/ppstructure/docs/kie.md index 21854b0d24b0b2bbe6a4612b1112b201c5df255d..a424968a9b5a33132afe52a4850cfe541919ae1c 100644 --- a/ppstructure/docs/kie.md +++ b/ppstructure/docs/kie.md @@ -1,64 +1,67 @@ -# 关键信息提取(Key Information Extraction) +# Key Information Extraction(KIE) -本节介绍PaddleOCR中关键信息提取SDMGR方法的快速使用和训练方法。 +This section provides a tutorial example on how to quickly use, train, and evaluate a key information extraction(KIE) model, [SDMGR](https://arxiv.org/abs/2103.14470), in PaddleOCR. -SDMGR是一个关键信息提取算法,将每个检测到的文本区域分类为预定义的类别,如订单ID、发票号码,金额等。 +[SDMGR(Spatial Dual-Modality Graph Reasoning)](https://arxiv.org/abs/2103.14470) is a KIE algorithm that classifies each detected text region into predefined categories, such as order ID, invoice number, amount, and etc. -* [1. 快速使用](#1-----) -* [2. 执行训练](#2-----) -* [3. 执行评估](#3-----) +* [1. Quick Use](#1-----) +* [2. Model Training](#2-----) +* [3. Model Evaluation](#3-----) -## 1. 快速使用 -训练和测试的数据采用wildreceipt数据集,通过如下指令下载数据集: +## 1. Quick Use -``` +[Wildreceipt dataset](https://paperswithcode.com/dataset/wildreceipt) is used for this tutorial. It contains 1765 photos, with 25 classes, and 50000 text boxes, which can be downloaded by wget: + +```shell wget https://paddleocr.bj.bcebos.com/dygraph_v2.1/kie/wildreceipt.tar && tar xf wildreceipt.tar ``` -执行预测: +Download the pretrained model and predict the result: -``` +```shell cd PaddleOCR/ wget https://paddleocr.bj.bcebos.com/dygraph_v2.1/kie/kie_vgg16.tar && tar xf kie_vgg16.tar python3.7 tools/infer_kie.py -c configs/kie/kie_unet_sdmgr.yml -o Global.checkpoints=kie_vgg16/best_accuracy Global.infer_img=../wildreceipt/1.txt ``` -执行预测后的结果保存在`./output/sdmgr_kie/predicts_kie.txt`文件中,可视化结果保存在`/output/sdmgr_kie/kie_results/`目录下。 +The prediction result is saved as `./output/sdmgr_kie/predicts_kie.txt`, and the visualization results are saved in the folder`/output/sdmgr_kie/kie_results/`. -可视化结果如下图所示: +The visualization results are shown in the figure below:
-## 2. 执行训练 +## 2. Model Training -创建数据集软链到PaddleOCR/train_data目录下: -``` +Create a softlink to the folder, `PaddleOCR/train_data`: +```shell cd PaddleOCR/ && mkdir train_data && cd train_data ln -s ../../wildreceipt ./ ``` -训练采用的配置文件是configs/kie/kie_unet_sdmgr.yml,配置文件中默认训练数据路径是`train_data/wildreceipt`,准备好数据后,可以通过如下指令执行训练: -``` +The configuration file used for training is `configs/kie/kie_unet_sdmgr.yml`. The default training data path in the configuration file is `train_data/wildreceipt`. After preparing the data, you can execute the model training with the following command: +```shell python3.7 tools/train.py -c configs/kie/kie_unet_sdmgr.yml -o Global.save_model_dir=./output/kie/ ``` -## 3. 执行评估 -``` +## 3. Model Evaluation + +After training, you can execute the model evaluation with the following command: + +```shell python3.7 tools/eval.py -c configs/kie/kie_unet_sdmgr.yml -o Global.checkpoints=./output/kie/best_accuracy ``` - -**参考文献:** +**Reference:** diff --git a/ppstructure/docs/kie_ch.md b/ppstructure/docs/kie_ch.md new file mode 100644 index 0000000000000000000000000000000000000000..21854b0d24b0b2bbe6a4612b1112b201c5df255d --- /dev/null +++ b/ppstructure/docs/kie_ch.md @@ -0,0 +1,74 @@ + + +# 关键信息提取(Key Information Extraction) + +本节介绍PaddleOCR中关键信息提取SDMGR方法的快速使用和训练方法。 + +SDMGR是一个关键信息提取算法,将每个检测到的文本区域分类为预定义的类别,如订单ID、发票号码,金额等。 + + +* [1. 快速使用](#1-----) +* [2. 执行训练](#2-----) +* [3. 执行评估](#3-----) + + +## 1. 快速使用 + +训练和测试的数据采用wildreceipt数据集,通过如下指令下载数据集: + +``` +wget https://paddleocr.bj.bcebos.com/dygraph_v2.1/kie/wildreceipt.tar && tar xf wildreceipt.tar +``` + +执行预测: + +``` +cd PaddleOCR/ +wget https://paddleocr.bj.bcebos.com/dygraph_v2.1/kie/kie_vgg16.tar && tar xf kie_vgg16.tar +python3.7 tools/infer_kie.py -c configs/kie/kie_unet_sdmgr.yml -o Global.checkpoints=kie_vgg16/best_accuracy Global.infer_img=../wildreceipt/1.txt +``` + +执行预测后的结果保存在`./output/sdmgr_kie/predicts_kie.txt`文件中,可视化结果保存在`/output/sdmgr_kie/kie_results/`目录下。 + +可视化结果如下图所示: + +
+ +
+ + +## 2. 执行训练 + +创建数据集软链到PaddleOCR/train_data目录下: +``` +cd PaddleOCR/ && mkdir train_data && cd train_data + +ln -s ../../wildreceipt ./ +``` + +训练采用的配置文件是configs/kie/kie_unet_sdmgr.yml,配置文件中默认训练数据路径是`train_data/wildreceipt`,准备好数据后,可以通过如下指令执行训练: +``` +python3.7 tools/train.py -c configs/kie/kie_unet_sdmgr.yml -o Global.save_model_dir=./output/kie/ +``` + +## 3. 执行评估 + +``` +python3.7 tools/eval.py -c configs/kie/kie_unet_sdmgr.yml -o Global.checkpoints=./output/kie/best_accuracy +``` + + +**参考文献:** + + + +```bibtex +@misc{sun2021spatial, + title={Spatial Dual-Modality Graph Reasoning for Key Information Extraction}, + author={Hongbin Sun and Zhanghui Kuang and Xiaoyu Yue and Chenhao Lin and Wayne Zhang}, + year={2021}, + eprint={2103.14470}, + archivePrefix={arXiv}, + primaryClass={cs.CV} +} +``` diff --git a/ppstructure/docs/kie_en.md b/ppstructure/docs/kie_en.md deleted file mode 100644 index a424968a9b5a33132afe52a4850cfe541919ae1c..0000000000000000000000000000000000000000 --- a/ppstructure/docs/kie_en.md +++ /dev/null @@ -1,77 +0,0 @@ - - -# Key Information Extraction(KIE) - -This section provides a tutorial example on how to quickly use, train, and evaluate a key information extraction(KIE) model, [SDMGR](https://arxiv.org/abs/2103.14470), in PaddleOCR. - -[SDMGR(Spatial Dual-Modality Graph Reasoning)](https://arxiv.org/abs/2103.14470) is a KIE algorithm that classifies each detected text region into predefined categories, such as order ID, invoice number, amount, and etc. - - -* [1. Quick Use](#1-----) -* [2. Model Training](#2-----) -* [3. Model Evaluation](#3-----) - - - -## 1. Quick Use - -[Wildreceipt dataset](https://paperswithcode.com/dataset/wildreceipt) is used for this tutorial. It contains 1765 photos, with 25 classes, and 50000 text boxes, which can be downloaded by wget: - -```shell -wget https://paddleocr.bj.bcebos.com/dygraph_v2.1/kie/wildreceipt.tar && tar xf wildreceipt.tar -``` - -Download the pretrained model and predict the result: - -```shell -cd PaddleOCR/ -wget https://paddleocr.bj.bcebos.com/dygraph_v2.1/kie/kie_vgg16.tar && tar xf kie_vgg16.tar -python3.7 tools/infer_kie.py -c configs/kie/kie_unet_sdmgr.yml -o Global.checkpoints=kie_vgg16/best_accuracy Global.infer_img=../wildreceipt/1.txt -``` - -The prediction result is saved as `./output/sdmgr_kie/predicts_kie.txt`, and the visualization results are saved in the folder`/output/sdmgr_kie/kie_results/`. - -The visualization results are shown in the figure below: - -
- -
- - -## 2. Model Training - -Create a softlink to the folder, `PaddleOCR/train_data`: -```shell -cd PaddleOCR/ && mkdir train_data && cd train_data - -ln -s ../../wildreceipt ./ -``` - -The configuration file used for training is `configs/kie/kie_unet_sdmgr.yml`. The default training data path in the configuration file is `train_data/wildreceipt`. After preparing the data, you can execute the model training with the following command: -```shell -python3.7 tools/train.py -c configs/kie/kie_unet_sdmgr.yml -o Global.save_model_dir=./output/kie/ -``` - - -## 3. Model Evaluation - -After training, you can execute the model evaluation with the following command: - -```shell -python3.7 tools/eval.py -c configs/kie/kie_unet_sdmgr.yml -o Global.checkpoints=./output/kie/best_accuracy -``` - -**Reference:** - - - -```bibtex -@misc{sun2021spatial, - title={Spatial Dual-Modality Graph Reasoning for Key Information Extraction}, - author={Hongbin Sun and Zhanghui Kuang and Xiaoyu Yue and Chenhao Lin and Wayne Zhang}, - year={2021}, - eprint={2103.14470}, - archivePrefix={arXiv}, - primaryClass={cs.CV} -} -``` diff --git a/ppstructure/vqa/README-en.md b/ppstructure/vqa/README-en.md deleted file mode 100644 index 168640874aa5e2339e81d7dc467e515d5aa9101e..0000000000000000000000000000000000000000 --- a/ppstructure/vqa/README-en.md +++ /dev/null @@ -1,331 +0,0 @@ -# Document Visual Q&A(DOC-VQA) - -Document Visual Q&A, mainly for the image content of the question and answer, DOC-VQA is a type of VQA task, DOC-VQA mainly asks questions about the textual content of text images. - -The DOC-VQA algorithm in PP-Structure is developed based on PaddleNLP natural language processing algorithm library. - -The main features are as follows: - -- Integrated LayoutXLM model and PP-OCR prediction engine. -- Support Semantic Entity Recognition (SER) and Relation Extraction (RE) tasks based on multi-modal methods. Based on SER task, text recognition and classification in images can be completed. Based on THE RE task, we can extract the relation of the text content in the image, such as judge the problem pair. - -- Support custom training for SER and RE tasks. - -- Support OCR+SER end-to-end system prediction and evaluation. - -- Support OCR+SER+RE end-to-end system prediction. - -**Note**: This project is based on the open source implementation of [LayoutXLM](https://arxiv.org/pdf/2104.08836.pdf) on Paddle 2.2, and at the same time, after in-depth polishing by the flying Paddle team and the Industrial and **Commercial Bank of China** in the scene of real estate certificate, jointly open source. - - -## 1.Performance - -We evaluated the algorithm on [XFUN](https://github.com/doc-analysis/XFUND) 's Chinese data set, and the performance is as follows - -| Model | Task | F1 | Model Download Link | -|:---:|:---:|:---:| :---:| -| LayoutXLM | RE | 0.7113 | [Link](https://paddleocr.bj.bcebos.com/pplayout/PP-Layout_v1.0_re_pretrained.tar) | -| LayoutXLM | SER | 0.9056 | [Link](https://paddleocr.bj.bcebos.com/pplayout/PP-Layout_v1.0_ser_pretrained.tar) | -| LayoutLM | SER | 0.78 | [Link](https://paddleocr.bj.bcebos.com/pplayout/LayoutLM_ser_pretrained.tar) | - - - -## 2.Demonstration - -**Note**: the test images are from the xfun dataset. - -### 2.1 SER - -![](./images/result_ser/zh_val_0_ser.jpg) | ![](./images/result_ser/zh_val_42_ser.jpg) ----|--- - -Different colored boxes in the figure represent different categories. For xfun dataset, there are three categories: query, answer and header: - -* Dark purple: header -* Light purple: query -* Army green: answer - -The corresponding category and OCR recognition results are also marked at the top left of the OCR detection box. - - -### 2.2 RE - -![](./images/result_re/zh_val_21_re.jpg) | ![](./images/result_re/zh_val_40_re.jpg) ----|--- - - -In the figure, the red box represents the question, the blue box represents the answer, and the question and answer are connected by green lines. The corresponding category and OCR recognition results are also marked at the top left of the OCR detection box. - - -## 3. Setup - -### 3.1 Installation dependency - -- **(1) Install PaddlePaddle** - -```bash -pip3 install --upgrade pip - -# GPU PaddlePaddle Install -python3 -m pip install paddlepaddle-gpu==2.2 -i https://mirror.baidu.com/pypi/simple - -# CPU PaddlePaddle Install -python3 -m pip install paddlepaddle==2.2 -i https://mirror.baidu.com/pypi/simple - -``` -For more requirements, please refer to the [instructions](https://www.paddlepaddle.org.cn/install/quick) in the installation document. - - -### 3.2 Install PaddleOCR (including pp-ocr and VQA) - -- **(1) PIP quick install paddleocr WHL package (forecast only)** - -```bash -pip install paddleocr -``` - -- **(2) Download VQA source code (prediction + training)** - -```bash -[recommended] git clone https://github.com/PaddlePaddle/PaddleOCR - -# If you cannot pull successfully because of network problems, you can also choose to use the hosting on the code cloud: -git clone https://gitee.com/paddlepaddle/PaddleOCR - -# Note: the code cloud hosting code may not be able to synchronize the update of this GitHub project in real time, with a delay of 3 ~ 5 days. Please give priority to the recommended method. -``` - -- **(3) Install PaddleNLP** - -```bash -# You need to use the latest code version of paddlenlp for installation -git clone https://github.com/PaddlePaddle/PaddleNLP -b develop -cd PaddleNLP -pip3 install -e . -``` - - -- **(4) Install requirements for VQA** - -```bash -cd ppstructure/vqa -pip install -r requirements.txt -``` - -## 4.Usage - - -### 4.1 Data and pre training model preparation - -Download address of processed xfun Chinese dataset: [https://paddleocr.bj.bcebos.com/dataset/XFUND.tar](https://paddleocr.bj.bcebos.com/dataset/XFUND.tar)。 - - -Download and unzip the dataset, and then place the dataset in the current directory. - -```shell -wget https://paddleocr.bj.bcebos.com/dataset/XFUND.tar -``` - -If you want to convert data sets in other languages in xfun, you can refer to [xfun data conversion script.](helper/trans_xfun_data.py)) - -If you want to experience the prediction process directly, you can download the pre training model provided by us, skip the training process and predict directly. - - -### 4.2 SER Task - -* Start training - -```shell -python3.7 train_ser.py \ - --model_name_or_path "layoutxlm-base-uncased" \ - --ser_model_type "LayoutXLM" \ - --train_data_dir "XFUND/zh_train/image" \ - --train_label_path "XFUND/zh_train/xfun_normalize_train.json" \ - --eval_data_dir "XFUND/zh_val/image" \ - --eval_label_path "XFUND/zh_val/xfun_normalize_val.json" \ - --num_train_epochs 200 \ - --eval_steps 10 \ - --output_dir "./output/ser/" \ - --learning_rate 5e-5 \ - --warmup_steps 50 \ - --evaluate_during_training \ - --seed 2048 -``` - -Finally, Precision, Recall, F1 and other indicators will be printed, and the model and training log will be saved in/ In the output/Ser/ folder. - -* Recovery training - -```shell -python3.7 train_ser.py \ - --model_name_or_path "model_path" \ - --ser_model_type "LayoutXLM" \ - --train_data_dir "XFUND/zh_train/image" \ - --train_label_path "XFUND/zh_train/xfun_normalize_train.json" \ - --eval_data_dir "XFUND/zh_val/image" \ - --eval_label_path "XFUND/zh_val/xfun_normalize_val.json" \ - --num_train_epochs 200 \ - --eval_steps 10 \ - --output_dir "./output/ser/" \ - --learning_rate 5e-5 \ - --warmup_steps 50 \ - --evaluate_during_training \ - --num_workers 8 \ - --seed 2048 \ - --resume -``` - -* Evaluation -```shell -export CUDA_VISIBLE_DEVICES=0 -python3 eval_ser.py \ - --model_name_or_path "PP-Layout_v1.0_ser_pretrained/" \ - --ser_model_type "LayoutXLM" \ - --eval_data_dir "XFUND/zh_val/image" \ - --eval_label_path "XFUND/zh_val/xfun_normalize_val.json" \ - --per_gpu_eval_batch_size 8 \ - --num_workers 8 \ - --output_dir "output/ser/" \ - --seed 2048 -``` -Finally, Precision, Recall, F1 and other indicators will be printed - -* The OCR recognition results provided in the evaluation set are used for prediction - -```shell -export CUDA_VISIBLE_DEVICES=0 -python3.7 infer_ser.py \ - --model_name_or_path "PP-Layout_v1.0_ser_pretrained/" \ - --ser_model_type "LayoutXLM" \ - --output_dir "output/ser/" \ - --infer_imgs "XFUND/zh_val/image/" \ - --ocr_json_path "XFUND/zh_val/xfun_normalize_val.json" -``` - -It will end up in output_res The visual image of the prediction result and the text file of the prediction result are saved in the res directory. The file name is infer_ results.txt. - -* Using OCR engine + SER concatenation results - -```shell -export CUDA_VISIBLE_DEVICES=0 -python3.7 infer_ser_e2e.py \ - --model_name_or_path "PP-Layout_v1.0_ser_pretrained/" \ - --ser_model_type "LayoutXLM" \ - --max_seq_length 512 \ - --output_dir "output/ser_e2e/" \ - --infer_imgs "images/input/zh_val_0.jpg" -``` - -* End-to-end evaluation of OCR engine + SER prediction system - -```shell -export CUDA_VISIBLE_DEVICES=0 -python3.7 helper/eval_with_label_end2end.py --gt_json_path XFUND/zh_val/xfun_normalize_val.json --pred_json_path output_res/infer_results.txt -``` - - -### 4.3 RE Task - -* Start training - -```shell -export CUDA_VISIBLE_DEVICES=0 -python3 train_re.py \ - --model_name_or_path "layoutxlm-base-uncased" \ - --train_data_dir "XFUND/zh_train/image" \ - --train_label_path "XFUND/zh_train/xfun_normalize_train.json" \ - --eval_data_dir "XFUND/zh_val/image" \ - --eval_label_path "XFUND/zh_val/xfun_normalize_val.json" \ - --label_map_path "labels/labels_ser.txt" \ - --num_train_epochs 200 \ - --eval_steps 10 \ - --output_dir "output/re/" \ - --learning_rate 5e-5 \ - --warmup_steps 50 \ - --per_gpu_train_batch_size 8 \ - --per_gpu_eval_batch_size 8 \ - --num_workers 8 \ - --evaluate_during_training \ - --seed 2048 - -``` - -* Resume training - -```shell -export CUDA_VISIBLE_DEVICES=0 -python3 train_re.py \ - --model_name_or_path "model_path" \ - --train_data_dir "XFUND/zh_train/image" \ - --train_label_path "XFUND/zh_train/xfun_normalize_train.json" \ - --eval_data_dir "XFUND/zh_val/image" \ - --eval_label_path "XFUND/zh_val/xfun_normalize_val.json" \ - --label_map_path "labels/labels_ser.txt" \ - --num_train_epochs 2 \ - --eval_steps 10 \ - --output_dir "output/re/" \ - --learning_rate 5e-5 \ - --warmup_steps 50 \ - --per_gpu_train_batch_size 8 \ - --per_gpu_eval_batch_size 8 \ - --num_workers 8 \ - --evaluate_during_training \ - --seed 2048 \ - --resume - -``` - -Finally, Precision, Recall, F1 and other indicators will be printed, and the model and training log will be saved in the output/RE file folder. - -* Evaluation -```shell -export CUDA_VISIBLE_DEVICES=0 -python3 eval_re.py \ - --model_name_or_path "PP-Layout_v1.0_re_pretrained/" \ - --max_seq_length 512 \ - --eval_data_dir "XFUND/zh_val/image" \ - --eval_label_path "XFUND/zh_val/xfun_normalize_val.json" \ - --label_map_path "labels/labels_ser.txt" \ - --output_dir "output/re/" \ - --per_gpu_eval_batch_size 8 \ - --num_workers 8 \ - --seed 2048 -``` -Finally, Precision, Recall, F1 and other indicators will be printed - - -* The OCR recognition results provided in the evaluation set are used for prediction - -```shell -export CUDA_VISIBLE_DEVICES=0 -python3 infer_re.py \ - --model_name_or_path "PP-Layout_v1.0_re_pretrained/" \ - --max_seq_length 512 \ - --eval_data_dir "XFUND/zh_val/image" \ - --eval_label_path "XFUND/zh_val/xfun_normalize_val.json" \ - --label_map_path "labels/labels_ser.txt" \ - --output_dir "output/re/" \ - --per_gpu_eval_batch_size 1 \ - --seed 2048 -``` - -The visual image of the prediction result and the text file of the prediction result are saved in the output_res file folder, the file name is`infer_results.txt`。 - -* Concatenation results using OCR engine + SER+ RE - -```shell -export CUDA_VISIBLE_DEVICES=0 -python3.7 infer_ser_re_e2e.py \ - --model_name_or_path "PP-Layout_v1.0_ser_pretrained/" \ - --re_model_name_or_path "PP-Layout_v1.0_re_pretrained/" \ - --ser_model_type "LayoutXLM" \ - --max_seq_length 512 \ - --output_dir "output/ser_re_e2e/" \ - --infer_imgs "images/input/zh_val_21.jpg" -``` - -## Reference - -- LayoutXLM: Multimodal Pre-training for Multilingual Visually-rich Document Understanding, https://arxiv.org/pdf/2104.08836.pdf -- microsoft/unilm/layoutxlm, https://github.com/microsoft/unilm/tree/master/layoutxlm -- XFUND dataset, https://github.com/doc-analysis/XFUND diff --git a/ppstructure/vqa/README.md b/ppstructure/vqa/README.md index 619ada71a82eacd88abd39199d0b220dc6c64c9b..168640874aa5e2339e81d7dc467e515d5aa9101e 100644 --- a/ppstructure/vqa/README.md +++ b/ppstructure/vqa/README.md @@ -1,318 +1,331 @@ -# 文档视觉问答(DOC-VQA) - -VQA指视觉问答,主要针对图像内容进行提问和回答,DOC-VQA是VQA任务中的一种,DOC-VQA主要针对文本图像的文字内容提出问题。 - -PP-Structure 里的 DOC-VQA算法基于PaddleNLP自然语言处理算法库进行开发。 - -主要特性如下: - -- 集成[LayoutXLM](https://arxiv.org/pdf/2104.08836.pdf)模型以及PP-OCR预测引擎。 -- 支持基于多模态方法的语义实体识别 (Semantic Entity Recognition, SER) 以及关系抽取 (Relation Extraction, RE) 任务。基于 SER 任务,可以完成对图像中的文本识别与分类;基于 RE 任务,可以完成对图象中的文本内容的关系提取,如判断问题对(pair)。 -- 支持SER任务和RE任务的自定义训练。 -- 支持OCR+SER的端到端系统预测与评估。 -- 支持OCR+SER+RE的端到端系统预测。 - -**Note**:本项目基于 [LayoutXLM](https://arxiv.org/pdf/2104.08836.pdf) 在Paddle 2.2上的开源实现,同时经过飞桨团队与**中国工商银行**在不动产证场景深入打磨,联合开源。 - - -## 1.性能 - -我们在 [XFUN](https://github.com/doc-analysis/XFUND) 的中文数据集上对算法进行了评估,性能如下 - -| 模型 | 任务 | f1 | 模型下载地址 | -|:---:|:---:|:---:| :---:| -| LayoutXLM | RE | 0.7113 | [链接](https://paddleocr.bj.bcebos.com/pplayout/PP-Layout_v1.0_re_pretrained.tar) | -| LayoutXLM | SER | 0.9056 | [链接](https://paddleocr.bj.bcebos.com/pplayout/PP-Layout_v1.0_ser_pretrained.tar) | -| LayoutLM | SER | 0.78 | [链接](https://paddleocr.bj.bcebos.com/pplayout/LayoutLM_ser_pretrained.tar) | - - - -## 2.效果演示 - -**注意:** 测试图片来源于XFUN数据集。 - -### 2.1 SER - -![](./images/result_ser/zh_val_0_ser.jpg) | ![](./images/result_ser/zh_val_42_ser.jpg) ----|--- - -图中不同颜色的框表示不同的类别,对于XFUN数据集,有`QUESTION`, `ANSWER`, `HEADER` 3种类别 - -* 深紫色:HEADER -* 浅紫色:QUESTION -* 军绿色:ANSWER - -在OCR检测框的左上方也标出了对应的类别和OCR识别结果。 - - -### 2.2 RE - -![](./images/result_re/zh_val_21_re.jpg) | ![](./images/result_re/zh_val_40_re.jpg) ----|--- - - -图中红色框表示问题,蓝色框表示答案,问题和答案之间使用绿色线连接。在OCR检测框的左上方也标出了对应的类别和OCR识别结果。 - - -## 3.安装 - -### 3.1 安装依赖 - -- **(1) 安装PaddlePaddle** - -```bash -python3 -m pip install --upgrade pip - -# GPU安装 -python3 -m pip install paddlepaddle-gpu==2.2 -i https://mirror.baidu.com/pypi/simple - -# CPU安装 -python3 -m pip install paddlepaddle==2.2 -i https://mirror.baidu.com/pypi/simple - -``` -更多需求,请参照[安装文档](https://www.paddlepaddle.org.cn/install/quick)中的说明进行操作。 - - -### 3.2 安装PaddleOCR(包含 PP-OCR 和 VQA ) - -- **(1)pip快速安装PaddleOCR whl包(仅预测)** - -```bash -python3 -m pip install paddleocr -``` - -- **(2)下载VQA源码(预测+训练)** - -```bash -【推荐】git clone https://github.com/PaddlePaddle/PaddleOCR - -# 如果因为网络问题无法pull成功,也可选择使用码云上的托管: -git clone https://gitee.com/paddlepaddle/PaddleOCR - -# 注:码云托管代码可能无法实时同步本github项目更新,存在3~5天延时,请优先使用推荐方式。 -``` - -- **(3)安装VQA的`requirements`** - -```bash -cd ppstructure/vqa -python3 -m pip install -r requirements.txt -``` - -## 4. 使用 - - -### 4.1 数据和预训练模型准备 - -处理好的XFUN中文数据集下载地址:[https://paddleocr.bj.bcebos.com/dataset/XFUND.tar](https://paddleocr.bj.bcebos.com/dataset/XFUND.tar)。 - - -下载并解压该数据集,解压后将数据集放置在当前目录下。 - -```shell -wget https://paddleocr.bj.bcebos.com/dataset/XFUND.tar -``` - -如果希望转换XFUN中其他语言的数据集,可以参考[XFUN数据转换脚本](helper/trans_xfun_data.py)。 - -如果希望直接体验预测过程,可以下载我们提供的预训练模型,跳过训练过程,直接预测即可。 - - -### 4.2 SER任务 - -* 启动训练 - -```shell -python3 train_ser.py \ - --model_name_or_path "layoutxlm-base-uncased" \ - --ser_model_type "LayoutXLM" \ - --train_data_dir "XFUND/zh_train/image" \ - --train_label_path "XFUND/zh_train/xfun_normalize_train.json" \ - --eval_data_dir "XFUND/zh_val/image" \ - --eval_label_path "XFUND/zh_val/xfun_normalize_val.json" \ - --num_train_epochs 200 \ - --eval_steps 10 \ - --output_dir "./output/ser/" \ - --learning_rate 5e-5 \ - --warmup_steps 50 \ - --evaluate_during_training \ - --seed 2048 -``` - -最终会打印出`precision`, `recall`, `f1`等指标,模型和训练日志会保存在`./output/ser/`文件夹中。 - -* 恢复训练 - -```shell -python3 train_ser.py \ - --model_name_or_path "model_path" \ - --ser_model_type "LayoutXLM" \ - --train_data_dir "XFUND/zh_train/image" \ - --train_label_path "XFUND/zh_train/xfun_normalize_train.json" \ - --eval_data_dir "XFUND/zh_val/image" \ - --eval_label_path "XFUND/zh_val/xfun_normalize_val.json" \ - --num_train_epochs 200 \ - --eval_steps 10 \ - --output_dir "./output/ser/" \ - --learning_rate 5e-5 \ - --warmup_steps 50 \ - --evaluate_during_training \ - --num_workers 8 \ - --seed 2048 \ - --resume -``` - -* 评估 -```shell -export CUDA_VISIBLE_DEVICES=0 -python3 eval_ser.py \ - --model_name_or_path "PP-Layout_v1.0_ser_pretrained/" \ - --ser_model_type "LayoutXLM" \ - --eval_data_dir "XFUND/zh_val/image" \ - --eval_label_path "XFUND/zh_val/xfun_normalize_val.json" \ - --per_gpu_eval_batch_size 8 \ - --num_workers 8 \ - --output_dir "output/ser/" \ - --seed 2048 -``` -最终会打印出`precision`, `recall`, `f1`等指标 - -* 使用评估集合中提供的OCR识别结果进行预测 - -```shell -export CUDA_VISIBLE_DEVICES=0 -python3 infer_ser.py \ - --model_name_or_path "PP-Layout_v1.0_ser_pretrained/" \ - --ser_model_type "LayoutXLM" \ - --output_dir "output/ser/" \ - --infer_imgs "XFUND/zh_val/image/" \ - --ocr_json_path "XFUND/zh_val/xfun_normalize_val.json" -``` - -最终会在`output_res`目录下保存预测结果可视化图像以及预测结果文本文件,文件名为`infer_results.txt`。 - -* 使用`OCR引擎 + SER`串联结果 - -```shell -export CUDA_VISIBLE_DEVICES=0 -python3 infer_ser_e2e.py \ - --model_name_or_path "PP-Layout_v1.0_ser_pretrained/" \ - --ser_model_type "LayoutXLM" \ - --max_seq_length 512 \ - --output_dir "output/ser_e2e/" \ - --infer_imgs "images/input/zh_val_0.jpg" -``` - -* 对`OCR引擎 + SER`预测系统进行端到端评估 - -```shell -export CUDA_VISIBLE_DEVICES=0 -python3 helper/eval_with_label_end2end.py --gt_json_path XFUND/zh_val/xfun_normalize_val.json --pred_json_path output_res/infer_results.txt -``` - - -### 4.3 RE任务 - -* 启动训练 - -```shell -export CUDA_VISIBLE_DEVICES=0 -python3 train_re.py \ - --model_name_or_path "layoutxlm-base-uncased" \ - --train_data_dir "XFUND/zh_train/image" \ - --train_label_path "XFUND/zh_train/xfun_normalize_train.json" \ - --eval_data_dir "XFUND/zh_val/image" \ - --eval_label_path "XFUND/zh_val/xfun_normalize_val.json" \ - --label_map_path "labels/labels_ser.txt" \ - --num_train_epochs 200 \ - --eval_steps 10 \ - --output_dir "output/re/" \ - --learning_rate 5e-5 \ - --warmup_steps 50 \ - --per_gpu_train_batch_size 8 \ - --per_gpu_eval_batch_size 8 \ - --num_workers 8 \ - --evaluate_during_training \ - --seed 2048 - -``` - -* 恢复训练 - -```shell -export CUDA_VISIBLE_DEVICES=0 -python3 train_re.py \ - --model_name_or_path "model_path" \ - --train_data_dir "XFUND/zh_train/image" \ - --train_label_path "XFUND/zh_train/xfun_normalize_train.json" \ - --eval_data_dir "XFUND/zh_val/image" \ - --eval_label_path "XFUND/zh_val/xfun_normalize_val.json" \ - --label_map_path "labels/labels_ser.txt" \ - --num_train_epochs 2 \ - --eval_steps 10 \ - --output_dir "output/re/" \ - --learning_rate 5e-5 \ - --warmup_steps 50 \ - --per_gpu_train_batch_size 8 \ - --per_gpu_eval_batch_size 8 \ - --num_workers 8 \ - --evaluate_during_training \ - --seed 2048 \ - --resume - -``` - -最终会打印出`precision`, `recall`, `f1`等指标,模型和训练日志会保存在`./output/re/`文件夹中。 - -* 评估 -```shell -export CUDA_VISIBLE_DEVICES=0 -python3 eval_re.py \ - --model_name_or_path "PP-Layout_v1.0_re_pretrained/" \ - --max_seq_length 512 \ - --eval_data_dir "XFUND/zh_val/image" \ - --eval_label_path "XFUND/zh_val/xfun_normalize_val.json" \ - --label_map_path "labels/labels_ser.txt" \ - --output_dir "output/re/" \ - --per_gpu_eval_batch_size 8 \ - --num_workers 8 \ - --seed 2048 -``` -最终会打印出`precision`, `recall`, `f1`等指标 - - -* 使用评估集合中提供的OCR识别结果进行预测 - -```shell -export CUDA_VISIBLE_DEVICES=0 -python3 infer_re.py \ - --model_name_or_path "PP-Layout_v1.0_re_pretrained/" \ - --max_seq_length 512 \ - --eval_data_dir "XFUND/zh_val/image" \ - --eval_label_path "XFUND/zh_val/xfun_normalize_val.json" \ - --label_map_path "labels/labels_ser.txt" \ - --output_dir "output/re/" \ - --per_gpu_eval_batch_size 1 \ - --seed 2048 -``` - -最终会在`output_res`目录下保存预测结果可视化图像以及预测结果文本文件,文件名为`infer_results.txt`。 - -* 使用`OCR引擎 + SER + RE`串联结果 - -```shell -export CUDA_VISIBLE_DEVICES=0 -python3 infer_ser_re_e2e.py \ - --model_name_or_path "PP-Layout_v1.0_ser_pretrained/" \ - --re_model_name_or_path "PP-Layout_v1.0_re_pretrained/" \ - --ser_model_type "LayoutXLM" \ - --max_seq_length 512 \ - --output_dir "output/ser_re_e2e/" \ - --infer_imgs "images/input/zh_val_21.jpg" -``` - -## 参考链接 - -- LayoutXLM: Multimodal Pre-training for Multilingual Visually-rich Document Understanding, https://arxiv.org/pdf/2104.08836.pdf -- microsoft/unilm/layoutxlm, https://github.com/microsoft/unilm/tree/master/layoutxlm -- XFUND dataset, https://github.com/doc-analysis/XFUND +# Document Visual Q&A(DOC-VQA) + +Document Visual Q&A, mainly for the image content of the question and answer, DOC-VQA is a type of VQA task, DOC-VQA mainly asks questions about the textual content of text images. + +The DOC-VQA algorithm in PP-Structure is developed based on PaddleNLP natural language processing algorithm library. + +The main features are as follows: + +- Integrated LayoutXLM model and PP-OCR prediction engine. +- Support Semantic Entity Recognition (SER) and Relation Extraction (RE) tasks based on multi-modal methods. Based on SER task, text recognition and classification in images can be completed. Based on THE RE task, we can extract the relation of the text content in the image, such as judge the problem pair. + +- Support custom training for SER and RE tasks. + +- Support OCR+SER end-to-end system prediction and evaluation. + +- Support OCR+SER+RE end-to-end system prediction. + +**Note**: This project is based on the open source implementation of [LayoutXLM](https://arxiv.org/pdf/2104.08836.pdf) on Paddle 2.2, and at the same time, after in-depth polishing by the flying Paddle team and the Industrial and **Commercial Bank of China** in the scene of real estate certificate, jointly open source. + + +## 1.Performance + +We evaluated the algorithm on [XFUN](https://github.com/doc-analysis/XFUND) 's Chinese data set, and the performance is as follows + +| Model | Task | F1 | Model Download Link | +|:---:|:---:|:---:| :---:| +| LayoutXLM | RE | 0.7113 | [Link](https://paddleocr.bj.bcebos.com/pplayout/PP-Layout_v1.0_re_pretrained.tar) | +| LayoutXLM | SER | 0.9056 | [Link](https://paddleocr.bj.bcebos.com/pplayout/PP-Layout_v1.0_ser_pretrained.tar) | +| LayoutLM | SER | 0.78 | [Link](https://paddleocr.bj.bcebos.com/pplayout/LayoutLM_ser_pretrained.tar) | + + + +## 2.Demonstration + +**Note**: the test images are from the xfun dataset. + +### 2.1 SER + +![](./images/result_ser/zh_val_0_ser.jpg) | ![](./images/result_ser/zh_val_42_ser.jpg) +---|--- + +Different colored boxes in the figure represent different categories. For xfun dataset, there are three categories: query, answer and header: + +* Dark purple: header +* Light purple: query +* Army green: answer + +The corresponding category and OCR recognition results are also marked at the top left of the OCR detection box. + + +### 2.2 RE + +![](./images/result_re/zh_val_21_re.jpg) | ![](./images/result_re/zh_val_40_re.jpg) +---|--- + + +In the figure, the red box represents the question, the blue box represents the answer, and the question and answer are connected by green lines. The corresponding category and OCR recognition results are also marked at the top left of the OCR detection box. + + +## 3. Setup + +### 3.1 Installation dependency + +- **(1) Install PaddlePaddle** + +```bash +pip3 install --upgrade pip + +# GPU PaddlePaddle Install +python3 -m pip install paddlepaddle-gpu==2.2 -i https://mirror.baidu.com/pypi/simple + +# CPU PaddlePaddle Install +python3 -m pip install paddlepaddle==2.2 -i https://mirror.baidu.com/pypi/simple + +``` +For more requirements, please refer to the [instructions](https://www.paddlepaddle.org.cn/install/quick) in the installation document. + + +### 3.2 Install PaddleOCR (including pp-ocr and VQA) + +- **(1) PIP quick install paddleocr WHL package (forecast only)** + +```bash +pip install paddleocr +``` + +- **(2) Download VQA source code (prediction + training)** + +```bash +[recommended] git clone https://github.com/PaddlePaddle/PaddleOCR + +# If you cannot pull successfully because of network problems, you can also choose to use the hosting on the code cloud: +git clone https://gitee.com/paddlepaddle/PaddleOCR + +# Note: the code cloud hosting code may not be able to synchronize the update of this GitHub project in real time, with a delay of 3 ~ 5 days. Please give priority to the recommended method. +``` + +- **(3) Install PaddleNLP** + +```bash +# You need to use the latest code version of paddlenlp for installation +git clone https://github.com/PaddlePaddle/PaddleNLP -b develop +cd PaddleNLP +pip3 install -e . +``` + + +- **(4) Install requirements for VQA** + +```bash +cd ppstructure/vqa +pip install -r requirements.txt +``` + +## 4.Usage + + +### 4.1 Data and pre training model preparation + +Download address of processed xfun Chinese dataset: [https://paddleocr.bj.bcebos.com/dataset/XFUND.tar](https://paddleocr.bj.bcebos.com/dataset/XFUND.tar)。 + + +Download and unzip the dataset, and then place the dataset in the current directory. + +```shell +wget https://paddleocr.bj.bcebos.com/dataset/XFUND.tar +``` + +If you want to convert data sets in other languages in xfun, you can refer to [xfun data conversion script.](helper/trans_xfun_data.py)) + +If you want to experience the prediction process directly, you can download the pre training model provided by us, skip the training process and predict directly. + + +### 4.2 SER Task + +* Start training + +```shell +python3.7 train_ser.py \ + --model_name_or_path "layoutxlm-base-uncased" \ + --ser_model_type "LayoutXLM" \ + --train_data_dir "XFUND/zh_train/image" \ + --train_label_path "XFUND/zh_train/xfun_normalize_train.json" \ + --eval_data_dir "XFUND/zh_val/image" \ + --eval_label_path "XFUND/zh_val/xfun_normalize_val.json" \ + --num_train_epochs 200 \ + --eval_steps 10 \ + --output_dir "./output/ser/" \ + --learning_rate 5e-5 \ + --warmup_steps 50 \ + --evaluate_during_training \ + --seed 2048 +``` + +Finally, Precision, Recall, F1 and other indicators will be printed, and the model and training log will be saved in/ In the output/Ser/ folder. + +* Recovery training + +```shell +python3.7 train_ser.py \ + --model_name_or_path "model_path" \ + --ser_model_type "LayoutXLM" \ + --train_data_dir "XFUND/zh_train/image" \ + --train_label_path "XFUND/zh_train/xfun_normalize_train.json" \ + --eval_data_dir "XFUND/zh_val/image" \ + --eval_label_path "XFUND/zh_val/xfun_normalize_val.json" \ + --num_train_epochs 200 \ + --eval_steps 10 \ + --output_dir "./output/ser/" \ + --learning_rate 5e-5 \ + --warmup_steps 50 \ + --evaluate_during_training \ + --num_workers 8 \ + --seed 2048 \ + --resume +``` + +* Evaluation +```shell +export CUDA_VISIBLE_DEVICES=0 +python3 eval_ser.py \ + --model_name_or_path "PP-Layout_v1.0_ser_pretrained/" \ + --ser_model_type "LayoutXLM" \ + --eval_data_dir "XFUND/zh_val/image" \ + --eval_label_path "XFUND/zh_val/xfun_normalize_val.json" \ + --per_gpu_eval_batch_size 8 \ + --num_workers 8 \ + --output_dir "output/ser/" \ + --seed 2048 +``` +Finally, Precision, Recall, F1 and other indicators will be printed + +* The OCR recognition results provided in the evaluation set are used for prediction + +```shell +export CUDA_VISIBLE_DEVICES=0 +python3.7 infer_ser.py \ + --model_name_or_path "PP-Layout_v1.0_ser_pretrained/" \ + --ser_model_type "LayoutXLM" \ + --output_dir "output/ser/" \ + --infer_imgs "XFUND/zh_val/image/" \ + --ocr_json_path "XFUND/zh_val/xfun_normalize_val.json" +``` + +It will end up in output_res The visual image of the prediction result and the text file of the prediction result are saved in the res directory. The file name is infer_ results.txt. + +* Using OCR engine + SER concatenation results + +```shell +export CUDA_VISIBLE_DEVICES=0 +python3.7 infer_ser_e2e.py \ + --model_name_or_path "PP-Layout_v1.0_ser_pretrained/" \ + --ser_model_type "LayoutXLM" \ + --max_seq_length 512 \ + --output_dir "output/ser_e2e/" \ + --infer_imgs "images/input/zh_val_0.jpg" +``` + +* End-to-end evaluation of OCR engine + SER prediction system + +```shell +export CUDA_VISIBLE_DEVICES=0 +python3.7 helper/eval_with_label_end2end.py --gt_json_path XFUND/zh_val/xfun_normalize_val.json --pred_json_path output_res/infer_results.txt +``` + + +### 4.3 RE Task + +* Start training + +```shell +export CUDA_VISIBLE_DEVICES=0 +python3 train_re.py \ + --model_name_or_path "layoutxlm-base-uncased" \ + --train_data_dir "XFUND/zh_train/image" \ + --train_label_path "XFUND/zh_train/xfun_normalize_train.json" \ + --eval_data_dir "XFUND/zh_val/image" \ + --eval_label_path "XFUND/zh_val/xfun_normalize_val.json" \ + --label_map_path "labels/labels_ser.txt" \ + --num_train_epochs 200 \ + --eval_steps 10 \ + --output_dir "output/re/" \ + --learning_rate 5e-5 \ + --warmup_steps 50 \ + --per_gpu_train_batch_size 8 \ + --per_gpu_eval_batch_size 8 \ + --num_workers 8 \ + --evaluate_during_training \ + --seed 2048 + +``` + +* Resume training + +```shell +export CUDA_VISIBLE_DEVICES=0 +python3 train_re.py \ + --model_name_or_path "model_path" \ + --train_data_dir "XFUND/zh_train/image" \ + --train_label_path "XFUND/zh_train/xfun_normalize_train.json" \ + --eval_data_dir "XFUND/zh_val/image" \ + --eval_label_path "XFUND/zh_val/xfun_normalize_val.json" \ + --label_map_path "labels/labels_ser.txt" \ + --num_train_epochs 2 \ + --eval_steps 10 \ + --output_dir "output/re/" \ + --learning_rate 5e-5 \ + --warmup_steps 50 \ + --per_gpu_train_batch_size 8 \ + --per_gpu_eval_batch_size 8 \ + --num_workers 8 \ + --evaluate_during_training \ + --seed 2048 \ + --resume + +``` + +Finally, Precision, Recall, F1 and other indicators will be printed, and the model and training log will be saved in the output/RE file folder. + +* Evaluation +```shell +export CUDA_VISIBLE_DEVICES=0 +python3 eval_re.py \ + --model_name_or_path "PP-Layout_v1.0_re_pretrained/" \ + --max_seq_length 512 \ + --eval_data_dir "XFUND/zh_val/image" \ + --eval_label_path "XFUND/zh_val/xfun_normalize_val.json" \ + --label_map_path "labels/labels_ser.txt" \ + --output_dir "output/re/" \ + --per_gpu_eval_batch_size 8 \ + --num_workers 8 \ + --seed 2048 +``` +Finally, Precision, Recall, F1 and other indicators will be printed + + +* The OCR recognition results provided in the evaluation set are used for prediction + +```shell +export CUDA_VISIBLE_DEVICES=0 +python3 infer_re.py \ + --model_name_or_path "PP-Layout_v1.0_re_pretrained/" \ + --max_seq_length 512 \ + --eval_data_dir "XFUND/zh_val/image" \ + --eval_label_path "XFUND/zh_val/xfun_normalize_val.json" \ + --label_map_path "labels/labels_ser.txt" \ + --output_dir "output/re/" \ + --per_gpu_eval_batch_size 1 \ + --seed 2048 +``` + +The visual image of the prediction result and the text file of the prediction result are saved in the output_res file folder, the file name is`infer_results.txt`。 + +* Concatenation results using OCR engine + SER+ RE + +```shell +export CUDA_VISIBLE_DEVICES=0 +python3.7 infer_ser_re_e2e.py \ + --model_name_or_path "PP-Layout_v1.0_ser_pretrained/" \ + --re_model_name_or_path "PP-Layout_v1.0_re_pretrained/" \ + --ser_model_type "LayoutXLM" \ + --max_seq_length 512 \ + --output_dir "output/ser_re_e2e/" \ + --infer_imgs "images/input/zh_val_21.jpg" +``` + +## Reference + +- LayoutXLM: Multimodal Pre-training for Multilingual Visually-rich Document Understanding, https://arxiv.org/pdf/2104.08836.pdf +- microsoft/unilm/layoutxlm, https://github.com/microsoft/unilm/tree/master/layoutxlm +- XFUND dataset, https://github.com/doc-analysis/XFUND diff --git a/ppstructure/vqa/README_ch.md b/ppstructure/vqa/README_ch.md new file mode 100644 index 0000000000000000000000000000000000000000..619ada71a82eacd88abd39199d0b220dc6c64c9b --- /dev/null +++ b/ppstructure/vqa/README_ch.md @@ -0,0 +1,318 @@ +# 文档视觉问答(DOC-VQA) + +VQA指视觉问答,主要针对图像内容进行提问和回答,DOC-VQA是VQA任务中的一种,DOC-VQA主要针对文本图像的文字内容提出问题。 + +PP-Structure 里的 DOC-VQA算法基于PaddleNLP自然语言处理算法库进行开发。 + +主要特性如下: + +- 集成[LayoutXLM](https://arxiv.org/pdf/2104.08836.pdf)模型以及PP-OCR预测引擎。 +- 支持基于多模态方法的语义实体识别 (Semantic Entity Recognition, SER) 以及关系抽取 (Relation Extraction, RE) 任务。基于 SER 任务,可以完成对图像中的文本识别与分类;基于 RE 任务,可以完成对图象中的文本内容的关系提取,如判断问题对(pair)。 +- 支持SER任务和RE任务的自定义训练。 +- 支持OCR+SER的端到端系统预测与评估。 +- 支持OCR+SER+RE的端到端系统预测。 + +**Note**:本项目基于 [LayoutXLM](https://arxiv.org/pdf/2104.08836.pdf) 在Paddle 2.2上的开源实现,同时经过飞桨团队与**中国工商银行**在不动产证场景深入打磨,联合开源。 + + +## 1.性能 + +我们在 [XFUN](https://github.com/doc-analysis/XFUND) 的中文数据集上对算法进行了评估,性能如下 + +| 模型 | 任务 | f1 | 模型下载地址 | +|:---:|:---:|:---:| :---:| +| LayoutXLM | RE | 0.7113 | [链接](https://paddleocr.bj.bcebos.com/pplayout/PP-Layout_v1.0_re_pretrained.tar) | +| LayoutXLM | SER | 0.9056 | [链接](https://paddleocr.bj.bcebos.com/pplayout/PP-Layout_v1.0_ser_pretrained.tar) | +| LayoutLM | SER | 0.78 | [链接](https://paddleocr.bj.bcebos.com/pplayout/LayoutLM_ser_pretrained.tar) | + + + +## 2.效果演示 + +**注意:** 测试图片来源于XFUN数据集。 + +### 2.1 SER + +![](./images/result_ser/zh_val_0_ser.jpg) | ![](./images/result_ser/zh_val_42_ser.jpg) +---|--- + +图中不同颜色的框表示不同的类别,对于XFUN数据集,有`QUESTION`, `ANSWER`, `HEADER` 3种类别 + +* 深紫色:HEADER +* 浅紫色:QUESTION +* 军绿色:ANSWER + +在OCR检测框的左上方也标出了对应的类别和OCR识别结果。 + + +### 2.2 RE + +![](./images/result_re/zh_val_21_re.jpg) | ![](./images/result_re/zh_val_40_re.jpg) +---|--- + + +图中红色框表示问题,蓝色框表示答案,问题和答案之间使用绿色线连接。在OCR检测框的左上方也标出了对应的类别和OCR识别结果。 + + +## 3.安装 + +### 3.1 安装依赖 + +- **(1) 安装PaddlePaddle** + +```bash +python3 -m pip install --upgrade pip + +# GPU安装 +python3 -m pip install paddlepaddle-gpu==2.2 -i https://mirror.baidu.com/pypi/simple + +# CPU安装 +python3 -m pip install paddlepaddle==2.2 -i https://mirror.baidu.com/pypi/simple + +``` +更多需求,请参照[安装文档](https://www.paddlepaddle.org.cn/install/quick)中的说明进行操作。 + + +### 3.2 安装PaddleOCR(包含 PP-OCR 和 VQA ) + +- **(1)pip快速安装PaddleOCR whl包(仅预测)** + +```bash +python3 -m pip install paddleocr +``` + +- **(2)下载VQA源码(预测+训练)** + +```bash +【推荐】git clone https://github.com/PaddlePaddle/PaddleOCR + +# 如果因为网络问题无法pull成功,也可选择使用码云上的托管: +git clone https://gitee.com/paddlepaddle/PaddleOCR + +# 注:码云托管代码可能无法实时同步本github项目更新,存在3~5天延时,请优先使用推荐方式。 +``` + +- **(3)安装VQA的`requirements`** + +```bash +cd ppstructure/vqa +python3 -m pip install -r requirements.txt +``` + +## 4. 使用 + + +### 4.1 数据和预训练模型准备 + +处理好的XFUN中文数据集下载地址:[https://paddleocr.bj.bcebos.com/dataset/XFUND.tar](https://paddleocr.bj.bcebos.com/dataset/XFUND.tar)。 + + +下载并解压该数据集,解压后将数据集放置在当前目录下。 + +```shell +wget https://paddleocr.bj.bcebos.com/dataset/XFUND.tar +``` + +如果希望转换XFUN中其他语言的数据集,可以参考[XFUN数据转换脚本](helper/trans_xfun_data.py)。 + +如果希望直接体验预测过程,可以下载我们提供的预训练模型,跳过训练过程,直接预测即可。 + + +### 4.2 SER任务 + +* 启动训练 + +```shell +python3 train_ser.py \ + --model_name_or_path "layoutxlm-base-uncased" \ + --ser_model_type "LayoutXLM" \ + --train_data_dir "XFUND/zh_train/image" \ + --train_label_path "XFUND/zh_train/xfun_normalize_train.json" \ + --eval_data_dir "XFUND/zh_val/image" \ + --eval_label_path "XFUND/zh_val/xfun_normalize_val.json" \ + --num_train_epochs 200 \ + --eval_steps 10 \ + --output_dir "./output/ser/" \ + --learning_rate 5e-5 \ + --warmup_steps 50 \ + --evaluate_during_training \ + --seed 2048 +``` + +最终会打印出`precision`, `recall`, `f1`等指标,模型和训练日志会保存在`./output/ser/`文件夹中。 + +* 恢复训练 + +```shell +python3 train_ser.py \ + --model_name_or_path "model_path" \ + --ser_model_type "LayoutXLM" \ + --train_data_dir "XFUND/zh_train/image" \ + --train_label_path "XFUND/zh_train/xfun_normalize_train.json" \ + --eval_data_dir "XFUND/zh_val/image" \ + --eval_label_path "XFUND/zh_val/xfun_normalize_val.json" \ + --num_train_epochs 200 \ + --eval_steps 10 \ + --output_dir "./output/ser/" \ + --learning_rate 5e-5 \ + --warmup_steps 50 \ + --evaluate_during_training \ + --num_workers 8 \ + --seed 2048 \ + --resume +``` + +* 评估 +```shell +export CUDA_VISIBLE_DEVICES=0 +python3 eval_ser.py \ + --model_name_or_path "PP-Layout_v1.0_ser_pretrained/" \ + --ser_model_type "LayoutXLM" \ + --eval_data_dir "XFUND/zh_val/image" \ + --eval_label_path "XFUND/zh_val/xfun_normalize_val.json" \ + --per_gpu_eval_batch_size 8 \ + --num_workers 8 \ + --output_dir "output/ser/" \ + --seed 2048 +``` +最终会打印出`precision`, `recall`, `f1`等指标 + +* 使用评估集合中提供的OCR识别结果进行预测 + +```shell +export CUDA_VISIBLE_DEVICES=0 +python3 infer_ser.py \ + --model_name_or_path "PP-Layout_v1.0_ser_pretrained/" \ + --ser_model_type "LayoutXLM" \ + --output_dir "output/ser/" \ + --infer_imgs "XFUND/zh_val/image/" \ + --ocr_json_path "XFUND/zh_val/xfun_normalize_val.json" +``` + +最终会在`output_res`目录下保存预测结果可视化图像以及预测结果文本文件,文件名为`infer_results.txt`。 + +* 使用`OCR引擎 + SER`串联结果 + +```shell +export CUDA_VISIBLE_DEVICES=0 +python3 infer_ser_e2e.py \ + --model_name_or_path "PP-Layout_v1.0_ser_pretrained/" \ + --ser_model_type "LayoutXLM" \ + --max_seq_length 512 \ + --output_dir "output/ser_e2e/" \ + --infer_imgs "images/input/zh_val_0.jpg" +``` + +* 对`OCR引擎 + SER`预测系统进行端到端评估 + +```shell +export CUDA_VISIBLE_DEVICES=0 +python3 helper/eval_with_label_end2end.py --gt_json_path XFUND/zh_val/xfun_normalize_val.json --pred_json_path output_res/infer_results.txt +``` + + +### 4.3 RE任务 + +* 启动训练 + +```shell +export CUDA_VISIBLE_DEVICES=0 +python3 train_re.py \ + --model_name_or_path "layoutxlm-base-uncased" \ + --train_data_dir "XFUND/zh_train/image" \ + --train_label_path "XFUND/zh_train/xfun_normalize_train.json" \ + --eval_data_dir "XFUND/zh_val/image" \ + --eval_label_path "XFUND/zh_val/xfun_normalize_val.json" \ + --label_map_path "labels/labels_ser.txt" \ + --num_train_epochs 200 \ + --eval_steps 10 \ + --output_dir "output/re/" \ + --learning_rate 5e-5 \ + --warmup_steps 50 \ + --per_gpu_train_batch_size 8 \ + --per_gpu_eval_batch_size 8 \ + --num_workers 8 \ + --evaluate_during_training \ + --seed 2048 + +``` + +* 恢复训练 + +```shell +export CUDA_VISIBLE_DEVICES=0 +python3 train_re.py \ + --model_name_or_path "model_path" \ + --train_data_dir "XFUND/zh_train/image" \ + --train_label_path "XFUND/zh_train/xfun_normalize_train.json" \ + --eval_data_dir "XFUND/zh_val/image" \ + --eval_label_path "XFUND/zh_val/xfun_normalize_val.json" \ + --label_map_path "labels/labels_ser.txt" \ + --num_train_epochs 2 \ + --eval_steps 10 \ + --output_dir "output/re/" \ + --learning_rate 5e-5 \ + --warmup_steps 50 \ + --per_gpu_train_batch_size 8 \ + --per_gpu_eval_batch_size 8 \ + --num_workers 8 \ + --evaluate_during_training \ + --seed 2048 \ + --resume + +``` + +最终会打印出`precision`, `recall`, `f1`等指标,模型和训练日志会保存在`./output/re/`文件夹中。 + +* 评估 +```shell +export CUDA_VISIBLE_DEVICES=0 +python3 eval_re.py \ + --model_name_or_path "PP-Layout_v1.0_re_pretrained/" \ + --max_seq_length 512 \ + --eval_data_dir "XFUND/zh_val/image" \ + --eval_label_path "XFUND/zh_val/xfun_normalize_val.json" \ + --label_map_path "labels/labels_ser.txt" \ + --output_dir "output/re/" \ + --per_gpu_eval_batch_size 8 \ + --num_workers 8 \ + --seed 2048 +``` +最终会打印出`precision`, `recall`, `f1`等指标 + + +* 使用评估集合中提供的OCR识别结果进行预测 + +```shell +export CUDA_VISIBLE_DEVICES=0 +python3 infer_re.py \ + --model_name_or_path "PP-Layout_v1.0_re_pretrained/" \ + --max_seq_length 512 \ + --eval_data_dir "XFUND/zh_val/image" \ + --eval_label_path "XFUND/zh_val/xfun_normalize_val.json" \ + --label_map_path "labels/labels_ser.txt" \ + --output_dir "output/re/" \ + --per_gpu_eval_batch_size 1 \ + --seed 2048 +``` + +最终会在`output_res`目录下保存预测结果可视化图像以及预测结果文本文件,文件名为`infer_results.txt`。 + +* 使用`OCR引擎 + SER + RE`串联结果 + +```shell +export CUDA_VISIBLE_DEVICES=0 +python3 infer_ser_re_e2e.py \ + --model_name_or_path "PP-Layout_v1.0_ser_pretrained/" \ + --re_model_name_or_path "PP-Layout_v1.0_re_pretrained/" \ + --ser_model_type "LayoutXLM" \ + --max_seq_length 512 \ + --output_dir "output/ser_re_e2e/" \ + --infer_imgs "images/input/zh_val_21.jpg" +``` + +## 参考链接 + +- LayoutXLM: Multimodal Pre-training for Multilingual Visually-rich Document Understanding, https://arxiv.org/pdf/2104.08836.pdf +- microsoft/unilm/layoutxlm, https://github.com/microsoft/unilm/tree/master/layoutxlm +- XFUND dataset, https://github.com/doc-analysis/XFUND