diff --git a/README_ch.md b/README_ch.md index 339c24e6b3530184a831d3d288ab2573998cc9ee..0eaf5cb2f8d73143fc3793fedca7a6cf762f40c6 100755 --- a/README_ch.md +++ b/README_ch.md @@ -100,7 +100,7 @@ PaddleOCR旨在打造一套丰富、领先、且实用的OCR工具库,助力 - [版面分析](./ppstructure/layout/README_ch.md) - [表格识别](./ppstructure/table/README_ch.md) - [DocVQA](./ppstructure/vqa/README_ch.md) - - [关键信息提取](./ppstructure/docs/kie.md) + - [关键信息提取](./ppstructure/docs/kie_ch.md) - OCR学术圈 - [两阶段模型介绍与下载](./doc/doc_ch/algorithm_overview.md) - [端到端PGNet算法](./doc/doc_ch/pgnet.md) diff --git a/ppstructure/docs/kie.md b/ppstructure/docs/kie.md index 21854b0d24b0b2bbe6a4612b1112b201c5df255d..a424968a9b5a33132afe52a4850cfe541919ae1c 100644 --- a/ppstructure/docs/kie.md +++ b/ppstructure/docs/kie.md @@ -1,64 +1,67 @@ -# 关键信息提取(Key Information Extraction) +# Key Information Extraction(KIE) -本节介绍PaddleOCR中关键信息提取SDMGR方法的快速使用和训练方法。 +This section provides a tutorial example on how to quickly use, train, and evaluate a key information extraction(KIE) model, [SDMGR](https://arxiv.org/abs/2103.14470), in PaddleOCR. -SDMGR是一个关键信息提取算法,将每个检测到的文本区域分类为预定义的类别,如订单ID、发票号码,金额等。 +[SDMGR(Spatial Dual-Modality Graph Reasoning)](https://arxiv.org/abs/2103.14470) is a KIE algorithm that classifies each detected text region into predefined categories, such as order ID, invoice number, amount, and etc. -* [1. 快速使用](#1-----) -* [2. 执行训练](#2-----) -* [3. 执行评估](#3-----) +* [1. Quick Use](#1-----) +* [2. Model Training](#2-----) +* [3. Model Evaluation](#3-----) -## 1. 快速使用 -训练和测试的数据采用wildreceipt数据集,通过如下指令下载数据集: +## 1. Quick Use -``` +[Wildreceipt dataset](https://paperswithcode.com/dataset/wildreceipt) is used for this tutorial. It contains 1765 photos, with 25 classes, and 50000 text boxes, which can be downloaded by wget: + +```shell wget https://paddleocr.bj.bcebos.com/dygraph_v2.1/kie/wildreceipt.tar && tar xf wildreceipt.tar ``` -执行预测: +Download the pretrained model and predict the result: -``` +```shell cd PaddleOCR/ wget https://paddleocr.bj.bcebos.com/dygraph_v2.1/kie/kie_vgg16.tar && tar xf kie_vgg16.tar python3.7 tools/infer_kie.py -c configs/kie/kie_unet_sdmgr.yml -o Global.checkpoints=kie_vgg16/best_accuracy Global.infer_img=../wildreceipt/1.txt ``` -执行预测后的结果保存在`./output/sdmgr_kie/predicts_kie.txt`文件中,可视化结果保存在`/output/sdmgr_kie/kie_results/`目录下。 +The prediction result is saved as `./output/sdmgr_kie/predicts_kie.txt`, and the visualization results are saved in the folder`/output/sdmgr_kie/kie_results/`. -可视化结果如下图所示: +The visualization results are shown in the figure below:
-## 2. 执行训练 +## 2. Model Training -创建数据集软链到PaddleOCR/train_data目录下: -``` +Create a softlink to the folder, `PaddleOCR/train_data`: +```shell cd PaddleOCR/ && mkdir train_data && cd train_data ln -s ../../wildreceipt ./ ``` -训练采用的配置文件是configs/kie/kie_unet_sdmgr.yml,配置文件中默认训练数据路径是`train_data/wildreceipt`,准备好数据后,可以通过如下指令执行训练: -``` +The configuration file used for training is `configs/kie/kie_unet_sdmgr.yml`. The default training data path in the configuration file is `train_data/wildreceipt`. After preparing the data, you can execute the model training with the following command: +```shell python3.7 tools/train.py -c configs/kie/kie_unet_sdmgr.yml -o Global.save_model_dir=./output/kie/ ``` -## 3. 执行评估 -``` +## 3. Model Evaluation + +After training, you can execute the model evaluation with the following command: + +```shell python3.7 tools/eval.py -c configs/kie/kie_unet_sdmgr.yml -o Global.checkpoints=./output/kie/best_accuracy ``` - -**参考文献:** +**Reference:** diff --git a/ppstructure/docs/kie_ch.md b/ppstructure/docs/kie_ch.md new file mode 100644 index 0000000000000000000000000000000000000000..21854b0d24b0b2bbe6a4612b1112b201c5df255d --- /dev/null +++ b/ppstructure/docs/kie_ch.md @@ -0,0 +1,74 @@ + + +# 关键信息提取(Key Information Extraction) + +本节介绍PaddleOCR中关键信息提取SDMGR方法的快速使用和训练方法。 + +SDMGR是一个关键信息提取算法,将每个检测到的文本区域分类为预定义的类别,如订单ID、发票号码,金额等。 + + +* [1. 快速使用](#1-----) +* [2. 执行训练](#2-----) +* [3. 执行评估](#3-----) + + +## 1. 快速使用 + +训练和测试的数据采用wildreceipt数据集,通过如下指令下载数据集: + +``` +wget https://paddleocr.bj.bcebos.com/dygraph_v2.1/kie/wildreceipt.tar && tar xf wildreceipt.tar +``` + +执行预测: + +``` +cd PaddleOCR/ +wget https://paddleocr.bj.bcebos.com/dygraph_v2.1/kie/kie_vgg16.tar && tar xf kie_vgg16.tar +python3.7 tools/infer_kie.py -c configs/kie/kie_unet_sdmgr.yml -o Global.checkpoints=kie_vgg16/best_accuracy Global.infer_img=../wildreceipt/1.txt +``` + +执行预测后的结果保存在`./output/sdmgr_kie/predicts_kie.txt`文件中,可视化结果保存在`/output/sdmgr_kie/kie_results/`目录下。 + +可视化结果如下图所示: + +
+ +
+ + +## 2. 执行训练 + +创建数据集软链到PaddleOCR/train_data目录下: +``` +cd PaddleOCR/ && mkdir train_data && cd train_data + +ln -s ../../wildreceipt ./ +``` + +训练采用的配置文件是configs/kie/kie_unet_sdmgr.yml,配置文件中默认训练数据路径是`train_data/wildreceipt`,准备好数据后,可以通过如下指令执行训练: +``` +python3.7 tools/train.py -c configs/kie/kie_unet_sdmgr.yml -o Global.save_model_dir=./output/kie/ +``` + +## 3. 执行评估 + +``` +python3.7 tools/eval.py -c configs/kie/kie_unet_sdmgr.yml -o Global.checkpoints=./output/kie/best_accuracy +``` + + +**参考文献:** + + + +```bibtex +@misc{sun2021spatial, + title={Spatial Dual-Modality Graph Reasoning for Key Information Extraction}, + author={Hongbin Sun and Zhanghui Kuang and Xiaoyu Yue and Chenhao Lin and Wayne Zhang}, + year={2021}, + eprint={2103.14470}, + archivePrefix={arXiv}, + primaryClass={cs.CV} +} +``` diff --git a/ppstructure/docs/kie_en.md b/ppstructure/docs/kie_en.md deleted file mode 100644 index a424968a9b5a33132afe52a4850cfe541919ae1c..0000000000000000000000000000000000000000 --- a/ppstructure/docs/kie_en.md +++ /dev/null @@ -1,77 +0,0 @@ - - -# Key Information Extraction(KIE) - -This section provides a tutorial example on how to quickly use, train, and evaluate a key information extraction(KIE) model, [SDMGR](https://arxiv.org/abs/2103.14470), in PaddleOCR. - -[SDMGR(Spatial Dual-Modality Graph Reasoning)](https://arxiv.org/abs/2103.14470) is a KIE algorithm that classifies each detected text region into predefined categories, such as order ID, invoice number, amount, and etc. - - -* [1. Quick Use](#1-----) -* [2. Model Training](#2-----) -* [3. Model Evaluation](#3-----) - - - -## 1. Quick Use - -[Wildreceipt dataset](https://paperswithcode.com/dataset/wildreceipt) is used for this tutorial. It contains 1765 photos, with 25 classes, and 50000 text boxes, which can be downloaded by wget: - -```shell -wget https://paddleocr.bj.bcebos.com/dygraph_v2.1/kie/wildreceipt.tar && tar xf wildreceipt.tar -``` - -Download the pretrained model and predict the result: - -```shell -cd PaddleOCR/ -wget https://paddleocr.bj.bcebos.com/dygraph_v2.1/kie/kie_vgg16.tar && tar xf kie_vgg16.tar -python3.7 tools/infer_kie.py -c configs/kie/kie_unet_sdmgr.yml -o Global.checkpoints=kie_vgg16/best_accuracy Global.infer_img=../wildreceipt/1.txt -``` - -The prediction result is saved as `./output/sdmgr_kie/predicts_kie.txt`, and the visualization results are saved in the folder`/output/sdmgr_kie/kie_results/`. - -The visualization results are shown in the figure below: - -
- -
- - -## 2. Model Training - -Create a softlink to the folder, `PaddleOCR/train_data`: -```shell -cd PaddleOCR/ && mkdir train_data && cd train_data - -ln -s ../../wildreceipt ./ -``` - -The configuration file used for training is `configs/kie/kie_unet_sdmgr.yml`. The default training data path in the configuration file is `train_data/wildreceipt`. After preparing the data, you can execute the model training with the following command: -```shell -python3.7 tools/train.py -c configs/kie/kie_unet_sdmgr.yml -o Global.save_model_dir=./output/kie/ -``` - - -## 3. Model Evaluation - -After training, you can execute the model evaluation with the following command: - -```shell -python3.7 tools/eval.py -c configs/kie/kie_unet_sdmgr.yml -o Global.checkpoints=./output/kie/best_accuracy -``` - -**Reference:** - - - -```bibtex -@misc{sun2021spatial, - title={Spatial Dual-Modality Graph Reasoning for Key Information Extraction}, - author={Hongbin Sun and Zhanghui Kuang and Xiaoyu Yue and Chenhao Lin and Wayne Zhang}, - year={2021}, - eprint={2103.14470}, - archivePrefix={arXiv}, - primaryClass={cs.CV} -} -``` diff --git a/ppstructure/vqa/README-en.md b/ppstructure/vqa/README-en.md deleted file mode 100644 index 168640874aa5e2339e81d7dc467e515d5aa9101e..0000000000000000000000000000000000000000 --- a/ppstructure/vqa/README-en.md +++ /dev/null @@ -1,331 +0,0 @@ -# Document Visual Q&A(DOC-VQA) - -Document Visual Q&A, mainly for the image content of the question and answer, DOC-VQA is a type of VQA task, DOC-VQA mainly asks questions about the textual content of text images. - -The DOC-VQA algorithm in PP-Structure is developed based on PaddleNLP natural language processing algorithm library. - -The main features are as follows: - -- Integrated LayoutXLM model and PP-OCR prediction engine. -- Support Semantic Entity Recognition (SER) and Relation Extraction (RE) tasks based on multi-modal methods. Based on SER task, text recognition and classification in images can be completed. Based on THE RE task, we can extract the relation of the text content in the image, such as judge the problem pair. - -- Support custom training for SER and RE tasks. - -- Support OCR+SER end-to-end system prediction and evaluation. - -- Support OCR+SER+RE end-to-end system prediction. - -**Note**: This project is based on the open source implementation of [LayoutXLM](https://arxiv.org/pdf/2104.08836.pdf) on Paddle 2.2, and at the same time, after in-depth polishing by the flying Paddle team and the Industrial and **Commercial Bank of China** in the scene of real estate certificate, jointly open source. - - -## 1.Performance - -We evaluated the algorithm on [XFUN](https://github.com/doc-analysis/XFUND) 's Chinese data set, and the performance is as follows - -| Model | Task | F1 | Model Download Link | -|:---:|:---:|:---:| :---:| -| LayoutXLM | RE | 0.7113 | [Link](https://paddleocr.bj.bcebos.com/pplayout/PP-Layout_v1.0_re_pretrained.tar) | -| LayoutXLM | SER | 0.9056 | [Link](https://paddleocr.bj.bcebos.com/pplayout/PP-Layout_v1.0_ser_pretrained.tar) | -| LayoutLM | SER | 0.78 | [Link](https://paddleocr.bj.bcebos.com/pplayout/LayoutLM_ser_pretrained.tar) | - - - -## 2.Demonstration - -**Note**: the test images are from the xfun dataset. - -### 2.1 SER - -![](./images/result_ser/zh_val_0_ser.jpg) | ![](./images/result_ser/zh_val_42_ser.jpg) ----|--- - -Different colored boxes in the figure represent different categories. For xfun dataset, there are three categories: query, answer and header: - -* Dark purple: header -* Light purple: query -* Army green: answer - -The corresponding category and OCR recognition results are also marked at the top left of the OCR detection box. - - -### 2.2 RE - -![](./images/result_re/zh_val_21_re.jpg) | ![](./images/result_re/zh_val_40_re.jpg) ----|--- - - -In the figure, the red box represents the question, the blue box represents the answer, and the question and answer are connected by green lines. The corresponding category and OCR recognition results are also marked at the top left of the OCR detection box. - - -## 3. Setup - -### 3.1 Installation dependency - -- **(1) Install PaddlePaddle** - -```bash -pip3 install --upgrade pip - -# GPU PaddlePaddle Install -python3 -m pip install paddlepaddle-gpu==2.2 -i https://mirror.baidu.com/pypi/simple - -# CPU PaddlePaddle Install -python3 -m pip install paddlepaddle==2.2 -i https://mirror.baidu.com/pypi/simple - -``` -For more requirements, please refer to the [instructions](https://www.paddlepaddle.org.cn/install/quick) in the installation document. - - -### 3.2 Install PaddleOCR (including pp-ocr and VQA) - -- **(1) PIP quick install paddleocr WHL package (forecast only)** - -```bash -pip install paddleocr -``` - -- **(2) Download VQA source code (prediction + training)** - -```bash -[recommended] git clone https://github.com/PaddlePaddle/PaddleOCR - -# If you cannot pull successfully because of network problems, you can also choose to use the hosting on the code cloud: -git clone https://gitee.com/paddlepaddle/PaddleOCR - -# Note: the code cloud hosting code may not be able to synchronize the update of this GitHub project in real time, with a delay of 3 ~ 5 days. Please give priority to the recommended method. -``` - -- **(3) Install PaddleNLP** - -```bash -# You need to use the latest code version of paddlenlp for installation -git clone https://github.com/PaddlePaddle/PaddleNLP -b develop -cd PaddleNLP -pip3 install -e . -``` - - -- **(4) Install requirements for VQA** - -```bash -cd ppstructure/vqa -pip install -r requirements.txt -``` - -## 4.Usage - - -### 4.1 Data and pre training model preparation - -Download address of processed xfun Chinese dataset: [https://paddleocr.bj.bcebos.com/dataset/XFUND.tar](https://paddleocr.bj.bcebos.com/dataset/XFUND.tar)。 - - -Download and unzip the dataset, and then place the dataset in the current directory. - -```shell -wget https://paddleocr.bj.bcebos.com/dataset/XFUND.tar -``` - -If you want to convert data sets in other languages in xfun, you can refer to [xfun data conversion script.](helper/trans_xfun_data.py)) - -If you want to experience the prediction process directly, you can download the pre training model provided by us, skip the training process and predict directly. - - -### 4.2 SER Task - -* Start training - -```shell -python3.7 train_ser.py \ - --model_name_or_path "layoutxlm-base-uncased" \ - --ser_model_type "LayoutXLM" \ - --train_data_dir "XFUND/zh_train/image" \ - --train_label_path "XFUND/zh_train/xfun_normalize_train.json" \ - --eval_data_dir "XFUND/zh_val/image" \ - --eval_label_path "XFUND/zh_val/xfun_normalize_val.json" \ - --num_train_epochs 200 \ - --eval_steps 10 \ - --output_dir "./output/ser/" \ - --learning_rate 5e-5 \ - --warmup_steps 50 \ - --evaluate_during_training \ - --seed 2048 -``` - -Finally, Precision, Recall, F1 and other indicators will be printed, and the model and training log will be saved in/ In the output/Ser/ folder. - -* Recovery training - -```shell -python3.7 train_ser.py \ - --model_name_or_path "model_path" \ - --ser_model_type "LayoutXLM" \ - --train_data_dir "XFUND/zh_train/image" \ - --train_label_path "XFUND/zh_train/xfun_normalize_train.json" \ - --eval_data_dir "XFUND/zh_val/image" \ - --eval_label_path "XFUND/zh_val/xfun_normalize_val.json" \ - --num_train_epochs 200 \ - --eval_steps 10 \ - --output_dir "./output/ser/" \ - --learning_rate 5e-5 \ - --warmup_steps 50 \ - --evaluate_during_training \ - --num_workers 8 \ - --seed 2048 \ - --resume -``` - -* Evaluation -```shell -export CUDA_VISIBLE_DEVICES=0 -python3 eval_ser.py \ - --model_name_or_path "PP-Layout_v1.0_ser_pretrained/" \ - --ser_model_type "LayoutXLM" \ - --eval_data_dir "XFUND/zh_val/image" \ - --eval_label_path "XFUND/zh_val/xfun_normalize_val.json" \ - --per_gpu_eval_batch_size 8 \ - --num_workers 8 \ - --output_dir "output/ser/" \ - --seed 2048 -``` -Finally, Precision, Recall, F1 and other indicators will be printed - -* The OCR recognition results provided in the evaluation set are used for prediction - -```shell -export CUDA_VISIBLE_DEVICES=0 -python3.7 infer_ser.py \ - --model_name_or_path "PP-Layout_v1.0_ser_pretrained/" \ - --ser_model_type "LayoutXLM" \ - --output_dir "output/ser/" \ - --infer_imgs "XFUND/zh_val/image/" \ - --ocr_json_path "XFUND/zh_val/xfun_normalize_val.json" -``` - -It will end up in output_res The visual image of the prediction result and the text file of the prediction result are saved in the res directory. The file name is infer_ results.txt. - -* Using OCR engine + SER concatenation results - -```shell -export CUDA_VISIBLE_DEVICES=0 -python3.7 infer_ser_e2e.py \ - --model_name_or_path "PP-Layout_v1.0_ser_pretrained/" \ - --ser_model_type "LayoutXLM" \ - --max_seq_length 512 \ - --output_dir "output/ser_e2e/" \ - --infer_imgs "images/input/zh_val_0.jpg" -``` - -* End-to-end evaluation of OCR engine + SER prediction system - -```shell -export CUDA_VISIBLE_DEVICES=0 -python3.7 helper/eval_with_label_end2end.py --gt_json_path XFUND/zh_val/xfun_normalize_val.json --pred_json_path output_res/infer_results.txt -``` - - -### 4.3 RE Task - -* Start training - -```shell -export CUDA_VISIBLE_DEVICES=0 -python3 train_re.py \ - --model_name_or_path "layoutxlm-base-uncased" \ - --train_data_dir "XFUND/zh_train/image" \ - --train_label_path "XFUND/zh_train/xfun_normalize_train.json" \ - --eval_data_dir "XFUND/zh_val/image" \ - --eval_label_path "XFUND/zh_val/xfun_normalize_val.json" \ - --label_map_path "labels/labels_ser.txt" \ - --num_train_epochs 200 \ - --eval_steps 10 \ - --output_dir "output/re/" \ - --learning_rate 5e-5 \ - --warmup_steps 50 \ - --per_gpu_train_batch_size 8 \ - --per_gpu_eval_batch_size 8 \ - --num_workers 8 \ - --evaluate_during_training \ - --seed 2048 - -``` - -* Resume training - -```shell -export CUDA_VISIBLE_DEVICES=0 -python3 train_re.py \ - --model_name_or_path "model_path" \ - --train_data_dir "XFUND/zh_train/image" \ - --train_label_path "XFUND/zh_train/xfun_normalize_train.json" \ - --eval_data_dir "XFUND/zh_val/image" \ - --eval_label_path "XFUND/zh_val/xfun_normalize_val.json" \ - --label_map_path "labels/labels_ser.txt" \ - --num_train_epochs 2 \ - --eval_steps 10 \ - --output_dir "output/re/" \ - --learning_rate 5e-5 \ - --warmup_steps 50 \ - --per_gpu_train_batch_size 8 \ - --per_gpu_eval_batch_size 8 \ - --num_workers 8 \ - --evaluate_during_training \ - --seed 2048 \ - --resume - -``` - -Finally, Precision, Recall, F1 and other indicators will be printed, and the model and training log will be saved in the output/RE file folder. - -* Evaluation -```shell -export CUDA_VISIBLE_DEVICES=0 -python3 eval_re.py \ - --model_name_or_path "PP-Layout_v1.0_re_pretrained/" \ - --max_seq_length 512 \ - --eval_data_dir "XFUND/zh_val/image" \ - --eval_label_path "XFUND/zh_val/xfun_normalize_val.json" \ - --label_map_path "labels/labels_ser.txt" \ - --output_dir "output/re/" \ - --per_gpu_eval_batch_size 8 \ - --num_workers 8 \ - --seed 2048 -``` -Finally, Precision, Recall, F1 and other indicators will be printed - - -* The OCR recognition results provided in the evaluation set are used for prediction - -```shell -export CUDA_VISIBLE_DEVICES=0 -python3 infer_re.py \ - --model_name_or_path "PP-Layout_v1.0_re_pretrained/" \ - --max_seq_length 512 \ - --eval_data_dir "XFUND/zh_val/image" \ - --eval_label_path "XFUND/zh_val/xfun_normalize_val.json" \ - --label_map_path "labels/labels_ser.txt" \ - --output_dir "output/re/" \ - --per_gpu_eval_batch_size 1 \ - --seed 2048 -``` - -The visual image of the prediction result and the text file of the prediction result are saved in the output_res file folder, the file name is`infer_results.txt`。 - -* Concatenation results using OCR engine + SER+ RE - -```shell -export CUDA_VISIBLE_DEVICES=0 -python3.7 infer_ser_re_e2e.py \ - --model_name_or_path "PP-Layout_v1.0_ser_pretrained/" \ - --re_model_name_or_path "PP-Layout_v1.0_re_pretrained/" \ - --ser_model_type "LayoutXLM" \ - --max_seq_length 512 \ - --output_dir "output/ser_re_e2e/" \ - --infer_imgs "images/input/zh_val_21.jpg" -``` - -## Reference - -- LayoutXLM: Multimodal Pre-training for Multilingual Visually-rich Document Understanding, https://arxiv.org/pdf/2104.08836.pdf -- microsoft/unilm/layoutxlm, https://github.com/microsoft/unilm/tree/master/layoutxlm -- XFUND dataset, https://github.com/doc-analysis/XFUND diff --git a/ppstructure/vqa/README.md b/ppstructure/vqa/README.md index 619ada71a82eacd88abd39199d0b220dc6c64c9b..168640874aa5e2339e81d7dc467e515d5aa9101e 100644 --- a/ppstructure/vqa/README.md +++ b/ppstructure/vqa/README.md @@ -1,318 +1,331 @@ -# 文档视觉问答(DOC-VQA) - -VQA指视觉问答,主要针对图像内容进行提问和回答,DOC-VQA是VQA任务中的一种,DOC-VQA主要针对文本图像的文字内容提出问题。 - -PP-Structure 里的 DOC-VQA算法基于PaddleNLP自然语言处理算法库进行开发。 - -主要特性如下: - -- 集成[LayoutXLM](https://arxiv.org/pdf/2104.08836.pdf)模型以及PP-OCR预测引擎。 -- 支持基于多模态方法的语义实体识别 (Semantic Entity Recognition, SER) 以及关系抽取 (Relation Extraction, RE) 任务。基于 SER 任务,可以完成对图像中的文本识别与分类;基于 RE 任务,可以完成对图象中的文本内容的关系提取,如判断问题对(pair)。 -- 支持SER任务和RE任务的自定义训练。 -- 支持OCR+SER的端到端系统预测与评估。 -- 支持OCR+SER+RE的端到端系统预测。 - -**Note**:本项目基于 [LayoutXLM](https://arxiv.org/pdf/2104.08836.pdf) 在Paddle 2.2上的开源实现,同时经过飞桨团队与**中国工商银行**在不动产证场景深入打磨,联合开源。 - - -## 1.性能 - -我们在 [XFUN](https://github.com/doc-analysis/XFUND) 的中文数据集上对算法进行了评估,性能如下 - -| 模型 | 任务 | f1 | 模型下载地址 | -|:---:|:---:|:---:| :---:| -| LayoutXLM | RE | 0.7113 | [链接](https://paddleocr.bj.bcebos.com/pplayout/PP-Layout_v1.0_re_pretrained.tar) | -| LayoutXLM | SER | 0.9056 | [链接](https://paddleocr.bj.bcebos.com/pplayout/PP-Layout_v1.0_ser_pretrained.tar) | -| LayoutLM | SER | 0.78 | [链接](https://paddleocr.bj.bcebos.com/pplayout/LayoutLM_ser_pretrained.tar) | - - - -## 2.效果演示 - -**注意:** 测试图片来源于XFUN数据集。 - -### 2.1 SER - -![](./images/result_ser/zh_val_0_ser.jpg) | ![](./images/result_ser/zh_val_42_ser.jpg) ----|--- - -图中不同颜色的框表示不同的类别,对于XFUN数据集,有`QUESTION`, `ANSWER`, `HEADER` 3种类别 - -* 深紫色:HEADER -* 浅紫色:QUESTION -* 军绿色:ANSWER - -在OCR检测框的左上方也标出了对应的类别和OCR识别结果。 - - -### 2.2 RE - -![](./images/result_re/zh_val_21_re.jpg) | ![](./images/result_re/zh_val_40_re.jpg) ----|--- - - -图中红色框表示问题,蓝色框表示答案,问题和答案之间使用绿色线连接。在OCR检测框的左上方也标出了对应的类别和OCR识别结果。 - - -## 3.安装 - -### 3.1 安装依赖 - -- **(1) 安装PaddlePaddle** - -```bash -python3 -m pip install --upgrade pip - -# GPU安装 -python3 -m pip install paddlepaddle-gpu==2.2 -i https://mirror.baidu.com/pypi/simple - -# CPU安装 -python3 -m pip install paddlepaddle==2.2 -i https://mirror.baidu.com/pypi/simple - -``` -更多需求,请参照[安装文档](https://www.paddlepaddle.org.cn/install/quick)中的说明进行操作。 - - -### 3.2 安装PaddleOCR(包含 PP-OCR 和 VQA ) - -- **(1)pip快速安装PaddleOCR whl包(仅预测)** - -```bash -python3 -m pip install paddleocr -``` - -- **(2)下载VQA源码(预测+训练)** - -```bash -【推荐】git clone https://github.com/PaddlePaddle/PaddleOCR - -# 如果因为网络问题无法pull成功,也可选择使用码云上的托管: -git clone https://gitee.com/paddlepaddle/PaddleOCR - -# 注:码云托管代码可能无法实时同步本github项目更新,存在3~5天延时,请优先使用推荐方式。 -``` - -- **(3)安装VQA的`requirements`** - -```bash -cd ppstructure/vqa -python3 -m pip install -r requirements.txt -``` - -## 4. 使用 - - -### 4.1 数据和预训练模型准备 - -处理好的XFUN中文数据集下载地址:[https://paddleocr.bj.bcebos.com/dataset/XFUND.tar](https://paddleocr.bj.bcebos.com/dataset/XFUND.tar)。 - - -下载并解压该数据集,解压后将数据集放置在当前目录下。 - -```shell -wget https://paddleocr.bj.bcebos.com/dataset/XFUND.tar -``` - -如果希望转换XFUN中其他语言的数据集,可以参考[XFUN数据转换脚本](helper/trans_xfun_data.py)。 - -如果希望直接体验预测过程,可以下载我们提供的预训练模型,跳过训练过程,直接预测即可。 - - -### 4.2 SER任务 - -* 启动训练 - -```shell -python3 train_ser.py \ - --model_name_or_path "layoutxlm-base-uncased" \ - --ser_model_type "LayoutXLM" \ - --train_data_dir "XFUND/zh_train/image" \ - --train_label_path "XFUND/zh_train/xfun_normalize_train.json" \ - --eval_data_dir "XFUND/zh_val/image" \ - --eval_label_path "XFUND/zh_val/xfun_normalize_val.json" \ - --num_train_epochs 200 \ - --eval_steps 10 \ - --output_dir "./output/ser/" \ - --learning_rate 5e-5 \ - --warmup_steps 50 \ - --evaluate_during_training \ - --seed 2048 -``` - -最终会打印出`precision`, `recall`, `f1`等指标,模型和训练日志会保存在`./output/ser/`文件夹中。 - -* 恢复训练 - -```shell -python3 train_ser.py \ - --model_name_or_path "model_path" \ - --ser_model_type "LayoutXLM" \ - --train_data_dir "XFUND/zh_train/image" \ - --train_label_path "XFUND/zh_train/xfun_normalize_train.json" \ - --eval_data_dir "XFUND/zh_val/image" \ - --eval_label_path "XFUND/zh_val/xfun_normalize_val.json" \ - --num_train_epochs 200 \ - --eval_steps 10 \ - --output_dir "./output/ser/" \ - --learning_rate 5e-5 \ - --warmup_steps 50 \ - --evaluate_during_training \ - --num_workers 8 \ - --seed 2048 \ - --resume -``` - -* 评估 -```shell -export CUDA_VISIBLE_DEVICES=0 -python3 eval_ser.py \ - --model_name_or_path "PP-Layout_v1.0_ser_pretrained/" \ - --ser_model_type "LayoutXLM" \ - --eval_data_dir "XFUND/zh_val/image" \ - --eval_label_path "XFUND/zh_val/xfun_normalize_val.json" \ - --per_gpu_eval_batch_size 8 \ - --num_workers 8 \ - --output_dir "output/ser/" \ - --seed 2048 -``` -最终会打印出`precision`, `recall`, `f1`等指标 - -* 使用评估集合中提供的OCR识别结果进行预测 - -```shell -export CUDA_VISIBLE_DEVICES=0 -python3 infer_ser.py \ - --model_name_or_path "PP-Layout_v1.0_ser_pretrained/" \ - --ser_model_type "LayoutXLM" \ - --output_dir "output/ser/" \ - --infer_imgs "XFUND/zh_val/image/" \ - --ocr_json_path "XFUND/zh_val/xfun_normalize_val.json" -``` - -最终会在`output_res`目录下保存预测结果可视化图像以及预测结果文本文件,文件名为`infer_results.txt`。 - -* 使用`OCR引擎 + SER`串联结果 - -```shell -export CUDA_VISIBLE_DEVICES=0 -python3 infer_ser_e2e.py \ - --model_name_or_path "PP-Layout_v1.0_ser_pretrained/" \ - --ser_model_type "LayoutXLM" \ - --max_seq_length 512 \ - --output_dir "output/ser_e2e/" \ - --infer_imgs "images/input/zh_val_0.jpg" -``` - -* 对`OCR引擎 + SER`预测系统进行端到端评估 - -```shell -export CUDA_VISIBLE_DEVICES=0 -python3 helper/eval_with_label_end2end.py --gt_json_path XFUND/zh_val/xfun_normalize_val.json --pred_json_path output_res/infer_results.txt -``` - - -### 4.3 RE任务 - -* 启动训练 - -```shell -export CUDA_VISIBLE_DEVICES=0 -python3 train_re.py \ - --model_name_or_path "layoutxlm-base-uncased" \ - --train_data_dir "XFUND/zh_train/image" \ - --train_label_path "XFUND/zh_train/xfun_normalize_train.json" \ - --eval_data_dir "XFUND/zh_val/image" \ - --eval_label_path "XFUND/zh_val/xfun_normalize_val.json" \ - --label_map_path "labels/labels_ser.txt" \ - --num_train_epochs 200 \ - --eval_steps 10 \ - --output_dir "output/re/" \ - --learning_rate 5e-5 \ - --warmup_steps 50 \ - --per_gpu_train_batch_size 8 \ - --per_gpu_eval_batch_size 8 \ - --num_workers 8 \ - --evaluate_during_training \ - --seed 2048 - -``` - -* 恢复训练 - -```shell -export CUDA_VISIBLE_DEVICES=0 -python3 train_re.py \ - --model_name_or_path "model_path" \ - --train_data_dir "XFUND/zh_train/image" \ - --train_label_path "XFUND/zh_train/xfun_normalize_train.json" \ - --eval_data_dir "XFUND/zh_val/image" \ - --eval_label_path "XFUND/zh_val/xfun_normalize_val.json" \ - --label_map_path "labels/labels_ser.txt" \ - --num_train_epochs 2 \ - --eval_steps 10 \ - --output_dir "output/re/" \ - --learning_rate 5e-5 \ - --warmup_steps 50 \ - --per_gpu_train_batch_size 8 \ - --per_gpu_eval_batch_size 8 \ - --num_workers 8 \ - --evaluate_during_training \ - --seed 2048 \ - --resume - -``` - -最终会打印出`precision`, `recall`, `f1`等指标,模型和训练日志会保存在`./output/re/`文件夹中。 - -* 评估 -```shell -export CUDA_VISIBLE_DEVICES=0 -python3 eval_re.py \ - --model_name_or_path "PP-Layout_v1.0_re_pretrained/" \ - --max_seq_length 512 \ - --eval_data_dir "XFUND/zh_val/image" \ - --eval_label_path "XFUND/zh_val/xfun_normalize_val.json" \ - --label_map_path "labels/labels_ser.txt" \ - --output_dir "output/re/" \ - --per_gpu_eval_batch_size 8 \ - --num_workers 8 \ - --seed 2048 -``` -最终会打印出`precision`, `recall`, `f1`等指标 - - -* 使用评估集合中提供的OCR识别结果进行预测 - -```shell -export CUDA_VISIBLE_DEVICES=0 -python3 infer_re.py \ - --model_name_or_path "PP-Layout_v1.0_re_pretrained/" \ - --max_seq_length 512 \ - --eval_data_dir "XFUND/zh_val/image" \ - --eval_label_path "XFUND/zh_val/xfun_normalize_val.json" \ - --label_map_path "labels/labels_ser.txt" \ - --output_dir "output/re/" \ - --per_gpu_eval_batch_size 1 \ - --seed 2048 -``` - -最终会在`output_res`目录下保存预测结果可视化图像以及预测结果文本文件,文件名为`infer_results.txt`。 - -* 使用`OCR引擎 + SER + RE`串联结果 - -```shell -export CUDA_VISIBLE_DEVICES=0 -python3 infer_ser_re_e2e.py \ - --model_name_or_path "PP-Layout_v1.0_ser_pretrained/" \ - --re_model_name_or_path "PP-Layout_v1.0_re_pretrained/" \ - --ser_model_type "LayoutXLM" \ - --max_seq_length 512 \ - --output_dir "output/ser_re_e2e/" \ - --infer_imgs "images/input/zh_val_21.jpg" -``` - -## 参考链接 - -- LayoutXLM: Multimodal Pre-training for Multilingual Visually-rich Document Understanding, https://arxiv.org/pdf/2104.08836.pdf -- microsoft/unilm/layoutxlm, https://github.com/microsoft/unilm/tree/master/layoutxlm -- XFUND dataset, https://github.com/doc-analysis/XFUND +# Document Visual Q&A(DOC-VQA) + +Document Visual Q&A, mainly for the image content of the question and answer, DOC-VQA is a type of VQA task, DOC-VQA mainly asks questions about the textual content of text images. + +The DOC-VQA algorithm in PP-Structure is developed based on PaddleNLP natural language processing algorithm library. + +The main features are as follows: + +- Integrated LayoutXLM model and PP-OCR prediction engine. +- Support Semantic Entity Recognition (SER) and Relation Extraction (RE) tasks based on multi-modal methods. Based on SER task, text recognition and classification in images can be completed. Based on THE RE task, we can extract the relation of the text content in the image, such as judge the problem pair. + +- Support custom training for SER and RE tasks. + +- Support OCR+SER end-to-end system prediction and evaluation. + +- Support OCR+SER+RE end-to-end system prediction. + +**Note**: This project is based on the open source implementation of [LayoutXLM](https://arxiv.org/pdf/2104.08836.pdf) on Paddle 2.2, and at the same time, after in-depth polishing by the flying Paddle team and the Industrial and **Commercial Bank of China** in the scene of real estate certificate, jointly open source. + + +## 1.Performance + +We evaluated the algorithm on [XFUN](https://github.com/doc-analysis/XFUND) 's Chinese data set, and the performance is as follows + +| Model | Task | F1 | Model Download Link | +|:---:|:---:|:---:| :---:| +| LayoutXLM | RE | 0.7113 | [Link](https://paddleocr.bj.bcebos.com/pplayout/PP-Layout_v1.0_re_pretrained.tar) | +| LayoutXLM | SER | 0.9056 | [Link](https://paddleocr.bj.bcebos.com/pplayout/PP-Layout_v1.0_ser_pretrained.tar) | +| LayoutLM | SER | 0.78 | [Link](https://paddleocr.bj.bcebos.com/pplayout/LayoutLM_ser_pretrained.tar) | + + + +## 2.Demonstration + +**Note**: the test images are from the xfun dataset. + +### 2.1 SER + +![](./images/result_ser/zh_val_0_ser.jpg) | ![](./images/result_ser/zh_val_42_ser.jpg) +---|--- + +Different colored boxes in the figure represent different categories. For xfun dataset, there are three categories: query, answer and header: + +* Dark purple: header +* Light purple: query +* Army green: answer + +The corresponding category and OCR recognition results are also marked at the top left of the OCR detection box. + + +### 2.2 RE + +![](./images/result_re/zh_val_21_re.jpg) | ![](./images/result_re/zh_val_40_re.jpg) +---|--- + + +In the figure, the red box represents the question, the blue box represents the answer, and the question and answer are connected by green lines. The corresponding category and OCR recognition results are also marked at the top left of the OCR detection box. + + +## 3. Setup + +### 3.1 Installation dependency + +- **(1) Install PaddlePaddle** + +```bash +pip3 install --upgrade pip + +# GPU PaddlePaddle Install +python3 -m pip install paddlepaddle-gpu==2.2 -i https://mirror.baidu.com/pypi/simple + +# CPU PaddlePaddle Install +python3 -m pip install paddlepaddle==2.2 -i https://mirror.baidu.com/pypi/simple + +``` +For more requirements, please refer to the [instructions](https://www.paddlepaddle.org.cn/install/quick) in the installation document. + + +### 3.2 Install PaddleOCR (including pp-ocr and VQA) + +- **(1) PIP quick install paddleocr WHL package (forecast only)** + +```bash +pip install paddleocr +``` + +- **(2) Download VQA source code (prediction + training)** + +```bash +[recommended] git clone https://github.com/PaddlePaddle/PaddleOCR + +# If you cannot pull successfully because of network problems, you can also choose to use the hosting on the code cloud: +git clone https://gitee.com/paddlepaddle/PaddleOCR + +# Note: the code cloud hosting code may not be able to synchronize the update of this GitHub project in real time, with a delay of 3 ~ 5 days. Please give priority to the recommended method. +``` + +- **(3) Install PaddleNLP** + +```bash +# You need to use the latest code version of paddlenlp for installation +git clone https://github.com/PaddlePaddle/PaddleNLP -b develop +cd PaddleNLP +pip3 install -e . +``` + + +- **(4) Install requirements for VQA** + +```bash +cd ppstructure/vqa +pip install -r requirements.txt +``` + +## 4.Usage + + +### 4.1 Data and pre training model preparation + +Download address of processed xfun Chinese dataset: [https://paddleocr.bj.bcebos.com/dataset/XFUND.tar](https://paddleocr.bj.bcebos.com/dataset/XFUND.tar)。 + + +Download and unzip the dataset, and then place the dataset in the current directory. + +```shell +wget https://paddleocr.bj.bcebos.com/dataset/XFUND.tar +``` + +If you want to convert data sets in other languages in xfun, you can refer to [xfun data conversion script.](helper/trans_xfun_data.py)) + +If you want to experience the prediction process directly, you can download the pre training model provided by us, skip the training process and predict directly. + + +### 4.2 SER Task + +* Start training + +```shell +python3.7 train_ser.py \ + --model_name_or_path "layoutxlm-base-uncased" \ + --ser_model_type "LayoutXLM" \ + --train_data_dir "XFUND/zh_train/image" \ + --train_label_path "XFUND/zh_train/xfun_normalize_train.json" \ + --eval_data_dir "XFUND/zh_val/image" \ + --eval_label_path "XFUND/zh_val/xfun_normalize_val.json" \ + --num_train_epochs 200 \ + --eval_steps 10 \ + --output_dir "./output/ser/" \ + --learning_rate 5e-5 \ + --warmup_steps 50 \ + --evaluate_during_training \ + --seed 2048 +``` + +Finally, Precision, Recall, F1 and other indicators will be printed, and the model and training log will be saved in/ In the output/Ser/ folder. + +* Recovery training + +```shell +python3.7 train_ser.py \ + --model_name_or_path "model_path" \ + --ser_model_type "LayoutXLM" \ + --train_data_dir "XFUND/zh_train/image" \ + --train_label_path "XFUND/zh_train/xfun_normalize_train.json" \ + --eval_data_dir "XFUND/zh_val/image" \ + --eval_label_path "XFUND/zh_val/xfun_normalize_val.json" \ + --num_train_epochs 200 \ + --eval_steps 10 \ + --output_dir "./output/ser/" \ + --learning_rate 5e-5 \ + --warmup_steps 50 \ + --evaluate_during_training \ + --num_workers 8 \ + --seed 2048 \ + --resume +``` + +* Evaluation +```shell +export CUDA_VISIBLE_DEVICES=0 +python3 eval_ser.py \ + --model_name_or_path "PP-Layout_v1.0_ser_pretrained/" \ + --ser_model_type "LayoutXLM" \ + --eval_data_dir "XFUND/zh_val/image" \ + --eval_label_path "XFUND/zh_val/xfun_normalize_val.json" \ + --per_gpu_eval_batch_size 8 \ + --num_workers 8 \ + --output_dir "output/ser/" \ + --seed 2048 +``` +Finally, Precision, Recall, F1 and other indicators will be printed + +* The OCR recognition results provided in the evaluation set are used for prediction + +```shell +export CUDA_VISIBLE_DEVICES=0 +python3.7 infer_ser.py \ + --model_name_or_path "PP-Layout_v1.0_ser_pretrained/" \ + --ser_model_type "LayoutXLM" \ + --output_dir "output/ser/" \ + --infer_imgs "XFUND/zh_val/image/" \ + --ocr_json_path "XFUND/zh_val/xfun_normalize_val.json" +``` + +It will end up in output_res The visual image of the prediction result and the text file of the prediction result are saved in the res directory. The file name is infer_ results.txt. + +* Using OCR engine + SER concatenation results + +```shell +export CUDA_VISIBLE_DEVICES=0 +python3.7 infer_ser_e2e.py \ + --model_name_or_path "PP-Layout_v1.0_ser_pretrained/" \ + --ser_model_type "LayoutXLM" \ + --max_seq_length 512 \ + --output_dir "output/ser_e2e/" \ + --infer_imgs "images/input/zh_val_0.jpg" +``` + +* End-to-end evaluation of OCR engine + SER prediction system + +```shell +export CUDA_VISIBLE_DEVICES=0 +python3.7 helper/eval_with_label_end2end.py --gt_json_path XFUND/zh_val/xfun_normalize_val.json --pred_json_path output_res/infer_results.txt +``` + + +### 4.3 RE Task + +* Start training + +```shell +export CUDA_VISIBLE_DEVICES=0 +python3 train_re.py \ + --model_name_or_path "layoutxlm-base-uncased" \ + --train_data_dir "XFUND/zh_train/image" \ + --train_label_path "XFUND/zh_train/xfun_normalize_train.json" \ + --eval_data_dir "XFUND/zh_val/image" \ + --eval_label_path "XFUND/zh_val/xfun_normalize_val.json" \ + --label_map_path "labels/labels_ser.txt" \ + --num_train_epochs 200 \ + --eval_steps 10 \ + --output_dir "output/re/" \ + --learning_rate 5e-5 \ + --warmup_steps 50 \ + --per_gpu_train_batch_size 8 \ + --per_gpu_eval_batch_size 8 \ + --num_workers 8 \ + --evaluate_during_training \ + --seed 2048 + +``` + +* Resume training + +```shell +export CUDA_VISIBLE_DEVICES=0 +python3 train_re.py \ + --model_name_or_path "model_path" \ + --train_data_dir "XFUND/zh_train/image" \ + --train_label_path "XFUND/zh_train/xfun_normalize_train.json" \ + --eval_data_dir "XFUND/zh_val/image" \ + --eval_label_path "XFUND/zh_val/xfun_normalize_val.json" \ + --label_map_path "labels/labels_ser.txt" \ + --num_train_epochs 2 \ + --eval_steps 10 \ + --output_dir "output/re/" \ + --learning_rate 5e-5 \ + --warmup_steps 50 \ + --per_gpu_train_batch_size 8 \ + --per_gpu_eval_batch_size 8 \ + --num_workers 8 \ + --evaluate_during_training \ + --seed 2048 \ + --resume + +``` + +Finally, Precision, Recall, F1 and other indicators will be printed, and the model and training log will be saved in the output/RE file folder. + +* Evaluation +```shell +export CUDA_VISIBLE_DEVICES=0 +python3 eval_re.py \ + --model_name_or_path "PP-Layout_v1.0_re_pretrained/" \ + --max_seq_length 512 \ + --eval_data_dir "XFUND/zh_val/image" \ + --eval_label_path "XFUND/zh_val/xfun_normalize_val.json" \ + --label_map_path "labels/labels_ser.txt" \ + --output_dir "output/re/" \ + --per_gpu_eval_batch_size 8 \ + --num_workers 8 \ + --seed 2048 +``` +Finally, Precision, Recall, F1 and other indicators will be printed + + +* The OCR recognition results provided in the evaluation set are used for prediction + +```shell +export CUDA_VISIBLE_DEVICES=0 +python3 infer_re.py \ + --model_name_or_path "PP-Layout_v1.0_re_pretrained/" \ + --max_seq_length 512 \ + --eval_data_dir "XFUND/zh_val/image" \ + --eval_label_path "XFUND/zh_val/xfun_normalize_val.json" \ + --label_map_path "labels/labels_ser.txt" \ + --output_dir "output/re/" \ + --per_gpu_eval_batch_size 1 \ + --seed 2048 +``` + +The visual image of the prediction result and the text file of the prediction result are saved in the output_res file folder, the file name is`infer_results.txt`。 + +* Concatenation results using OCR engine + SER+ RE + +```shell +export CUDA_VISIBLE_DEVICES=0 +python3.7 infer_ser_re_e2e.py \ + --model_name_or_path "PP-Layout_v1.0_ser_pretrained/" \ + --re_model_name_or_path "PP-Layout_v1.0_re_pretrained/" \ + --ser_model_type "LayoutXLM" \ + --max_seq_length 512 \ + --output_dir "output/ser_re_e2e/" \ + --infer_imgs "images/input/zh_val_21.jpg" +``` + +## Reference + +- LayoutXLM: Multimodal Pre-training for Multilingual Visually-rich Document Understanding, https://arxiv.org/pdf/2104.08836.pdf +- microsoft/unilm/layoutxlm, https://github.com/microsoft/unilm/tree/master/layoutxlm +- XFUND dataset, https://github.com/doc-analysis/XFUND diff --git a/ppstructure/vqa/README_ch.md b/ppstructure/vqa/README_ch.md new file mode 100644 index 0000000000000000000000000000000000000000..619ada71a82eacd88abd39199d0b220dc6c64c9b --- /dev/null +++ b/ppstructure/vqa/README_ch.md @@ -0,0 +1,318 @@ +# 文档视觉问答(DOC-VQA) + +VQA指视觉问答,主要针对图像内容进行提问和回答,DOC-VQA是VQA任务中的一种,DOC-VQA主要针对文本图像的文字内容提出问题。 + +PP-Structure 里的 DOC-VQA算法基于PaddleNLP自然语言处理算法库进行开发。 + +主要特性如下: + +- 集成[LayoutXLM](https://arxiv.org/pdf/2104.08836.pdf)模型以及PP-OCR预测引擎。 +- 支持基于多模态方法的语义实体识别 (Semantic Entity Recognition, SER) 以及关系抽取 (Relation Extraction, RE) 任务。基于 SER 任务,可以完成对图像中的文本识别与分类;基于 RE 任务,可以完成对图象中的文本内容的关系提取,如判断问题对(pair)。 +- 支持SER任务和RE任务的自定义训练。 +- 支持OCR+SER的端到端系统预测与评估。 +- 支持OCR+SER+RE的端到端系统预测。 + +**Note**:本项目基于 [LayoutXLM](https://arxiv.org/pdf/2104.08836.pdf) 在Paddle 2.2上的开源实现,同时经过飞桨团队与**中国工商银行**在不动产证场景深入打磨,联合开源。 + + +## 1.性能 + +我们在 [XFUN](https://github.com/doc-analysis/XFUND) 的中文数据集上对算法进行了评估,性能如下 + +| 模型 | 任务 | f1 | 模型下载地址 | +|:---:|:---:|:---:| :---:| +| LayoutXLM | RE | 0.7113 | [链接](https://paddleocr.bj.bcebos.com/pplayout/PP-Layout_v1.0_re_pretrained.tar) | +| LayoutXLM | SER | 0.9056 | [链接](https://paddleocr.bj.bcebos.com/pplayout/PP-Layout_v1.0_ser_pretrained.tar) | +| LayoutLM | SER | 0.78 | [链接](https://paddleocr.bj.bcebos.com/pplayout/LayoutLM_ser_pretrained.tar) | + + + +## 2.效果演示 + +**注意:** 测试图片来源于XFUN数据集。 + +### 2.1 SER + +![](./images/result_ser/zh_val_0_ser.jpg) | ![](./images/result_ser/zh_val_42_ser.jpg) +---|--- + +图中不同颜色的框表示不同的类别,对于XFUN数据集,有`QUESTION`, `ANSWER`, `HEADER` 3种类别 + +* 深紫色:HEADER +* 浅紫色:QUESTION +* 军绿色:ANSWER + +在OCR检测框的左上方也标出了对应的类别和OCR识别结果。 + + +### 2.2 RE + +![](./images/result_re/zh_val_21_re.jpg) | ![](./images/result_re/zh_val_40_re.jpg) +---|--- + + +图中红色框表示问题,蓝色框表示答案,问题和答案之间使用绿色线连接。在OCR检测框的左上方也标出了对应的类别和OCR识别结果。 + + +## 3.安装 + +### 3.1 安装依赖 + +- **(1) 安装PaddlePaddle** + +```bash +python3 -m pip install --upgrade pip + +# GPU安装 +python3 -m pip install paddlepaddle-gpu==2.2 -i https://mirror.baidu.com/pypi/simple + +# CPU安装 +python3 -m pip install paddlepaddle==2.2 -i https://mirror.baidu.com/pypi/simple + +``` +更多需求,请参照[安装文档](https://www.paddlepaddle.org.cn/install/quick)中的说明进行操作。 + + +### 3.2 安装PaddleOCR(包含 PP-OCR 和 VQA ) + +- **(1)pip快速安装PaddleOCR whl包(仅预测)** + +```bash +python3 -m pip install paddleocr +``` + +- **(2)下载VQA源码(预测+训练)** + +```bash +【推荐】git clone https://github.com/PaddlePaddle/PaddleOCR + +# 如果因为网络问题无法pull成功,也可选择使用码云上的托管: +git clone https://gitee.com/paddlepaddle/PaddleOCR + +# 注:码云托管代码可能无法实时同步本github项目更新,存在3~5天延时,请优先使用推荐方式。 +``` + +- **(3)安装VQA的`requirements`** + +```bash +cd ppstructure/vqa +python3 -m pip install -r requirements.txt +``` + +## 4. 使用 + + +### 4.1 数据和预训练模型准备 + +处理好的XFUN中文数据集下载地址:[https://paddleocr.bj.bcebos.com/dataset/XFUND.tar](https://paddleocr.bj.bcebos.com/dataset/XFUND.tar)。 + + +下载并解压该数据集,解压后将数据集放置在当前目录下。 + +```shell +wget https://paddleocr.bj.bcebos.com/dataset/XFUND.tar +``` + +如果希望转换XFUN中其他语言的数据集,可以参考[XFUN数据转换脚本](helper/trans_xfun_data.py)。 + +如果希望直接体验预测过程,可以下载我们提供的预训练模型,跳过训练过程,直接预测即可。 + + +### 4.2 SER任务 + +* 启动训练 + +```shell +python3 train_ser.py \ + --model_name_or_path "layoutxlm-base-uncased" \ + --ser_model_type "LayoutXLM" \ + --train_data_dir "XFUND/zh_train/image" \ + --train_label_path "XFUND/zh_train/xfun_normalize_train.json" \ + --eval_data_dir "XFUND/zh_val/image" \ + --eval_label_path "XFUND/zh_val/xfun_normalize_val.json" \ + --num_train_epochs 200 \ + --eval_steps 10 \ + --output_dir "./output/ser/" \ + --learning_rate 5e-5 \ + --warmup_steps 50 \ + --evaluate_during_training \ + --seed 2048 +``` + +最终会打印出`precision`, `recall`, `f1`等指标,模型和训练日志会保存在`./output/ser/`文件夹中。 + +* 恢复训练 + +```shell +python3 train_ser.py \ + --model_name_or_path "model_path" \ + --ser_model_type "LayoutXLM" \ + --train_data_dir "XFUND/zh_train/image" \ + --train_label_path "XFUND/zh_train/xfun_normalize_train.json" \ + --eval_data_dir "XFUND/zh_val/image" \ + --eval_label_path "XFUND/zh_val/xfun_normalize_val.json" \ + --num_train_epochs 200 \ + --eval_steps 10 \ + --output_dir "./output/ser/" \ + --learning_rate 5e-5 \ + --warmup_steps 50 \ + --evaluate_during_training \ + --num_workers 8 \ + --seed 2048 \ + --resume +``` + +* 评估 +```shell +export CUDA_VISIBLE_DEVICES=0 +python3 eval_ser.py \ + --model_name_or_path "PP-Layout_v1.0_ser_pretrained/" \ + --ser_model_type "LayoutXLM" \ + --eval_data_dir "XFUND/zh_val/image" \ + --eval_label_path "XFUND/zh_val/xfun_normalize_val.json" \ + --per_gpu_eval_batch_size 8 \ + --num_workers 8 \ + --output_dir "output/ser/" \ + --seed 2048 +``` +最终会打印出`precision`, `recall`, `f1`等指标 + +* 使用评估集合中提供的OCR识别结果进行预测 + +```shell +export CUDA_VISIBLE_DEVICES=0 +python3 infer_ser.py \ + --model_name_or_path "PP-Layout_v1.0_ser_pretrained/" \ + --ser_model_type "LayoutXLM" \ + --output_dir "output/ser/" \ + --infer_imgs "XFUND/zh_val/image/" \ + --ocr_json_path "XFUND/zh_val/xfun_normalize_val.json" +``` + +最终会在`output_res`目录下保存预测结果可视化图像以及预测结果文本文件,文件名为`infer_results.txt`。 + +* 使用`OCR引擎 + SER`串联结果 + +```shell +export CUDA_VISIBLE_DEVICES=0 +python3 infer_ser_e2e.py \ + --model_name_or_path "PP-Layout_v1.0_ser_pretrained/" \ + --ser_model_type "LayoutXLM" \ + --max_seq_length 512 \ + --output_dir "output/ser_e2e/" \ + --infer_imgs "images/input/zh_val_0.jpg" +``` + +* 对`OCR引擎 + SER`预测系统进行端到端评估 + +```shell +export CUDA_VISIBLE_DEVICES=0 +python3 helper/eval_with_label_end2end.py --gt_json_path XFUND/zh_val/xfun_normalize_val.json --pred_json_path output_res/infer_results.txt +``` + + +### 4.3 RE任务 + +* 启动训练 + +```shell +export CUDA_VISIBLE_DEVICES=0 +python3 train_re.py \ + --model_name_or_path "layoutxlm-base-uncased" \ + --train_data_dir "XFUND/zh_train/image" \ + --train_label_path "XFUND/zh_train/xfun_normalize_train.json" \ + --eval_data_dir "XFUND/zh_val/image" \ + --eval_label_path "XFUND/zh_val/xfun_normalize_val.json" \ + --label_map_path "labels/labels_ser.txt" \ + --num_train_epochs 200 \ + --eval_steps 10 \ + --output_dir "output/re/" \ + --learning_rate 5e-5 \ + --warmup_steps 50 \ + --per_gpu_train_batch_size 8 \ + --per_gpu_eval_batch_size 8 \ + --num_workers 8 \ + --evaluate_during_training \ + --seed 2048 + +``` + +* 恢复训练 + +```shell +export CUDA_VISIBLE_DEVICES=0 +python3 train_re.py \ + --model_name_or_path "model_path" \ + --train_data_dir "XFUND/zh_train/image" \ + --train_label_path "XFUND/zh_train/xfun_normalize_train.json" \ + --eval_data_dir "XFUND/zh_val/image" \ + --eval_label_path "XFUND/zh_val/xfun_normalize_val.json" \ + --label_map_path "labels/labels_ser.txt" \ + --num_train_epochs 2 \ + --eval_steps 10 \ + --output_dir "output/re/" \ + --learning_rate 5e-5 \ + --warmup_steps 50 \ + --per_gpu_train_batch_size 8 \ + --per_gpu_eval_batch_size 8 \ + --num_workers 8 \ + --evaluate_during_training \ + --seed 2048 \ + --resume + +``` + +最终会打印出`precision`, `recall`, `f1`等指标,模型和训练日志会保存在`./output/re/`文件夹中。 + +* 评估 +```shell +export CUDA_VISIBLE_DEVICES=0 +python3 eval_re.py \ + --model_name_or_path "PP-Layout_v1.0_re_pretrained/" \ + --max_seq_length 512 \ + --eval_data_dir "XFUND/zh_val/image" \ + --eval_label_path "XFUND/zh_val/xfun_normalize_val.json" \ + --label_map_path "labels/labels_ser.txt" \ + --output_dir "output/re/" \ + --per_gpu_eval_batch_size 8 \ + --num_workers 8 \ + --seed 2048 +``` +最终会打印出`precision`, `recall`, `f1`等指标 + + +* 使用评估集合中提供的OCR识别结果进行预测 + +```shell +export CUDA_VISIBLE_DEVICES=0 +python3 infer_re.py \ + --model_name_or_path "PP-Layout_v1.0_re_pretrained/" \ + --max_seq_length 512 \ + --eval_data_dir "XFUND/zh_val/image" \ + --eval_label_path "XFUND/zh_val/xfun_normalize_val.json" \ + --label_map_path "labels/labels_ser.txt" \ + --output_dir "output/re/" \ + --per_gpu_eval_batch_size 1 \ + --seed 2048 +``` + +最终会在`output_res`目录下保存预测结果可视化图像以及预测结果文本文件,文件名为`infer_results.txt`。 + +* 使用`OCR引擎 + SER + RE`串联结果 + +```shell +export CUDA_VISIBLE_DEVICES=0 +python3 infer_ser_re_e2e.py \ + --model_name_or_path "PP-Layout_v1.0_ser_pretrained/" \ + --re_model_name_or_path "PP-Layout_v1.0_re_pretrained/" \ + --ser_model_type "LayoutXLM" \ + --max_seq_length 512 \ + --output_dir "output/ser_re_e2e/" \ + --infer_imgs "images/input/zh_val_21.jpg" +``` + +## 参考链接 + +- LayoutXLM: Multimodal Pre-training for Multilingual Visually-rich Document Understanding, https://arxiv.org/pdf/2104.08836.pdf +- microsoft/unilm/layoutxlm, https://github.com/microsoft/unilm/tree/master/layoutxlm +- XFUND dataset, https://github.com/doc-analysis/XFUND