README.md 11.8 KB
Newer Older
B
Bin Lu 已提交
1
English | [简体中文](README_ch.md)
littletomatodonkey's avatar
littletomatodonkey 已提交
2

3 4
- [Document Visual Question Answering](#document-visual-question-answering)
  - [1 Introduction](#1-introduction)
B
Bin Lu 已提交
5
  - [2. Performance](#2-performance)
6
  - [3. Effect demo](#3-effect-demo)
文幕地方's avatar
文幕地方 已提交
7 8
    - [3.1 SER](#31-ser)
    - [3.2 RE](#32-re)
9 10
  - [4. Install](#4-install)
    - [4.1 Install dependencies](#41-install-dependencies)
文幕地方's avatar
文幕地方 已提交
11
    - [5.3 RE](#53-re)
12 13
  - [6. Reference Links](#6-reference-links)
  - [License](#license)
M
update  
MissPenguin 已提交
14

B
Bin Lu 已提交
15
# Document Visual Question Answering
M
update  
MissPenguin 已提交
16

B
Bin Lu 已提交
17
## 1 Introduction
M
update  
MissPenguin 已提交
18

B
Bin Lu 已提交
19
VQA refers to visual question answering, which mainly asks and answers image content. DOC-VQA is one of the VQA tasks. DOC-VQA mainly asks questions about the text content of text images.
文幕地方's avatar
文幕地方 已提交
20

B
Bin Lu 已提交
21
The DOC-VQA algorithm in PP-Structure is developed based on the PaddleNLP natural language processing algorithm library.
文幕地方's avatar
add re  
文幕地方 已提交
22

B
Bin Lu 已提交
23
The main features are as follows:
文幕地方's avatar
add re  
文幕地方 已提交
24

B
Bin Lu 已提交
25 26 27 28 29
- Integrate [LayoutXLM](https://arxiv.org/pdf/2104.08836.pdf) model and PP-OCR prediction engine.
- Supports Semantic Entity Recognition (SER) and Relation Extraction (RE) tasks based on multimodal methods. Based on the SER task, the text recognition and classification in the image can be completed; based on the RE task, the relationship extraction of the text content in the image can be completed, such as judging the problem pair (pair).
- Supports custom training for SER tasks and RE tasks.
- Supports end-to-end system prediction and evaluation of OCR+SER.
- Supports end-to-end system prediction of OCR+SER+RE.
littletomatodonkey's avatar
littletomatodonkey 已提交
30 31


B
Bin Lu 已提交
32 33
This project is an open source implementation of [LayoutXLM: Multimodal Pre-training for Multilingual Visually-rich Document Understanding](https://arxiv.org/pdf/2104.08836.pdf) on Paddle 2.2,
Included fine-tuning code on [XFUND dataset](https://github.com/doc-analysis/XFUND).
littletomatodonkey's avatar
littletomatodonkey 已提交
34

B
Bin Lu 已提交
35
## 2. Performance
littletomatodonkey's avatar
littletomatodonkey 已提交
36

B
Bin Lu 已提交
37
We evaluate the algorithm on the Chinese dataset of [XFUND](https://github.com/doc-analysis/XFUND), and the performance is as follows
文幕地方's avatar
add re  
文幕地方 已提交
38

B
Bin Lu 已提交
39
| Model | Task | hmean | Model download address |
文幕地方's avatar
文幕地方 已提交
40
|:---:|:---:|:---:| :---:|
B
Bin Lu 已提交
41 42 43 44 45
| LayoutXLM | SER | 0.9038 | [link](https://paddleocr.bj.bcebos.com/pplayout/ser_LayoutXLM_xfun_zh.tar) |
| LayoutXLM | RE | 0.7483 | [link](https://paddleocr.bj.bcebos.com/pplayout/re_LayoutXLM_xfun_zh.tar) |
| LayoutLMv2 | SER | 0.8544 | [link](https://paddleocr.bj.bcebos.com/pplayout/ser_LayoutLMv2_xfun_zh.tar)
| LayoutLMv2 | RE | 0.6777 | [link](https://paddleocr.bj.bcebos.com/pplayout/re_LayoutLMv2_xfun_zh.tar) |
| LayoutLM | SER | 0.7731 | [link](https://paddleocr.bj.bcebos.com/pplayout/ser_LayoutLM_xfun_zh.tar) |
文幕地方's avatar
add re  
文幕地方 已提交
46

B
Bin Lu 已提交
47
## 3. Effect demo
littletomatodonkey's avatar
littletomatodonkey 已提交
48

B
Bin Lu 已提交
49
**Note:** The test images are from the XFUND dataset.
littletomatodonkey's avatar
littletomatodonkey 已提交
50

M
update  
MissPenguin 已提交
51
<a name="31"></a>
文幕地方's avatar
文幕地方 已提交
52
### 3.1 SER
littletomatodonkey's avatar
littletomatodonkey 已提交
53

M
update  
MissPenguin 已提交
54
![](../docs/vqa/result_ser/zh_val_0_ser.jpg) | ![](../docs/vqa/result_ser/zh_val_42_ser.jpg)
文幕地方's avatar
add re  
文幕地方 已提交
55
---|---
littletomatodonkey's avatar
littletomatodonkey 已提交
56

B
Bin Lu 已提交
57
Boxes with different colors in the figure represent different categories. For the XFUND dataset, there are 3 categories: `QUESTION`, `ANSWER`, `HEADER`
littletomatodonkey's avatar
littletomatodonkey 已提交
58

B
Bin Lu 已提交
59 60 61
* Dark purple: HEADER
* Light purple: QUESTION
* Army Green: ANSWER
littletomatodonkey's avatar
littletomatodonkey 已提交
62

B
Bin Lu 已提交
63
The corresponding categories and OCR recognition results are also marked on the upper left of the OCR detection frame.
littletomatodonkey's avatar
littletomatodonkey 已提交
64

M
update  
MissPenguin 已提交
65
<a name="32"></a>
文幕地方's avatar
文幕地方 已提交
66
### 3.2 RE
littletomatodonkey's avatar
littletomatodonkey 已提交
67

M
update  
MissPenguin 已提交
68
![](../docs/vqa/result_re/zh_val_21_re.jpg) | ![](../docs/vqa/result_re/zh_val_40_re.jpg)
文幕地方's avatar
add re  
文幕地方 已提交
69
---|---
littletomatodonkey's avatar
littletomatodonkey 已提交
70 71


B
Bin Lu 已提交
72
The red box in the figure represents the question, the blue box represents the answer, and the question and the answer are connected by a green line. The corresponding categories and OCR recognition results are also marked on the upper left of the OCR detection frame.
littletomatodonkey's avatar
littletomatodonkey 已提交
73

B
Bin Lu 已提交
74
## 4. Install
文幕地方's avatar
add re  
文幕地方 已提交
75

B
Bin Lu 已提交
76
### 4.1 Install dependencies
littletomatodonkey's avatar
littletomatodonkey 已提交
77

B
Bin Lu 已提交
78
- **(1) Install PaddlePaddle**
littletomatodonkey's avatar
littletomatodonkey 已提交
79 80

```bash
文幕地方's avatar
文幕地方 已提交
81
python3 -m pip install --upgrade pip
littletomatodonkey's avatar
littletomatodonkey 已提交
82

B
Bin Lu 已提交
83
# GPU installation
84
python3 -m pip install "paddlepaddle-gpu>=2.2" -i https://mirror.baidu.com/pypi/simple
littletomatodonkey's avatar
littletomatodonkey 已提交
85

B
Bin Lu 已提交
86
# CPU installation
87
python3 -m pip install "paddlepaddle>=2.2" -i https://mirror.baidu.com/pypi/simple
littletomatodonkey's avatar
littletomatodonkey 已提交
88

B
Bin Lu 已提交
89 90
````
For more requirements, please refer to the instructions in [Installation Documentation](https://www.paddlepaddle.org.cn/install/quick).
littletomatodonkey's avatar
littletomatodonkey 已提交
91

B
Bin Lu 已提交
92
### 4.2 Install PaddleOCR
littletomatodonkey's avatar
littletomatodonkey 已提交
93

B
Bin Lu 已提交
94
- **(1) pip install PaddleOCR whl package quickly (prediction only)**
littletomatodonkey's avatar
littletomatodonkey 已提交
95 96

```bash
97
python3 -m pip install paddleocr
B
Bin Lu 已提交
98
````
littletomatodonkey's avatar
littletomatodonkey 已提交
99

B
Bin Lu 已提交
100
- **(2) Download VQA source code (prediction + training)**
littletomatodonkey's avatar
littletomatodonkey 已提交
101 102

```bash
B
Bin Lu 已提交
103
[Recommended] git clone https://github.com/PaddlePaddle/PaddleOCR
littletomatodonkey's avatar
littletomatodonkey 已提交
104

B
Bin Lu 已提交
105
# If the pull cannot be successful due to network problems, you can also choose to use the hosting on the code cloud:
littletomatodonkey's avatar
littletomatodonkey 已提交
106 107
git clone https://gitee.com/paddlepaddle/PaddleOCR

B
Bin Lu 已提交
108 109
# Note: Code cloud hosting code may not be able to synchronize the update of this github project in real time, there is a delay of 3 to 5 days, please use the recommended method first.
````
littletomatodonkey's avatar
littletomatodonkey 已提交
110

B
Bin Lu 已提交
111
- **(3) Install VQA's `requirements`**
littletomatodonkey's avatar
littletomatodonkey 已提交
112 113

```bash
114
python3 -m pip install -r ppstructure/vqa/requirements.txt
B
Bin Lu 已提交
115
````
littletomatodonkey's avatar
littletomatodonkey 已提交
116

B
Bin Lu 已提交
117
## 5. Usage
littletomatodonkey's avatar
littletomatodonkey 已提交
118

B
Bin Lu 已提交
119
### 5.1 Data and Model Preparation
littletomatodonkey's avatar
littletomatodonkey 已提交
120

B
Bin Lu 已提交
121
If you want to experience the prediction process directly, you can download the pre-training model provided by us, skip the training process, and just predict directly.
122

B
Bin Lu 已提交
123
* Download the processed dataset
124

125
The download address of the processed XFUND Chinese dataset: [link](https://paddleocr.bj.bcebos.com/ppstructure/dataset/XFUND.tar).
littletomatodonkey's avatar
littletomatodonkey 已提交
126 127


B
Bin Lu 已提交
128
Download and unzip the dataset, and place the dataset in the current directory after unzipping.
littletomatodonkey's avatar
littletomatodonkey 已提交
129 130

```shell
131
wget https://paddleocr.bj.bcebos.com/ppstructure/dataset/XFUND.tar
B
Bin Lu 已提交
132
````
littletomatodonkey's avatar
littletomatodonkey 已提交
133

B
Bin Lu 已提交
134
* Convert the dataset
littletomatodonkey's avatar
littletomatodonkey 已提交
135

B
Bin Lu 已提交
136
If you need to train other XFUND datasets, you can use the following commands to convert the datasets
littletomatodonkey's avatar
littletomatodonkey 已提交
137

B
Bin Lu 已提交
138 139 140
```bash
python3 ppstructure/vqa/tools/trans_xfun_data.py --ori_gt_path=path/to/json_path --output_path=path/to/save_path
````
littletomatodonkey's avatar
littletomatodonkey 已提交
141

B
Bin Lu 已提交
142
* Download the pretrained models
143
```bash
B
Bin Lu 已提交
144 145 146 147 148 149 150
mkdir pretrain && cd pretrain
#download the SER model
wget https://paddleocr.bj.bcebos.com/pplayout/ser_LayoutXLM_xfun_zh.tar && tar -xvf ser_LayoutXLM_xfun_zh.tar
#download the RE model
wget https://paddleocr.bj.bcebos.com/pplayout/re_LayoutXLM_xfun_zh.tar && tar -xvf re_LayoutXLM_xfun_zh.tar
cd ../
````
littletomatodonkey's avatar
littletomatodonkey 已提交
151

M
update  
MissPenguin 已提交
152
<a name="52"></a>
文幕地方's avatar
文幕地方 已提交
153
### 5.2 SER
littletomatodonkey's avatar
littletomatodonkey 已提交
154

B
Bin Lu 已提交
155
Before starting training, you need to modify the following four fields
156

B
Bin Lu 已提交
157 158 159 160
1. `Train.dataset.data_dir`: point to the directory where the training set images are stored
2. `Train.dataset.label_file_list`: point to the training set label file
3. `Eval.dataset.data_dir`: refers to the directory where the validation set images are stored
4. `Eval.dataset.label_file_list`: point to the validation set label file
littletomatodonkey's avatar
littletomatodonkey 已提交
161

B
Bin Lu 已提交
162
* start training
littletomatodonkey's avatar
littletomatodonkey 已提交
163
```shell
164
CUDA_VISIBLE_DEVICES=0 python3 tools/train.py -c configs/vqa/ser/layoutxlm.yml
B
Bin Lu 已提交
165
````
littletomatodonkey's avatar
littletomatodonkey 已提交
166

B
Bin Lu 已提交
167 168
Finally, `precision`, `recall`, `hmean` and other indicators will be printed.
In the `./output/ser_layoutxlm/` folder will save the training log, the optimal model and the model for the latest epoch.
littletomatodonkey's avatar
littletomatodonkey 已提交
169

B
Bin Lu 已提交
170
* resume training
Z
zhoujun 已提交
171

B
Bin Lu 已提交
172
To resume training, assign the folder path of the previously trained model to the `Architecture.Backbone.checkpoints` field.
173

Z
zhoujun 已提交
174
```shell
175
CUDA_VISIBLE_DEVICES=0 python3 tools/train.py -c configs/vqa/ser/layoutxlm.yml -o Architecture.Backbone.checkpoints=path/to/model_dir
B
Bin Lu 已提交
176
````
Z
zhoujun 已提交
177

B
Bin Lu 已提交
178
* evaluate
Z
zhoujun 已提交
179

B
Bin Lu 已提交
180
Evaluation requires assigning the folder path of the model to be evaluated to the `Architecture.Backbone.checkpoints` field.
littletomatodonkey's avatar
littletomatodonkey 已提交
181 182

```shell
183
CUDA_VISIBLE_DEVICES=0 python3 tools/eval.py -c configs/vqa/ser/layoutxlm.yml -o Architecture.Backbone.checkpoints=path/to/model_dir
B
Bin Lu 已提交
184 185
````
Finally, `precision`, `recall`, `hmean` and other indicators will be printed
littletomatodonkey's avatar
littletomatodonkey 已提交
186

187
* `OCR + SER` tandem prediction based on training engine
littletomatodonkey's avatar
littletomatodonkey 已提交
188

189
Use the following command to complete the series prediction of `OCR engine + SER`, taking the SER model based on LayoutXLM as an example::
littletomatodonkey's avatar
littletomatodonkey 已提交
190 191

```shell
192
python3.7 tools/export_model.py -c configs/vqa/ser/layoutxlm.yml -o Architecture.Backbone.checkpoints=pretrain/ser_LayoutXLM_xfun_zh/ Global.save_inference_dir=output/ser/infer
B
Bin Lu 已提交
193
````
littletomatodonkey's avatar
littletomatodonkey 已提交
194

B
Bin Lu 已提交
195
Finally, the prediction result visualization image and the prediction result text file will be saved in the directory configured by the `config.Global.save_res_path` field. The prediction result text file is named `infer_results.txt`.
196

197
* End-to-end evaluation of `OCR + SER` prediction system
littletomatodonkey's avatar
littletomatodonkey 已提交
198

B
Bin Lu 已提交
199
First use the `tools/infer_vqa_token_ser.py` script to complete the prediction of the dataset, then use the following command to evaluate.
200

littletomatodonkey's avatar
littletomatodonkey 已提交
201 202
```shell
export CUDA_VISIBLE_DEVICES=0
B
Bin Lu 已提交
203 204
python3 tools/eval_with_label_end2end.py --gt_json_path XFUND/zh_val/xfun_normalize_val.json --pred_json_path output_res/infer_results.txt
````
205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222
* export model

Use the following command to complete the model export of the SER model, taking the SER model based on LayoutXLM as an example:

```shell
python3.7 tools/export_model.py -c configs/vqa/ser/layoutxlm.yml -o Architecture.Backbone.checkpoints=pretrain/ser_LayoutXLM_xfun_zh/ Global.save_inference_dir=output/ser/infer
```
The converted model will be stored in the directory specified by the `Global.save_inference_dir` field.

* `OCR + SER` tandem prediction based on prediction engine
  
Use the following command to complete the tandem prediction of `OCR + SER` based on the prediction engine, taking the SER model based on LayoutXLM as an example:

```shell
cd ppstructure
CUDA_VISIBLE_DEVICES=0 python3.7 vqa/predict_vqa_token_ser.py --vqa_algorithm=LayoutXLM --ser_model_dir=../output/ser/infer --ser_dict_path=../train_data/XFUND/class_list_xfun.txt --image_dir=docs/vqa/input/zh_val_42.jpg --output=output
```
After the prediction is successful, the visualization images and results will be saved in the directory specified by the `output` field
littletomatodonkey's avatar
littletomatodonkey 已提交
223

M
update  
MissPenguin 已提交
224
<a name="53"></a>
文幕地方's avatar
文幕地方 已提交
225
### 5.3 RE
littletomatodonkey's avatar
littletomatodonkey 已提交
226

B
Bin Lu 已提交
227
* start training
littletomatodonkey's avatar
littletomatodonkey 已提交
228

B
Bin Lu 已提交
229
Before starting training, you need to modify the following four fields
文幕地方's avatar
add re  
文幕地方 已提交
230

B
Bin Lu 已提交
231 232 233 234
1. `Train.dataset.data_dir`: point to the directory where the training set images are stored
2. `Train.dataset.label_file_list`: point to the training set label file
3. `Eval.dataset.data_dir`: refers to the directory where the validation set images are stored
4. `Eval.dataset.label_file_list`: point to the validation set label file
Z
zhoujun 已提交
235 236

```shell
237
CUDA_VISIBLE_DEVICES=0 python3 tools/train.py -c configs/vqa/re/layoutxlm.yml
B
Bin Lu 已提交
238
````
Z
zhoujun 已提交
239

B
Bin Lu 已提交
240 241
Finally, `precision`, `recall`, `hmean` and other indicators will be printed.
In the `./output/re_layoutxlm/` folder will save the training log, the optimal model and the model for the latest epoch.
242

B
Bin Lu 已提交
243
* resume training
244

B
Bin Lu 已提交
245
To resume training, assign the folder path of the previously trained model to the `Architecture.Backbone.checkpoints` field.
文幕地方's avatar
add re  
文幕地方 已提交
246

Z
zhoujun 已提交
247
```shell
248
CUDA_VISIBLE_DEVICES=0 python3 tools/train.py -c configs/vqa/re/layoutxlm.yml -o Architecture.Backbone.checkpoints=path/to/model_dir
B
Bin Lu 已提交
249
````
Z
zhoujun 已提交
250

B
Bin Lu 已提交
251
* evaluate
Z
zhoujun 已提交
252

B
Bin Lu 已提交
253
Evaluation requires assigning the folder path of the model to be evaluated to the `Architecture.Backbone.checkpoints` field.
文幕地方's avatar
add re  
文幕地方 已提交
254 255

```shell
256
CUDA_VISIBLE_DEVICES=0 python3 tools/eval.py -c configs/vqa/re/layoutxlm.yml -o Architecture.Backbone.checkpoints=path/to/model_dir
B
Bin Lu 已提交
257 258
````
Finally, `precision`, `recall`, `hmean` and other indicators will be printed
文幕地方's avatar
add re  
文幕地方 已提交
259

B
Bin Lu 已提交
260
* Use `OCR engine + SER + RE` tandem prediction
文幕地方's avatar
add re  
文幕地方 已提交
261

B
Bin Lu 已提交
262
Use the following command to complete the series prediction of `OCR engine + SER + RE`, taking the pretrained SER and RE models as an example:
文幕地方's avatar
add re  
文幕地方 已提交
263 264
```shell
export CUDA_VISIBLE_DEVICES=0
265
python3 tools/infer_vqa_token_ser_re.py -c configs/vqa/re/layoutxlm.yml -o Architecture.Backbone.checkpoints=pretrain/re_LayoutXLM_xfun_zh/Global.infer_img=ppstructure/docs/vqa/input/zh_val_21.jpg -c_ser configs/vqa/ser/layoutxlm. yml -o_ser Architecture.Backbone.checkpoints=pretrain/ser_LayoutXLM_xfun_zh/
B
Bin Lu 已提交
266
````
littletomatodonkey's avatar
littletomatodonkey 已提交
267

B
Bin Lu 已提交
268
Finally, the prediction result visualization image and the prediction result text file will be saved in the directory configured by the `config.Global.save_res_path` field. The prediction result text file is named `infer_results.txt`.
269

270 271 272 273 274 275 276 277
* export model
  
cooming soon

* `OCR + SER + RE` tandem prediction based on prediction engine

cooming soon

B
Bin Lu 已提交
278
## 6. Reference Links
littletomatodonkey's avatar
littletomatodonkey 已提交
279 280 281 282

- LayoutXLM: Multimodal Pre-training for Multilingual Visually-rich Document Understanding, https://arxiv.org/pdf/2104.08836.pdf
- microsoft/unilm/layoutxlm, https://github.com/microsoft/unilm/tree/master/layoutxlm
- XFUND dataset, https://github.com/doc-analysis/XFUND
M
MissPenguin 已提交
283 284 285 286

## License

The content of this project itself is licensed under the [Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0)](https://creativecommons.org/licenses/by-nc-sa/4.0/)