PPStructure is an OCR toolkit for complex layout analysis. It can divide document data in the form of pictures into **text, table, title, picture and list** 5 types of areas, and extract the table area as excel
## 1. Quick start
# PP-Structure
### install
**install PaddlePaddle2.0**
PP-Structure is an OCR toolkit that can be used for complex documents analysis. The main features are as follows:
- Support the layout analysis of documents, divide the documents into 5 types of areas **text, title, table, image and list** (conjunction with Layout-Parser)
- Support to extract the texts from the text, title, picture and list areas (used in conjunction with PP-OCR)
- Support to extract excel files from the table areas
- Support python whl package and command line usage, easy to use
- Support custom training for layout analysis and table structure tasks
- The total model size is only about 18.6M (continuous optimization)
```bash
pip3 install--upgrade pip
## 1. Visualization
# If you have cuda9 or cuda10 installed on your machine, please run the following command to install
# Note: The cloud-hosting code may not be able to synchronize the update with this GitHub project in real time. There might be a delay of 3-5 days. Please give priority to the recommended method.
# For more,refer[Installation](https://www.paddlepaddle.org.cn/install/quick)。
```
**install paddleocr**
-**(2) Install Layout-Parser**
install by pypi
```bash
cd PaddleOCR
pip install"paddleocr>=2.2"# # Recommend to use version 2.2
The return result of PPStructure is a list composed of a dict, an example is as follows
```shell
...
...
@@ -106,7 +111,7 @@ The description of each field in dict is as follows
|res|OCR or table recognition result of image area。<br> Table: HTML string of the table; <br> OCR: A tuple containing the detection coordinates and recognition results of each single line of text|
@@ -119,22 +124,22 @@ Most of the parameters are consistent with the paddleocr whl package, see [doc o
After running, each image will have a directory with the same name under the directory specified in the output field. Each table in the picture will be stored as an excel and figure area will be cropped and saved, the excel and image file name will be the coordinates of the table in the image.
## 2. PPStructure Pipeline
## 4. PPStructure Pipeline
the process is as follows
![pipeline](../doc/table/pipeline_en.jpg)
In PPStructure, the image will be analyzed by layoutparser first. In the layout analysis, the area in the image will be classified, including **text, title, image, list and table** 5 categories. For the first 4 types of areas, directly use the PP-OCR to complete the text detection and recognition. The table area will be converted to an excel file of the same table style via Table OCR.
### 2.1 LayoutParser
### 4.1 LayoutParser
Layout analysis divides the document data into regions, including the use of Python scripts for layout analysis tools, extraction of special category detection boxes, performance indicators, and custom training layout analysis models. For details, please refer to [document](layout/README_en.md).
### 2.2 Table Structure
### 4.2 Table Structure
Table OCR converts table image into excel documents, which include the detection and recognition of table text and the prediction of table structure and cell coordinates. For detailed, please refer to [document](table/README.md)
## 3. Predictive by inference engine
## 5. Predictive by inference engine
Use the following commands to complete the inference.
The following figure shows the result, with different colored detection boxes representing different categories and displaying specific categories in the upper left corner of the box with `show_element_type`
| config_path | model config path | None | Specify config_ path will automatically download the model (only for the first time,the model will exist and will not be downloaded again) |
| model_path | model path | None | local model path, config_ path and model_ path must be set to one, cannot be none at the same time |
| label_map | category mapping table | None | Setting config_ path, it can be none, and the label is automatically obtained according to the dataset name_ map |
| enforce_cpu | whether to use CPU | False | False to use GPU, and True to force the use of CPU |
| enforce_mkldnn | whether mkldnn acceleration is enabled in CPU prediction | True | \ |
| thread_num | the number of CPU threads | 10 | \ |
The following model configurations and label maps are currently supported, which you can use by modifying '--config_path' and '--label_map' to detect different types of content:
* TableBank word and TableBank latex are trained on datasets of word documents and latex documents respectively;
*Download TableBank dataset contains both word and latex。
<aname="后处理"></a>
<aname="PostProcess"></a>
## 3. 后处理
## 3. PostProcess
版面分析检测包含多个类别,如果只想获取指定类别(如"Text"类别)的检测框、可以使用下述代码:
Layout parser contains multiple categories, if you only want to get the detection box for a specific category (such as the "Text" category), you can use the following code:
The above model is based on PaddleDetection](https://github.com/PaddlePaddle/PaddleDetection) ,if you want to train your own layout parser model,please refer to:[train_layoutparser_model](train_layoutparser_model.md)
The following figure shows the result, with different colored detection boxes representing different categories and displaying specific categories in the upper left corner of the box with `show_element_type`
| config_path | model config path | None | Specify config_ path will automatically download the model (only for the first time,the model will exist and will not be downloaded again) |
| model_path | model path | None | local model path, config_ path and model_ path must be set to one, cannot be none at the same time |
| label_map | category mapping table | None | Setting config_ path, it can be none, and the label is automatically obtained according to the dataset name_ map |
| enforce_cpu | whether to use CPU | False | False to use GPU, and True to force the use of CPU |
| enforce_mkldnn | whether mkldnn acceleration is enabled in CPU prediction | True | \ |
| thread_num | the number of CPU threads | 10 | \ |
The following model configurations and label maps are currently supported, which you can use by modifying '--config_path' and '--label_map' to detect different types of content:
Layout parser contains multiple categories, if you only want to get the detection box for a specific category (such as the "Text" category), you can use the following code:
@@ -131,9 +134,8 @@ Displays results with only the "Text" category:
**GPU:** a single NVIDIA Tesla P40
<aname="Training"></a>
## 5. Training
<aname="训练版面分析模型"></a>
The above model is based on PaddleDetection](https://github.com/PaddlePaddle/PaddleDetection) ,if you want to train your own layout parser model,please refer to:[train_layoutparser_model](train_layoutparser_model_en.md)
For more installation tutorials, please refer to:[Install doc](https://github.com/PaddlePaddle/PaddleDetection/blob/release/2.1/docs/tutorials/INSTALL_cn.md)
For other datasets,please refer to [the PrepareDataSet]((https://github.com/PaddlePaddle/PaddleDetection/blob/release/2.1/docs/tutorials/PrepareDataSet.md))
- ppyolov2_reader.yml:mainly describes the configuration of data readers, such as batch size and number of concurrent loading child processes, and also includes post preprocessing, such as resize and data augmention etc.
根据实际情况,修改上述文件,比如数据集路径、batch size等。
Modify the preceding files, such as the dataset path and batch size etc.
<aname="训练"></a>
<aname="Training"></a>
## 4. PaddleDetection训练
## 4. Training
PaddleDetection提供了单卡/多卡训练模式,满足用户多种训练需求
PaddleDetection provides single-card/multi-card training mode to meet various training needs of users:
`--draw_threshold`is an optional parameter. According to the calculation of [NMS](https://ieeexplore.ieee.org/document/1699659), different threshold will produce different results, ` keep_top_k ` represent the maximum amount of output target, the default value is 10. You can set different value according to your own actual situation。
n the process of model training, the model file saved contains the process of forward prediction and back propagation. In the actual industrial deployment, there is no need for back propagation. Therefore, the model should be translated into the model format required by the deployment. The `tools/export_model.py` script is provided in PaddleDetection to export the model.
The exported model name defaults to `model.*`, Layout Parser's code model is `inference.*`, So change [PaddleDetection/ppdet/engine/trainer. Py ](https://github.com/PaddlePaddle/PaddleDetection/blob/b87a1ea86fa18ce69e44a17ad1b49c1326f19ff9/ppdet/engine/trainer.py# L512) (click on the link to see the detailed line of code), change 'model' to 'inference'.
The prediction model is exported to `inference/ppyolov2_r50vd_dcn_365e_coco` ,including:`infer_cfg.yml`(prediction not required), `inference.pdiparams`, `inference.pdiparams.info`,`inference.pdmodel`
More PaddleDetection training tutorials,please reference:[PaddleDetection Training](https://github.com/PaddlePaddle/PaddleDetection/blob/release/2.1/docs/tutorials/GETTING_STARTED_cn.md)
For more installation tutorials, please refer to:[Install doc](https://github.com/PaddlePaddle/PaddleDetection/blob/release/2.1/docs/tutorials/INSTALL_cn.md)
@@ -78,13 +78,13 @@ PubLayNet directory structure after decompressing :
| `LICENSE.txt` | Plaintext version of the CDLA-Permissive license | 1 |
| `README.txt` | Text file with the file names and description | 1 |
For other datasets,please refer to [the PrepareDataSet]((https://github.com/PaddlePaddle/PaddleDetection/blob/release/2.1/docs/tutorials/PrepareDataSet.md))
- optimizer_365e.yml:mainly explains the learning rate and optimizer configuration
- optimizer_365e.yml:主要说明了学习率和优化器的配置
- ppyolov2_r50vd_dcn.yml:mainly describes the model and the network
- ppyolov2_r50vd_dcn.yml:主要说明模型和主干网络的情况
- ppyolov2_reader.yml:mainly describes the configuration of data readers, such as batch size and number of concurrent loading child processes, and also includes post preprocessing, such as resize and data augmention etc.
`--draw_threshold`is an optional parameter. According to the calculation of [NMS](https://ieeexplore.ieee.org/document/1699659), different threshold will produce different results, ` keep_top_k ` represent the maximum amount of output target, the default value is 10. You can set different value according to your own actual situation。
n the process of model training, the model file saved contains the process of forward prediction and back propagation. In the actual industrial deployment, there is no need for back propagation. Therefore, the model should be translated into the model format required by the deployment. The `tools/export_model.py` script is provided in PaddleDetection to export the model.
The exported model name defaults to `model.*`, Layout Parser's code model is `inference.*`, So change [PaddleDetection/ppdet/engine/trainer. Py ](https://github.com/PaddlePaddle/PaddleDetection/blob/b87a1ea86fa18ce69e44a17ad1b49c1326f19ff9/ppdet/engine/trainer.py# L512) (click on the link to see the detailed line of code), change 'model' to 'inference'.
The prediction model is exported to `inference/ppyolov2_r50vd_dcn_365e_coco` ,including:`infer_cfg.yml`(prediction not required), `inference.pdiparams`, `inference.pdiparams.info`,`inference.pdmodel`
`model_path` represent the trained model path, and layoutparser is used to predict:
`model_path`指定训练好的模型路径,使用layout parser进行预测:
```bash
import layoutparser as lp
...
...
@@ -198,7 +198,6 @@ model = lp.PaddleDetectionLayoutModel(model_path="inference/ppyolov2_r50vd_dcn_3
***
More PaddleDetection training tutorials,please reference:[PaddleDetection Training](https://github.com/PaddlePaddle/PaddleDetection/blob/release/2.1/docs/tutorials/GETTING_STARTED_cn.md)