diff --git a/doc/doc_ch/table_recognition.md b/doc/doc_ch/table_recognition.md new file mode 100644 index 0000000000000000000000000000000000000000..fea95222cf68f9436b43edf63f8e1a549cb65491 --- /dev/null +++ b/doc/doc_ch/table_recognition.md @@ -0,0 +1,310 @@ +# 表格识别 + +本文提供了PaddleOCR表格识别模型的全流程指南,包括数据准备、模型训练、调优、评估、预测,各个阶段的详细说明: + +- [1. 数据准备](#1-数据准备) + - [1.1. 准备数据集](#11-准备数据集) + - [1.2. 数据下载](#12-数据下载) + - [1.3. 数据集生成](#13-数据集生成) +- [2. 开始训练](#2-开始训练) + - [2.1. 启动训练](#21-启动训练) + - [2.2. 断点训练](#22-断点训练) + - [2.3. 更换Backbone 训练](#23-更换backbone-训练) + - [2.4. 混合精度训练](#24-混合精度训练) + - [2.5. 分布式训练](#25-分布式训练) + - [2.6. 知识蒸馏训练](#26-知识蒸馏训练) + - [2.7. 其他训练环境](#27-其他训练环境) + - [2.8 模型微调](#28-模型微调) +- [3. 模型评估与预测](#3-模型评估与预测) + - [3.1. 指标评估](#31-指标评估) + - [3.2. 测试表格结构识别效果](#32-测试表格结构识别效果) +- [4. 模型导出与预测](#4-模型导出与预测) +- [5. FAQ](#5-faq) + +# 1. 数据准备 + +## 1.1. 准备数据集 + +PaddleOCR 表格识别模型数据集格式如下: +```txt +img_label # 每张图片标注经过json.dumps()之后的字符串 +... +img_label +``` + +每一行的json格式为: +```json +{ + 'filename': PMC5755158_010_01.png, # 图像名 + 'split': ’train‘, # 图像属于训练集还是验证集 + 'imgid': 0, # 图像的index + 'html': { + 'structure': {'tokens': ['', '', '', ...]}, # 表格的HTML字符串 + 'cell': [ + { + 'tokens': ['P', 'a', 'd', 'd', 'l', 'e', 'P', 'a', 'd', 'd', 'l', 'e'], # 表格中的单个文本 + 'bbox': [x0, y0, x1, y1] # 表格中的单个文本的坐标 + } + ] + } +} +``` + +训练数据的默认存储路径是 `PaddleOCR/train_data`,如果您的磁盘上已有数据集,只需创建软链接至数据集目录: + +``` +# linux and mac os +ln -sf /train_data/dataset +# windows +mklink /d /train_data/dataset +``` + +## 1.2. 数据下载 + +公开数据集下载可参考 [table_datasets](dataset/table_datasets.md)。 + +## 1.3. 数据集生成 + +使用[TableGeneration](https://github.com/WenmuZhou/TableGeneration)可进行扫描表格图像的生成。 + +TableGeneration是一个开源表格数据集生成工具,其通过浏览器渲染的方式对html字符串进行渲染后获得表格图像。部分样张如下: + +|类型|样例| +|---|---| +|简单表格|![](https://github.com/WenmuZhou/TableGeneration/blob/main/imgs/simple.jpg)| +|彩色表格|![](https://github.com/WenmuZhou/TableGeneration/blob/main/imgs/color.jpg)| + +# 2. 开始训练 + +PaddleOCR提供了训练脚本、评估脚本和预测脚本,本节将以 [SLANet](../../configs/table/SLANet.yml) 模型训练PubTabNet英文数据集为例: + +## 2.1. 启动训练 + +*如果您安装的是cpu版本,请将配置文件中的 `use_gpu` 字段修改为false* + +``` +# GPU训练 支持单卡,多卡训练 +# 训练日志会自动保存为 "{save_model_dir}" 下的train.log + +#单卡训练(训练周期长,不建议) +python3 tools/train.py -c configs/table/SLANet.yml + +#多卡训练,通过--gpus参数指定卡号 +python3 -m paddle.distributed.launch --gpus '0,1,2,3' tools/train.py -c configs/table/SLANet.yml +``` + +正常启动训练后,会看到以下log输出: + +``` +[2022/08/16 03:07:33] ppocr INFO: epoch: [1/400], global_step: 20, lr: 0.000100, acc: 0.000000, loss: 3.915012, structure_loss: 3.229450, loc_loss: 0.670590, avg_reader_cost: 2.63382 s, avg_batch_cost: 6.32390 s, avg_samples: 48.0, ips: 7.59025 samples/s, eta: 9 days, 2:29:27 +[2022/08/16 03:08:41] ppocr INFO: epoch: [1/400], global_step: 40, lr: 0.000100, acc: 0.000000, loss: 1.750859, structure_loss: 1.082116, loc_loss: 0.652822, avg_reader_cost: 0.02533 s, avg_batch_cost: 3.37251 s, avg_samples: 48.0, ips: 14.23271 samples/s, eta: 6 days, 23:28:43 +[2022/08/16 03:09:46] ppocr INFO: epoch: [1/400], global_step: 60, lr: 0.000100, acc: 0.000000, loss: 1.395154, structure_loss: 0.776803, loc_loss: 0.625030, avg_reader_cost: 0.02550 s, avg_batch_cost: 3.26261 s, avg_samples: 48.0, ips: 14.71214 samples/s, eta: 6 days, 5:11:48 +``` + +log 中自动打印如下信息: + +| 字段 | 含义 | +| :----: | :------: | +| epoch | 当前迭代轮次 | +| global_step | 当前迭代次数 | +| lr | 当前学习率 | +| acc | 当前batch的准确率 | +| loss | 当前损失函数 | +| structure_loss | 表格结构损失值 | +| loc_loss | 单元格坐标损失值 | +| avg_reader_cost | 当前 batch 数据处理耗时 | +| avg_batch_cost | 当前 batch 总耗时 | +| avg_samples | 当前 batch 内的样本数 | +| ips | 每秒处理图片的数量 | + + +PaddleOCR支持训练和评估交替进行, 可以在 `configs/table/SLANet.yml` 中修改 `eval_batch_step` 设置评估频率,默认每1000个iter评估一次。评估过程中默认将最佳acc模型,保存为 `output/SLANet/best_accuracy` 。 + +如果验证集很大,测试将会比较耗时,建议减少评估次数,或训练完再进行评估。 + +**提示:** 可通过 -c 参数选择 `configs/table/` 路径下的多种模型配置进行训练,PaddleOCR支持的表格识别算法可以参考[前沿算法列表](https://github.com/PaddlePaddle/PaddleOCR/blob/dygraph/doc/doc_ch/algorithm_overview.md#3-%E8%A1%A8%E6%A0%BC%E8%AF%86%E5%88%AB%E7%AE%97%E6%B3%95): + +**注意,预测/评估时的配置文件请务必与训练一致。** + +## 2.2. 断点训练 + +如果训练程序中断,如果希望加载训练中断的模型从而恢复训练,可以通过指定Global.checkpoints指定要加载的模型路径: +```shell +python3 tools/train.py -c configs/table/SLANet.yml -o Global.checkpoints=./your/trained/model +``` + +**注意**:`Global.checkpoints`的优先级高于`Global.pretrained_model`的优先级,即同时指定两个参数时,优先加载`Global.checkpoints`指定的模型,如果`Global.checkpoints`指定的模型路径有误,会加载`Global.pretrained_model`指定的模型。 + +## 2.3. 更换Backbone 训练 + +PaddleOCR将网络划分为四部分,分别在[ppocr/modeling](../../ppocr/modeling)下。 进入网络的数据将按照顺序(transforms->backbones->necks->heads)依次通过这四个部分。 + +```bash +├── architectures # 网络的组网代码 +├── transforms # 网络的图像变换模块 +├── backbones # 网络的特征提取模块 +├── necks # 网络的特征增强模块 +└── heads # 网络的输出模块 +``` +如果要更换的Backbone 在PaddleOCR中有对应实现,直接修改配置yml文件中`Backbone`部分的参数即可。 + +如果要使用新的Backbone,更换backbones的例子如下: + +1. 在 [ppocr/modeling/backbones](../../ppocr/modeling/backbones) 文件夹下新建文件,如my_backbone.py。 +2. 在 my_backbone.py 文件内添加相关代码,示例代码如下: + +```python +import paddle +import paddle.nn as nn +import paddle.nn.functional as F + + +class MyBackbone(nn.Layer): + def __init__(self, *args, **kwargs): + super(MyBackbone, self).__init__() + # your init code + self.conv = nn.xxxx + + def forward(self, inputs): + # your network forward + y = self.conv(inputs) + return y +``` + +3. 在 [ppocr/modeling/backbones/\__init\__.py](../../ppocr/modeling/backbones/__init__.py)文件内导入添加的`MyBackbone`模块,然后修改配置文件中Backbone进行配置即可使用,格式如下: + +```yaml +Backbone: +name: MyBackbone +args1: args1 +``` + +**注意**:如果要更换网络的其他模块,可以参考[文档](./add_new_algorithm.md)。 + +## 2.4. 混合精度训练 + +如果您想进一步加快训练速度,可以使用[自动混合精度训练](https://www.paddlepaddle.org.cn/documentation/docs/zh/guides/01_paddle2.0_introduction/basic_concept/amp_cn.html), 以单机单卡为例,命令如下: + +```shell +python3 tools/train.py -c configs/table/SLANet.yml \ + -o Global.pretrained_model=./pretrain_models/SLANet/best_accuracy \ + Global.use_amp=True Global.scale_loss=1024.0 Global.use_dynamic_loss_scaling=True + ``` + +## 2.5. 分布式训练 + +多机多卡训练时,通过 `--ips` 参数设置使用的机器IP地址,通过 `--gpus` 参数设置使用的GPU ID: + +```bash +python3 -m paddle.distributed.launch --ips="xx.xx.xx.xx,xx.xx.xx.xx" --gpus '0,1,2,3' tools/train.py -c configs/table/SLANet.yml \ + -o Global.pretrained_model=./pretrain_models/SLANet/best_accuracy +``` + +**注意:** (1)采用多机多卡训练时,需要替换上面命令中的ips值为您机器的地址,机器之间需要能够相互ping通;(2)训练时需要在多个机器上分别启动命令。查看机器ip地址的命令为`ifconfig`;(3)更多关于分布式训练的性能优势等信息,请参考:[分布式训练教程](./distributed_training.md)。 + +## 2.6. 知识蒸馏训练 + +coming soon! + +## 2.7. 其他训练环境 + +- Windows GPU/CPU +在Windows平台上与Linux平台略有不同: +Windows平台只支持`单卡`的训练与预测,指定GPU进行训练`set CUDA_VISIBLE_DEVICES=0` +在Windows平台,DataLoader只支持单进程模式,因此需要设置 `num_workers` 为0; + +- macOS +不支持GPU模式,需要在配置文件中设置`use_gpu`为False,其余训练评估预测命令与Linux GPU完全相同。 + +- Linux DCU +DCU设备上运行需要设置环境变量 `export HIP_VISIBLE_DEVICES=0,1,2,3`,其余训练评估预测命令与Linux GPU完全相同。 + +## 2.8 模型微调 + +实际使用过程中,建议加载官方提供的预训练模型,在自己的数据集中进行微调,关于模型的微调方法,请参考:[模型微调教程](./finetune.md)。 + + +# 3. 模型评估与预测 + +## 3.1. 指标评估 + +训练中模型参数默认保存在`Global.save_model_dir`目录下。在评估指标时,需要设置`Global.checkpoints`指向保存的参数文件。评估数据集可以通过 `configs/table/SLANet.yml` 修改Eval中的 `label_file_list` 设置。 + + +``` +# GPU 评估, Global.checkpoints 为待测权重 +python3 -m paddle.distributed.launch --gpus '0' tools/eval.py -c configs/table/SLANet.yml -o Global.checkpoints={path/to/weights}/best_accuracy +``` + +## 3.2. 测试表格结构识别效果 + +使用 PaddleOCR 训练好的模型,可以通过以下脚本进行快速预测。 + +默认预测图片存储在 `infer_img` 里,通过 `-o Global.checkpoints` 加载训练好的参数文件: + +根据配置文件中设置的 `save_model_dir` 和 `save_epoch_step` 字段,会有以下几种参数被保存下来: + +``` +output/SLANet/ +├── best_accuracy.pdopt +├── best_accuracy.pdparams +├── best_accuracy.states +├── config.yml +├── latest.pdopt +├── latest.pdparams +├── latest.states +└── train.log +``` +其中 best_accuracy.* 是评估集上的最优模型;latest.* 是最后一个epoch的模型。 + +``` +# 预测表格图像 +python3 tools/infer_table.py -c configs/table/SLANet.yml -o Global.pretrained_model={path/to/weights}/best_accuracy Global.infer_img=ppstructure/docs/table/table.jpg +``` + +预测图片: + +![](../../ppstructure/docs/table/table.jpg) + +得到输入图像的预测结果: + +``` +['', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '
', '', ''],[[320.0562438964844, 197.83375549316406, 350.0928955078125, 214.4309539794922], ... , [318.959228515625, 271.0166931152344, 353.7394104003906, 286.4538269042969]] +``` + +单元格坐标可视化结果为 + +![](../../ppstructure/docs/imgs/slanet_result.jpg) + +# 4. 模型导出与预测 + +inference 模型(`paddle.jit.save`保存的模型) +一般是模型训练,把模型结构和模型参数保存在文件中的固化模型,多用于预测部署场景。 +训练过程中保存的模型是checkpoints模型,保存的只有模型的参数,多用于恢复训练等。 +与checkpoints模型相比,inference 模型会额外保存模型的结构信息,在预测部署、加速推理上性能优越,灵活方便,适合于实际系统集成。 + +表格识别模型转inference模型与文字检测识别的方式相同,如下: + +``` +# -c 后面设置训练算法的yml配置文件 +# -o 配置可选参数 +# Global.pretrained_model 参数设置待转换的训练模型地址,不用添加文件后缀 .pdmodel,.pdopt或.pdparams。 +# Global.save_inference_dir参数设置转换的模型将保存的地址。 + +python3 tools/export_model.py -c configs/table/SLANet.yml -o Global.pretrained_model=./pretrain_models/SLANet/best_accuracy Global.save_inference_dir=./inference/SLANet/ +``` + +转换成功后,在目录下有三个文件: + +``` +inference/SLANet/ + ├── inference.pdiparams # inference模型的参数文件 + ├── inference.pdiparams.info # inference模型的参数信息,可忽略 + └── inference.pdmodel # inference模型的program文件 +``` + +# 5. FAQ + +Q1: 训练模型转inference 模型之后预测效果不一致? + +**A**:此类问题出现较多,问题多是trained model预测时候的预处理、后处理参数和inference model预测的时候的预处理、后处理参数不一致导致的。可以对比训练使用的配置文件中的预处理、后处理和预测时是否存在差异。 diff --git a/doc/doc_en/table_recognition_en.md b/doc/doc_en/table_recognition_en.md new file mode 100644 index 0000000000000000000000000000000000000000..28f8c6fa98246d584532ebcea1d77394a03921f8 --- /dev/null +++ b/doc/doc_en/table_recognition_en.md @@ -0,0 +1,320 @@ +# Table Recognition + +This article provides a full-process guide for the PaddleOCR table recognition model, including data preparation, model training, tuning, evaluation, prediction, and detailed descriptions of each stage: + +- [1. Data Preparation](#1-data-preparation) + - [1.1. DataSet Preparation](#11-dataset-preparation) + - [1.2. Data Download](#12-data-download) + - [1.3. Dataset Generation](#13-dataset-generation) +- [2. Training](#2-training) + - [2.1. Start Training](#21-start-training) + - [2.2. Resume Training](#22-resume-training) + - [2.3. Training with New Backbone](#23-training-with-new-backbone) + - [2.4. Mixed Precision Training](#24-mixed-precision-training) + - [2.5. Distributed Training](#25-distributed-training) + - [2.6. Training with Knowledge Distillation](#26-training-with-knowledge-distillation) + - [2.7. Training on other platform(Windows/macOS/Linux DCU)](#27-training-on-other-platformwindowsmacoslinux-dcu) + - [2.8 Fine-tuning](#28-fine-tuning) +- [3. Evaluation and Test](#3-evaluation-and-test) + - [3.1. Evaluation](#31-evaluation) + - [3.2. Test table structure recognition effect](#32-test-table-structure-recognition-effect) +- [4. Model export and prediction](#4-model-export-and-prediction) + - [5. FAQ](#5-faq) + +# 1. Data Preparation + +## 1.1. DataSet Preparation + +The format of the PaddleOCR table recognition model dataset is as follows: +```txt +img_label # Each image is marked with a string after json.dumps() +... +img_label +``` + +The json format of each line is: +```json +{ + 'filename': PMC5755158_010_01.png, # image name + 'split': ’train‘, # whether the image belongs to the training set or the validation set + 'imgid': 0, # index of image + 'html': { + 'structure': {'tokens': ['', '', '', ...]}, # HTML string of the table + 'cell': [ + { + 'tokens': ['P', 'a', 'd', 'd', 'l', 'e', 'P', 'a', 'd', 'd', 'l', 'e'], # text in cell + 'bbox': [x0, y0, x1, y1] # bbox of cell + } + ] + } +} +``` + +The default storage path for training data is `PaddleOCR/train_data`, if you already have a dataset on disk, just create a soft link to the dataset directory: + +``` +# linux and mac os +ln -sf /train_data/dataset +# windows +mklink /d /train_data/dataset +``` + +## 1.2. Data Download + +Download the public dataset reference [table_datasets](dataset/table_datasets_en.md)。 + +## 1.3. Dataset Generation + +Use [TableGeneration](https://github.com/WenmuZhou/TableGeneration) to generate scanned table images. + +TableGeneration is an open source table dataset generation tool, which renders html strings through browser rendering to obtain table images. + +Some samples are as follows: + +|Type|Sample| +|---|---| +|Simple Table|![](https://github.com/WenmuZhou/TableGeneration/blob/main/imgs/simple.jpg)| +|Simple Color Table|![](https://github.com/WenmuZhou/TableGeneration/blob/main/imgs/color.jpg)| + +# 2. Training + +PaddleOCR provides training scripts, evaluation scripts, and prediction scripts. In this section, the [SLANet](../../configs/table/SLANet.yml) model will be used as an example: + +## 2.1. Start Training + +*If you are installing the cpu version, please modify the `use_gpu` field in the configuration file to false* + +``` +# GPU training Support single card and multi-card training +# The training log will be automatically saved as train.log under "{save_model_dir}" + +# specify the single card training(Long training time, not recommended) +python3 tools/train.py -c configs/table/SLANet.yml + +# specify the card number through --gpus +python3 -m paddle.distributed.launch --gpus '0,1,2,3' tools/train.py -c configs/table/SLANet.yml +``` + +After starting training normally, you will see the following log output: + +``` +[2022/08/16 03:07:33] ppocr INFO: epoch: [1/400], global_step: 20, lr: 0.000100, acc: 0.000000, loss: 3.915012, structure_loss: 3.229450, loc_loss: 0.670590, avg_reader_cost: 2.63382 s, avg_batch_cost: 6.32390 s, avg_samples: 48.0, ips: 7.59025 samples/s, eta: 9 days, 2:29:27 +[2022/08/16 03:08:41] ppocr INFO: epoch: [1/400], global_step: 40, lr: 0.000100, acc: 0.000000, loss: 1.750859, structure_loss: 1.082116, loc_loss: 0.652822, avg_reader_cost: 0.02533 s, avg_batch_cost: 3.37251 s, avg_samples: 48.0, ips: 14.23271 samples/s, eta: 6 days, 23:28:43 +[2022/08/16 03:09:46] ppocr INFO: epoch: [1/400], global_step: 60, lr: 0.000100, acc: 0.000000, loss: 1.395154, structure_loss: 0.776803, loc_loss: 0.625030, avg_reader_cost: 0.02550 s, avg_batch_cost: 3.26261 s, avg_samples: 48.0, ips: 14.71214 samples/s, eta: 6 days, 5:11:48 +``` + +The following information is automatically printed in the log: + +| Field | Meaning | +| :----: | :------: | +| epoch | current iteration round | +| global_step | current iteration count | +| lr | current learning rate | +| acc | The accuracy of the current batch | +| loss | current loss function | +| structure_loss | Table Structure Loss Values | +| loc_loss | Cell Coordinate Loss Value | +| avg_reader_cost | Current batch data processing time | +| avg_batch_cost | The total time spent in the current batch | +| avg_samples | The number of samples in the current batch | +| ips | Number of images processed per second | + + +PaddleOCR supports alternating training and evaluation. You can modify `eval_batch_step` in `configs/table/SLANet.yml` to set the evaluation frequency. By default, it is evaluated once every 1000 iters. During the evaluation process, the best acc model is saved as `output/SLANet/best_accuracy` by default. + +If the validation set is large, the test will be time-consuming. It is recommended to reduce the number of evaluations, or perform evaluation after training. + +**Tips:** You can use the -c parameter to select various model configurations under the `configs/table/` path for training. For the table recognition algorithms supported by PaddleOCR, please refer to [Table Algorithms List](https://github.com/PaddlePaddle/PaddleOCR/blob/dygraph/doc/doc_en/algorithm_overview_en.md#3): + +**Note that the configuration file for prediction/evaluation must be the same as training. ** + +## 2.2. Resume Training + +If the training program is interrupted, if you want to load the interrupted model to resume training, you can specify the path of the model to be loaded by specifying Global.checkpoints: + +```shell +python3 tools/train.py -c configs/table/SLANet.yml -o Global.checkpoints=./your/trained/model +``` +**Note**: The priority of `Global.checkpoints` is higher than that of `Global.pretrained_model`, that is, when two parameters are specified at the same time, the model specified by `Global.checkpoints` will be loaded first. If `Global.checkpoints` The specified model path is incorrect, and the model specified by `Global.pretrained_model` will be loaded. + +## 2.3. Training with New Backbone + +The network part completes the construction of the network, and PaddleOCR divides the network into four parts, which are under [ppocr/modeling](../../ppocr/modeling). The data entering the network will pass through these four parts in sequence(transforms->backbones-> +necks->heads). + +```bash +├── architectures # Code for building network +├── transforms # Image Transformation Module +├── backbones # Feature extraction module +├── necks # Feature enhancement module +└── heads # Output module +``` + +If the Backbone to be replaced has a corresponding implementation in PaddleOCR, you can directly modify the parameters in the `Backbone` part of the configuration yml file. + +However, if you want to use a new Backbone, an example of replacing the backbones is as follows: + +1. Create a new file under the [ppocr/modeling/backbones](../../ppocr/modeling/backbones) folder, such as my_backbone.py. +2. Add code in the my_backbone.py file, the sample code is as follows: + +```python +import paddle +import paddle.nn as nn +import paddle.nn.functional as F + + +class MyBackbone(nn.Layer): + def __init__(self, *args, **kwargs): + super(MyBackbone, self).__init__() + # your init code + self.conv = nn.xxxx + + def forward(self, inputs): + # your network forward + y = self.conv(inputs) + return y +``` + +3. Import the added module in the [ppocr/modeling/backbones/\__init\__.py](../../ppocr/modeling/backbones/__init__.py) file. + +After adding the four-part modules of the network, you only need to configure them in the configuration file to use, such as: + +```yaml + Backbone: + name: MyBackbone + args1: args1 +``` + +**NOTE**: More details about replace Backbone and other mudule can be found in [doc](add_new_algorithm_en.md). + +## 2.4. Mixed Precision Training + +If you want to speed up your training further, you can use [Auto Mixed Precision Training](https://www.paddlepaddle.org.cn/documentation/docs/zh/guides/01_paddle2.0_introduction/basic_concept/amp_cn.html), taking a single machine and a single gpu as an example, the commands are as follows: + +```shell +python3 tools/train.py -c configs/table/SLANet.yml \ + -o Global.pretrained_model=./pretrain_models/SLANet/best_accuracy \ + Global.use_amp=True Global.scale_loss=1024.0 Global.use_dynamic_loss_scaling=True + ``` + +## 2.5. Distributed Training + +During multi-machine multi-gpu training, use the `--ips` parameter to set the used machine IP address, and the `--gpus` parameter to set the used GPU ID: + +```bash +python3 -m paddle.distributed.launch --ips="xx.xx.xx.xx,xx.xx.xx.xx" --gpus '0,1,2,3' tools/train.py -c configs/table/SLANet.yml \ + -o Global.pretrained_model=./pretrain_models/SLANet/best_accuracy +``` + + +**Note:** (1) When using multi-machine and multi-gpu training, you need to replace the ips value in the above command with the address of your machine, and the machines need to be able to ping each other. (2) Training needs to be launched separately on multiple machines. The command to view the ip address of the machine is `ifconfig`. (3) For more details about the distributed training speedup ratio, please refer to [Distributed Training Tutorial](./distributed_training_en.md). + +## 2.6. Training with Knowledge Distillation + +coming soon! + +## 2.7. Training on other platform(Windows/macOS/Linux DCU) + +- Windows GPU/CPU +The Windows platform is slightly different from the Linux platform: +Windows platform only supports `single gpu` training and inference, specify GPU for training `set CUDA_VISIBLE_DEVICES=0` +On the Windows platform, DataLoader only supports single-process mode, so you need to set `num_workers` to 0; + +- macOS +GPU mode is not supported, you need to set `use_gpu` to False in the configuration file, and the rest of the training evaluation prediction commands are exactly the same as Linux GPU. + +- Linux DCU +Running on a DCU device requires setting the environment variable `export HIP_VISIBLE_DEVICES=0,1,2,3`, and the rest of the training and evaluation prediction commands are exactly the same as the Linux GPU. + + +## 2.8 Fine-tuning + +In the actual use process, it is recommended to load the officially provided pre-training model and fine-tune it in your own data set. For the fine-tuning method of the table recognition model, please refer to: [Model fine-tuning tutorial](./finetune.md). + + +# 3. Evaluation and Test + +## 3.1. Evaluation + +The model parameters during training are saved in the `Global.save_model_dir` directory by default. When evaluating metrics, you need to set `Global.checkpoints` to point to the saved parameter file. Evaluation datasets can be modified via the `label_file_list` setting in Eval via `configs/table/SLANet.yml`. + +``` +# GPU evaluation, Global.checkpoints is the weight to be tested +python3 -m paddle.distributed.launch --gpus '0' tools/eval.py -c configs/table/SLANet.yml -o Global.checkpoints={path/to/weights}/best_accuracy +``` + +## 3.2. Test table structure recognition effect + +Using the model trained by PaddleOCR, you can quickly get prediction through the following script. + +The default prediction picture is stored in `infer_img`, and the trained weight is specified via `-o Global.checkpoints`: + + +According to the `save_model_dir` and `save_epoch_step` fields set in the configuration file, the following parameters will be saved: + + +``` +output/SLANet/ +├── best_accuracy.pdopt +├── best_accuracy.pdparams +├── best_accuracy.states +├── config.yml +├── latest.pdopt +├── latest.pdparams +├── latest.states +└── train.log +``` +Among them, best_accuracy.* is the best model on the evaluation set; latest.* is the model of the last epoch. + +``` +# Predict table image +python3 tools/infer_table.py -c configs/table/SLANet.yml -o Global.pretrained_model={path/to/weights}/best_accuracy Global.infer_img=ppstructure/docs/table/table.jpg +``` + +Input image: + +![](../../ppstructure/docs/table/table.jpg) + +Get the prediction result of the input image: + +``` +['', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '
', '', ''],[[320.0562438964844, 197.83375549316406, 350.0928955078125, 214.4309539794922], ... , [318.959228515625, 271.0166931152344, 353.7394104003906, 286.4538269042969]] +``` + +The cell coordinates are visualized as + +![](../../ppstructure/docs/imgs/slanet_result.jpg) + +# 4. Model export and prediction + +inference model (model saved by `paddle.jit.save`) +Generally, it is model training, a solidified model that saves the model structure and model parameters in a file, and is mostly used to predict deployment scenarios. +The model saved during the training process is the checkpoints model, and only the parameters of the model are saved, which are mostly used to resume training. +Compared with the checkpoints model, the inference model will additionally save the structural information of the model. It has superior performance in predicting deployment and accelerating reasoning, and is flexible and convenient, and is suitable for actual system integration. + +The way to convert the form recognition model to the inference model is the same as the text detection and recognition, as follows: + +``` +# -c Set the training algorithm yml configuration file +# -o Set optional parameters +# Global.pretrained_model parameter Set the training model address to be converted without adding the file suffix .pdmodel, .pdopt or .pdparams. +# Global.save_inference_dir Set the address where the converted model will be saved. + +python3 tools/export_model.py -c configs/table/SLANet.yml -o Global.pretrained_model=./pretrain_models/SLANet/best_accuracy Global.save_inference_dir=./inference/SLANet/ +``` + +After the conversion is successful, there are three files in the model save directory: + + +``` +inference/SLANet/ + ├── inference.pdiparams # The parameter file of inference model + ├── inference.pdiparams.info # The parameter information of inference model, which can be ignored + └── inference.pdmodel # The program file of model +``` + +## 5. FAQ + +Q1: After the training model is transferred to the inference model, the prediction effect is inconsistent? + +**A**: There are many such problems, and the problems are mostly caused by inconsistent preprocessing and postprocessing parameters when the trained model predicts and the preprocessing and postprocessing parameters when the inference model predicts. You can compare whether there are differences in preprocessing, postprocessing, and prediction in the configuration files used for training. diff --git a/ppstructure/docs/imgs/slanet_result.jpg b/ppstructure/docs/imgs/slanet_result.jpg new file mode 100644 index 0000000000000000000000000000000000000000..011857fbc2295b91a96d938f861d38b8e07421bc Binary files /dev/null and b/ppstructure/docs/imgs/slanet_result.jpg differ diff --git a/ppstructure/table/README.md b/ppstructure/table/README.md index 7ecbe0ad84207795fefcfa49775cbf4c8de69bf3..4204f1f233e65b53b8d444a9d48b5467271e43df 100644 --- a/ppstructure/table/README.md +++ b/ppstructure/table/README.md @@ -1,27 +1,28 @@ -- [Table Recognition](#table-recognition) - - [1. pipeline](#1-pipeline) - - [2. Performance](#2-performance) - - [3. How to use](#3-how-to-use) - - [3.1 quick start](#31-quick-start) - - [3.2 Train](#32-train) - - [3.3 Eval](#33-eval) - - [3.4 Inference](#34-inference) - +English | [简体中文](README_ch.md) # Table Recognition +- [1. pipeline](#1-pipeline) +- [2. Performance](#2-performance) +- [3. How to use](#3-how-to-use) + - [3.1 quick start](#31-quick-start) + - [3.2 Train](#32-train) + - [3.3 Calculate TEDS](#33-calculate-teds) +- [4. Reference](#4-reference) + + ## 1. pipeline The table recognition mainly contains three models 1. Single line text detection-DB 2. Single line text recognition-CRNN -3. Table structure and cell coordinate prediction-RARE +3. Table structure and cell coordinate prediction-SLANet The table recognition flow chart is as follows ![tableocr_pipeline](../docs/table/tableocr_pipeline_en.jpg) 1. The coordinates of single-line text is detected by DB model, and then sends it to the recognition model to get the recognition result. -2. The table structure and cell coordinates is predicted by RARE model. +2. The table structure and cell coordinates is predicted by SLANet model. 3. The recognition result of the cell is combined by the coordinates, recognition result of the single line and the coordinates of the cell. 4. The cell recognition result and the table structure together construct the html string of the table. @@ -29,17 +30,17 @@ The table recognition flow chart is as follows We evaluated the algorithm on the PubTabNet[1] eval dataset, and the performance is as follows: -|Method|[TEDS(Tree-Edit-Distance-based Similarity)](https://github.com/ibm-aur-nlp/PubTabNet/tree/master/src)| -| --- | --- | -| EDD[2] | 88.3 | -| TableRec-RARE(ours) | 93.32 | -| SLANet(ours) | 94.98 | +|Method|acc|[TEDS(Tree-Edit-Distance-based Similarity)](https://github.com/ibm-aur-nlp/PubTabNet/tree/master/src)| +| --- | --- | --- | +| EDD[2] |x| 88.3 | +| TableRec-RARE(ours) |73.8%| 93.32 | +| SLANet(ours) | 76.2%| 94.98 |SLANet | ## 3. How to use ### 3.1 quick start -- table recognition +Use the following commands to quickly complete the identification of a table. ```python cd PaddleOCR/ppstructure @@ -67,64 +68,15 @@ python3.7 table/predict_table.py \ After the operation is completed, the excel table of each image will be saved to the directory specified by the output field, and an html file will be produced in the directory to visually view the cell coordinates and the recognized table. -- table structure recognition -```python -cd PaddleOCR/ppstructure - -# download model -mkdir inference && cd inference -# Download the PP-Structurev2 form recognition model and unzip it -wget https://paddleocr.bj.bcebos.com/ppstructure/models/slanet/ch_ppstructure_mobile_v2.0_SLANet_infer.tar && tar xf ch_ppstructure_mobile_v2.0_SLANet_infer.tar -cd .. -# run -python3.7 table/predict_structure.py \ - --table_model_dir=inference/ch_ppstructure_mobile_v2.0_SLANet_infer \ - --table_char_dict_path=../ppocr/utils/dict/table_structure_dict_ch.txt \ - --image_dir=docs/table/table.jpg \ - --output=../output/table -``` -After the run is complete, the visualization of the detection frame of the cell will be saved to the directory specified by the output field. - ### 3.2 Train -In this chapter, we only introduce the training of the table structure model, For model training of [text detection](../../doc/doc_en/detection_en.md) and [text recognition](../../doc/doc_en/recognition_en.md), please refer to the corresponding documents +The training, evaluation and inference process of the text detection model can be referred to [detection](../../doc/doc_en/detection_en.md) -* data preparation +The training, evaluation and inference process of the text recognition model can be referred to [recognition](../../doc/doc_en/recognition_en.md) -For the Chinese model and the English model, the data sources are different, as follows: +The training, evaluation and inference process of the table recognition model can be referred to [table_recognition](../../doc/doc_en/table_recognition_en.md) -English dataset: The training data uses public data set [PubTabNet](https://arxiv.org/abs/1911.10683 ), Can be downloaded from the official [website](https://github.com/ibm-aur-nlp/PubTabNet) 。The PubTabNet data set contains about 500,000 images, as well as annotations in html format。 - -Chinese dataset: The Chinese dataset consists of the following two parts, which are trained with a 1:1 sampling ratio. -> 1. Generate dataset: Use [Table Generation Tool](https://github.com/WenmuZhou/TableGeneration) to generate 40,000 images. -> 2. Crop 10,000 images from [WTW](https://github.com/wangwen-whu/WTW-Dataset). - -For a detailed introduction to public datasets, please refer to [table_datasets](../../doc/doc_en/dataset/table_datasets_en.md). The following training and evaluation procedures are based on the English dataset as an example. - -* Start training -*If you are installing the cpu version of paddle, please modify the `use_gpu` field in the configuration file to false* -```shell -# single GPU training -python3 tools/train.py -c configs/table/table_mv3.yml -# multi-GPU training -# Set the GPU ID used by the '--gpus' parameter. -python3 -m paddle.distributed.launch --gpus '0,1,2,3' tools/train.py -c configs/table/table_mv3.yml -``` - -In the above instruction, use `-c` to select the training to use the `configs/table/table_mv3.yml` configuration file. -For a detailed explanation of the configuration file, please refer to [config](../../doc/doc_en/config_en.md). - -* load trained model and continue training - -If you expect to load trained model and continue the training again, you can specify the parameter `Global.checkpoints` as the model path to be loaded. - -```shell -python3 tools/train.py -c configs/table/table_mv3.yml -o Global.checkpoints=./your/trained/model -``` - -**Note**: The priority of `Global.checkpoints` is higher than that of `Global.pretrain_weights`, that is, when two parameters are specified at the same time, the model specified by `Global.checkpoints` will be loaded first. If the model path specified by `Global.checkpoints` is wrong, the one specified by `Global.pretrain_weights` will be loaded. - -### 3.3 Eval +### 3.3 Calculate TEDS The table uses [TEDS(Tree-Edit-Distance-based Similarity)](https://github.com/ibm-aur-nlp/PubTabNet/tree/master/src) as the evaluation metric of the model. Before the model evaluation, the three models in the pipeline need to be exported as inference models (we have provided them), and the gt for evaluation needs to be prepared. Examples of gt are as follows: ```txt @@ -139,8 +91,16 @@ python3 ppstructure/table/convert_label2html.py --ori_gt_path /path/to/your_labe Use the following command to evaluate. After the evaluation is completed, the teds indicator will be output. ```python -cd PaddleOCR/ppstructure -python3 table/eval_table.py --det_model_dir=path/to/det_model_dir --rec_model_dir=path/to/rec_model_dir --table_model_dir=path/to/table_model_dir --image_dir=../doc/table/1.png --rec_char_dict_path=../ppocr/utils/dict/table_dict.txt --table_char_dict_path=../ppocr/utils/dict/table_structure_dict.txt --det_limit_side_len=736 --det_limit_type=min --gt_path=path/to/gt.txt +python3 table/eval_table.py \ + --det_model_dir=path/to/det_model_dir \ + --rec_model_dir=path/to/rec_model_dir \ + --table_model_dir=path/to/table_model_dir \ + --image_dir=../doc/table/1.png \ + --rec_char_dict_path=../ppocr/utils/dict/table_dict.txt \ + --table_char_dict_path=../ppocr/utils/dict/table_structure_dict.txt \ + --det_limit_side_len=736 \ + --det_limit_type=min \ + --gt_path=path/to/gt.txt ``` If the PubLatNet eval dataset is used, it will be output @@ -148,14 +108,6 @@ If the PubLatNet eval dataset is used, it will be output teds: 94.98 ``` -### 3.4 Inference - -```python -cd PaddleOCR/ppstructure -python3 table/predict_table.py --det_model_dir=path/to/det_model_dir --rec_model_dir=path/to/rec_model_dir --table_model_dir=path/to/table_model_dir --image_dir=../doc/table/1.png --rec_char_dict_path=../ppocr/utils/dict/table_dict.txt --table_char_dict_path=../ppocr/utils/dict/table_structure_dict.txt --det_limit_side_len=736 --det_limit_type=min --output ../output/table -``` -After running, the excel sheet of each picture will be saved in the directory specified by the output field - -Reference +## 4. Reference 1. https://github.com/ibm-aur-nlp/PubTabNet 2. https://arxiv.org/pdf/1911.10683 diff --git a/ppstructure/table/README_ch.md b/ppstructure/table/README_ch.md index ac5029e7ff42d52bac51af1ccf0060511fbdb47c..d7e82658338bc0ec129d0996638ae0f07c39f1e4 100644 --- a/ppstructure/table/README_ch.md +++ b/ppstructure/table/README_ch.md @@ -2,22 +2,21 @@ # 表格识别 -- [1. 表格识别 pipeline](#1) -- [2. 性能](#2) -- [3. 使用](#3) - - [3.1 快速开始](#31) - - [3.2 训练](#32) - - [3.3 评估](#33) - - [3.4 预测](#34) +- [1. 表格识别 pipeline](#1-表格识别-pipeline) +- [2. 性能](#2-性能) +- [3. 使用](#3-使用) + - [3.1 快速开始](#31-快速开始) + - [3.2 训练](#32-训练) + - [3.3 计算TEDS](#33-计算teds) +- [4. Reference](#4-reference) - ## 1. 表格识别 pipeline 表格识别主要包含三个模型 1. 单行文本检测-DB 2. 单行文本识别-CRNN -3. 表格结构和cell坐标预测-RARE +3. 表格结构和cell坐标预测-SLANet 具体流程图如下 @@ -26,30 +25,27 @@ 流程说明: 1. 图片由单行文字检测模型检测到单行文字的坐标,然后送入识别模型拿到识别结果。 -2. 图片由表格结构和cell坐标预测模型拿到表格的结构信息和单元格的坐标信息。 +2. 图片由SLANet模型拿到表格的结构信息和单元格的坐标信息。 3. 由单行文字的坐标、识别结果和单元格的坐标一起组合出单元格的识别结果。 4. 单元格的识别结果和表格结构一起构造表格的html字符串。 - ## 2. 性能 我们在 PubTabNet[1] 评估数据集上对算法进行了评估,性能如下 -|算法|[TEDS(Tree-Edit-Distance-based Similarity)](https://github.com/ibm-aur-nlp/PubTabNet/tree/master/src)| -| --- | --- | -| EDD[2] | 88.3 | -| TableRec-RARE(ours) | 93.32 | -| SLANet(ours) | 94.98 | +|算法|acc|[TEDS(Tree-Edit-Distance-based Similarity)](https://github.com/ibm-aur-nlp/PubTabNet/tree/master/src)| +| --- | --- | --- | +| EDD[2] |x| 88.3 | +| TableRec-RARE(ours) |73.8%| 93.32 | +| SLANet(ours) | 76.2%| 94.98 | - ## 3. 使用 - ### 3.1 快速开始 -- 表格识别 +使用如下命令即可快速完成一张表格的识别。 ```python cd PaddleOCR/ppstructure @@ -63,7 +59,7 @@ wget https://paddleocr.bj.bcebos.com/PP-OCRv3/chinese/ch_PP-OCRv3_rec_slim_infer wget https://paddleocr.bj.bcebos.com/ppstructure/models/slanet/ch_ppstructure_mobile_v2.0_SLANet_infer.tar && tar xf ch_ppstructure_mobile_v2.0_SLANet_infer.tar cd .. # 执行表格识别 -python3.7 table/predict_table.py \ +python table/predict_table.py \ --det_model_dir=inference/ch_PP-OCRv3_det_slim_infer \ --rec_model_dir=inference/ch_PP-OCRv3_rec_slim_infer \ --table_model_dir=inference/ch_ppstructure_mobile_v2.0_SLANet_infer \ @@ -74,68 +70,19 @@ python3.7 table/predict_table.py \ ``` 运行完成后,每张图片的excel表格会保存到output字段指定的目录下,同时在该目录下回生产一个html文件,用于可视化查看单元格坐标和识别的表格。 -- 表格结构识别 -```python -cd PaddleOCR/ppstructure - -# 下载模型 -mkdir inference && cd inference -# 下载PP-Structurev2表格识别模型并解压 -wget https://paddleocr.bj.bcebos.com/ppstructure/models/slanet/ch_ppstructure_mobile_v2.0_SLANet_infer.tar && tar xf ch_ppstructure_mobile_v2.0_SLANet_infer.tar -cd .. -# 执行表格结构识别 -python3.7 table/predict_structure.py \ - --table_model_dir=inference/ch_ppstructure_mobile_v2.0_SLANet_infer \ - --table_char_dict_path=../ppocr/utils/dict/table_structure_dict_ch.txt \ - --image_dir=docs/table/table.jpg \ - --output=../output/table -``` -运行完成后,单元格的检测框可视化会保存到output字段指定的目录下。 - - ### 3.2 训练 -在这一章节中,我们仅介绍表格结构模型的训练,[文字检测](../../doc/doc_ch/detection.md)和[文字识别](../../doc/doc_ch/recognition.md)的模型训练请参考对应的文档。 - -* 数据准备 +文本检测模型的训练、评估和推理流程可参考 [detection](../../doc/doc_ch/detection.md) -对于中文模型和英文模型,数据来源不同,分别介绍如下 +文本识别模型的训练、评估和推理流程可参考 [recognition](../../doc/doc_ch/recognition.md) -英文数据集: 训练数据使用公开数据集PubTabNet ([论文](https://arxiv.org/abs/1911.10683),[下载地址](https://github.com/ibm-aur-nlp/PubTabNet))。PubTabNet数据集包含约50万张表格数据的图像,以及图像对应的html格式的注释。 +表格识别模型的训练、评估和推理流程可参考 [table_recognition](../../doc/doc_ch/table_recognition.md) -中文数据集: 中文数据集下面两部分构成,这两部分安装1:1的采样比例进行训练。 -> 1. 生成数据集: 使用[表格生成工具](https://github.com/WenmuZhou/TableGeneration)生成4w张。 -> 2. 从[WTW](https://github.com/wangwen-whu/WTW-Dataset)中获取1w张。 - -关于公开数据集的详细介绍可以参考 [table_datasets](../../doc/doc_ch/dataset/table_datasets.md),下述训练和评估流程均以英文数据集为例。 - -* 启动训练 - -*如果您安装的是cpu版本,请将配置文件中的 `use_gpu` 字段修改为false* -```shell -# 单机单卡训练 -python3 tools/train.py -c configs/table/table_mv3.yml -# 单机多卡训练,通过 --gpus 参数设置使用的GPU ID -python3 -m paddle.distributed.launch --gpus '0,1,2,3' tools/train.py -c configs/table/table_mv3.yml -``` - -上述指令中,通过-c 选择训练使用configs/table/table_mv3.yml配置文件。有关配置文件的详细解释,请参考[链接](../../doc/doc_ch/config.md)。 - -* 断点训练 - -如果训练程序中断,如果希望加载训练中断的模型从而恢复训练,可以通过指定Global.checkpoints指定要加载的模型路径: -```shell -python3 tools/train.py -c configs/table/table_mv3.yml -o Global.checkpoints=./your/trained/model -``` - -**注意**:`Global.checkpoints`的优先级高于`Global.pretrain_weights`的优先级,即同时指定两个参数时,优先加载`Global.checkpoints`指定的模型,如果`Global.checkpoints`指定的模型路径有误,会加载`Global.pretrain_weights`指定的模型。 - - -### 3.3 评估 +### 3.3 计算TEDS 表格使用 [TEDS(Tree-Edit-Distance-based Similarity)](https://github.com/ibm-aur-nlp/PubTabNet/tree/master/src) 作为模型的评估指标。在进行模型评估之前,需要将pipeline中的三个模型分别导出为inference模型(我们已经提供好),还需要准备评估的gt, gt示例如下: ```txt -PMC5755158_010_01.png
WeaningWeek 15Off-test
Weaning
Week 150.17 ± 0.080.16 ± 0.03
Off-test0.80 ± 0.240.19 ± 0.09
+PMC5755158_010_01.png
WeaningWeek 15Off-test
Weaning
Week 150.17 ± 0.080.16 ± 0.03
Off-test0.80 ± 0.240.19 ± 0.09
``` gt每一行都由文件名和表格的html字符串组成,文件名和表格的html字符串之间使用`\t`分隔。 @@ -147,21 +94,22 @@ python3 ppstructure/table/convert_label2html.py --ori_gt_path /path/to/your_labe 准备完成后使用如下命令进行评估,评估完成后会输出teds指标。 ```python cd PaddleOCR/ppstructure -python3 table/eval_table.py --det_model_dir=path/to/det_model_dir --rec_model_dir=path/to/rec_model_dir --table_model_dir=path/to/table_model_dir --image_dir=../doc/table/1.png --rec_char_dict_path=../ppocr/utils/dict/table_dict.txt --table_char_dict_path=../ppocr/utils/dict/table_structure_dict.txt --det_limit_side_len=736 --det_limit_type=min --gt_path=path/to/gt.txt +python3 table/eval_table.py \ + --det_model_dir=path/to/det_model_dir \ + --rec_model_dir=path/to/rec_model_dir \ + --table_model_dir=path/to/table_model_dir \ + --image_dir=../doc/table/1.png \ + --rec_char_dict_path=../ppocr/utils/dict/table_dict.txt \ + --table_char_dict_path=../ppocr/utils/dict/table_structure_dict.txt \ + --det_limit_side_len=736 \ + --det_limit_type=min \ + --gt_path=path/to/gt.txt ``` 如使用PubLatNet评估数据集,将会输出 ```bash teds: 94.98 ``` - -### 3.4 预测 - -```python -cd PaddleOCR/ppstructure -python3 table/predict_table.py --det_model_dir=path/to/det_model_dir --rec_model_dir=path/to/rec_model_dir --table_model_dir=path/to/table_model_dir --image_dir=../doc/table/1.png --rec_char_dict_path=../ppocr/utils/dict/table_dict.txt --table_char_dict_path=../ppocr/utils/dict/table_structure_dict.txt --det_limit_side_len=736 --det_limit_type=min --output ../output/table -``` - -# Reference +## 4. Reference 1. https://github.com/ibm-aur-nlp/PubTabNet 2. https://arxiv.org/pdf/1911.10683