提交 17b1312e 编写于 作者: xuyang2233's avatar xuyang2233

fixed conflict vl 0811

...@@ -131,7 +131,7 @@ pip3 install dist/PPOCRLabel-1.0.2-py2.py3-none-any.whl -i https://mirror.baidu. ...@@ -131,7 +131,7 @@ pip3 install dist/PPOCRLabel-1.0.2-py2.py3-none-any.whl -i https://mirror.baidu.
> 注意:如果表格中存在空白单元格,同样需要使用一个标注框将其标出,使得单元格总数与图像中保持一致。 > 注意:如果表格中存在空白单元格,同样需要使用一个标注框将其标出,使得单元格总数与图像中保持一致。
3. **调整单元格顺序:**点击软件`视图-显示框编号` 打开标注框序号,在软件界面右侧拖动 `识别结果` 一栏下的所有结果,使得标注框编号按照从左到右,从上到下的顺序排列 3. **调整单元格顺序**:点击软件`视图-显示框编号` 打开标注框序号,在软件界面右侧拖动 `识别结果` 一栏下的所有结果,使得标注框编号按照从左到右,从上到下的顺序排列,按行依次标注。
4. 标注表格结构:**在外部Excel软件中,将存在文字的单元格标记为任意标识符(如 `1` )**,保证Excel中的单元格合并情况与原图相同即可(即不需要Excel中的单元格文字与图片中的文字完全相同) 4. 标注表格结构:**在外部Excel软件中,将存在文字的单元格标记为任意标识符(如 `1` )**,保证Excel中的单元格合并情况与原图相同即可(即不需要Excel中的单元格文字与图片中的文字完全相同)
......
[English](README_en.md) | 简体中文
# 场景应用 # 场景应用
PaddleOCR场景应用覆盖通用,制造、金融、交通行业的主要OCR垂类应用,在PP-OCR、PP-Structure的通用能力基础之上,以notebook的形式展示利用场景数据微调、模型优化方法、数据增广等内容,为开发者快速落地OCR应用提供示范与启发。 PaddleOCR场景应用覆盖通用,制造、金融、交通行业的主要OCR垂类应用,在PP-OCR、PP-Structure的通用能力基础之上,以notebook的形式展示利用场景数据微调、模型优化方法、数据增广等内容,为开发者快速落地OCR应用提供示范与启发。
> 如需下载全部垂类模型,可以扫描下方二维码,关注公众号填写问卷后,加入PaddleOCR官方交流群获取20G OCR学习大礼包(内含《动手学OCR》电子书、课程回放视频、前沿论文等重磅资料) - [教程文档](#1)
- [通用](#11)
- [制造](#12)
- [金融](#13)
- [交通](#14)
<div align="center"> - [模型下载](#2)
<img src="https://ai-studio-static-online.cdn.bcebos.com/dd721099bd50478f9d5fb13d8dd00fad69c22d6848244fd3a1d3980d7fefc63e" width = "150" height = "150" />
</div> <a name="1"></a>
## 教程文档
<a name="11"></a>
### 通用
| 类别 | 亮点 | 模型下载 | 教程 |
| ---------------------- | ------------ | -------------- | --------------------------------------- |
| 高精度中文识别模型SVTR | 比PP-OCRv3识别模型精度高3%,可用于数据挖掘或对预测效率要求不高的场景。| [模型下载](#2) | [中文](./高精度中文识别模型.md)/English |
| 手写体识别 | 新增字形支持 | | |
<a name="12"></a>
> 如果您是企业开发者且未在下述场景中找到合适的方案,可以填写[OCR应用合作调研问卷](https://paddle.wjx.cn/vj/QwF7GKw.aspx),免费与官方团队展开不同层次的合作,包括但不限于问题抽象、确定技术方案、项目答疑、共同研发等。如果您已经使用PaddleOCR落地项目,也可以填写此问卷,与飞桨平台共同宣传推广,提升企业技术品宣。期待您的提交! ### 制造
## 通用 | 类别 | 亮点 | 模型下载 | 教程 | 示例图 |
| -------------- | ------------------------------ | -------------- | ------------------------------------------------------------ | ------------------------------------------------------------ |
| 数码管识别 | 数码管数据合成、漏识别调优 | [模型下载](#2) | [中文](./光功率计数码管字符识别/光功率计数码管字符识别.md)/English | <img src="https://ai-studio-static-online.cdn.bcebos.com/7d5774a273f84efba5b9ce7fd3f86e9ef24b6473e046444db69fa3ca20ac0986" width = "200" height = "100" /> |
| 液晶屏读数识别 | 检测模型蒸馏、Serving部署 | [模型下载](#2) | [中文](./液晶屏读数识别.md)/English | <img src="https://ai-studio-static-online.cdn.bcebos.com/901ab741cb46441ebec510b37e63b9d8d1b7c95f63cc4e5e8757f35179ae6373" width = "200" height = "100" /> |
| 包装生产日期 | 点阵字符合成、过曝过暗文字识别 | [模型下载](#2) | [中文](./包装生产日期识别.md)/English | <img src="https://ai-studio-static-online.cdn.bcebos.com/d9e0533cc1df47ffa3bbe99de9e42639a3ebfa5bce834bafb1ca4574bf9db684" width = "200" height = "100" /> |
| PCB文字识别 | 小尺寸文本检测与识别 | [模型下载](#2) | [中文](./PCB字符识别/PCB字符识别.md)/English | <img src="https://ai-studio-static-online.cdn.bcebos.com/95d8e95bf1ab476987f2519c0f8f0c60a0cdc2c444804ed6ab08f2f7ab054880" width = "200" height = "100" /> |
| 电表识别 | 大分辨率图像检测调优 | [模型下载](#2) | | |
| 液晶屏缺陷检测 | 非文字字符识别 | | | |
| 类别 | 亮点 | 类别 | 亮点 | <a name="13"></a>
| ------------------------------------------------- | -------- | ---------- | ------------ |
| [高精度中文识别模型SVTR](./高精度中文识别模型.md) | 新增模型 | 手写体识别 | 新增字形支持 |
## 制造 ### 金融
| 类别 | 亮点 | 类别 | 亮点 | | 类别 | 亮点 | 模型下载 | 教程 | 示例图 |
| ------------------------------------------------------------ | ------------------------------ | ------------------------------------------- | -------------------- | | -------------- | ------------------------ | -------------- | ----------------------------------- | ------------------------------------------------------------ |
| [数码管识别](./光功率计数码管字符识别/光功率计数码管字符识别.md) | 数码管数据合成、漏识别调优 | 电表识别 | 大分辨率图像检测调优 | | 表单VQA | 多模态通用表单结构化提取 | [模型下载](#2) | [中文](./多模态表单识别.md)/English | <img src="https://ai-studio-static-online.cdn.bcebos.com/a3b25766f3074d2facdf88d4a60fc76612f51992fd124cf5bd846b213130665b" width = "200" height = "200" /> |
| [液晶屏读数识别](./液晶屏读数识别.md) | 检测模型蒸馏、Serving部署 | [PCB文字识别](./PCB字符识别/PCB字符识别.md) | 小尺寸文本检测与识别 | | 增值税发票 | 尽请期待 | | | |
| [包装生产日期](./包装生产日期识别.md) | 点阵字符合成、过曝过暗文字识别 | 液晶屏缺陷检测 | 非文字字符识别 | | 印章检测与识别 | 端到端弯曲文本识别 | | | |
| 通用卡证识别 | 通用结构化提取 | | | |
| 身份证识别 | 结构化提取、图像阴影 | | | |
| 合同比对 | 密集文本检测、NLP串联 | | | |
## 金融 <a name="14"></a>
| 类别 | 亮点 | 类别 | 亮点 | ### 交通
| ------------------------------ | ------------------------ | ------------ | --------------------- |
| [表单VQA](./多模态表单识别.md) | 多模态通用表单结构化提取 | 通用卡证识别 | 通用结构化提取 | | 类别 | 亮点 | 模型下载 | 教程 | 示例图 |
| 增值税发票 | 尽请期待 | 身份证识别 | 结构化提取、图像阴影 | | ----------------- | ------------------------------ | -------------- | ----------------------------------- | ------------------------------------------------------------ |
| 印章检测与识别 | 端到端弯曲文本识别 | 合同比对 | 密集文本检测、NLP串联 | | 车牌识别 | 多角度图像、轻量模型、端侧部署 | [模型下载](#2) | [中文](./轻量级车牌识别.md)/English | <img src="https://ai-studio-static-online.cdn.bcebos.com/76b6a0939c2c4cf49039b6563c4b28e241e11285d7464e799e81c58c0f7707a7" width = "200" height = "100" /> |
| 驾驶证/行驶证识别 | 尽请期待 | | | |
| 快递单识别 | 尽请期待 | | | |
<a name="2"></a>
## 模型下载
如需下载上述场景中已经训练好的垂类模型,可以扫描下方二维码,关注公众号填写问卷后,加入PaddleOCR官方交流群获取20G OCR学习大礼包(内含《动手学OCR》电子书、课程回放视频、前沿论文等重磅资料)
<div align="center">
<img src="https://ai-studio-static-online.cdn.bcebos.com/dd721099bd50478f9d5fb13d8dd00fad69c22d6848244fd3a1d3980d7fefc63e" width = "150" height = "150" />
</div>
## 交通 如果您是企业开发者且未在上述场景中找到合适的方案,可以填写[OCR应用合作调研问卷](https://paddle.wjx.cn/vj/QwF7GKw.aspx),免费与官方团队展开不同层次的合作,包括但不限于问题抽象、确定技术方案、项目答疑、共同研发等。如果您已经使用PaddleOCR落地项目,也可以填写此问卷,与飞桨平台共同宣传推广,提升企业技术品宣。期待您的提交!
| 类别 | 亮点 | 类别 | 亮点 | <a href="https://trackgit.com">
| ------------------------------- | ------------------------------ | ---------- | -------- | <img src="https://us-central1-trackgit-analytics.cloudfunctions.net/token/ping/l63cvzo0w09yxypc7ygl" alt="traffic" />
| [车牌识别](./轻量级车牌识别.md) | 多角度图像、轻量模型、端侧部署 | 快递单识别 | 尽请期待 | </a>
| 驾驶证/行驶证识别 | 尽请期待 | | |
\ No newline at end of file
# 基于PP-OCRv3的手写文字识别
- [1. 项目背景及意义](#1-项目背景及意义)
- [2. 项目内容](#2-项目内容)
- [3. PP-OCRv3识别算法介绍](#3-PP-OCRv3识别算法介绍)
- [4. 安装环境](#4-安装环境)
- [5. 数据准备](#5-数据准备)
- [6. 模型训练](#6-模型训练)
- [6.1 下载预训练模型](#61-下载预训练模型)
- [6.2 修改配置文件](#62-修改配置文件)
- [6.3 开始训练](#63-开始训练)
- [7. 模型评估](#7-模型评估)
- [8. 模型导出推理](#8-模型导出推理)
- [8.1 模型导出](#81-模型导出)
- [8.2 模型推理](#82-模型推理)
## 1. 项目背景及意义
目前光学字符识别(OCR)技术在我们的生活当中被广泛使用,但是大多数模型在通用场景下的准确性还有待提高。针对于此我们借助飞桨提供的PaddleOCR套件较容易的实现了在垂类场景下的应用。手写体在日常生活中较为常见,然而手写体的识别却存在着很大的挑战,因为每个人的手写字体风格不一样,这对于视觉模型来说还是相当有挑战的。因此训练一个手写体识别模型具有很好的现实意义。下面给出一些手写体的示例图:
![example](https://ai-studio-static-online.cdn.bcebos.com/7a8865b2836f42d382e7c3fdaedc4d307d797fa2bcd0466e9f8b7705efff5a7b)
## 2. 项目内容
本项目基于PaddleOCR套件,以PP-OCRv3识别模型为基础,针对手写文字识别场景进行优化。
Aistudio项目链接:[OCR手写文字识别](https://aistudio.baidu.com/aistudio/projectdetail/4330587)
## 3. PP-OCRv3识别算法介绍
PP-OCRv3的识别模块是基于文本识别算法[SVTR](https://arxiv.org/abs/2205.00159)优化。SVTR不再采用RNN结构,通过引入Transformers结构更加有效地挖掘文本行图像的上下文信息,从而提升文本识别能力。如下图所示,PP-OCRv3采用了6个优化策略。
![v3_rec](https://ai-studio-static-online.cdn.bcebos.com/d4f5344b5b854d50be738671598a89a45689c6704c4d481fb904dd7cf72f2a1a)
优化策略汇总如下:
* SVTR_LCNet:轻量级文本识别网络
* GTC:Attention指导CTC训练策略
* TextConAug:挖掘文字上下文信息的数据增广策略
* TextRotNet:自监督的预训练模型
* UDML:联合互学习策略
* UIM:无标注数据挖掘方案
详细优化策略描述请参考[PP-OCRv3优化策略](https://github.com/PaddlePaddle/PaddleOCR/blob/release/2.5/doc/doc_ch/PP-OCRv3_introduction.md#3-%E8%AF%86%E5%88%AB%E4%BC%98%E5%8C%96)
## 4. 安装环境
```python
# 首先git官方的PaddleOCR项目,安装需要的依赖
git clone https://github.com/PaddlePaddle/PaddleOCR.git
cd PaddleOCR
pip install -r requirements.txt
```
## 5. 数据准备
本项目使用公开的手写文本识别数据集,包含Chinese OCR, 中科院自动化研究所-手写中文数据集[CASIA-HWDB2.x](http://www.nlpr.ia.ac.cn/databases/handwriting/Download.html),以及由中科院手写数据和网上开源数据合并组合的[数据集](https://aistudio.baidu.com/aistudio/datasetdetail/102884/0)等,该项目已经挂载处理好的数据集,可直接下载使用进行训练。
```python
下载并解压数据
tar -xf hw_data.tar
```
## 6. 模型训练
### 6.1 下载预训练模型
首先需要下载我们需要的PP-OCRv3识别预训练模型,更多选择请自行选择其他的[文字识别模型](https://github.com/PaddlePaddle/PaddleOCR/blob/release/2.5/doc/doc_ch/models_list.md#2-%E6%96%87%E6%9C%AC%E8%AF%86%E5%88%AB%E6%A8%A1%E5%9E%8B)
```python
# 使用该指令下载需要的预训练模型
wget -P ./pretrained_models/ https://paddleocr.bj.bcebos.com/PP-OCRv3/chinese/ch_PP-OCRv3_rec_train.tar
# 解压预训练模型文件
tar -xf ./pretrained_models/ch_PP-OCRv3_rec_train.tar -C pretrained_models
```
### 6.2 修改配置文件
我们使用`configs/rec/PP-OCRv3/ch_PP-OCRv3_rec_distillation.yml`,主要修改训练轮数和学习率参相关参数,设置预训练模型路径,设置数据集路径。 另外,batch_size可根据自己机器显存大小进行调整。 具体修改如下几个地方:
```
epoch_num: 100 # 训练epoch数
save_model_dir: ./output/ch_PP-OCR_v3_rec
save_epoch_step: 10
eval_batch_step: [0, 100] # 评估间隔,每隔100step评估一次
pretrained_model: ./pretrained_models/ch_PP-OCRv3_rec_train/best_accuracy # 预训练模型路径
lr:
name: Cosine # 修改学习率衰减策略为Cosine
learning_rate: 0.0001 # 修改fine-tune的学习率
warmup_epoch: 2 # 修改warmup轮数
Train:
dataset:
name: SimpleDataSet
data_dir: ./train_data # 训练集图片路径
ext_op_transform_idx: 1
label_file_list:
- ./train_data/chineseocr-data/rec_hand_line_all_label_train.txt # 训练集标签
- ./train_data/handwrite/HWDB2.0Train_label.txt
- ./train_data/handwrite/HWDB2.1Train_label.txt
- ./train_data/handwrite/HWDB2.2Train_label.txt
- ./train_data/handwrite/hwdb_ic13/handwriting_hwdb_train_labels.txt
- ./train_data/handwrite/HW_Chinese/train_hw.txt
ratio_list:
- 0.1
- 1.0
- 1.0
- 1.0
- 0.02
- 1.0
loader:
shuffle: true
batch_size_per_card: 64
drop_last: true
num_workers: 4
Eval:
dataset:
name: SimpleDataSet
data_dir: ./train_data # 测试集图片路径
label_file_list:
- ./train_data/chineseocr-data/rec_hand_line_all_label_val.txt # 测试集标签
- ./train_data/handwrite/HWDB2.0Test_label.txt
- ./train_data/handwrite/HWDB2.1Test_label.txt
- ./train_data/handwrite/HWDB2.2Test_label.txt
- ./train_data/handwrite/hwdb_ic13/handwriting_hwdb_val_labels.txt
- ./train_data/handwrite/HW_Chinese/test_hw.txt
loader:
shuffle: false
drop_last: false
batch_size_per_card: 64
num_workers: 4
```
由于数据集大多是长文本,因此需要**注释**掉下面的数据增广策略,以便训练出更好的模型。
```
- RecConAug:
prob: 0.5
ext_data_num: 2
image_shape: [48, 320, 3]
```
### 6.3 开始训练
我们使用上面修改好的配置文件`configs/rec/PP-OCRv3/ch_PP-OCRv3_rec_distillation.yml`,预训练模型,数据集路径,学习率,训练轮数等都已经设置完毕后,可以使用下面命令开始训练。
```python
# 开始训练识别模型
python tools/train.py -c configs/rec/PP-OCRv3/ch_PP-OCRv3_rec_distillation.yml
```
## 7. 模型评估
在训练之前,我们可以直接使用下面命令来评估预训练模型的效果:
```python
# 评估预训练模型
python tools/eval.py -c configs/rec/PP-OCRv3/ch_PP-OCRv3_rec_distillation.yml -o Global.pretrained_model="./pretrained_models/ch_PP-OCRv3_rec_train/best_accuracy"
```
```
[2022/07/14 10:46:22] ppocr INFO: load pretrain successful from ./pretrained_models/ch_PP-OCRv3_rec_train/best_accuracy
eval model:: 100%|████████████████████████████| 687/687 [03:29<00:00, 3.27it/s]
[2022/07/14 10:49:52] ppocr INFO: metric eval ***************
[2022/07/14 10:49:52] ppocr INFO: acc:0.03724954461811258
[2022/07/14 10:49:52] ppocr INFO: norm_edit_dis:0.4859541065843199
[2022/07/14 10:49:52] ppocr INFO: Teacher_acc:0.0371584699368947
[2022/07/14 10:49:52] ppocr INFO: Teacher_norm_edit_dis:0.48718814890536477
[2022/07/14 10:49:52] ppocr INFO: fps:947.8562684823883
```
可以看出,直接加载预训练模型进行评估,效果较差,因为预训练模型并不是基于手写文字进行单独训练的,所以我们需要基于预训练模型进行finetune。
训练完成后,可以进行测试评估,评估命令如下:
```python
# 评估finetune效果
python tools/eval.py -c configs/rec/PP-OCRv3/ch_PP-OCRv3_rec_distillation.yml -o Global.pretrained_model="./output/ch_PP-OCR_v3_rec/best_accuracy"
```
评估结果如下,可以看出识别准确率为54.3%。
```
[2022/07/14 10:54:06] ppocr INFO: metric eval ***************
[2022/07/14 10:54:06] ppocr INFO: acc:0.5430100180913
[2022/07/14 10:54:06] ppocr INFO: norm_edit_dis:0.9203322593158589
[2022/07/14 10:54:06] ppocr INFO: Teacher_acc:0.5401183969626324
[2022/07/14 10:54:06] ppocr INFO: Teacher_norm_edit_dis:0.919827504507755
[2022/07/14 10:54:06] ppocr INFO: fps:928.948733797251
```
如需获取已训练模型,请扫码填写问卷,加入PaddleOCR官方交流群获取全部OCR垂类模型下载链接、《动手学OCR》电子书等全套OCR学习资料🎁
<div align="left">
<img src="https://ai-studio-static-online.cdn.bcebos.com/dd721099bd50478f9d5fb13d8dd00fad69c22d6848244fd3a1d3980d7fefc63e" width = "150" height = "150" />
</div>
将下载或训练完成的模型放置在对应目录下即可完成模型推理。
## 8. 模型导出推理
训练完成后,可以将训练模型转换成inference模型。inference 模型会额外保存模型的结构信息,在预测部署、加速推理上性能优越,灵活方便,适合于实际系统集成。
### 8.1 模型导出
导出命令如下:
```python
# 转化为推理模型
python tools/export_model.py -c configs/rec/PP-OCRv3/ch_PP-OCRv3_rec_distillation.yml -o Global.pretrained_model="./output/ch_PP-OCR_v3_rec/best_accuracy" Global.save_inference_dir="./inference/rec_ppocrv3/"
```
### 8.2 模型推理
导出模型后,可以使用如下命令进行推理预测:
```python
# 推理预测
python tools/infer/predict_rec.py --image_dir="train_data/handwrite/HWDB2.0Test_images/104-P16_4.jpg" --rec_model_dir="./inference/rec_ppocrv3/Student"
```
```
[2022/07/14 10:55:56] ppocr INFO: In PP-OCRv3, rec_image_shape parameter defaults to '3, 48, 320', if you are using recognition model with PP-OCRv2 or an older version, please set --rec_image_shape='3,32,320
[2022/07/14 10:55:58] ppocr INFO: Predicts of train_data/handwrite/HWDB2.0Test_images/104-P16_4.jpg:('品结构,差异化的多品牌渗透使欧莱雅确立了其在中国化妆', 0.9904912114143372)
```
```python
# 可视化文字识别图片
from PIL import Image
import matplotlib.pyplot as plt
import numpy as np
import os
img_path = 'train_data/handwrite/HWDB2.0Test_images/104-P16_4.jpg'
def vis(img_path):
plt.figure()
image = Image.open(img_path)
plt.imshow(image)
plt.show()
# image = image.resize([208, 208])
vis(img_path)
```
![res](https://ai-studio-static-online.cdn.bcebos.com/ad7c02745491498d82e0ce95f4a274f9b3920b2f467646858709359b7af9d869)
...@@ -2,7 +2,7 @@ ...@@ -2,7 +2,7 @@
## 1. 简介 ## 1. 简介
PP-OCRv3是百度开源的超轻量级场景文本检测识别模型库,其中超轻量的场景中文识别模型SVTR_LCNet使用了SVTR算法结构。为了保证速度,SVTR_LCNet将SVTR模型的Local Blocks替换为LCNet,使用两层Global Blocks。在中文场景中,PP-OCRv3识别主要使用如下优化策略: PP-OCRv3是百度开源的超轻量级场景文本检测识别模型库,其中超轻量的场景中文识别模型SVTR_LCNet使用了SVTR算法结构。为了保证速度,SVTR_LCNet将SVTR模型的Local Blocks替换为LCNet,使用两层Global Blocks。在中文场景中,PP-OCRv3识别主要使用如下优化策略[详细技术报告](../doc/doc_ch/PP-OCRv3_introduction.md)
- GTC:Attention指导CTC训练策略; - GTC:Attention指导CTC训练策略;
- TextConAug:挖掘文字上下文信息的数据增广策略; - TextConAug:挖掘文字上下文信息的数据增广策略;
- TextRotNet:自监督的预训练模型; - TextRotNet:自监督的预训练模型;
......
...@@ -6,11 +6,11 @@ Global: ...@@ -6,11 +6,11 @@ Global:
save_model_dir: ./output/re_layoutlmv2_xfund_zh save_model_dir: ./output/re_layoutlmv2_xfund_zh
save_epoch_step: 2000 save_epoch_step: 2000
# evaluation is run every 10 iterations after the 0th iteration # evaluation is run every 10 iterations after the 0th iteration
eval_batch_step: [ 0, 57 ] eval_batch_step: [ 0, 19 ]
cal_metric_during_train: False cal_metric_during_train: False
save_inference_dir: save_inference_dir:
use_visualdl: False use_visualdl: False
seed: 2048 seed: 2022
infer_img: ppstructure/docs/vqa/input/zh_val_21.jpg infer_img: ppstructure/docs/vqa/input/zh_val_21.jpg
save_res_path: ./output/re_layoutlmv2_xfund_zh/res/ save_res_path: ./output/re_layoutlmv2_xfund_zh/res/
......
Global: Global:
use_gpu: True use_gpu: True
epoch_num: &epoch_num 200 epoch_num: &epoch_num 130
log_smooth_window: 10 log_smooth_window: 10
print_batch_step: 10 print_batch_step: 10
save_model_dir: ./output/re_layoutxlm/ save_model_dir: ./output/re_layoutxlm_xfund_zh
save_epoch_step: 2000 save_epoch_step: 2000
# evaluation is run every 10 iterations after the 0th iteration # evaluation is run every 10 iterations after the 0th iteration
eval_batch_step: [ 0, 19 ] eval_batch_step: [ 0, 19 ]
...@@ -12,7 +12,7 @@ Global: ...@@ -12,7 +12,7 @@ Global:
use_visualdl: False use_visualdl: False
seed: 2022 seed: 2022
infer_img: ppstructure/docs/vqa/input/zh_val_21.jpg infer_img: ppstructure/docs/vqa/input/zh_val_21.jpg
save_res_path: ./output/re/ save_res_path: ./output/re_layoutxlm_xfund_zh/res/
Architecture: Architecture:
model_type: vqa model_type: vqa
...@@ -81,7 +81,7 @@ Train: ...@@ -81,7 +81,7 @@ Train:
loader: loader:
shuffle: True shuffle: True
drop_last: False drop_last: False
batch_size_per_card: 8 batch_size_per_card: 2
num_workers: 8 num_workers: 8
collate_fn: ListCollator collate_fn: ListCollator
......
...@@ -6,13 +6,13 @@ Global: ...@@ -6,13 +6,13 @@ Global:
save_model_dir: ./output/ser_layoutlm_xfund_zh save_model_dir: ./output/ser_layoutlm_xfund_zh
save_epoch_step: 2000 save_epoch_step: 2000
# evaluation is run every 10 iterations after the 0th iteration # evaluation is run every 10 iterations after the 0th iteration
eval_batch_step: [ 0, 57 ] eval_batch_step: [ 0, 19 ]
cal_metric_during_train: False cal_metric_during_train: False
save_inference_dir: save_inference_dir:
use_visualdl: False use_visualdl: False
seed: 2022 seed: 2022
infer_img: ppstructure/docs/vqa/input/zh_val_42.jpg infer_img: ppstructure/docs/vqa/input/zh_val_42.jpg
save_res_path: ./output/ser_layoutlm_xfund_zh/res/ save_res_path: ./output/re_layoutlm_xfund_zh/res
Architecture: Architecture:
model_type: vqa model_type: vqa
...@@ -55,6 +55,7 @@ Train: ...@@ -55,6 +55,7 @@ Train:
data_dir: train_data/XFUND/zh_train/image data_dir: train_data/XFUND/zh_train/image
label_file_list: label_file_list:
- train_data/XFUND/zh_train/train.json - train_data/XFUND/zh_train/train.json
ratio_list: [ 1.0 ]
transforms: transforms:
- DecodeImage: # load image - DecodeImage: # load image
img_mode: RGB img_mode: RGB
......
...@@ -27,6 +27,7 @@ Architecture: ...@@ -27,6 +27,7 @@ Architecture:
Loss: Loss:
name: VQASerTokenLayoutLMLoss name: VQASerTokenLayoutLMLoss
num_classes: *num_classes num_classes: *num_classes
key: "backbone_out"
Optimizer: Optimizer:
name: AdamW name: AdamW
......
...@@ -27,6 +27,7 @@ Architecture: ...@@ -27,6 +27,7 @@ Architecture:
Loss: Loss:
name: VQASerTokenLayoutLMLoss name: VQASerTokenLayoutLMLoss
num_classes: *num_classes num_classes: *num_classes
key: "backbone_out"
Optimizer: Optimizer:
name: AdamW name: AdamW
......
Global: Global:
use_gpu: True use_gpu: True
epoch_num: &epoch_num 200 epoch_num: &epoch_num 130
log_smooth_window: 10 log_smooth_window: 10
print_batch_step: 10 print_batch_step: 10
save_model_dir: ./output/re_layoutxlm_funsd save_model_dir: ./output/re_vi_layoutxlm_xfund_zh
save_epoch_step: 2000 save_epoch_step: 2000
# evaluation is run every 10 iterations after the 0th iteration # evaluation is run every 10 iterations after the 0th iteration
eval_batch_step: [ 0, 57 ] eval_batch_step: [ 0, 19 ]
cal_metric_during_train: False cal_metric_during_train: False
save_inference_dir: save_inference_dir:
use_visualdl: False use_visualdl: False
seed: 2022 seed: 2022
infer_img: train_data/FUNSD/testing_data/images/83624198.png infer_img: ppstructure/docs/vqa/input/zh_val_21.jpg
save_res_path: ./output/re_layoutxlm_funsd/res/ save_res_path: ./output/re/xfund_zh/with_gt
Architecture: Architecture:
model_type: vqa model_type: vqa
...@@ -21,6 +21,7 @@ Architecture: ...@@ -21,6 +21,7 @@ Architecture:
Backbone: Backbone:
name: LayoutXLMForRe name: LayoutXLMForRe
pretrained: True pretrained: True
mode: vi
checkpoints: checkpoints:
Loss: Loss:
...@@ -50,10 +51,9 @@ Metric: ...@@ -50,10 +51,9 @@ Metric:
Train: Train:
dataset: dataset:
name: SimpleDataSet name: SimpleDataSet
data_dir: ./train_data/FUNSD/training_data/images/ data_dir: train_data/XFUND/zh_train/image
label_file_list: label_file_list:
- ./train_data/FUNSD/train_v4.json - train_data/XFUND/zh_train/train.json
# - ./train_data/FUNSD/train.json
ratio_list: [ 1.0 ] ratio_list: [ 1.0 ]
transforms: transforms:
- DecodeImage: # load image - DecodeImage: # load image
...@@ -62,8 +62,9 @@ Train: ...@@ -62,8 +62,9 @@ Train:
- VQATokenLabelEncode: # Class handling label - VQATokenLabelEncode: # Class handling label
contains_re: True contains_re: True
algorithm: *algorithm algorithm: *algorithm
class_path: &class_path ./train_data/FUNSD/class_list.txt class_path: &class_path train_data/XFUND/class_list_xfun.txt
use_textline_bbox_info: &use_textline_bbox_info True use_textline_bbox_info: &use_textline_bbox_info True
order_method: &order_method "tb-yx"
- VQATokenPad: - VQATokenPad:
max_seq_len: &max_seq_len 512 max_seq_len: &max_seq_len 512
return_attention_mask: True return_attention_mask: True
...@@ -79,22 +80,20 @@ Train: ...@@ -79,22 +80,20 @@ Train:
order: 'hwc' order: 'hwc'
- ToCHWImage: - ToCHWImage:
- KeepKeys: - KeepKeys:
# dataloader will return list in this order keep_keys: [ 'input_ids', 'bbox','attention_mask', 'token_type_ids', 'image', 'entities', 'relations'] # dataloader will return list in this order
keep_keys: [ 'input_ids', 'bbox', 'attention_mask', 'token_type_ids', 'image', 'entities', 'relations']
loader: loader:
shuffle: False shuffle: True
drop_last: False drop_last: False
batch_size_per_card: 8 batch_size_per_card: 2
num_workers: 16 num_workers: 4
collate_fn: ListCollator collate_fn: ListCollator
Eval: Eval:
dataset: dataset:
name: SimpleDataSet name: SimpleDataSet
data_dir: ./train_data/FUNSD/testing_data/images/ data_dir: train_data/XFUND/zh_val/image
label_file_list: label_file_list:
- ./train_data/FUNSD/test_v4.json - train_data/XFUND/zh_val/val.json
# - ./train_data/FUNSD/test.json
transforms: transforms:
- DecodeImage: # load image - DecodeImage: # load image
img_mode: RGB img_mode: RGB
...@@ -104,6 +103,7 @@ Eval: ...@@ -104,6 +103,7 @@ Eval:
algorithm: *algorithm algorithm: *algorithm
class_path: *class_path class_path: *class_path
use_textline_bbox_info: *use_textline_bbox_info use_textline_bbox_info: *use_textline_bbox_info
order_method: *order_method
- VQATokenPad: - VQATokenPad:
max_seq_len: *max_seq_len max_seq_len: *max_seq_len
return_attention_mask: True return_attention_mask: True
...@@ -119,11 +119,11 @@ Eval: ...@@ -119,11 +119,11 @@ Eval:
order: 'hwc' order: 'hwc'
- ToCHWImage: - ToCHWImage:
- KeepKeys: - KeepKeys:
# dataloader will return list in this order keep_keys: [ 'input_ids', 'bbox', 'attention_mask', 'token_type_ids', 'image', 'entities', 'relations'] # dataloader will return list in this order
keep_keys: [ 'input_ids', 'bbox', 'attention_mask', 'token_type_ids', 'image', 'entities', 'relations']
loader: loader:
shuffle: False shuffle: False
drop_last: False drop_last: False
batch_size_per_card: 8 batch_size_per_card: 8
num_workers: 8 num_workers: 8
collate_fn: ListCollator collate_fn: ListCollator
Global:
use_gpu: True
epoch_num: &epoch_num 130
log_smooth_window: 10
print_batch_step: 10
save_model_dir: ./output/re_vi_layoutxlm_xfund_zh_udml
save_epoch_step: 2000
# evaluation is run every 10 iterations after the 0th iteration
eval_batch_step: [ 0, 19 ]
cal_metric_during_train: False
save_inference_dir:
use_visualdl: False
seed: 2022
infer_img: ppstructure/docs/vqa/input/zh_val_21.jpg
save_res_path: ./output/re/xfund_zh/with_gt
Architecture:
model_type: &model_type "vqa"
name: DistillationModel
algorithm: Distillation
Models:
Teacher:
pretrained:
freeze_params: false
return_all_feats: true
model_type: *model_type
algorithm: &algorithm "LayoutXLM"
Transform:
Backbone:
name: LayoutXLMForRe
pretrained: True
mode: vi
checkpoints:
Student:
pretrained:
freeze_params: false
return_all_feats: true
model_type: *model_type
algorithm: *algorithm
Transform:
Backbone:
name: LayoutXLMForRe
pretrained: True
mode: vi
checkpoints:
Loss:
name: CombinedLoss
loss_config_list:
- DistillationLossFromOutput:
weight: 1.0
model_name_list: ["Student", "Teacher"]
key: loss
reduction: mean
- DistillationVQADistanceLoss:
weight: 0.5
mode: "l2"
model_name_pairs:
- ["Student", "Teacher"]
key: hidden_states_5
name: "loss_5"
- DistillationVQADistanceLoss:
weight: 0.5
mode: "l2"
model_name_pairs:
- ["Student", "Teacher"]
key: hidden_states_8
name: "loss_8"
Optimizer:
name: AdamW
beta1: 0.9
beta2: 0.999
clip_norm: 10
lr:
learning_rate: 0.00005
warmup_epoch: 10
regularizer:
name: L2
factor: 0.00000
PostProcess:
name: DistillationRePostProcess
model_name: ["Student", "Teacher"]
key: null
Metric:
name: DistillationMetric
base_metric_name: VQAReTokenMetric
main_indicator: hmean
key: "Student"
Train:
dataset:
name: SimpleDataSet
data_dir: train_data/XFUND/zh_train/image
label_file_list:
- train_data/XFUND/zh_train/train.json
ratio_list: [ 1.0 ]
transforms:
- DecodeImage: # load image
img_mode: RGB
channel_first: False
- VQATokenLabelEncode: # Class handling label
contains_re: True
algorithm: *algorithm
class_path: &class_path train_data/XFUND/class_list_xfun.txt
use_textline_bbox_info: &use_textline_bbox_info True
# [None, "tb-yx"]
order_method: &order_method "tb-yx"
- VQATokenPad:
max_seq_len: &max_seq_len 512
return_attention_mask: True
- VQAReTokenRelation:
- VQAReTokenChunk:
max_seq_len: *max_seq_len
- Resize:
size: [224,224]
- NormalizeImage:
scale: 1
mean: [ 123.675, 116.28, 103.53 ]
std: [ 58.395, 57.12, 57.375 ]
order: 'hwc'
- ToCHWImage:
- KeepKeys:
keep_keys: [ 'input_ids', 'bbox','attention_mask', 'token_type_ids', 'image', 'entities', 'relations'] # dataloader will return list in this order
loader:
shuffle: True
drop_last: False
batch_size_per_card: 2
num_workers: 4
collate_fn: ListCollator
Eval:
dataset:
name: SimpleDataSet
data_dir: train_data/XFUND/zh_val/image
label_file_list:
- train_data/XFUND/zh_val/val.json
transforms:
- DecodeImage: # load image
img_mode: RGB
channel_first: False
- VQATokenLabelEncode: # Class handling label
contains_re: True
algorithm: *algorithm
class_path: *class_path
use_textline_bbox_info: *use_textline_bbox_info
order_method: *order_method
- VQATokenPad:
max_seq_len: *max_seq_len
return_attention_mask: True
- VQAReTokenRelation:
- VQAReTokenChunk:
max_seq_len: *max_seq_len
- Resize:
size: [224,224]
- NormalizeImage:
scale: 1
mean: [ 123.675, 116.28, 103.53 ]
std: [ 58.395, 57.12, 57.375 ]
order: 'hwc'
- ToCHWImage:
- KeepKeys:
keep_keys: [ 'input_ids', 'bbox', 'attention_mask', 'token_type_ids', 'image', 'entities', 'relations'] # dataloader will return list in this order
loader:
shuffle: False
drop_last: False
batch_size_per_card: 8
num_workers: 8
collate_fn: ListCollator
...@@ -3,30 +3,38 @@ Global: ...@@ -3,30 +3,38 @@ Global:
epoch_num: &epoch_num 200 epoch_num: &epoch_num 200
log_smooth_window: 10 log_smooth_window: 10
print_batch_step: 10 print_batch_step: 10
save_model_dir: ./output/ser_layoutlm_funsd save_model_dir: ./output/ser_vi_layoutxlm_xfund_zh
save_epoch_step: 2000 save_epoch_step: 2000
# evaluation is run every 10 iterations after the 0th iteration # evaluation is run every 10 iterations after the 0th iteration
eval_batch_step: [ 0, 57 ] eval_batch_step: [ 0, 19 ]
cal_metric_during_train: False cal_metric_during_train: False
save_inference_dir: save_inference_dir:
use_visualdl: False use_visualdl: False
seed: 2022 seed: 2022
infer_img: train_data/FUNSD/testing_data/images/83624198.png infer_img: ppstructure/docs/vqa/input/zh_val_42.jpg
save_res_path: ./output/ser_layoutlm_funsd/res/ # if you want to predict using the groundtruth ocr info,
# you can use the following config
# infer_img: train_data/XFUND/zh_val/val.json
# infer_mode: False
save_res_path: ./output/ser/xfund_zh/res
Architecture: Architecture:
model_type: vqa model_type: vqa
algorithm: &algorithm "LayoutLM" algorithm: &algorithm "LayoutXLM"
Transform: Transform:
Backbone: Backbone:
name: LayoutLMForSer name: LayoutXLMForSer
pretrained: True pretrained: True
checkpoints: checkpoints:
# one of base or vi
mode: vi
num_classes: &num_classes 7 num_classes: &num_classes 7
Loss: Loss:
name: VQASerTokenLayoutLMLoss name: VQASerTokenLayoutLMLoss
num_classes: *num_classes num_classes: *num_classes
key: "backbone_out"
Optimizer: Optimizer:
name: AdamW name: AdamW
...@@ -43,7 +51,7 @@ Optimizer: ...@@ -43,7 +51,7 @@ Optimizer:
PostProcess: PostProcess:
name: VQASerTokenLayoutLMPostProcess name: VQASerTokenLayoutLMPostProcess
class_path: &class_path ./train_data/FUNSD/class_list.txt class_path: &class_path train_data/XFUND/class_list_xfun.txt
Metric: Metric:
name: VQASerTokenMetric name: VQASerTokenMetric
...@@ -52,9 +60,10 @@ Metric: ...@@ -52,9 +60,10 @@ Metric:
Train: Train:
dataset: dataset:
name: SimpleDataSet name: SimpleDataSet
data_dir: ./train_data/FUNSD/training_data/images/ data_dir: train_data/XFUND/zh_train/image
label_file_list: label_file_list:
- ./train_data/FUNSD/train.json - train_data/XFUND/zh_train/train.json
ratio_list: [ 1.0 ]
transforms: transforms:
- DecodeImage: # load image - DecodeImage: # load image
img_mode: RGB img_mode: RGB
...@@ -64,6 +73,8 @@ Train: ...@@ -64,6 +73,8 @@ Train:
algorithm: *algorithm algorithm: *algorithm
class_path: *class_path class_path: *class_path
use_textline_bbox_info: &use_textline_bbox_info True use_textline_bbox_info: &use_textline_bbox_info True
# one of [None, "tb-yx"]
order_method: &order_method "tb-yx"
- VQATokenPad: - VQATokenPad:
max_seq_len: &max_seq_len 512 max_seq_len: &max_seq_len 512
return_attention_mask: True return_attention_mask: True
...@@ -78,8 +89,7 @@ Train: ...@@ -78,8 +89,7 @@ Train:
order: 'hwc' order: 'hwc'
- ToCHWImage: - ToCHWImage:
- KeepKeys: - KeepKeys:
# dataloader will return list in this order keep_keys: [ 'input_ids', 'bbox', 'attention_mask', 'token_type_ids', 'image', 'labels'] # dataloader will return list in this order
keep_keys: [ 'input_ids', 'bbox', 'attention_mask', 'token_type_ids', 'image', 'labels']
loader: loader:
shuffle: True shuffle: True
drop_last: False drop_last: False
...@@ -89,9 +99,9 @@ Train: ...@@ -89,9 +99,9 @@ Train:
Eval: Eval:
dataset: dataset:
name: SimpleDataSet name: SimpleDataSet
data_dir: train_data/FUNSD/testing_data/images/ data_dir: train_data/XFUND/zh_val/image
label_file_list: label_file_list:
- ./train_data/FUNSD/test.json - train_data/XFUND/zh_val/val.json
transforms: transforms:
- DecodeImage: # load image - DecodeImage: # load image
img_mode: RGB img_mode: RGB
...@@ -101,6 +111,7 @@ Eval: ...@@ -101,6 +111,7 @@ Eval:
algorithm: *algorithm algorithm: *algorithm
class_path: *class_path class_path: *class_path
use_textline_bbox_info: *use_textline_bbox_info use_textline_bbox_info: *use_textline_bbox_info
order_method: *order_method
- VQATokenPad: - VQATokenPad:
max_seq_len: *max_seq_len max_seq_len: *max_seq_len
return_attention_mask: True return_attention_mask: True
...@@ -115,8 +126,7 @@ Eval: ...@@ -115,8 +126,7 @@ Eval:
order: 'hwc' order: 'hwc'
- ToCHWImage: - ToCHWImage:
- KeepKeys: - KeepKeys:
# dataloader will return list in this order keep_keys: [ 'input_ids', 'bbox', 'attention_mask', 'token_type_ids', 'image', 'labels'] # dataloader will return list in this order
keep_keys: [ 'input_ids', 'bbox', 'attention_mask', 'token_type_ids', 'image', 'labels']
loader: loader:
shuffle: False shuffle: False
drop_last: False drop_last: False
......
...@@ -3,30 +3,84 @@ Global: ...@@ -3,30 +3,84 @@ Global:
epoch_num: &epoch_num 200 epoch_num: &epoch_num 200
log_smooth_window: 10 log_smooth_window: 10
print_batch_step: 10 print_batch_step: 10
save_model_dir: ./output/ser_layoutxlm_funsd save_model_dir: ./output/ser_vi_layoutxlm_xfund_zh_udml
save_epoch_step: 2000 save_epoch_step: 2000
# evaluation is run every 10 iterations after the 0th iteration # evaluation is run every 10 iterations after the 0th iteration
eval_batch_step: [ 0, 57 ] eval_batch_step: [ 0, 19 ]
cal_metric_during_train: False cal_metric_during_train: False
save_inference_dir: save_inference_dir:
use_visualdl: False use_visualdl: False
seed: 2022 seed: 2022
infer_img: train_data/FUNSD/testing_data/images/83624198.png infer_img: ppstructure/docs/vqa/input/zh_val_42.jpg
save_res_path: output/ser_layoutxlm_funsd/res/ save_res_path: ./output/ser_layoutxlm_xfund_zh/res
Architecture: Architecture:
model_type: vqa model_type: &model_type "vqa"
algorithm: &algorithm "LayoutXLM" name: DistillationModel
Transform: algorithm: Distillation
Backbone: Models:
name: LayoutXLMForSer Teacher:
pretrained: True pretrained:
checkpoints: freeze_params: false
num_classes: &num_classes 7 return_all_feats: true
model_type: *model_type
algorithm: &algorithm "LayoutXLM"
Transform:
Backbone:
name: LayoutXLMForSer
pretrained: True
# one of base or vi
mode: vi
checkpoints:
num_classes: &num_classes 7
Student:
pretrained:
freeze_params: false
return_all_feats: true
model_type: *model_type
algorithm: *algorithm
Transform:
Backbone:
name: LayoutXLMForSer
pretrained: True
# one of base or vi
mode: vi
checkpoints:
num_classes: *num_classes
Loss: Loss:
name: VQASerTokenLayoutLMLoss name: CombinedLoss
num_classes: *num_classes loss_config_list:
- DistillationVQASerTokenLayoutLMLoss:
weight: 1.0
model_name_list: ["Student", "Teacher"]
key: backbone_out
num_classes: *num_classes
- DistillationSERDMLLoss:
weight: 1.0
act: "softmax"
use_log: true
model_name_pairs:
- ["Student", "Teacher"]
key: backbone_out
- DistillationVQADistanceLoss:
weight: 0.5
mode: "l2"
model_name_pairs:
- ["Student", "Teacher"]
key: hidden_states_5
name: "loss_5"
- DistillationVQADistanceLoss:
weight: 0.5
mode: "l2"
model_name_pairs:
- ["Student", "Teacher"]
key: hidden_states_8
name: "loss_8"
Optimizer: Optimizer:
name: AdamW name: AdamW
...@@ -36,25 +90,29 @@ Optimizer: ...@@ -36,25 +90,29 @@ Optimizer:
name: Linear name: Linear
learning_rate: 0.00005 learning_rate: 0.00005
epochs: *epoch_num epochs: *epoch_num
warmup_epoch: 2 warmup_epoch: 10
regularizer: regularizer:
name: L2 name: L2
factor: 0.00000 factor: 0.00000
PostProcess: PostProcess:
name: VQASerTokenLayoutLMPostProcess name: DistillationSerPostProcess
class_path: &class_path ./train_data/FUNSD/class_list.txt model_name: ["Student", "Teacher"]
key: backbone_out
class_path: &class_path train_data/XFUND/class_list_xfun.txt
Metric: Metric:
name: VQASerTokenMetric name: DistillationMetric
base_metric_name: VQASerTokenMetric
main_indicator: hmean main_indicator: hmean
key: "Student"
Train: Train:
dataset: dataset:
name: SimpleDataSet name: SimpleDataSet
data_dir: ./train_data/FUNSD/training_data/images/ data_dir: train_data/XFUND/zh_train/image
label_file_list: label_file_list:
- ./train_data/FUNSD/train.json - train_data/XFUND/zh_train/train.json
ratio_list: [ 1.0 ] ratio_list: [ 1.0 ]
transforms: transforms:
- DecodeImage: # load image - DecodeImage: # load image
...@@ -64,6 +122,8 @@ Train: ...@@ -64,6 +122,8 @@ Train:
contains_re: False contains_re: False
algorithm: *algorithm algorithm: *algorithm
class_path: *class_path class_path: *class_path
# one of [None, "tb-yx"]
order_method: &order_method "tb-yx"
- VQATokenPad: - VQATokenPad:
max_seq_len: &max_seq_len 512 max_seq_len: &max_seq_len 512
return_attention_mask: True return_attention_mask: True
...@@ -78,20 +138,19 @@ Train: ...@@ -78,20 +138,19 @@ Train:
order: 'hwc' order: 'hwc'
- ToCHWImage: - ToCHWImage:
- KeepKeys: - KeepKeys:
# dataloader will return list in this order keep_keys: [ 'input_ids', 'bbox', 'attention_mask', 'token_type_ids', 'image', 'labels'] # dataloader will return list in this order
keep_keys: [ 'input_ids', 'bbox', 'attention_mask', 'token_type_ids', 'image', 'labels']
loader: loader:
shuffle: True shuffle: True
drop_last: False drop_last: False
batch_size_per_card: 8 batch_size_per_card: 4
num_workers: 4 num_workers: 4
Eval: Eval:
dataset: dataset:
name: SimpleDataSet name: SimpleDataSet
data_dir: train_data/FUNSD/testing_data/images/ data_dir: train_data/XFUND/zh_val/image
label_file_list: label_file_list:
- ./train_data/FUNSD/test.json - train_data/XFUND/zh_val/val.json
transforms: transforms:
- DecodeImage: # load image - DecodeImage: # load image
img_mode: RGB img_mode: RGB
...@@ -100,6 +159,7 @@ Eval: ...@@ -100,6 +159,7 @@ Eval:
contains_re: False contains_re: False
algorithm: *algorithm algorithm: *algorithm
class_path: *class_path class_path: *class_path
order_method: *order_method
- VQATokenPad: - VQATokenPad:
max_seq_len: *max_seq_len max_seq_len: *max_seq_len
return_attention_mask: True return_attention_mask: True
...@@ -114,10 +174,10 @@ Eval: ...@@ -114,10 +174,10 @@ Eval:
order: 'hwc' order: 'hwc'
- ToCHWImage: - ToCHWImage:
- KeepKeys: - KeepKeys:
# dataloader will return list in this order keep_keys: [ 'input_ids', 'bbox', 'attention_mask', 'token_type_ids', 'image', 'labels'] # dataloader will return list in this order
keep_keys: [ 'input_ids', 'bbox', 'attention_mask', 'token_type_ids', 'image', 'labels']
loader: loader:
shuffle: False shuffle: False
drop_last: False drop_last: False
batch_size_per_card: 8 batch_size_per_card: 8
num_workers: 4 num_workers: 4
Global:
use_gpu: true
epoch_num: 8
log_smooth_window: 200
print_batch_step: 200
save_model_dir: ./output/rec/r45_visionlan
save_epoch_step: 1
# evaluation is run every 2000 iterations
eval_batch_step: [0, 2000]
cal_metric_during_train: True
pretrained_model:
checkpoints:
save_inference_dir:
use_visualdl: True
infer_img: doc/imgs_words/en/word_2.png
# for data or label process
character_dict_path:
max_text_length: &max_text_length 25
training_step: &training_step LA
infer_mode: False
use_space_char: False
save_res_path: ./output/rec/predicts_visionlan.txt
Optimizer:
name: Adam
beta1: 0.9
beta2: 0.999
clip_norm: 20.0
group_lr: true
training_step: *training_step
lr:
name: Piecewise
decay_epochs: [6]
values: [0.0001, 0.00001]
regularizer:
name: 'L2'
factor: 0
Architecture:
model_type: rec
algorithm: VisionLAN
Transform:
Backbone:
name: ResNet45
strides: [2, 2, 2, 1, 1]
Head:
name: VLHead
n_layers: 3
n_position: 256
n_dim: 512
max_text_length: *max_text_length
training_step: *training_step
Loss:
name: VLLoss
mode: *training_step
weight_res: 0.5
weight_mas: 0.5
PostProcess:
name: VLLabelDecode
Metric:
name: RecMetric
is_filter: true
Train:
dataset:
name: LMDBDataSet
data_dir: ./train_data/data_lmdb_release/training/
transforms:
- DecodeImage: # load image
img_mode: RGB
channel_first: False
- ABINetRecAug:
- VLLabelEncode: # Class handling label
- VLRecResizeImg:
image_shape: [3, 64, 256]
- KeepKeys:
keep_keys: ['image', 'label', 'label_res', 'label_sub', 'label_id', 'length'] # dataloader will return list in this order
loader:
shuffle: True
batch_size_per_card: 220
drop_last: True
num_workers: 4
Eval:
dataset:
name: LMDBDataSet
data_dir: ./train_data/data_lmdb_release/validation/
transforms:
- DecodeImage: # load image
img_mode: RGB
channel_first: False
- VLLabelEncode: # Class handling label
- VLRecResizeImg:
image_shape: [3, 64, 256]
- KeepKeys:
keep_keys: ['image', 'label', 'label_res', 'label_sub', 'label_id', 'length'] # dataloader will return list in this order
loader:
shuffle: False
drop_last: False
batch_size_per_card: 64
num_workers: 4
Global:
use_gpu: True
epoch_num: &epoch_num 200
log_smooth_window: 10
print_batch_step: 10
save_model_dir: ./output/re_layoutlmv2_funsd
save_epoch_step: 2000
# evaluation is run every 10 iterations after the 0th iteration
eval_batch_step: [ 0, 57 ]
cal_metric_during_train: False
save_inference_dir:
use_visualdl: False
seed: 2022
infer_img: train_data/FUNSD/testing_data/images/83624198.png
save_res_path: ./output/re_layoutlmv2_funsd/res/
Architecture:
model_type: vqa
algorithm: &algorithm "LayoutLMv2"
Transform:
Backbone:
name: LayoutLMv2ForRe
pretrained: True
checkpoints:
Loss:
name: LossFromOutput
key: loss
reduction: mean
Optimizer:
name: AdamW
beta1: 0.9
beta2: 0.999
clip_norm: 10
lr:
learning_rate: 0.00005
warmup_epoch: 10
regularizer:
name: L2
factor: 0.00000
PostProcess:
name: VQAReTokenLayoutLMPostProcess
Metric:
name: VQAReTokenMetric
main_indicator: hmean
Train:
dataset:
name: SimpleDataSet
data_dir: ./train_data/FUNSD/training_data/images/
label_file_list:
- ./train_data/FUNSD/train.json
ratio_list: [ 1.0 ]
transforms:
- DecodeImage: # load image
img_mode: RGB
channel_first: False
- VQATokenLabelEncode: # Class handling label
contains_re: True
algorithm: *algorithm
class_path: &class_path train_data/FUNSD/class_list.txt
- VQATokenPad:
max_seq_len: &max_seq_len 512
return_attention_mask: True
- VQAReTokenRelation:
- VQAReTokenChunk:
max_seq_len: *max_seq_len
- Resize:
size: [224,224]
- NormalizeImage:
scale: 1./255.
mean: [0.485, 0.456, 0.406]
std: [0.229, 0.224, 0.225]
order: 'hwc'
- ToCHWImage:
- KeepKeys:
# dataloader will return list in this order
keep_keys: [ 'input_ids', 'bbox', 'attention_mask', 'token_type_ids', 'image', 'entities', 'relations']
loader:
shuffle: True
drop_last: False
batch_size_per_card: 8
num_workers: 8
collate_fn: ListCollator
Eval:
dataset:
name: SimpleDataSet
data_dir: ./train_data/FUNSD/testing_data/images/
label_file_list:
- ./train_data/FUNSD/test.json
transforms:
- DecodeImage: # load image
img_mode: RGB
channel_first: False
- VQATokenLabelEncode: # Class handling label
contains_re: True
algorithm: *algorithm
class_path: *class_path
- VQATokenPad:
max_seq_len: *max_seq_len
return_attention_mask: True
- VQAReTokenRelation:
- VQAReTokenChunk:
max_seq_len: *max_seq_len
- Resize:
size: [224,224]
- NormalizeImage:
scale: 1./255.
mean: [0.485, 0.456, 0.406]
std: [0.229, 0.224, 0.225]
order: 'hwc'
- ToCHWImage:
- KeepKeys:
# dataloader will return list in this order
keep_keys: [ 'input_ids', 'bbox', 'attention_mask', 'token_type_ids', 'image', 'entities', 'relations']
loader:
shuffle: False
drop_last: False
batch_size_per_card: 8
num_workers: 8
collate_fn: ListCollator
Global:
use_gpu: True
epoch_num: &epoch_num 200
log_smooth_window: 10
print_batch_step: 10
save_model_dir: ./output/ser_layoutlm_sroie
save_epoch_step: 2000
# evaluation is run every 10 iterations after the 0th iteration
eval_batch_step: [ 0, 200 ]
cal_metric_during_train: False
save_inference_dir:
use_visualdl: False
seed: 2022
infer_img: train_data/SROIE/test/X00016469670.jpg
save_res_path: ./output/ser_layoutlm_sroie/res/
Architecture:
model_type: vqa
algorithm: &algorithm "LayoutLM"
Transform:
Backbone:
name: LayoutLMForSer
pretrained: True
checkpoints:
num_classes: &num_classes 9
Loss:
name: VQASerTokenLayoutLMLoss
num_classes: *num_classes
Optimizer:
name: AdamW
beta1: 0.9
beta2: 0.999
lr:
name: Linear
learning_rate: 0.00005
epochs: *epoch_num
warmup_epoch: 2
regularizer:
name: L2
factor: 0.00000
PostProcess:
name: VQASerTokenLayoutLMPostProcess
class_path: &class_path ./train_data/SROIE/class_list.txt
Metric:
name: VQASerTokenMetric
main_indicator: hmean
Train:
dataset:
name: SimpleDataSet
data_dir: ./train_data/SROIE/train
label_file_list:
- ./train_data/SROIE/train.txt
transforms:
- DecodeImage: # load image
img_mode: RGB
channel_first: False
- VQATokenLabelEncode: # Class handling label
contains_re: False
algorithm: *algorithm
class_path: *class_path
use_textline_bbox_info: &use_textline_bbox_info True
- VQATokenPad:
max_seq_len: &max_seq_len 512
return_attention_mask: True
- VQASerTokenChunk:
max_seq_len: *max_seq_len
- Resize:
size: [224,224]
- NormalizeImage:
scale: 1
mean: [ 123.675, 116.28, 103.53 ]
std: [ 58.395, 57.12, 57.375 ]
order: 'hwc'
- ToCHWImage:
- KeepKeys:
# dataloader will return list in this order
keep_keys: [ 'input_ids', 'bbox', 'attention_mask', 'token_type_ids', 'image', 'labels']
loader:
shuffle: True
drop_last: False
batch_size_per_card: 8
num_workers: 4
Eval:
dataset:
name: SimpleDataSet
data_dir: ./train_data/SROIE/test
label_file_list:
- ./train_data/SROIE/test.txt
transforms:
- DecodeImage: # load image
img_mode: RGB
channel_first: False
- VQATokenLabelEncode: # Class handling label
contains_re: False
algorithm: *algorithm
class_path: *class_path
use_textline_bbox_info: *use_textline_bbox_info
- VQATokenPad:
max_seq_len: *max_seq_len
return_attention_mask: True
- VQASerTokenChunk:
max_seq_len: *max_seq_len
- Resize:
size: [224,224]
- NormalizeImage:
scale: 1
mean: [ 123.675, 116.28, 103.53 ]
std: [ 58.395, 57.12, 57.375 ]
order: 'hwc'
- ToCHWImage:
- KeepKeys:
# dataloader will return list in this order
keep_keys: [ 'input_ids', 'bbox', 'attention_mask', 'token_type_ids', 'image', 'labels']
loader:
shuffle: False
drop_last: False
batch_size_per_card: 8
num_workers: 4
Global:
use_gpu: True
epoch_num: &epoch_num 200
log_smooth_window: 10
print_batch_step: 10
save_model_dir: ./output/ser_layoutlmv2_funsd
save_epoch_step: 2000
# evaluation is run every 10 iterations after the 0th iteration
eval_batch_step: [ 0, 100 ]
cal_metric_during_train: False
save_inference_dir:
use_visualdl: False
seed: 2022
infer_img: train_data/FUNSD/testing_data/images/83624198.png
save_res_path: ./output/ser_layoutlmv2_funsd/res/
Architecture:
model_type: vqa
algorithm: &algorithm "LayoutLMv2"
Transform:
Backbone:
name: LayoutLMv2ForSer
pretrained: True
checkpoints:
num_classes: &num_classes 7
Loss:
name: VQASerTokenLayoutLMLoss
num_classes: *num_classes
Optimizer:
name: AdamW
beta1: 0.9
beta2: 0.999
lr:
name: Linear
learning_rate: 0.00005
epochs: *epoch_num
warmup_epoch: 2
regularizer:
name: L2
factor: 0.00000
PostProcess:
name: VQASerTokenLayoutLMPostProcess
class_path: &class_path train_data/FUNSD/class_list.txt
Metric:
name: VQASerTokenMetric
main_indicator: hmean
Train:
dataset:
name: SimpleDataSet
data_dir: ./train_data/FUNSD/training_data/images/
label_file_list:
- ./train_data/FUNSD/train.json
transforms:
- DecodeImage: # load image
img_mode: RGB
channel_first: False
- VQATokenLabelEncode: # Class handling label
contains_re: False
algorithm: *algorithm
class_path: *class_path
- VQATokenPad:
max_seq_len: &max_seq_len 512
return_attention_mask: True
- VQASerTokenChunk:
max_seq_len: *max_seq_len
- Resize:
size: [224,224]
- NormalizeImage:
scale: 1
mean: [ 123.675, 116.28, 103.53 ]
std: [ 58.395, 57.12, 57.375 ]
order: 'hwc'
- ToCHWImage:
- KeepKeys:
# dataloader will return list in this order
keep_keys: [ 'input_ids', 'bbox', 'attention_mask', 'token_type_ids', 'image', 'labels']
loader:
shuffle: True
drop_last: False
batch_size_per_card: 8
num_workers: 4
Eval:
dataset:
name: SimpleDataSet
data_dir: ./train_data/FUNSD/testing_data/images/
label_file_list:
- ./train_data/FUNSD/test.json
transforms:
- DecodeImage: # load image
img_mode: RGB
channel_first: False
- VQATokenLabelEncode: # Class handling label
contains_re: False
algorithm: *algorithm
class_path: *class_path
- VQATokenPad:
max_seq_len: *max_seq_len
return_attention_mask: True
- VQASerTokenChunk:
max_seq_len: *max_seq_len
- Resize:
size: [224,224]
- NormalizeImage:
scale: 1
mean: [ 123.675, 116.28, 103.53 ]
std: [ 58.395, 57.12, 57.375 ]
order: 'hwc'
- ToCHWImage:
- KeepKeys:
# dataloader will return list in this order
keep_keys: [ 'input_ids', 'bbox', 'attention_mask', 'token_type_ids', 'image', 'labels']
loader:
shuffle: False
drop_last: False
batch_size_per_card: 8
num_workers: 4
Global:
use_gpu: True
epoch_num: &epoch_num 200
log_smooth_window: 10
print_batch_step: 10
save_model_dir: ./output/ser_layoutlmv2_sroie
save_epoch_step: 2000
# evaluation is run every 10 iterations after the 0th iteration
eval_batch_step: [ 0, 200 ]
cal_metric_during_train: False
save_inference_dir:
use_visualdl: False
seed: 2022
infer_img: train_data/SROIE/test/X00016469670.jpg
save_res_path: ./output/ser_layoutlmv2_sroie/res/
Architecture:
model_type: vqa
algorithm: &algorithm "LayoutLMv2"
Transform:
Backbone:
name: LayoutLMv2ForSer
pretrained: True
checkpoints:
num_classes: &num_classes 9
Loss:
name: VQASerTokenLayoutLMLoss
num_classes: *num_classes
Optimizer:
name: AdamW
beta1: 0.9
beta2: 0.999
lr:
name: Linear
learning_rate: 0.00005
epochs: *epoch_num
warmup_epoch: 2
regularizer:
name: L2
factor: 0.00000
PostProcess:
name: VQASerTokenLayoutLMPostProcess
class_path: &class_path ./train_data/SROIE/class_list.txt
Metric:
name: VQASerTokenMetric
main_indicator: hmean
Train:
dataset:
name: SimpleDataSet
data_dir: ./train_data/SROIE/train
label_file_list:
- ./train_data/SROIE/train.txt
transforms:
- DecodeImage: # load image
img_mode: RGB
channel_first: False
- VQATokenLabelEncode: # Class handling label
contains_re: False
algorithm: *algorithm
class_path: *class_path
- VQATokenPad:
max_seq_len: &max_seq_len 512
return_attention_mask: True
- VQASerTokenChunk:
max_seq_len: *max_seq_len
- Resize:
size: [224,224]
- NormalizeImage:
scale: 1
mean: [ 123.675, 116.28, 103.53 ]
std: [ 58.395, 57.12, 57.375 ]
order: 'hwc'
- ToCHWImage:
- KeepKeys:
# dataloader will return list in this order
keep_keys: [ 'input_ids', 'bbox', 'attention_mask', 'token_type_ids', 'image', 'labels']
loader:
shuffle: True
drop_last: False
batch_size_per_card: 8
num_workers: 4
Eval:
dataset:
name: SimpleDataSet
data_dir: ./train_data/SROIE/test
label_file_list:
- ./train_data/SROIE/test.txt
transforms:
- DecodeImage: # load image
img_mode: RGB
channel_first: False
- VQATokenLabelEncode: # Class handling label
contains_re: False
algorithm: *algorithm
class_path: *class_path
- VQATokenPad:
max_seq_len: *max_seq_len
return_attention_mask: True
- VQASerTokenChunk:
max_seq_len: *max_seq_len
- Resize:
size: [224,224]
- NormalizeImage:
scale: 1
mean: [ 123.675, 116.28, 103.53 ]
std: [ 58.395, 57.12, 57.375 ]
order: 'hwc'
- ToCHWImage:
- KeepKeys:
# dataloader will return list in this order
keep_keys: [ 'input_ids', 'bbox', 'attention_mask', 'token_type_ids', 'image', 'labels']
loader:
shuffle: False
drop_last: False
batch_size_per_card: 8
num_workers: 4
Global:
use_gpu: True
epoch_num: &epoch_num 200
log_smooth_window: 10
print_batch_step: 10
save_model_dir: ./output/ser_layoutxlm_sroie
save_epoch_step: 2000
# evaluation is run every 10 iterations after the 0th iteration
eval_batch_step: [ 0, 200 ]
cal_metric_during_train: False
save_inference_dir:
use_visualdl: False
seed: 2022
infer_img: train_data/SROIE/test/X00016469670.jpg
save_res_path: res_img_aug_with_gt
Architecture:
model_type: vqa
algorithm: &algorithm "LayoutXLM"
Transform:
Backbone:
name: LayoutXLMForSer
pretrained: True
checkpoints:
num_classes: &num_classes 9
Loss:
name: VQASerTokenLayoutLMLoss
num_classes: *num_classes
Optimizer:
name: AdamW
beta1: 0.9
beta2: 0.999
lr:
name: Linear
learning_rate: 0.00005
epochs: *epoch_num
warmup_epoch: 2
regularizer:
name: L2
factor: 0.00000
PostProcess:
name: VQASerTokenLayoutLMPostProcess
class_path: &class_path ./train_data/SROIE/class_list.txt
Metric:
name: VQASerTokenMetric
main_indicator: hmean
Train:
dataset:
name: SimpleDataSet
data_dir: ./train_data/SROIE/train
label_file_list:
- ./train_data/SROIE/train.txt
ratio_list: [ 1.0 ]
transforms:
- DecodeImage: # load image
img_mode: RGB
channel_first: False
- VQATokenLabelEncode: # Class handling label
contains_re: False
algorithm: *algorithm
class_path: *class_path
- VQATokenPad:
max_seq_len: &max_seq_len 512
return_attention_mask: True
- VQASerTokenChunk:
max_seq_len: *max_seq_len
- Resize:
size: [224,224]
- NormalizeImage:
scale: 1
mean: [ 123.675, 116.28, 103.53 ]
std: [ 58.395, 57.12, 57.375 ]
order: 'hwc'
- ToCHWImage:
- KeepKeys:
# dataloader will return list in this order
keep_keys: [ 'input_ids', 'bbox', 'attention_mask', 'token_type_ids', 'image', 'labels']
loader:
shuffle: True
drop_last: False
batch_size_per_card: 8
num_workers: 4
Eval:
dataset:
name: SimpleDataSet
data_dir: train_data/SROIE/test
label_file_list:
- ./train_data/SROIE/test.txt
transforms:
- DecodeImage: # load image
img_mode: RGB
channel_first: False
- VQATokenLabelEncode: # Class handling label
contains_re: False
algorithm: *algorithm
class_path: *class_path
- VQATokenPad:
max_seq_len: *max_seq_len
return_attention_mask: True
- VQASerTokenChunk:
max_seq_len: *max_seq_len
- Resize:
size: [224,224]
- NormalizeImage:
scale: 1
mean: [ 123.675, 116.28, 103.53 ]
std: [ 58.395, 57.12, 57.375 ]
order: 'hwc'
- ToCHWImage:
- KeepKeys:
# dataloader will return list in this order
keep_keys: [ 'input_ids', 'bbox', 'attention_mask', 'token_type_ids', 'image', 'labels']
loader:
shuffle: False
drop_last: False
batch_size_per_card: 8
num_workers: 4
Global:
use_gpu: True
epoch_num: &epoch_num 100
log_smooth_window: 10
print_batch_step: 10
save_model_dir: ./output/ser_layoutxlm_wildreceipt
save_epoch_step: 2000
# evaluation is run every 10 iterations after the 0th iteration
eval_batch_step: [ 0, 200 ]
cal_metric_during_train: False
save_inference_dir:
use_visualdl: False
seed: 2022
infer_img: train_data//wildreceipt/image_files/Image_12/10/845be0dd6f5b04866a2042abd28d558032ef2576.jpeg
save_res_path: ./output/ser_layoutxlm_wildreceipt/res
Architecture:
model_type: vqa
algorithm: &algorithm "LayoutXLM"
Transform:
Backbone:
name: LayoutXLMForSer
pretrained: True
checkpoints:
num_classes: &num_classes 51
Loss:
name: VQASerTokenLayoutLMLoss
num_classes: *num_classes
Optimizer:
name: AdamW
beta1: 0.9
beta2: 0.999
lr:
name: Linear
learning_rate: 0.00005
epochs: *epoch_num
warmup_epoch: 2
regularizer:
name: L2
factor: 0.00000
PostProcess:
name: VQASerTokenLayoutLMPostProcess
class_path: &class_path ./train_data/wildreceipt/class_list.txt
Metric:
name: VQASerTokenMetric
main_indicator: hmean
Train:
dataset:
name: SimpleDataSet
data_dir: ./train_data/wildreceipt/
label_file_list:
- ./train_data/wildreceipt/wildreceipt_train.txt
ratio_list: [ 1.0 ]
transforms:
- DecodeImage: # load image
img_mode: RGB
channel_first: False
- VQATokenLabelEncode: # Class handling label
contains_re: False
algorithm: *algorithm
class_path: *class_path
- VQATokenPad:
max_seq_len: &max_seq_len 512
return_attention_mask: True
- VQASerTokenChunk:
max_seq_len: *max_seq_len
- Resize:
size: [224,224]
- NormalizeImage:
scale: 1
mean: [ 123.675, 116.28, 103.53 ]
std: [ 58.395, 57.12, 57.375 ]
order: 'hwc'
- ToCHWImage:
- KeepKeys:
# dataloader will return list in this order
keep_keys: [ 'input_ids', 'bbox', 'attention_mask', 'token_type_ids', 'image', 'labels']
loader:
shuffle: True
drop_last: False
batch_size_per_card: 8
num_workers: 4
Eval:
dataset:
name: SimpleDataSet
data_dir: train_data/wildreceipt
label_file_list:
- ./train_data/wildreceipt/wildreceipt_test.txt
transforms:
- DecodeImage: # load image
img_mode: RGB
channel_first: False
- VQATokenLabelEncode: # Class handling label
contains_re: False
algorithm: *algorithm
class_path: *class_path
- VQATokenPad:
max_seq_len: *max_seq_len
return_attention_mask: True
- VQASerTokenChunk:
max_seq_len: *max_seq_len
- Resize:
size: [224,224]
- NormalizeImage:
scale: 1
mean: [ 123.675, 116.28, 103.53 ]
std: [ 58.395, 57.12, 57.375 ]
order: 'hwc'
- ToCHWImage:
- KeepKeys:
# dataloader will return list in this order
keep_keys: [ 'input_ids', 'bbox', 'attention_mask', 'token_type_ids', 'image', 'labels']
loader:
shuffle: False
drop_last: False
batch_size_per_card: 8
num_workers: 4
...@@ -53,10 +53,11 @@ PP-OCRv3检测模型是对PP-OCRv2中的[CML](https://arxiv.org/pdf/2109.03144.p ...@@ -53,10 +53,11 @@ PP-OCRv3检测模型是对PP-OCRv2中的[CML](https://arxiv.org/pdf/2109.03144.p
|序号|策略|模型大小|hmean|速度(cpu + mkldnn)| |序号|策略|模型大小|hmean|速度(cpu + mkldnn)|
|-|-|-|-|-| |-|-|-|-|-|
|baseline teacher|DB-R50|99M|83.5%|260ms| |baseline teacher|PP-OCR server|49M|83.2%|171ms|
|teacher1|DB-R50-LK-PAN|124M|85.0%|396ms| |teacher1|DB-R50-LK-PAN|124M|85.0%|396ms|
|teacher2|DB-R50-LK-PAN-DML|124M|86.0%|396ms| |teacher2|DB-R50-LK-PAN-DML|124M|86.0%|396ms|
|baseline student|PP-OCRv2|3M|83.2%|117ms| |baseline student|PP-OCRv2|3M|83.2%|117ms|
|student0|DB-MV3-RSE-FPN|3.6M|84.5%|124ms|
|student1|DB-MV3-CML(teacher2)|3M|84.3%|117ms| |student1|DB-MV3-CML(teacher2)|3M|84.3%|117ms|
|student2|DB-MV3-RSE-FPN-CML(teacher2)|3.6M|85.4%|124ms| |student2|DB-MV3-RSE-FPN-CML(teacher2)|3.6M|85.4%|124ms|
...@@ -184,7 +185,7 @@ UDML(Unified-Deep Mutual Learning)联合互学习是PP-OCRv2中就采用的 ...@@ -184,7 +185,7 @@ UDML(Unified-Deep Mutual Learning)联合互学习是PP-OCRv2中就采用的
**(6)UIM:无标注数据挖掘方案** **(6)UIM:无标注数据挖掘方案**
UIM(Unlabeled Images Mining)是一种非常简单的无标注数据挖掘方案。核心思想是利用高精度的文本识别大模型对无标注数据进行预测,获取伪标签,并且选择预测置信度高的样本作为训练数据,用于训练小模型。使用该策略,识别模型的准确率进一步提升到79.4%(+1%)。 UIM(Unlabeled Images Mining)是一种非常简单的无标注数据挖掘方案。核心思想是利用高精度的文本识别大模型对无标注数据进行预测,获取伪标签,并且选择预测置信度高的样本作为训练数据,用于训练小模型。使用该策略,识别模型的准确率进一步提升到79.4%(+1%)。实际操作中,我们使用全量数据集训练高精度SVTR-Tiny模型(acc=82.5%)进行数据挖掘,点击获取[模型下载地址和使用教程](../../applications/高精度中文识别模型.md)
<div align="center"> <div align="center">
<img src="../ppocr_v3/UIM.png" width="500"> <img src="../ppocr_v3/UIM.png" width="500">
......
...@@ -69,6 +69,7 @@ ...@@ -69,6 +69,7 @@
- [x] [SVTR](./algorithm_rec_svtr.md) - [x] [SVTR](./algorithm_rec_svtr.md)
- [x] [ViTSTR](./algorithm_rec_vitstr.md) - [x] [ViTSTR](./algorithm_rec_vitstr.md)
- [x] [ABINet](./algorithm_rec_abinet.md) - [x] [ABINet](./algorithm_rec_abinet.md)
- [x] [VisionLAN](./algorithm_rec_visionlan.md)
- [x] [SPIN](./algorithm_rec_spin.md) - [x] [SPIN](./algorithm_rec_spin.md)
- [x] [RobustScanner](./algorithm_rec_robustscanner.md) - [x] [RobustScanner](./algorithm_rec_robustscanner.md)
...@@ -91,6 +92,7 @@ ...@@ -91,6 +92,7 @@
|SVTR|SVTR-Tiny| 89.25% | rec_svtr_tiny_none_ctc_en | [训练模型](https://paddleocr.bj.bcebos.com/PP-OCRv3/chinese/rec_svtr_tiny_none_ctc_en_train.tar) | |SVTR|SVTR-Tiny| 89.25% | rec_svtr_tiny_none_ctc_en | [训练模型](https://paddleocr.bj.bcebos.com/PP-OCRv3/chinese/rec_svtr_tiny_none_ctc_en_train.tar) |
|ViTSTR|ViTSTR| 79.82% | rec_vitstr_none_ce | [训练模型](https://paddleocr.bj.bcebos.com/rec_vitstr_none_ce_train.tar) | |ViTSTR|ViTSTR| 79.82% | rec_vitstr_none_ce | [训练模型](https://paddleocr.bj.bcebos.com/rec_vitstr_none_ce_train.tar) |
|ABINet|Resnet45| 90.75% | rec_r45_abinet | [训练模型](https://paddleocr.bj.bcebos.com/rec_r45_abinet_train.tar) | |ABINet|Resnet45| 90.75% | rec_r45_abinet | [训练模型](https://paddleocr.bj.bcebos.com/rec_r45_abinet_train.tar) |
|VisionLAN|Resnet45| 90.30% | rec_r45_visionlan | [训练模型](https://paddleocr.bj.bcebos.com/rec_r45_visionlan_train.tar) |
|SPIN|ResNet32| 90.00% | rec_r32_gaspin_bilstm_att | coming soon | |SPIN|ResNet32| 90.00% | rec_r32_gaspin_bilstm_att | coming soon |
|RobustScanner|ResNet31| 87.77% | rec_r31_robustscanner | coming soon | |RobustScanner|ResNet31| 87.77% | rec_r31_robustscanner | coming soon |
......
# 场景文本识别算法-VisionLAN
- [1. 算法简介](#1)
- [2. 环境配置](#2)
- [3. 模型训练、评估、预测](#3)
- [3.1 训练](#3-1)
- [3.2 评估](#3-2)
- [3.3 预测](#3-3)
- [4. 推理部署](#4)
- [4.1 Python推理](#4-1)
- [4.2 C++推理](#4-2)
- [4.3 Serving服务化部署](#4-3)
- [4.4 更多推理部署](#4-4)
- [5. FAQ](#5)
<a name="1"></a>
## 1. 算法简介
论文信息:
> [From Two to One: A New Scene Text Recognizer with Visual Language Modeling Network](https://arxiv.org/abs/2108.09661)
> Yuxin Wang, Hongtao Xie, Shancheng Fang, Jing Wang, Shenggao Zhu, Yongdong Zhang
> ICCV, 2021
<a name="model"></a>
`VisionLAN`使用MJSynth和SynthText两个文字识别数据集训练,在IIIT, SVT, IC13, IC15, SVTP, CUTE数据集上进行评估,算法复现效果如下:
|模型|骨干网络|配置文件|Acc|下载链接|
| --- | --- | --- | --- | --- |
|VisionLAN|ResNet45|[rec_r45_visionlan.yml](../../configs/rec/rec_r45_visionlan.yml)|90.3%|[预训练、训练模型](https://paddleocr.bj.bcebos.com/rec_r45_visionlan_train.tar)|
<a name="2"></a>
## 2. 环境配置
请先参考[《运行环境准备》](./environment.md)配置PaddleOCR运行环境,参考[《项目克隆》](./clone.md)克隆项目代码。
<a name="3"></a>
## 3. 模型训练、评估、预测
<a name="3-1"></a>
### 3.1 模型训练
请参考[文本识别训练教程](./recognition.md)。PaddleOCR对代码进行了模块化,训练`VisionLAN`识别模型时需要**更换配置文件**`VisionLAN`[配置文件](../../configs/rec/rec_r45_visionlan.yml)
#### 启动训练
具体地,在完成数据准备后,便可以启动训练,训练命令如下:
```shell
#单卡训练(训练周期长,不建议)
python3 tools/train.py -c configs/rec/rec_r45_visionlan.yml
#多卡训练,通过--gpus参数指定卡号
python3 -m paddle.distributed.launch --gpus '0,1,2,3' tools/train.py -c configs/rec/rec_r45_visionlan.yml
```
<a name="3-2"></a>
### 3.2 评估
可下载已训练完成的[模型文件](#model),使用如下命令进行评估:
```shell
# 注意将pretrained_model的路径设置为本地路径。
python3 tools/eval.py -c configs/rec/rec_r45_visionlan.yml -o Global.pretrained_model=./rec_r45_visionlan_train/best_accuracy
```
<a name="3-3"></a>
### 3.3 预测
使用如下命令进行单张图片预测:
```shell
# 注意将pretrained_model的路径设置为本地路径。
python3 tools/infer_rec.py -c configs/rec/rec_r45_visionlan.yml -o Global.infer_img='./doc/imgs_words/en/word_2.png' Global.pretrained_model=./rec_r45_visionlan_train/best_accuracy
# 预测文件夹下所有图像时,可修改infer_img为文件夹,如 Global.infer_img='./doc/imgs_words_en/'。
```
<a name="4"></a>
## 4. 推理部署
<a name="4-1"></a>
### 4.1 Python推理
首先将训练得到best模型,转换成inference model。这里以训练完成的模型为例([模型下载地址](https://paddleocr.bj.bcebos.com/rec_r45_visionlan_train.tar)),可以使用如下命令进行转换:
```shell
# 注意将pretrained_model的路径设置为本地路径。
python3 tools/export_model.py -c configs/rec/rec_r45_visionlan.yml -o Global.pretrained_model=./rec_r45_visionlan_train/best_accuracy Global.save_inference_dir=./inference/rec_r45_visionlan/
```
**注意:**
- 如果您是在自己的数据集上训练的模型,并且调整了字典文件,请注意修改配置文件中的`character_dict_path`是否是所需要的字典文件。
- 如果您修改了训练时的输入大小,请修改`tools/export_model.py`文件中的对应VisionLAN的`infer_shape`
转换成功后,在目录下有三个文件:
```
./inference/rec_r45_visionlan/
├── inference.pdiparams # 识别inference模型的参数文件
├── inference.pdiparams.info # 识别inference模型的参数信息,可忽略
└── inference.pdmodel # 识别inference模型的program文件
```
执行如下命令进行模型推理:
```shell
python3 tools/infer/predict_rec.py --image_dir='./doc/imgs_words/en/word_2.png' --rec_model_dir='./inference/rec_r45_visionlan/' --rec_algorithm='VisionLAN' --rec_image_shape='3,64,256' --rec_char_dict_path='./ppocr/utils/dict36.txt'
# 预测文件夹下所有图像时,可修改image_dir为文件夹,如 --image_dir='./doc/imgs_words_en/'。
```
![](../imgs_words/en/word_2.png)
执行命令后,上面图像的预测结果(识别的文本和得分)会打印到屏幕上,示例如下:
结果如下:
```shell
Predicts of ./doc/imgs_words/en/word_2.png:('yourself', 0.97076982)
```
**注意**
- 训练上述模型采用的图像分辨率是[3,64,256],需要通过参数`rec_image_shape`设置为您训练时的识别图像形状。
- 在推理时需要设置参数`rec_char_dict_path`指定字典,如果您修改了字典,请修改该参数为您的字典文件。
- 如果您修改了预处理方法,需修改`tools/infer/predict_rec.py`中VisionLAN的预处理为您的预处理方法。
<a name="4-2"></a>
### 4.2 C++推理部署
由于C++预处理后处理还未支持VisionLAN,所以暂未支持
<a name="4-3"></a>
### 4.3 Serving服务化部署
暂不支持
<a name="4-4"></a>
### 4.4 更多推理部署
暂不支持
<a name="5"></a>
## 5. FAQ
1. MJSynth和SynthText两种数据集来自于[VisionLAN源repo](https://github.com/wangyuxin87/VisionLAN)
2. 我们使用VisionLAN作者提供的预训练模型进行finetune训练。
## 引用
```bibtex
@inproceedings{wang2021two,
title={From Two to One: A New Scene Text Recognizer with Visual Language Modeling Network},
author={Wang, Yuxin and Xie, Hongtao and Fang, Shancheng and Wang, Jing and Zhu, Shenggao and Zhang, Yongdong},
booktitle={Proceedings of the IEEE/CVF International Conference on Computer Vision},
pages={14194--14203},
year={2021}
}
```
...@@ -65,7 +65,7 @@ python3 -m paddle.distributed.launch --gpus '0,1,2,3' tools/train.py -c configs/ ...@@ -65,7 +65,7 @@ python3 -m paddle.distributed.launch --gpus '0,1,2,3' tools/train.py -c configs/
``` ```
上述指令中,通过-c 选择训练使用configs/det/det_db_mv3.yml配置文件。 上述指令中,通过-c 选择训练使用configs/det/det_mv3_db.yml配置文件。
有关配置文件的详细解释,请参考[链接](./config.md) 有关配置文件的详细解释,请参考[链接](./config.md)
您也可以通过-o参数在不需要修改yml文件的情况下,改变训练的参数,比如,调整训练的学习率为0.0001 您也可以通过-o参数在不需要修改yml文件的情况下,改变训练的参数,比如,调整训练的学习率为0.0001
......
...@@ -55,10 +55,11 @@ The ablation experiments are as follows: ...@@ -55,10 +55,11 @@ The ablation experiments are as follows:
|ID|Strategy|Model Size|Hmean|The Inference Time(cpu + mkldnn)| |ID|Strategy|Model Size|Hmean|The Inference Time(cpu + mkldnn)|
|-|-|-|-|-| |-|-|-|-|-|
|baseline teacher|DB-R50|99M|83.5%|260ms| |baseline teacher|PP-OCR server|49M|83.2%|171ms|
|teacher1|DB-R50-LK-PAN|124M|85.0%|396ms| |teacher1|DB-R50-LK-PAN|124M|85.0%|396ms|
|teacher2|DB-R50-LK-PAN-DML|124M|86.0%|396ms| |teacher2|DB-R50-LK-PAN-DML|124M|86.0%|396ms|
|baseline student|PP-OCRv2|3M|83.2%|117ms| |baseline student|PP-OCRv2|3M|83.2%|117ms|
|student0|DB-MV3-RSE-FPN|3.6M|84.5%|124ms|
|student1|DB-MV3-CML(teacher2)|3M|84.3%|117ms| |student1|DB-MV3-CML(teacher2)|3M|84.3%|117ms|
|student2|DB-MV3-RSE-FPN-CML(teacher2)|3.6M|85.4%|124ms| |student2|DB-MV3-RSE-FPN-CML(teacher2)|3.6M|85.4%|124ms|
...@@ -199,7 +200,7 @@ UDML (Unified-Deep Mutual Learning) is a strategy proposed in PP-OCRv2 which is ...@@ -199,7 +200,7 @@ UDML (Unified-Deep Mutual Learning) is a strategy proposed in PP-OCRv2 which is
**(6)UIM:Unlabeled Images Mining** **(6)UIM:Unlabeled Images Mining**
UIM (Unlabeled Images Mining) is a very simple unlabeled data mining strategy. The main idea is to use a high-precision text recognition model to predict unlabeled images to obtain pseudo-labels, and select samples with high prediction confidence as training data for training lightweight models. Using this strategy, the accuracy of the recognition model is further improved to 79.4% (+1%). UIM (Unlabeled Images Mining) is a very simple unlabeled data mining strategy. The main idea is to use a high-precision text recognition model to predict unlabeled images to obtain pseudo-labels, and select samples with high prediction confidence as training data for training lightweight models. Using this strategy, the accuracy of the recognition model is further improved to 79.4% (+1%). In practice, we use the full data set to train the high-precision SVTR_Tiny model (acc=82.5%) for data mining. [SVTR_Tiny model download and tutorial](../../applications/高精度中文识别模型.md).
<div align="center"> <div align="center">
<img src="../ppocr_v3/UIM.png" width="500"> <img src="../ppocr_v3/UIM.png" width="500">
......
...@@ -68,6 +68,7 @@ Supported text recognition algorithms (Click the link to get the tutorial): ...@@ -68,6 +68,7 @@ Supported text recognition algorithms (Click the link to get the tutorial):
- [x] [SVTR](./algorithm_rec_svtr_en.md) - [x] [SVTR](./algorithm_rec_svtr_en.md)
- [x] [ViTSTR](./algorithm_rec_vitstr_en.md) - [x] [ViTSTR](./algorithm_rec_vitstr_en.md)
- [x] [ABINet](./algorithm_rec_abinet_en.md) - [x] [ABINet](./algorithm_rec_abinet_en.md)
- [x] [VisionLAN](./algorithm_rec_visionlan_en.md)
- [x] [SPIN](./algorithm_rec_spin_en.md) - [x] [SPIN](./algorithm_rec_spin_en.md)
- [x] [RobustScanner](./algorithm_rec_robustscanner_en.md) - [x] [RobustScanner](./algorithm_rec_robustscanner_en.md)
...@@ -90,6 +91,7 @@ Refer to [DTRB](https://arxiv.org/abs/1904.01906), the training and evaluation r ...@@ -90,6 +91,7 @@ Refer to [DTRB](https://arxiv.org/abs/1904.01906), the training and evaluation r
|SVTR|SVTR-Tiny| 89.25% | rec_svtr_tiny_none_ctc_en | [trained model](https://paddleocr.bj.bcebos.com/PP-OCRv3/chinese/rec_svtr_tiny_none_ctc_en_train.tar) | |SVTR|SVTR-Tiny| 89.25% | rec_svtr_tiny_none_ctc_en | [trained model](https://paddleocr.bj.bcebos.com/PP-OCRv3/chinese/rec_svtr_tiny_none_ctc_en_train.tar) |
|ViTSTR|ViTSTR| 79.82% | rec_vitstr_none_ce | [trained model](https://paddleocr.bj.bcebos.com/rec_vitstr_none_none_train.tar) | |ViTSTR|ViTSTR| 79.82% | rec_vitstr_none_ce | [trained model](https://paddleocr.bj.bcebos.com/rec_vitstr_none_none_train.tar) |
|ABINet|Resnet45| 90.75% | rec_r45_abinet | [trained model](https://paddleocr.bj.bcebos.com/rec_r45_abinet_train.tar) | |ABINet|Resnet45| 90.75% | rec_r45_abinet | [trained model](https://paddleocr.bj.bcebos.com/rec_r45_abinet_train.tar) |
|VisionLAN|Resnet45| 90.30% | rec_r45_visionlan | [trained model](https://paddleocr.bj.bcebos.com/rec_r45_visionlan_train.tar) |
|SPIN|ResNet32| 90.00% | rec_r32_gaspin_bilstm_att | coming soon | |SPIN|ResNet32| 90.00% | rec_r32_gaspin_bilstm_att | coming soon |
|RobustScanner|ResNet31| 87.77% | rec_r31_robustscanner | coming soon | |RobustScanner|ResNet31| 87.77% | rec_r31_robustscanner | coming soon |
......
# VisionLAN
- [1. Introduction](#1)
- [2. Environment](#2)
- [3. Model Training / Evaluation / Prediction](#3)
- [3.1 Training](#3-1)
- [3.2 Evaluation](#3-2)
- [3.3 Prediction](#3-3)
- [4. Inference and Deployment](#4)
- [4.1 Python Inference](#4-1)
- [4.2 C++ Inference](#4-2)
- [4.3 Serving](#4-3)
- [4.4 More](#4-4)
- [5. FAQ](#5)
<a name="1"></a>
## 1. Introduction
Paper:
> [From Two to One: A New Scene Text Recognizer with Visual Language Modeling Network](https://arxiv.org/abs/2108.09661)
> Yuxin Wang, Hongtao Xie, Shancheng Fang, Jing Wang, Shenggao Zhu, Yongdong Zhang
> ICCV, 2021
Using MJSynth and SynthText two text recognition datasets for training, and evaluating on IIIT, SVT, IC13, IC15, SVTP, CUTE datasets, the algorithm reproduction effect is as follows:
|Model|Backbone|config|Acc|Download link|
| --- | --- | --- | --- | --- |
|VisionLAN|ResNet45|[rec_r45_visionlan.yml](../../configs/rec/rec_r45_visionlan.yml)|90.3%|[预训练、训练模型](https://paddleocr.bj.bcebos.com/rec_r45_visionlan_train.tar)|
<a name="2"></a>
## 2. Environment
Please refer to ["Environment Preparation"](./environment_en.md) to configure the PaddleOCR environment, and refer to ["Project Clone"](./clone_en.md) to clone the project code.
<a name="3"></a>
## 3. Model Training / Evaluation / Prediction
Please refer to [Text Recognition Tutorial](./recognition_en.md). PaddleOCR modularizes the code, and training different recognition models only requires **changing the configuration file**.
Training:
Specifically, after the data preparation is completed, the training can be started. The training command is as follows:
```
#Single GPU training (long training period, not recommended)
python3 tools/train.py -c configs/rec/rec_r45_visionlan.yml
#Multi GPU training, specify the gpu number through the --gpus parameter
python3 -m paddle.distributed.launch --gpus '0,1,2,3' tools/train.py -c configs/rec/rec_r45_visionlan.yml
```
Evaluation:
```
# GPU evaluation
python3 tools/eval.py -c configs/rec/rec_r45_visionlan.yml -o Global.pretrained_model={path/to/weights}/best_accuracy
```
Prediction:
```
# The configuration file used for prediction must match the training
python3 tools/infer_rec.py -c configs/rec/rec_r45_visionlan.yml -o Global.infer_img='./doc/imgs_words/en/word_2.png' Global.pretrained_model=./rec_r45_visionlan_train/best_accuracy
```
<a name="4"></a>
## 4. Inference and Deployment
<a name="4-1"></a>
### 4.1 Python Inference
First, the model saved during the VisionLAN text recognition training process is converted into an inference model. ( [Model download link](https://paddleocr.bj.bcebos.com/rec_r45_visionlan_train.tar)) ), you can use the following command to convert:
```
python3 tools/export_model.py -c configs/rec/rec_r45_visionlan.yml -o Global.pretrained_model=./rec_r45_visionlan_train/best_accuracy Global.save_inference_dir=./inference/rec_r45_visionlan/
```
**Note:**
- If you are training the model on your own dataset and have modified the dictionary file, please pay attention to modify the `character_dict_path` in the configuration file to the modified dictionary file.
- If you modified the input size during training, please modify the `infer_shape` corresponding to VisionLAN in the `tools/export_model.py` file.
After the conversion is successful, there are three files in the directory:
```
./inference/rec_r45_visionlan/
├── inference.pdiparams
├── inference.pdiparams.info
└── inference.pdmodel
```
For VisionLAN text recognition model inference, the following commands can be executed:
```
python3 tools/infer/predict_rec.py --image_dir='./doc/imgs_words/en/word_2.png' --rec_model_dir='./inference/rec_r45_visionlan/' --rec_algorithm='VisionLAN' --rec_image_shape='3,64,256' --rec_char_dict_path='./ppocr/utils/dict36.txt'
```
![](../imgs_words/en/word_2.png)
After executing the command, the prediction result (recognized text and score) of the image above is printed to the screen, an example is as follows:
The result is as follows:
```shell
Predicts of ./doc/imgs_words/en/word_2.png:('yourself', 0.97076982)
```
<a name="4-2"></a>
### 4.2 C++ Inference
Not supported
<a name="4-3"></a>
### 4.3 Serving
Not supported
<a name="4-4"></a>
### 4.4 More
Not supported
<a name="5"></a>
## 5. FAQ
1. Note that the MJSynth and SynthText datasets come from [VisionLAN repo](https://github.com/wangyuxin87/VisionLAN).
2. We use the pre-trained model provided by the VisionLAN authors for finetune training.
## Citation
```bibtex
@inproceedings{wang2021two,
title={From Two to One: A New Scene Text Recognizer with Visual Language Modeling Network},
author={Wang, Yuxin and Xie, Hongtao and Fang, Shancheng and Wang, Jing and Zhu, Shenggao and Zhang, Yongdong},
booktitle={Proceedings of the IEEE/CVF International Conference on Computer Vision},
pages={14194--14203},
year={2021}
}
```
...@@ -51,7 +51,7 @@ python3 tools/train.py -c configs/det/det_mv3_db.yml \ ...@@ -51,7 +51,7 @@ python3 tools/train.py -c configs/det/det_mv3_db.yml \
-o Global.pretrained_model=./pretrain_models/MobileNetV3_large_x0_5_pretrained -o Global.pretrained_model=./pretrain_models/MobileNetV3_large_x0_5_pretrained
``` ```
In the above instruction, use `-c` to select the training to use the `configs/det/det_db_mv3.yml` configuration file. In the above instruction, use `-c` to select the training to use the `configs/det/det_mv3_db.yml` configuration file.
For a detailed explanation of the configuration file, please refer to [config](./config_en.md). For a detailed explanation of the configuration file, please refer to [config](./config_en.md).
You can also use `-o` to change the training parameters without modifying the yml file. For example, adjust the training learning rate to 0.0001 You can also use `-o` to change the training parameters without modifying the yml file. For example, adjust the training learning rate to 0.0001
......
...@@ -25,8 +25,9 @@ from .make_pse_gt import MakePseGt ...@@ -25,8 +25,9 @@ from .make_pse_gt import MakePseGt
from .rec_img_aug import BaseDataAugmentation, RecAug, RecConAug, RecResizeImg, ClsResizeImg, \ from .rec_img_aug import BaseDataAugmentation, RecAug, RecConAug, RecResizeImg, ClsResizeImg, \
SRNRecResizeImg, GrayRecResizeImg, SARRecResizeImg, PRENResizeImg, \ SRNRecResizeImg, GrayRecResizeImg, SARRecResizeImg, PRENResizeImg, \
ABINetRecResizeImg, SVTRRecResizeImg, ABINetRecAug, SPINRecResizeImg, RobustScannerRecResizeImg ABINetRecResizeImg, SVTRRecResizeImg, ABINetRecAug, VLRecResizeImg, SPINRecResizeImg, RobustScannerRecResizeImg
from .ssl_img_aug import SSLRotateResize from .ssl_img_aug import SSLRotateResize
from .randaugment import RandAugment from .randaugment import RandAugment
from .copy_paste import CopyPaste from .copy_paste import CopyPaste
......
...@@ -23,7 +23,10 @@ import string ...@@ -23,7 +23,10 @@ import string
from shapely.geometry import LineString, Point, Polygon from shapely.geometry import LineString, Point, Polygon
import json import json
import copy import copy
from random import sample
from ppocr.utils.logging import get_logger from ppocr.utils.logging import get_logger
from ppocr.data.imaug.vqa.augment import order_by_tbyx
class ClsLabelEncode(object): class ClsLabelEncode(object):
...@@ -97,12 +100,13 @@ class BaseRecLabelEncode(object): ...@@ -97,12 +100,13 @@ class BaseRecLabelEncode(object):
def __init__(self, def __init__(self,
max_text_length, max_text_length,
character_dict_path=None, character_dict_path=None,
use_space_char=False): use_space_char=False,
lower=False):
self.max_text_len = max_text_length self.max_text_len = max_text_length
self.beg_str = "sos" self.beg_str = "sos"
self.end_str = "eos" self.end_str = "eos"
self.lower = False self.lower = lower
if character_dict_path is None: if character_dict_path is None:
logger = get_logger() logger = get_logger()
...@@ -870,6 +874,7 @@ class VQATokenLabelEncode(object): ...@@ -870,6 +874,7 @@ class VQATokenLabelEncode(object):
add_special_ids=False, add_special_ids=False,
algorithm='LayoutXLM', algorithm='LayoutXLM',
use_textline_bbox_info=True, use_textline_bbox_info=True,
order_method=None,
infer_mode=False, infer_mode=False,
ocr_engine=None, ocr_engine=None,
**kwargs): **kwargs):
...@@ -899,6 +904,8 @@ class VQATokenLabelEncode(object): ...@@ -899,6 +904,8 @@ class VQATokenLabelEncode(object):
self.infer_mode = infer_mode self.infer_mode = infer_mode
self.ocr_engine = ocr_engine self.ocr_engine = ocr_engine
self.use_textline_bbox_info = use_textline_bbox_info self.use_textline_bbox_info = use_textline_bbox_info
self.order_method = order_method
assert self.order_method in [None, "tb-yx"]
def split_bbox(self, bbox, text, tokenizer): def split_bbox(self, bbox, text, tokenizer):
words = text.split() words = text.split()
...@@ -938,6 +945,14 @@ class VQATokenLabelEncode(object): ...@@ -938,6 +945,14 @@ class VQATokenLabelEncode(object):
# load bbox and label info # load bbox and label info
ocr_info = self._load_ocr_info(data) ocr_info = self._load_ocr_info(data)
for idx in range(len(ocr_info)):
if "bbox" not in ocr_info[idx]:
ocr_info[idx]["bbox"] = self.trans_poly_to_bbox(ocr_info[idx][
"points"])
if self.order_method == "tb-yx":
ocr_info = order_by_tbyx(ocr_info)
# for re # for re
train_re = self.contains_re and not self.infer_mode train_re = self.contains_re and not self.infer_mode
if train_re: if train_re:
...@@ -977,7 +992,10 @@ class VQATokenLabelEncode(object): ...@@ -977,7 +992,10 @@ class VQATokenLabelEncode(object):
info["bbox"] = self.trans_poly_to_bbox(info["points"]) info["bbox"] = self.trans_poly_to_bbox(info["points"])
encode_res = self.tokenizer.encode( encode_res = self.tokenizer.encode(
text, pad_to_max_seq_len=False, return_attention_mask=True) text,
pad_to_max_seq_len=False,
return_attention_mask=True,
return_token_type_ids=True)
if not self.add_special_ids: if not self.add_special_ids:
# TODO: use tok.all_special_ids to remove # TODO: use tok.all_special_ids to remove
...@@ -1049,10 +1067,10 @@ class VQATokenLabelEncode(object): ...@@ -1049,10 +1067,10 @@ class VQATokenLabelEncode(object):
return data return data
def trans_poly_to_bbox(self, poly): def trans_poly_to_bbox(self, poly):
x1 = np.min([p[0] for p in poly]) x1 = int(np.min([p[0] for p in poly]))
x2 = np.max([p[0] for p in poly]) x2 = int(np.max([p[0] for p in poly]))
y1 = np.min([p[1] for p in poly]) y1 = int(np.min([p[1] for p in poly]))
y2 = np.max([p[1] for p in poly]) y2 = int(np.max([p[1] for p in poly]))
return [x1, y1, x2, y2] return [x1, y1, x2, y2]
def _load_ocr_info(self, data): def _load_ocr_info(self, data):
...@@ -1217,6 +1235,7 @@ class ABINetLabelEncode(BaseRecLabelEncode): ...@@ -1217,6 +1235,7 @@ class ABINetLabelEncode(BaseRecLabelEncode):
dict_character = ['</s>'] + dict_character dict_character = ['</s>'] + dict_character
return dict_character return dict_character
class SPINLabelEncode(AttnLabelEncode): class SPINLabelEncode(AttnLabelEncode):
""" Convert between text-label and text-index """ """ Convert between text-label and text-index """
...@@ -1229,6 +1248,7 @@ class SPINLabelEncode(AttnLabelEncode): ...@@ -1229,6 +1248,7 @@ class SPINLabelEncode(AttnLabelEncode):
super(SPINLabelEncode, self).__init__( super(SPINLabelEncode, self).__init__(
max_text_length, character_dict_path, use_space_char) max_text_length, character_dict_path, use_space_char)
self.lower = lower self.lower = lower
def add_special_char(self, dict_character): def add_special_char(self, dict_character):
self.beg_str = "sos" self.beg_str = "sos"
self.end_str = "eos" self.end_str = "eos"
...@@ -1248,4 +1268,68 @@ class SPINLabelEncode(AttnLabelEncode): ...@@ -1248,4 +1268,68 @@ class SPINLabelEncode(AttnLabelEncode):
padded_text[:len(target)] = target padded_text[:len(target)] = target
data['label'] = np.array(padded_text) data['label'] = np.array(padded_text)
return data return data
\ No newline at end of file
class VLLabelEncode(BaseRecLabelEncode):
""" Convert between text-label and text-index """
def __init__(self,
max_text_length,
character_dict_path=None,
use_space_char=False,
lower=True,
**kwargs):
super(VLLabelEncode, self).__init__(
max_text_length, character_dict_path, use_space_char, lower)
self.character = self.character[10:] + self.character[
1:10] + [self.character[0]]
self.dict = {}
for i, char in enumerate(self.character):
self.dict[char] = i
def __call__(self, data):
text = data['label'] # original string
# generate occluded text
len_str = len(text)
if len_str <= 0:
return None
change_num = 1
order = list(range(len_str))
change_id = sample(order, change_num)[0]
label_sub = text[change_id]
if change_id == (len_str - 1):
label_res = text[:change_id]
elif change_id == 0:
label_res = text[1:]
else:
label_res = text[:change_id] + text[change_id + 1:]
data['label_res'] = label_res # remaining string
data['label_sub'] = label_sub # occluded character
data['label_id'] = change_id # character index
# encode label
text = self.encode(text)
if text is None:
return None
text = [i + 1 for i in text]
data['length'] = np.array(len(text))
text = text + [0] * (self.max_text_len - len(text))
data['label'] = np.array(text)
label_res = self.encode(label_res)
label_sub = self.encode(label_sub)
if label_res is None:
label_res = []
else:
label_res = [i + 1 for i in label_res]
if label_sub is None:
label_sub = []
else:
label_sub = [i + 1 for i in label_sub]
data['length_res'] = np.array(len(label_res))
data['length_sub'] = np.array(len(label_sub))
label_res = label_res + [0] * (self.max_text_len - len(label_res))
label_sub = label_sub + [0] * (self.max_text_len - len(label_sub))
data['label_res'] = np.array(label_res)
data['label_sub'] = np.array(label_sub)
return data
...@@ -205,6 +205,38 @@ class RecResizeImg(object): ...@@ -205,6 +205,38 @@ class RecResizeImg(object):
return data return data
class VLRecResizeImg(object):
def __init__(self,
image_shape,
infer_mode=False,
character_dict_path='./ppocr/utils/ppocr_keys_v1.txt',
padding=True,
**kwargs):
self.image_shape = image_shape
self.infer_mode = infer_mode
self.character_dict_path = character_dict_path
self.padding = padding
def __call__(self, data):
img = data['image']
imgC, imgH, imgW = self.image_shape
resized_image = cv2.resize(
img, (imgW, imgH), interpolation=cv2.INTER_LINEAR)
resized_w = imgW
resized_image = resized_image.astype('float32')
if self.image_shape[0] == 1:
resized_image = resized_image / 255
norm_img = resized_image[np.newaxis, :]
else:
norm_img = resized_image.transpose((2, 0, 1)) / 255
valid_ratio = min(1.0, float(resized_w / imgW))
data['image'] = norm_img
data['valid_ratio'] = valid_ratio
return data
class SRNRecResizeImg(object): class SRNRecResizeImg(object):
def __init__(self, image_shape, num_heads, max_text_length, **kwargs): def __init__(self, image_shape, num_heads, max_text_length, **kwargs):
self.image_shape = image_shape self.image_shape = image_shape
...@@ -259,6 +291,7 @@ class PRENResizeImg(object): ...@@ -259,6 +291,7 @@ class PRENResizeImg(object):
data['image'] = resized_img.astype(np.float32) data['image'] = resized_img.astype(np.float32)
return data return data
class SPINRecResizeImg(object): class SPINRecResizeImg(object):
def __init__(self, def __init__(self,
image_shape, image_shape,
...@@ -267,7 +300,7 @@ class SPINRecResizeImg(object): ...@@ -267,7 +300,7 @@ class SPINRecResizeImg(object):
std=(127.5, 127.5, 127.5), std=(127.5, 127.5, 127.5),
**kwargs): **kwargs):
self.image_shape = image_shape self.image_shape = image_shape
self.mean = np.array(mean, dtype=np.float32) self.mean = np.array(mean, dtype=np.float32)
self.std = np.array(std, dtype=np.float32) self.std = np.array(std, dtype=np.float32)
self.interpolation = interpolation self.interpolation = interpolation
...@@ -303,6 +336,7 @@ class SPINRecResizeImg(object): ...@@ -303,6 +336,7 @@ class SPINRecResizeImg(object):
data['image'] = img data['image'] = img
return data return data
class GrayRecResizeImg(object): class GrayRecResizeImg(object):
def __init__(self, def __init__(self,
image_shape, image_shape,
......
...@@ -13,12 +13,10 @@ ...@@ -13,12 +13,10 @@
# limitations under the License. # limitations under the License.
from .token import VQATokenPad, VQASerTokenChunk, VQAReTokenChunk, VQAReTokenRelation from .token import VQATokenPad, VQASerTokenChunk, VQAReTokenChunk, VQAReTokenRelation
from .augment import DistortBBox
__all__ = [ __all__ = [
'VQATokenPad', 'VQATokenPad',
'VQASerTokenChunk', 'VQASerTokenChunk',
'VQAReTokenChunk', 'VQAReTokenChunk',
'VQAReTokenRelation', 'VQAReTokenRelation',
'DistortBBox',
] ]
...@@ -16,22 +16,18 @@ import os ...@@ -16,22 +16,18 @@ import os
import sys import sys
import numpy as np import numpy as np
import random import random
from copy import deepcopy
class DistortBBox: def order_by_tbyx(ocr_info):
def __init__(self, prob=0.5, max_scale=1, **kwargs): res = sorted(ocr_info, key=lambda r: (r["bbox"][1], r["bbox"][0]))
"""Random distort bbox for i in range(len(res) - 1):
""" for j in range(i, 0, -1):
self.prob = prob if abs(res[j + 1]["bbox"][1] - res[j]["bbox"][1]) < 20 and \
self.max_scale = max_scale (res[j + 1]["bbox"][0] < res[j]["bbox"][0]):
tmp = deepcopy(res[j])
def __call__(self, data): res[j] = deepcopy(res[j + 1])
if random.random() > self.prob: res[j + 1] = deepcopy(tmp)
return data else:
bbox = np.array(data['bbox']) break
rnd_scale = (np.random.rand(*bbox.shape) - 0.5) * 2 * self.max_scale return res
bbox = np.round(bbox + rnd_scale).astype(bbox.dtype)
data['bbox'] = np.clip(data['bbox'], 0, 1000)
data['bbox'] = bbox.tolist()
sys.stdout.flush()
return data
...@@ -35,6 +35,7 @@ from .rec_sar_loss import SARLoss ...@@ -35,6 +35,7 @@ from .rec_sar_loss import SARLoss
from .rec_aster_loss import AsterLoss from .rec_aster_loss import AsterLoss
from .rec_pren_loss import PRENLoss from .rec_pren_loss import PRENLoss
from .rec_multi_loss import MultiLoss from .rec_multi_loss import MultiLoss
from .rec_vl_loss import VLLoss
from .rec_spin_att_loss import SPINAttentionLoss from .rec_spin_att_loss import SPINAttentionLoss
# cls loss # cls loss
...@@ -63,7 +64,7 @@ def build_loss(config): ...@@ -63,7 +64,7 @@ def build_loss(config):
'ClsLoss', 'AttentionLoss', 'SRNLoss', 'PGLoss', 'CombinedLoss', 'ClsLoss', 'AttentionLoss', 'SRNLoss', 'PGLoss', 'CombinedLoss',
'CELoss', 'TableAttentionLoss', 'SARLoss', 'AsterLoss', 'SDMGRLoss', 'CELoss', 'TableAttentionLoss', 'SARLoss', 'AsterLoss', 'SDMGRLoss',
'VQASerTokenLayoutLMLoss', 'LossFromOutput', 'PRENLoss', 'MultiLoss', 'VQASerTokenLayoutLMLoss', 'LossFromOutput', 'PRENLoss', 'MultiLoss',
'TableMasterLoss', 'SPINAttentionLoss' 'TableMasterLoss', 'SPINAttentionLoss', 'VLLoss'
] ]
config = copy.deepcopy(config) config = copy.deepcopy(config)
module_name = config.pop('name') module_name = config.pop('name')
......
...@@ -63,18 +63,21 @@ class KLJSLoss(object): ...@@ -63,18 +63,21 @@ class KLJSLoss(object):
def __call__(self, p1, p2, reduction="mean"): def __call__(self, p1, p2, reduction="mean"):
if self.mode.lower() == 'kl': if self.mode.lower() == 'kl':
loss = paddle.multiply(p2, paddle.log((p2 + 1e-5) / (p1 + 1e-5) + 1e-5)) loss = paddle.multiply(p2,
paddle.log((p2 + 1e-5) / (p1 + 1e-5) + 1e-5))
loss += paddle.multiply( loss += paddle.multiply(
p1, paddle.log((p1 + 1e-5) / (p2 + 1e-5) + 1e-5)) p1, paddle.log((p1 + 1e-5) / (p2 + 1e-5) + 1e-5))
loss *= 0.5 loss *= 0.5
elif self.mode.lower() == "js": elif self.mode.lower() == "js":
loss = paddle.multiply(p2, paddle.log((2*p2 + 1e-5) / (p1 + p2 + 1e-5) + 1e-5)) loss = paddle.multiply(
p2, paddle.log((2 * p2 + 1e-5) / (p1 + p2 + 1e-5) + 1e-5))
loss += paddle.multiply( loss += paddle.multiply(
p1, paddle.log((2*p1 + 1e-5) / (p1 + p2 + 1e-5) + 1e-5)) p1, paddle.log((2 * p1 + 1e-5) / (p1 + p2 + 1e-5) + 1e-5))
loss *= 0.5 loss *= 0.5
else: else:
raise ValueError("The mode.lower() if KLJSLoss should be one of ['kl', 'js']") raise ValueError(
"The mode.lower() if KLJSLoss should be one of ['kl', 'js']")
if reduction == "mean": if reduction == "mean":
loss = paddle.mean(loss, axis=[1, 2]) loss = paddle.mean(loss, axis=[1, 2])
elif reduction == "none" or reduction is None: elif reduction == "none" or reduction is None:
...@@ -154,7 +157,9 @@ class LossFromOutput(nn.Layer): ...@@ -154,7 +157,9 @@ class LossFromOutput(nn.Layer):
self.reduction = reduction self.reduction = reduction
def forward(self, predicts, batch): def forward(self, predicts, batch):
loss = predicts[self.key] loss = predicts
if self.key is not None and isinstance(predicts, dict):
loss = loss[self.key]
if self.reduction == 'mean': if self.reduction == 'mean':
loss = paddle.mean(loss) loss = paddle.mean(loss)
elif self.reduction == 'sum': elif self.reduction == 'sum':
......
...@@ -24,6 +24,9 @@ from .distillation_loss import DistillationCTCLoss ...@@ -24,6 +24,9 @@ from .distillation_loss import DistillationCTCLoss
from .distillation_loss import DistillationSARLoss from .distillation_loss import DistillationSARLoss
from .distillation_loss import DistillationDMLLoss from .distillation_loss import DistillationDMLLoss
from .distillation_loss import DistillationDistanceLoss, DistillationDBLoss, DistillationDilaDBLoss from .distillation_loss import DistillationDistanceLoss, DistillationDBLoss, DistillationDilaDBLoss
from .distillation_loss import DistillationVQASerTokenLayoutLMLoss, DistillationSERDMLLoss
from .distillation_loss import DistillationLossFromOutput
from .distillation_loss import DistillationVQADistanceLoss
class CombinedLoss(nn.Layer): class CombinedLoss(nn.Layer):
......
...@@ -21,8 +21,10 @@ from .rec_ctc_loss import CTCLoss ...@@ -21,8 +21,10 @@ from .rec_ctc_loss import CTCLoss
from .rec_sar_loss import SARLoss from .rec_sar_loss import SARLoss
from .basic_loss import DMLLoss from .basic_loss import DMLLoss
from .basic_loss import DistanceLoss from .basic_loss import DistanceLoss
from .basic_loss import LossFromOutput
from .det_db_loss import DBLoss from .det_db_loss import DBLoss
from .det_basic_loss import BalanceLoss, MaskL1Loss, DiceLoss from .det_basic_loss import BalanceLoss, MaskL1Loss, DiceLoss
from .vqa_token_layoutlm_loss import VQASerTokenLayoutLMLoss
def _sum_loss(loss_dict): def _sum_loss(loss_dict):
...@@ -322,3 +324,133 @@ class DistillationDistanceLoss(DistanceLoss): ...@@ -322,3 +324,133 @@ class DistillationDistanceLoss(DistanceLoss):
loss_dict["{}_{}_{}_{}".format(self.name, pair[0], pair[1], loss_dict["{}_{}_{}_{}".format(self.name, pair[0], pair[1],
idx)] = loss idx)] = loss
return loss_dict return loss_dict
class DistillationVQASerTokenLayoutLMLoss(VQASerTokenLayoutLMLoss):
def __init__(self,
num_classes,
model_name_list=[],
key=None,
name="loss_ser"):
super().__init__(num_classes=num_classes)
self.model_name_list = model_name_list
self.key = key
self.name = name
def forward(self, predicts, batch):
loss_dict = dict()
for idx, model_name in enumerate(self.model_name_list):
out = predicts[model_name]
if self.key is not None:
out = out[self.key]
loss = super().forward(out, batch)
loss_dict["{}_{}".format(self.name, model_name)] = loss["loss"]
return loss_dict
class DistillationLossFromOutput(LossFromOutput):
def __init__(self,
reduction="none",
model_name_list=[],
dist_key=None,
key="loss",
name="loss_re"):
super().__init__(key=key, reduction=reduction)
self.model_name_list = model_name_list
self.name = name
self.dist_key = dist_key
def forward(self, predicts, batch):
loss_dict = dict()
for idx, model_name in enumerate(self.model_name_list):
out = predicts[model_name]
if self.dist_key is not None:
out = out[self.dist_key]
loss = super().forward(out, batch)
loss_dict["{}_{}".format(self.name, model_name)] = loss["loss"]
return loss_dict
class DistillationSERDMLLoss(DMLLoss):
"""
"""
def __init__(self,
act="softmax",
use_log=True,
num_classes=7,
model_name_pairs=[],
key=None,
name="loss_dml_ser"):
super().__init__(act=act, use_log=use_log)
assert isinstance(model_name_pairs, list)
self.key = key
self.name = name
self.num_classes = num_classes
self.model_name_pairs = model_name_pairs
def forward(self, predicts, batch):
loss_dict = dict()
for idx, pair in enumerate(self.model_name_pairs):
out1 = predicts[pair[0]]
out2 = predicts[pair[1]]
if self.key is not None:
out1 = out1[self.key]
out2 = out2[self.key]
out1 = out1.reshape([-1, out1.shape[-1]])
out2 = out2.reshape([-1, out2.shape[-1]])
attention_mask = batch[2]
if attention_mask is not None:
active_output = attention_mask.reshape([-1, ]) == 1
out1 = out1[active_output]
out2 = out2[active_output]
loss_dict["{}_{}".format(self.name, idx)] = super().forward(out1,
out2)
return loss_dict
class DistillationVQADistanceLoss(DistanceLoss):
def __init__(self,
mode="l2",
model_name_pairs=[],
key=None,
name="loss_distance",
**kargs):
super().__init__(mode=mode, **kargs)
assert isinstance(model_name_pairs, list)
self.key = key
self.model_name_pairs = model_name_pairs
self.name = name + "_l2"
def forward(self, predicts, batch):
loss_dict = dict()
for idx, pair in enumerate(self.model_name_pairs):
out1 = predicts[pair[0]]
out2 = predicts[pair[1]]
attention_mask = batch[2]
if self.key is not None:
out1 = out1[self.key]
out2 = out2[self.key]
if attention_mask is not None:
max_len = attention_mask.shape[-1]
out1 = out1[:, :max_len]
out2 = out2[:, :max_len]
out1 = out1.reshape([-1, out1.shape[-1]])
out2 = out2.reshape([-1, out2.shape[-1]])
if attention_mask is not None:
active_output = attention_mask.reshape([-1, ]) == 1
out1 = out1[active_output]
out2 = out2[active_output]
loss = super().forward(out1, out2)
if isinstance(loss, dict):
for key in loss:
loss_dict["{}_{}nohu_{}".format(self.name, key,
idx)] = loss[key]
else:
loss_dict["{}_{}_{}_{}".format(self.name, pair[0], pair[1],
idx)] = loss
return loss_dict
# copyright (c) 2022 PaddlePaddle Authors. All Rights Reserve.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""
This code is refer from:
https://github.com/wangyuxin87/VisionLAN
"""
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
import paddle
from paddle import nn
class VLLoss(nn.Layer):
def __init__(self, mode='LF_1', weight_res=0.5, weight_mas=0.5, **kwargs):
super(VLLoss, self).__init__()
self.loss_func = paddle.nn.loss.CrossEntropyLoss(reduction="mean")
assert mode in ['LF_1', 'LF_2', 'LA']
self.mode = mode
self.weight_res = weight_res
self.weight_mas = weight_mas
def flatten_label(self, target):
label_flatten = []
label_length = []
for i in range(0, target.shape[0]):
cur_label = target[i].tolist()
label_flatten += cur_label[:cur_label.index(0) + 1]
label_length.append(cur_label.index(0) + 1)
label_flatten = paddle.to_tensor(label_flatten, dtype='int64')
label_length = paddle.to_tensor(label_length, dtype='int32')
return (label_flatten, label_length)
def _flatten(self, sources, lengths):
return paddle.concat([t[:l] for t, l in zip(sources, lengths)])
def forward(self, predicts, batch):
text_pre = predicts[0]
target = batch[1].astype('int64')
label_flatten, length = self.flatten_label(target)
text_pre = self._flatten(text_pre, length)
if self.mode == 'LF_1':
loss = self.loss_func(text_pre, label_flatten)
else:
text_rem = predicts[1]
text_mas = predicts[2]
target_res = batch[2].astype('int64')
target_sub = batch[3].astype('int64')
label_flatten_res, length_res = self.flatten_label(target_res)
label_flatten_sub, length_sub = self.flatten_label(target_sub)
text_rem = self._flatten(text_rem, length_res)
text_mas = self._flatten(text_mas, length_sub)
loss_ori = self.loss_func(text_pre, label_flatten)
loss_res = self.loss_func(text_rem, label_flatten_res)
loss_mas = self.loss_func(text_mas, label_flatten_sub)
loss = loss_ori + loss_res * self.weight_res + loss_mas * self.weight_mas
return {'loss': loss}
...@@ -17,26 +17,30 @@ from __future__ import division ...@@ -17,26 +17,30 @@ from __future__ import division
from __future__ import print_function from __future__ import print_function
from paddle import nn from paddle import nn
from ppocr.losses.basic_loss import DMLLoss
class VQASerTokenLayoutLMLoss(nn.Layer): class VQASerTokenLayoutLMLoss(nn.Layer):
def __init__(self, num_classes): def __init__(self, num_classes, key=None):
super().__init__() super().__init__()
self.loss_class = nn.CrossEntropyLoss() self.loss_class = nn.CrossEntropyLoss()
self.num_classes = num_classes self.num_classes = num_classes
self.ignore_index = self.loss_class.ignore_index self.ignore_index = self.loss_class.ignore_index
self.key = key
def forward(self, predicts, batch): def forward(self, predicts, batch):
if isinstance(predicts, dict) and self.key is not None:
predicts = predicts[self.key]
labels = batch[5] labels = batch[5]
attention_mask = batch[2] attention_mask = batch[2]
if attention_mask is not None: if attention_mask is not None:
active_loss = attention_mask.reshape([-1, ]) == 1 active_loss = attention_mask.reshape([-1, ]) == 1
active_outputs = predicts.reshape( active_output = predicts.reshape(
[-1, self.num_classes])[active_loss] [-1, self.num_classes])[active_loss]
active_labels = labels.reshape([-1, ])[active_loss] active_label = labels.reshape([-1, ])[active_loss]
loss = self.loss_class(active_outputs, active_labels) loss = self.loss_class(active_output, active_label)
else: else:
loss = self.loss_class( loss = self.loss_class(
predicts.reshape([-1, self.num_classes]), predicts.reshape([-1, self.num_classes]),
labels.reshape([-1, ])) labels.reshape([-1, ]))
return {'loss': loss} return {'loss': loss}
\ No newline at end of file
...@@ -19,6 +19,8 @@ from .rec_metric import RecMetric ...@@ -19,6 +19,8 @@ from .rec_metric import RecMetric
from .det_metric import DetMetric from .det_metric import DetMetric
from .e2e_metric import E2EMetric from .e2e_metric import E2EMetric
from .cls_metric import ClsMetric from .cls_metric import ClsMetric
from .vqa_token_ser_metric import VQASerTokenMetric
from .vqa_token_re_metric import VQAReTokenMetric
class DistillationMetric(object): class DistillationMetric(object):
......
...@@ -73,28 +73,40 @@ class BaseModel(nn.Layer): ...@@ -73,28 +73,40 @@ class BaseModel(nn.Layer):
self.return_all_feats = config.get("return_all_feats", False) self.return_all_feats = config.get("return_all_feats", False)
def forward(self, x, data=None): def forward(self, x, data=None):
y = dict() y = dict()
if self.use_transform: if self.use_transform:
x = self.transform(x) x = self.transform(x)
x = self.backbone(x) x = self.backbone(x)
y["backbone_out"] = x if isinstance(x, dict):
y.update(x)
else:
y["backbone_out"] = x
final_name = "backbone_out"
if self.use_neck: if self.use_neck:
x = self.neck(x) x = self.neck(x)
y["neck_out"] = x if isinstance(x, dict):
y.update(x)
else:
y["neck_out"] = x
final_name = "neck_out"
if self.use_head: if self.use_head:
x = self.head(x, targets=data) x = self.head(x, targets=data)
# for multi head, save ctc neck out for udml # for multi head, save ctc neck out for udml
if isinstance(x, dict) and 'ctc_neck' in x.keys(): if isinstance(x, dict) and 'ctc_neck' in x.keys():
y["neck_out"] = x["ctc_neck"] y["neck_out"] = x["ctc_neck"]
y["head_out"] = x y["head_out"] = x
elif isinstance(x, dict): elif isinstance(x, dict):
y.update(x) y.update(x)
else: else:
y["head_out"] = x y["head_out"] = x
final_name = "head_out"
if self.return_all_feats: if self.return_all_feats:
if self.training: if self.training:
return y return y
elif isinstance(x, dict):
return x
else: else:
return {"head_out": y["head_out"]} return {final_name: x}
else: else:
return x return x
...@@ -84,11 +84,15 @@ class BasicBlock(nn.Layer): ...@@ -84,11 +84,15 @@ class BasicBlock(nn.Layer):
class ResNet45(nn.Layer): class ResNet45(nn.Layer):
def __init__(self, block=BasicBlock, layers=[3, 4, 6, 6, 3], in_channels=3): def __init__(self,
in_channels=3,
block=BasicBlock,
layers=[3, 4, 6, 6, 3],
strides=[2, 1, 2, 1, 1]):
self.inplanes = 32 self.inplanes = 32
super(ResNet45, self).__init__() super(ResNet45, self).__init__()
self.conv1 = nn.Conv2D( self.conv1 = nn.Conv2D(
3, in_channels,
32, 32,
kernel_size=3, kernel_size=3,
stride=1, stride=1,
...@@ -98,18 +102,13 @@ class ResNet45(nn.Layer): ...@@ -98,18 +102,13 @@ class ResNet45(nn.Layer):
self.bn1 = nn.BatchNorm2D(32) self.bn1 = nn.BatchNorm2D(32)
self.relu = nn.ReLU() self.relu = nn.ReLU()
self.layer1 = self._make_layer(block, 32, layers[0], stride=2) self.layer1 = self._make_layer(block, 32, layers[0], stride=strides[0])
self.layer2 = self._make_layer(block, 64, layers[1], stride=1) self.layer2 = self._make_layer(block, 64, layers[1], stride=strides[1])
self.layer3 = self._make_layer(block, 128, layers[2], stride=2) self.layer3 = self._make_layer(block, 128, layers[2], stride=strides[2])
self.layer4 = self._make_layer(block, 256, layers[3], stride=1) self.layer4 = self._make_layer(block, 256, layers[3], stride=strides[3])
self.layer5 = self._make_layer(block, 512, layers[4], stride=1) self.layer5 = self._make_layer(block, 512, layers[4], stride=strides[4])
self.out_channels = 512 self.out_channels = 512
# for m in self.modules():
# if isinstance(m, nn.Conv2D):
# n = m._kernel_size[0] * m._kernel_size[1] * m._out_channels
# m.weight.data.normal_(0, math.sqrt(2. / n))
def _make_layer(self, block, planes, blocks, stride=1): def _make_layer(self, block, planes, blocks, stride=1):
downsample = None downsample = None
if stride != 1 or self.inplanes != planes * block.expansion: if stride != 1 or self.inplanes != planes * block.expansion:
...@@ -137,11 +136,9 @@ class ResNet45(nn.Layer): ...@@ -137,11 +136,9 @@ class ResNet45(nn.Layer):
x = self.conv1(x) x = self.conv1(x)
x = self.bn1(x) x = self.bn1(x)
x = self.relu(x) x = self.relu(x)
# print(x)
x = self.layer1(x) x = self.layer1(x)
x = self.layer2(x) x = self.layer2(x)
x = self.layer3(x) x = self.layer3(x)
# print(x)
x = self.layer4(x) x = self.layer4(x)
x = self.layer5(x) x = self.layer5(x)
return x return x
...@@ -140,4 +140,4 @@ class ResNet_ASTER(nn.Layer): ...@@ -140,4 +140,4 @@ class ResNet_ASTER(nn.Layer):
rnn_feat, _ = self.rnn(cnn_feat) rnn_feat, _ = self.rnn(cnn_feat)
return rnn_feat return rnn_feat
else: else:
return cnn_feat return cnn_feat
\ No newline at end of file
...@@ -22,13 +22,22 @@ from paddle import nn ...@@ -22,13 +22,22 @@ from paddle import nn
from paddlenlp.transformers import LayoutXLMModel, LayoutXLMForTokenClassification, LayoutXLMForRelationExtraction from paddlenlp.transformers import LayoutXLMModel, LayoutXLMForTokenClassification, LayoutXLMForRelationExtraction
from paddlenlp.transformers import LayoutLMModel, LayoutLMForTokenClassification from paddlenlp.transformers import LayoutLMModel, LayoutLMForTokenClassification
from paddlenlp.transformers import LayoutLMv2Model, LayoutLMv2ForTokenClassification, LayoutLMv2ForRelationExtraction from paddlenlp.transformers import LayoutLMv2Model, LayoutLMv2ForTokenClassification, LayoutLMv2ForRelationExtraction
from paddlenlp.transformers import AutoModel
__all__ = ["LayoutXLMForSer", 'LayoutLMForSer'] __all__ = ["LayoutXLMForSer", "LayoutLMForSer"]
pretrained_model_dict = { pretrained_model_dict = {
LayoutXLMModel: 'layoutxlm-base-uncased', LayoutXLMModel: {
LayoutLMModel: 'layoutlm-base-uncased', "base": "layoutxlm-base-uncased",
LayoutLMv2Model: 'layoutlmv2-base-uncased' "vi": "layoutxlm-wo-backbone-base-uncased",
},
LayoutLMModel: {
"base": "layoutlm-base-uncased",
},
LayoutLMv2Model: {
"base": "layoutlmv2-base-uncased",
"vi": "layoutlmv2-wo-backbone-base-uncased",
},
} }
...@@ -36,42 +45,47 @@ class NLPBaseModel(nn.Layer): ...@@ -36,42 +45,47 @@ class NLPBaseModel(nn.Layer):
def __init__(self, def __init__(self,
base_model_class, base_model_class,
model_class, model_class,
type='ser', mode="base",
type="ser",
pretrained=True, pretrained=True,
checkpoints=None, checkpoints=None,
**kwargs): **kwargs):
super(NLPBaseModel, self).__init__() super(NLPBaseModel, self).__init__()
if checkpoints is not None: if checkpoints is not None: # load the trained model
self.model = model_class.from_pretrained(checkpoints) self.model = model_class.from_pretrained(checkpoints)
elif isinstance(pretrained, (str, )) and os.path.exists(pretrained): else: # load the pretrained-model
self.model = model_class.from_pretrained(pretrained) pretrained_model_name = pretrained_model_dict[base_model_class][
else: mode]
pretrained_model_name = pretrained_model_dict[base_model_class]
if pretrained is True: if pretrained is True:
base_model = base_model_class.from_pretrained( base_model = base_model_class.from_pretrained(
pretrained_model_name) pretrained_model_name)
else: else:
base_model = base_model_class( base_model = base_model_class.from_pretrained(pretrained)
**base_model_class.pretrained_init_configuration[ if type == "ser":
pretrained_model_name])
if type == 'ser':
self.model = model_class( self.model = model_class(
base_model, num_classes=kwargs['num_classes'], dropout=None) base_model, num_classes=kwargs["num_classes"], dropout=None)
else: else:
self.model = model_class(base_model, dropout=None) self.model = model_class(base_model, dropout=None)
self.out_channels = 1 self.out_channels = 1
self.use_visual_backbone = True
class LayoutLMForSer(NLPBaseModel): class LayoutLMForSer(NLPBaseModel):
def __init__(self, num_classes, pretrained=True, checkpoints=None, def __init__(self,
num_classes,
pretrained=True,
checkpoints=None,
mode="base",
**kwargs): **kwargs):
super(LayoutLMForSer, self).__init__( super(LayoutLMForSer, self).__init__(
LayoutLMModel, LayoutLMModel,
LayoutLMForTokenClassification, LayoutLMForTokenClassification,
'ser', mode,
"ser",
pretrained, pretrained,
checkpoints, checkpoints,
num_classes=num_classes) num_classes=num_classes, )
self.use_visual_backbone = False
def forward(self, x): def forward(self, x):
x = self.model( x = self.model(
...@@ -85,62 +99,92 @@ class LayoutLMForSer(NLPBaseModel): ...@@ -85,62 +99,92 @@ class LayoutLMForSer(NLPBaseModel):
class LayoutLMv2ForSer(NLPBaseModel): class LayoutLMv2ForSer(NLPBaseModel):
def __init__(self, num_classes, pretrained=True, checkpoints=None, def __init__(self,
num_classes,
pretrained=True,
checkpoints=None,
mode="base",
**kwargs): **kwargs):
super(LayoutLMv2ForSer, self).__init__( super(LayoutLMv2ForSer, self).__init__(
LayoutLMv2Model, LayoutLMv2Model,
LayoutLMv2ForTokenClassification, LayoutLMv2ForTokenClassification,
'ser', mode,
"ser",
pretrained, pretrained,
checkpoints, checkpoints,
num_classes=num_classes) num_classes=num_classes)
self.use_visual_backbone = True
if hasattr(self.model.layoutlmv2, "use_visual_backbone"
) and self.model.layoutlmv2.use_visual_backbone is False:
self.use_visual_backbone = False
def forward(self, x): def forward(self, x):
if self.use_visual_backbone is True:
image = x[4]
else:
image = None
x = self.model( x = self.model(
input_ids=x[0], input_ids=x[0],
bbox=x[1], bbox=x[1],
attention_mask=x[2], attention_mask=x[2],
token_type_ids=x[3], token_type_ids=x[3],
image=x[4], image=image,
position_ids=None, position_ids=None,
head_mask=None, head_mask=None,
labels=None) labels=None)
if not self.training: if self.training:
res = {"backbone_out": x[0]}
res.update(x[1])
return res
else:
return x return x
return x[0]
class LayoutXLMForSer(NLPBaseModel): class LayoutXLMForSer(NLPBaseModel):
def __init__(self, num_classes, pretrained=True, checkpoints=None, def __init__(self,
num_classes,
pretrained=True,
checkpoints=None,
mode="base",
**kwargs): **kwargs):
super(LayoutXLMForSer, self).__init__( super(LayoutXLMForSer, self).__init__(
LayoutXLMModel, LayoutXLMModel,
LayoutXLMForTokenClassification, LayoutXLMForTokenClassification,
'ser', mode,
"ser",
pretrained, pretrained,
checkpoints, checkpoints,
num_classes=num_classes) num_classes=num_classes)
self.use_visual_backbone = True
def forward(self, x): def forward(self, x):
if self.use_visual_backbone is True:
image = x[4]
else:
image = None
x = self.model( x = self.model(
input_ids=x[0], input_ids=x[0],
bbox=x[1], bbox=x[1],
attention_mask=x[2], attention_mask=x[2],
token_type_ids=x[3], token_type_ids=x[3],
image=x[4], image=image,
position_ids=None, position_ids=None,
head_mask=None, head_mask=None,
labels=None) labels=None)
if not self.training: if self.training:
res = {"backbone_out": x[0]}
res.update(x[1])
return res
else:
return x return x
return x[0]
class LayoutLMv2ForRe(NLPBaseModel): class LayoutLMv2ForRe(NLPBaseModel):
def __init__(self, pretrained=True, checkpoints=None, **kwargs): def __init__(self, pretrained=True, checkpoints=None, mode="base",
super(LayoutLMv2ForRe, self).__init__(LayoutLMv2Model, **kwargs):
LayoutLMv2ForRelationExtraction, super(LayoutLMv2ForRe, self).__init__(
're', pretrained, checkpoints) LayoutLMv2Model, LayoutLMv2ForRelationExtraction, mode, "re",
pretrained, checkpoints)
def forward(self, x): def forward(self, x):
x = self.model( x = self.model(
...@@ -158,18 +202,27 @@ class LayoutLMv2ForRe(NLPBaseModel): ...@@ -158,18 +202,27 @@ class LayoutLMv2ForRe(NLPBaseModel):
class LayoutXLMForRe(NLPBaseModel): class LayoutXLMForRe(NLPBaseModel):
def __init__(self, pretrained=True, checkpoints=None, **kwargs): def __init__(self, pretrained=True, checkpoints=None, mode="base",
super(LayoutXLMForRe, self).__init__(LayoutXLMModel, **kwargs):
LayoutXLMForRelationExtraction, super(LayoutXLMForRe, self).__init__(
're', pretrained, checkpoints) LayoutXLMModel, LayoutXLMForRelationExtraction, mode, "re",
pretrained, checkpoints)
self.use_visual_backbone = True
if hasattr(self.model.layoutxlm, "use_visual_backbone"
) and self.model.layoutxlm.use_visual_backbone is False:
self.use_visual_backbone = False
def forward(self, x): def forward(self, x):
if self.use_visual_backbone is True:
image = x[4]
else:
image = None
x = self.model( x = self.model(
input_ids=x[0], input_ids=x[0],
bbox=x[1], bbox=x[1],
attention_mask=x[2], attention_mask=x[2],
token_type_ids=x[3], token_type_ids=x[3],
image=x[4], image=image,
position_ids=None, position_ids=None,
head_mask=None, head_mask=None,
labels=None, labels=None,
......
...@@ -36,6 +36,7 @@ def build_head(config): ...@@ -36,6 +36,7 @@ def build_head(config):
from .rec_spin_att_head import SPINAttentionHead from .rec_spin_att_head import SPINAttentionHead
from .rec_abinet_head import ABINetHead from .rec_abinet_head import ABINetHead
from .rec_robustscanner_head import RobustScannerHead from .rec_robustscanner_head import RobustScannerHead
from .rec_visionlan_head import VLHead
# cls head # cls head
from .cls_head import ClsHead from .cls_head import ClsHead
...@@ -50,7 +51,8 @@ def build_head(config): ...@@ -50,7 +51,8 @@ def build_head(config):
'DBHead', 'PSEHead', 'FCEHead', 'EASTHead', 'SASTHead', 'CTCHead', 'DBHead', 'PSEHead', 'FCEHead', 'EASTHead', 'SASTHead', 'CTCHead',
'ClsHead', 'AttentionHead', 'SRNHead', 'PGHead', 'Transformer', 'ClsHead', 'AttentionHead', 'SRNHead', 'PGHead', 'Transformer',
'TableAttentionHead', 'SARHead', 'AsterHead', 'SDMGRHead', 'PRENHead', 'TableAttentionHead', 'SARHead', 'AsterHead', 'SDMGRHead', 'PRENHead',
'MultiHead', 'ABINetHead', 'TableMasterHead', 'SPINAttentionHead', 'RobustScannerHead' 'MultiHead', 'ABINetHead', 'TableMasterHead', 'SPINAttentionHead',
'VLHead', 'RobustScannerHead'
] ]
#table head #table head
......
# copyright (c) 2022 PaddlePaddle Authors. All Rights Reserve.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""
This code is refer from:
https://github.com/wangyuxin87/VisionLAN
"""
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
import paddle
from paddle import ParamAttr
import paddle.nn as nn
import paddle.nn.functional as F
from paddle.nn.initializer import Normal, XavierNormal
import numpy as np
class PositionalEncoding(nn.Layer):
def __init__(self, d_hid, n_position=200):
super(PositionalEncoding, self).__init__()
self.register_buffer(
'pos_table', self._get_sinusoid_encoding_table(n_position, d_hid))
def _get_sinusoid_encoding_table(self, n_position, d_hid):
''' Sinusoid position encoding table '''
def get_position_angle_vec(position):
return [
position / np.power(10000, 2 * (hid_j // 2) / d_hid)
for hid_j in range(d_hid)
]
sinusoid_table = np.array(
[get_position_angle_vec(pos_i) for pos_i in range(n_position)])
sinusoid_table[:, 0::2] = np.sin(sinusoid_table[:, 0::2]) # dim 2i
sinusoid_table[:, 1::2] = np.cos(sinusoid_table[:, 1::2]) # dim 2i+1
sinusoid_table = paddle.to_tensor(sinusoid_table, dtype='float32')
sinusoid_table = paddle.unsqueeze(sinusoid_table, axis=0)
return sinusoid_table
def forward(self, x):
return x + self.pos_table[:, :x.shape[1]].clone().detach()
class ScaledDotProductAttention(nn.Layer):
"Scaled Dot-Product Attention"
def __init__(self, temperature, attn_dropout=0.1):
super(ScaledDotProductAttention, self).__init__()
self.temperature = temperature
self.dropout = nn.Dropout(attn_dropout)
self.softmax = nn.Softmax(axis=2)
def forward(self, q, k, v, mask=None):
k = paddle.transpose(k, perm=[0, 2, 1])
attn = paddle.bmm(q, k)
attn = attn / self.temperature
if mask is not None:
attn = attn.masked_fill(mask, -1e9)
if mask.dim() == 3:
mask = paddle.unsqueeze(mask, axis=1)
elif mask.dim() == 2:
mask = paddle.unsqueeze(mask, axis=1)
mask = paddle.unsqueeze(mask, axis=1)
repeat_times = [
attn.shape[1] // mask.shape[1], attn.shape[2] // mask.shape[2]
]
mask = paddle.tile(mask, [1, repeat_times[0], repeat_times[1], 1])
attn[mask == 0] = -1e9
attn = self.softmax(attn)
attn = self.dropout(attn)
output = paddle.bmm(attn, v)
return output
class MultiHeadAttention(nn.Layer):
" Multi-Head Attention module"
def __init__(self, n_head, d_model, d_k, d_v, dropout=0.1):
super(MultiHeadAttention, self).__init__()
self.n_head = n_head
self.d_k = d_k
self.d_v = d_v
self.w_qs = nn.Linear(
d_model,
n_head * d_k,
weight_attr=ParamAttr(initializer=Normal(
mean=0, std=np.sqrt(2.0 / (d_model + d_k)))))
self.w_ks = nn.Linear(
d_model,
n_head * d_k,
weight_attr=ParamAttr(initializer=Normal(
mean=0, std=np.sqrt(2.0 / (d_model + d_k)))))
self.w_vs = nn.Linear(
d_model,
n_head * d_v,
weight_attr=ParamAttr(initializer=Normal(
mean=0, std=np.sqrt(2.0 / (d_model + d_v)))))
self.attention = ScaledDotProductAttention(temperature=np.power(d_k,
0.5))
self.layer_norm = nn.LayerNorm(d_model)
self.fc = nn.Linear(
n_head * d_v,
d_model,
weight_attr=ParamAttr(initializer=XavierNormal()))
self.dropout = nn.Dropout(dropout)
def forward(self, q, k, v, mask=None):
d_k, d_v, n_head = self.d_k, self.d_v, self.n_head
sz_b, len_q, _ = q.shape
sz_b, len_k, _ = k.shape
sz_b, len_v, _ = v.shape
residual = q
q = self.w_qs(q)
q = paddle.reshape(
q, shape=[-1, len_q, n_head, d_k]) # 4*21*512 ---- 4*21*8*64
k = self.w_ks(k)
k = paddle.reshape(k, shape=[-1, len_k, n_head, d_k])
v = self.w_vs(v)
v = paddle.reshape(v, shape=[-1, len_v, n_head, d_v])
q = paddle.transpose(q, perm=[2, 0, 1, 3])
q = paddle.reshape(q, shape=[-1, len_q, d_k]) # (n*b) x lq x dk
k = paddle.transpose(k, perm=[2, 0, 1, 3])
k = paddle.reshape(k, shape=[-1, len_k, d_k]) # (n*b) x lk x dk
v = paddle.transpose(v, perm=[2, 0, 1, 3])
v = paddle.reshape(v, shape=[-1, len_v, d_v]) # (n*b) x lv x dv
mask = paddle.tile(
mask,
[n_head, 1, 1]) if mask is not None else None # (n*b) x .. x ..
output = self.attention(q, k, v, mask=mask)
output = paddle.reshape(output, shape=[n_head, -1, len_q, d_v])
output = paddle.transpose(output, perm=[1, 2, 0, 3])
output = paddle.reshape(
output, shape=[-1, len_q, n_head * d_v]) # b x lq x (n*dv)
output = self.dropout(self.fc(output))
output = self.layer_norm(output + residual)
return output
class PositionwiseFeedForward(nn.Layer):
def __init__(self, d_in, d_hid, dropout=0.1):
super(PositionwiseFeedForward, self).__init__()
self.w_1 = nn.Conv1D(d_in, d_hid, 1) # position-wise
self.w_2 = nn.Conv1D(d_hid, d_in, 1) # position-wise
self.layer_norm = nn.LayerNorm(d_in)
self.dropout = nn.Dropout(dropout)
def forward(self, x):
residual = x
x = paddle.transpose(x, perm=[0, 2, 1])
x = self.w_2(F.relu(self.w_1(x)))
x = paddle.transpose(x, perm=[0, 2, 1])
x = self.dropout(x)
x = self.layer_norm(x + residual)
return x
class EncoderLayer(nn.Layer):
''' Compose with two layers '''
def __init__(self, d_model, d_inner, n_head, d_k, d_v, dropout=0.1):
super(EncoderLayer, self).__init__()
self.slf_attn = MultiHeadAttention(
n_head, d_model, d_k, d_v, dropout=dropout)
self.pos_ffn = PositionwiseFeedForward(
d_model, d_inner, dropout=dropout)
def forward(self, enc_input, slf_attn_mask=None):
enc_output = self.slf_attn(
enc_input, enc_input, enc_input, mask=slf_attn_mask)
enc_output = self.pos_ffn(enc_output)
return enc_output
class Transformer_Encoder(nn.Layer):
def __init__(self,
n_layers=2,
n_head=8,
d_word_vec=512,
d_k=64,
d_v=64,
d_model=512,
d_inner=2048,
dropout=0.1,
n_position=256):
super(Transformer_Encoder, self).__init__()
self.position_enc = PositionalEncoding(
d_word_vec, n_position=n_position)
self.dropout = nn.Dropout(p=dropout)
self.layer_stack = nn.LayerList([
EncoderLayer(
d_model, d_inner, n_head, d_k, d_v, dropout=dropout)
for _ in range(n_layers)
])
self.layer_norm = nn.LayerNorm(d_model, epsilon=1e-6)
def forward(self, enc_output, src_mask, return_attns=False):
enc_output = self.dropout(
self.position_enc(enc_output)) # position embeding
for enc_layer in self.layer_stack:
enc_output = enc_layer(enc_output, slf_attn_mask=src_mask)
enc_output = self.layer_norm(enc_output)
return enc_output
class PP_layer(nn.Layer):
def __init__(self, n_dim=512, N_max_character=25, n_position=256):
super(PP_layer, self).__init__()
self.character_len = N_max_character
self.f0_embedding = nn.Embedding(N_max_character, n_dim)
self.w0 = nn.Linear(N_max_character, n_position)
self.wv = nn.Linear(n_dim, n_dim)
self.we = nn.Linear(n_dim, N_max_character)
self.active = nn.Tanh()
self.softmax = nn.Softmax(axis=2)
def forward(self, enc_output):
# enc_output: b,256,512
reading_order = paddle.arange(self.character_len, dtype='int64')
reading_order = reading_order.unsqueeze(0).expand(
[enc_output.shape[0], self.character_len]) # (S,) -> (B, S)
reading_order = self.f0_embedding(reading_order) # b,25,512
# calculate attention
reading_order = paddle.transpose(reading_order, perm=[0, 2, 1])
t = self.w0(reading_order) # b,512,256
t = self.active(
paddle.transpose(
t, perm=[0, 2, 1]) + self.wv(enc_output)) # b,256,512
t = self.we(t) # b,256,25
t = self.softmax(paddle.transpose(t, perm=[0, 2, 1])) # b,25,256
g_output = paddle.bmm(t, enc_output) # b,25,512
return g_output
class Prediction(nn.Layer):
def __init__(self,
n_dim=512,
n_position=256,
N_max_character=25,
n_class=37):
super(Prediction, self).__init__()
self.pp = PP_layer(
n_dim=n_dim, N_max_character=N_max_character, n_position=n_position)
self.pp_share = PP_layer(
n_dim=n_dim, N_max_character=N_max_character, n_position=n_position)
self.w_vrm = nn.Linear(n_dim, n_class) # output layer
self.w_share = nn.Linear(n_dim, n_class) # output layer
self.nclass = n_class
def forward(self, cnn_feature, f_res, f_sub, train_mode=False,
use_mlm=True):
if train_mode:
if not use_mlm:
g_output = self.pp(cnn_feature) # b,25,512
g_output = self.w_vrm(g_output)
f_res = 0
f_sub = 0
return g_output, f_res, f_sub
g_output = self.pp(cnn_feature) # b,25,512
f_res = self.pp_share(f_res)
f_sub = self.pp_share(f_sub)
g_output = self.w_vrm(g_output)
f_res = self.w_share(f_res)
f_sub = self.w_share(f_sub)
return g_output, f_res, f_sub
else:
g_output = self.pp(cnn_feature) # b,25,512
g_output = self.w_vrm(g_output)
return g_output
class MLM(nn.Layer):
"Architecture of MLM"
def __init__(self, n_dim=512, n_position=256, max_text_length=25):
super(MLM, self).__init__()
self.MLM_SequenceModeling_mask = Transformer_Encoder(
n_layers=2, n_position=n_position)
self.MLM_SequenceModeling_WCL = Transformer_Encoder(
n_layers=1, n_position=n_position)
self.pos_embedding = nn.Embedding(max_text_length, n_dim)
self.w0_linear = nn.Linear(1, n_position)
self.wv = nn.Linear(n_dim, n_dim)
self.active = nn.Tanh()
self.we = nn.Linear(n_dim, 1)
self.sigmoid = nn.Sigmoid()
def forward(self, x, label_pos):
# transformer unit for generating mask_c
feature_v_seq = self.MLM_SequenceModeling_mask(x, src_mask=None)
# position embedding layer
label_pos = paddle.to_tensor(label_pos, dtype='int64')
pos_emb = self.pos_embedding(label_pos)
pos_emb = self.w0_linear(paddle.unsqueeze(pos_emb, axis=2))
pos_emb = paddle.transpose(pos_emb, perm=[0, 2, 1])
# fusion position embedding with features V & generate mask_c
att_map_sub = self.active(pos_emb + self.wv(feature_v_seq))
att_map_sub = self.we(att_map_sub) # b,256,1
att_map_sub = paddle.transpose(att_map_sub, perm=[0, 2, 1])
att_map_sub = self.sigmoid(att_map_sub) # b,1,256
# WCL
## generate inputs for WCL
att_map_sub = paddle.transpose(att_map_sub, perm=[0, 2, 1])
f_res = x * (1 - att_map_sub) # second path with remaining string
f_sub = x * att_map_sub # first path with occluded character
## transformer units in WCL
f_res = self.MLM_SequenceModeling_WCL(f_res, src_mask=None)
f_sub = self.MLM_SequenceModeling_WCL(f_sub, src_mask=None)
return f_res, f_sub, att_map_sub
def trans_1d_2d(x):
b, w_h, c = x.shape # b, 256, 512
x = paddle.transpose(x, perm=[0, 2, 1])
x = paddle.reshape(x, [-1, c, 32, 8])
x = paddle.transpose(x, perm=[0, 1, 3, 2]) # [b, c, 8, 32]
return x
class MLM_VRM(nn.Layer):
"""
MLM+VRM, MLM is only used in training.
ratio controls the occluded number in a batch.
The pipeline of VisionLAN in testing is very concise with only a backbone + sequence modeling(transformer unit) + prediction layer(pp layer).
x: input image
label_pos: character index
training_step: LF or LA process
output
text_pre: prediction of VRM
test_rem: prediction of remaining string in MLM
text_mas: prediction of occluded character in MLM
mask_c_show: visualization of Mask_c
"""
def __init__(self,
n_layers=3,
n_position=256,
n_dim=512,
max_text_length=25,
nclass=37):
super(MLM_VRM, self).__init__()
self.MLM = MLM(n_dim=n_dim,
n_position=n_position,
max_text_length=max_text_length)
self.SequenceModeling = Transformer_Encoder(
n_layers=n_layers, n_position=n_position)
self.Prediction = Prediction(
n_dim=n_dim,
n_position=n_position,
N_max_character=max_text_length +
1, # N_max_character = 1 eos + 25 characters
n_class=nclass)
self.nclass = nclass
self.max_text_length = max_text_length
def forward(self, x, label_pos, training_step, train_mode=False):
b, c, h, w = x.shape
nT = self.max_text_length
x = paddle.transpose(x, perm=[0, 1, 3, 2])
x = paddle.reshape(x, [-1, c, h * w])
x = paddle.transpose(x, perm=[0, 2, 1])
if train_mode:
if training_step == 'LF_1':
f_res = 0
f_sub = 0
x = self.SequenceModeling(x, src_mask=None)
text_pre, test_rem, text_mas = self.Prediction(
x, f_res, f_sub, train_mode=True, use_mlm=False)
return text_pre, text_pre, text_pre, text_pre
elif training_step == 'LF_2':
# MLM
f_res, f_sub, mask_c = self.MLM(x, label_pos)
x = self.SequenceModeling(x, src_mask=None)
text_pre, test_rem, text_mas = self.Prediction(
x, f_res, f_sub, train_mode=True)
mask_c_show = trans_1d_2d(mask_c)
return text_pre, test_rem, text_mas, mask_c_show
elif training_step == 'LA':
# MLM
f_res, f_sub, mask_c = self.MLM(x, label_pos)
## use the mask_c (1 for occluded character and 0 for remaining characters) to occlude input
## ratio controls the occluded number in a batch
character_mask = paddle.zeros_like(mask_c)
ratio = b // 2
if ratio >= 1:
with paddle.no_grad():
character_mask[0:ratio, :, :] = mask_c[0:ratio, :, :]
else:
character_mask = mask_c
x = x * (1 - character_mask)
# VRM
## transformer unit for VRM
x = self.SequenceModeling(x, src_mask=None)
## prediction layer for MLM and VSR
text_pre, test_rem, text_mas = self.Prediction(
x, f_res, f_sub, train_mode=True)
mask_c_show = trans_1d_2d(mask_c)
return text_pre, test_rem, text_mas, mask_c_show
else:
raise NotImplementedError
else: # VRM is only used in the testing stage
f_res = 0
f_sub = 0
contextual_feature = self.SequenceModeling(x, src_mask=None)
text_pre = self.Prediction(
contextual_feature,
f_res,
f_sub,
train_mode=False,
use_mlm=False)
text_pre = paddle.transpose(
text_pre, perm=[1, 0, 2]) # (26, b, 37))
return text_pre, x
class VLHead(nn.Layer):
"""
Architecture of VisionLAN
"""
def __init__(self,
in_channels,
out_channels=36,
n_layers=3,
n_position=256,
n_dim=512,
max_text_length=25,
training_step='LA'):
super(VLHead, self).__init__()
self.MLM_VRM = MLM_VRM(
n_layers=n_layers,
n_position=n_position,
n_dim=n_dim,
max_text_length=max_text_length,
nclass=out_channels + 1)
self.training_step = training_step
def forward(self, feat, targets=None):
if self.training:
label_pos = targets[-2]
text_pre, test_rem, text_mas, mask_map = self.MLM_VRM(
feat, label_pos, self.training_step, train_mode=True)
return text_pre, test_rem, text_mas, mask_map
else:
text_pre, x = self.MLM_VRM(
feat, targets, self.training_step, train_mode=False)
return text_pre, x
...@@ -77,11 +77,62 @@ class Adam(object): ...@@ -77,11 +77,62 @@ class Adam(object):
self.grad_clip = grad_clip self.grad_clip = grad_clip
self.name = name self.name = name
self.lazy_mode = lazy_mode self.lazy_mode = lazy_mode
self.group_lr = kwargs.get('group_lr', False)
self.training_step = kwargs.get('training_step', None)
def __call__(self, model): def __call__(self, model):
train_params = [ if self.group_lr:
param for param in model.parameters() if param.trainable is True if self.training_step == 'LF_2':
] import paddle
if isinstance(model, paddle.fluid.dygraph.parallel.
DataParallel): # multi gpu
mlm = model._layers.head.MLM_VRM.MLM.parameters()
pre_mlm_pp = model._layers.head.MLM_VRM.Prediction.pp_share.parameters(
)
pre_mlm_w = model._layers.head.MLM_VRM.Prediction.w_share.parameters(
)
else: # single gpu
mlm = model.head.MLM_VRM.MLM.parameters()
pre_mlm_pp = model.head.MLM_VRM.Prediction.pp_share.parameters(
)
pre_mlm_w = model.head.MLM_VRM.Prediction.w_share.parameters(
)
total = []
for param in mlm:
total.append(id(param))
for param in pre_mlm_pp:
total.append(id(param))
for param in pre_mlm_w:
total.append(id(param))
group_base_params = [
param for param in model.parameters() if id(param) in total
]
group_small_params = [
param for param in model.parameters()
if id(param) not in total
]
train_params = [{
'params': group_base_params
}, {
'params': group_small_params,
'learning_rate': self.learning_rate.values[0] * 0.1
}]
else:
print(
'group lr currently only support VisionLAN in LF_2 training step'
)
train_params = [
param for param in model.parameters()
if param.trainable is True
]
else:
train_params = [
param for param in model.parameters() if param.trainable is True
]
opt = optim.Adam( opt = optim.Adam(
learning_rate=self.learning_rate, learning_rate=self.learning_rate,
beta1=self.beta1, beta1=self.beta1,
......
...@@ -28,12 +28,13 @@ from .fce_postprocess import FCEPostProcess ...@@ -28,12 +28,13 @@ from .fce_postprocess import FCEPostProcess
from .rec_postprocess import CTCLabelDecode, AttnLabelDecode, SRNLabelDecode, \ from .rec_postprocess import CTCLabelDecode, AttnLabelDecode, SRNLabelDecode, \
DistillationCTCLabelDecode, NRTRLabelDecode, SARLabelDecode, \ DistillationCTCLabelDecode, NRTRLabelDecode, SARLabelDecode, \
SEEDLabelDecode, PRENLabelDecode, ViTSTRLabelDecode, ABINetLabelDecode, \ SEEDLabelDecode, PRENLabelDecode, ViTSTRLabelDecode, ABINetLabelDecode, \
SPINLabelDecode SPINLabelDecode, VLLabelDecode
from .cls_postprocess import ClsPostProcess from .cls_postprocess import ClsPostProcess
from .pg_postprocess import PGPostProcess from .pg_postprocess import PGPostProcess
from .vqa_token_ser_layoutlm_postprocess import VQASerTokenLayoutLMPostProcess from .vqa_token_ser_layoutlm_postprocess import VQASerTokenLayoutLMPostProcess, DistillationSerPostProcess
from .vqa_token_re_layoutlm_postprocess import VQAReTokenLayoutLMPostProcess from .vqa_token_re_layoutlm_postprocess import VQAReTokenLayoutLMPostProcess, DistillationRePostProcess
from .table_postprocess import TableMasterLabelDecode, TableLabelDecode from .table_postprocess import TableMasterLabelDecode, TableLabelDecode
from .picodet_postprocess import PicoDetPostProcess
def build_post_process(config, global_config=None): def build_post_process(config, global_config=None):
...@@ -45,7 +46,9 @@ def build_post_process(config, global_config=None): ...@@ -45,7 +46,9 @@ def build_post_process(config, global_config=None):
'SEEDLabelDecode', 'VQASerTokenLayoutLMPostProcess', 'SEEDLabelDecode', 'VQASerTokenLayoutLMPostProcess',
'VQAReTokenLayoutLMPostProcess', 'PRENLabelDecode', 'VQAReTokenLayoutLMPostProcess', 'PRENLabelDecode',
'DistillationSARLabelDecode', 'ViTSTRLabelDecode', 'ABINetLabelDecode', 'DistillationSARLabelDecode', 'ViTSTRLabelDecode', 'ABINetLabelDecode',
'TableMasterLabelDecode', 'SPINLabelDecode' 'TableMasterLabelDecode', 'SPINLabelDecode',
'DistillationSerPostProcess', 'DistillationRePostProcess',
'VLLabelDecode', 'PicoDetPostProcess'
] ]
if config['name'] == 'PSEPostProcess': if config['name'] == 'PSEPostProcess':
......
# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import numpy as np
from scipy.special import softmax
def hard_nms(box_scores, iou_threshold, top_k=-1, candidate_size=200):
"""
Args:
box_scores (N, 5): boxes in corner-form and probabilities.
iou_threshold: intersection over union threshold.
top_k: keep top_k results. If k <= 0, keep all the results.
candidate_size: only consider the candidates with the highest scores.
Returns:
picked: a list of indexes of the kept boxes
"""
scores = box_scores[:, -1]
boxes = box_scores[:, :-1]
picked = []
indexes = np.argsort(scores)
indexes = indexes[-candidate_size:]
while len(indexes) > 0:
current = indexes[-1]
picked.append(current)
if 0 < top_k == len(picked) or len(indexes) == 1:
break
current_box = boxes[current, :]
indexes = indexes[:-1]
rest_boxes = boxes[indexes, :]
iou = iou_of(
rest_boxes,
np.expand_dims(
current_box, axis=0), )
indexes = indexes[iou <= iou_threshold]
return box_scores[picked, :]
def iou_of(boxes0, boxes1, eps=1e-5):
"""Return intersection-over-union (Jaccard index) of boxes.
Args:
boxes0 (N, 4): ground truth boxes.
boxes1 (N or 1, 4): predicted boxes.
eps: a small number to avoid 0 as denominator.
Returns:
iou (N): IoU values.
"""
overlap_left_top = np.maximum(boxes0[..., :2], boxes1[..., :2])
overlap_right_bottom = np.minimum(boxes0[..., 2:], boxes1[..., 2:])
overlap_area = area_of(overlap_left_top, overlap_right_bottom)
area0 = area_of(boxes0[..., :2], boxes0[..., 2:])
area1 = area_of(boxes1[..., :2], boxes1[..., 2:])
return overlap_area / (area0 + area1 - overlap_area + eps)
def area_of(left_top, right_bottom):
"""Compute the areas of rectangles given two corners.
Args:
left_top (N, 2): left top corner.
right_bottom (N, 2): right bottom corner.
Returns:
area (N): return the area.
"""
hw = np.clip(right_bottom - left_top, 0.0, None)
return hw[..., 0] * hw[..., 1]
class PicoDetPostProcess(object):
"""
Args:
input_shape (int): network input image size
ori_shape (int): ori image shape of before padding
scale_factor (float): scale factor of ori image
enable_mkldnn (bool): whether to open MKLDNN
"""
def __init__(self,
layout_dict_path,
strides=[8, 16, 32, 64],
score_threshold=0.4,
nms_threshold=0.5,
nms_top_k=1000,
keep_top_k=100):
self.labels = self.load_layout_dict(layout_dict_path)
self.strides = strides
self.score_threshold = score_threshold
self.nms_threshold = nms_threshold
self.nms_top_k = nms_top_k
self.keep_top_k = keep_top_k
def load_layout_dict(self, layout_dict_path):
with open(layout_dict_path, 'r', encoding='utf-8') as fp:
labels = fp.readlines()
return [label.strip('\n') for label in labels]
def warp_boxes(self, boxes, ori_shape):
"""Apply transform to boxes
"""
width, height = ori_shape[1], ori_shape[0]
n = len(boxes)
if n:
# warp points
xy = np.ones((n * 4, 3))
xy[:, :2] = boxes[:, [0, 1, 2, 3, 0, 3, 2, 1]].reshape(
n * 4, 2) # x1y1, x2y2, x1y2, x2y1
# xy = xy @ M.T # transform
xy = (xy[:, :2] / xy[:, 2:3]).reshape(n, 8) # rescale
# create new boxes
x = xy[:, [0, 2, 4, 6]]
y = xy[:, [1, 3, 5, 7]]
xy = np.concatenate(
(x.min(1), y.min(1), x.max(1), y.max(1))).reshape(4, n).T
# clip boxes
xy[:, [0, 2]] = xy[:, [0, 2]].clip(0, width)
xy[:, [1, 3]] = xy[:, [1, 3]].clip(0, height)
return xy.astype(np.float32)
else:
return boxes
def img_info(self, ori_img, img):
origin_shape = ori_img.shape
resize_shape = img.shape
im_scale_y = resize_shape[2] / float(origin_shape[0])
im_scale_x = resize_shape[3] / float(origin_shape[1])
scale_factor = np.array([im_scale_y, im_scale_x], dtype=np.float32)
img_shape = np.array(img.shape[2:], dtype=np.float32)
input_shape = np.array(img).astype('float32').shape[2:]
ori_shape = np.array((img_shape, )).astype('float32')
scale_factor = np.array((scale_factor, )).astype('float32')
return ori_shape, input_shape, scale_factor
def __call__(self, ori_img, img, preds):
scores, raw_boxes = preds['boxes'], preds['boxes_num']
batch_size = raw_boxes[0].shape[0]
reg_max = int(raw_boxes[0].shape[-1] / 4 - 1)
out_boxes_num = []
out_boxes_list = []
results = []
ori_shape, input_shape, scale_factor = self.img_info(ori_img, img)
for batch_id in range(batch_size):
# generate centers
decode_boxes = []
select_scores = []
for stride, box_distribute, score in zip(self.strides, raw_boxes,
scores):
box_distribute = box_distribute[batch_id]
score = score[batch_id]
# centers
fm_h = input_shape[0] / stride
fm_w = input_shape[1] / stride
h_range = np.arange(fm_h)
w_range = np.arange(fm_w)
ww, hh = np.meshgrid(w_range, h_range)
ct_row = (hh.flatten() + 0.5) * stride
ct_col = (ww.flatten() + 0.5) * stride
center = np.stack((ct_col, ct_row, ct_col, ct_row), axis=1)
# box distribution to distance
reg_range = np.arange(reg_max + 1)
box_distance = box_distribute.reshape((-1, reg_max + 1))
box_distance = softmax(box_distance, axis=1)
box_distance = box_distance * np.expand_dims(reg_range, axis=0)
box_distance = np.sum(box_distance, axis=1).reshape((-1, 4))
box_distance = box_distance * stride
# top K candidate
topk_idx = np.argsort(score.max(axis=1))[::-1]
topk_idx = topk_idx[:self.nms_top_k]
center = center[topk_idx]
score = score[topk_idx]
box_distance = box_distance[topk_idx]
# decode box
decode_box = center + [-1, -1, 1, 1] * box_distance
select_scores.append(score)
decode_boxes.append(decode_box)
# nms
bboxes = np.concatenate(decode_boxes, axis=0)
confidences = np.concatenate(select_scores, axis=0)
picked_box_probs = []
picked_labels = []
for class_index in range(0, confidences.shape[1]):
probs = confidences[:, class_index]
mask = probs > self.score_threshold
probs = probs[mask]
if probs.shape[0] == 0:
continue
subset_boxes = bboxes[mask, :]
box_probs = np.concatenate(
[subset_boxes, probs.reshape(-1, 1)], axis=1)
box_probs = hard_nms(
box_probs,
iou_threshold=self.nms_threshold,
top_k=self.keep_top_k, )
picked_box_probs.append(box_probs)
picked_labels.extend([class_index] * box_probs.shape[0])
if len(picked_box_probs) == 0:
out_boxes_list.append(np.empty((0, 4)))
out_boxes_num.append(0)
else:
picked_box_probs = np.concatenate(picked_box_probs)
# resize output boxes
picked_box_probs[:, :4] = self.warp_boxes(
picked_box_probs[:, :4], ori_shape[batch_id])
im_scale = np.concatenate([
scale_factor[batch_id][::-1], scale_factor[batch_id][::-1]
])
picked_box_probs[:, :4] /= im_scale
# clas score box
out_boxes_list.append(
np.concatenate(
[
np.expand_dims(
np.array(picked_labels),
axis=-1), np.expand_dims(
picked_box_probs[:, 4], axis=-1),
picked_box_probs[:, :4]
],
axis=1))
out_boxes_num.append(len(picked_labels))
out_boxes_list = np.concatenate(out_boxes_list, axis=0)
out_boxes_num = np.asarray(out_boxes_num).astype(np.int32)
for dt in out_boxes_list:
clsid, bbox, score = int(dt[0]), dt[2:], dt[1]
label = self.labels[clsid]
result = {'bbox': bbox, 'label': label}
results.append(result)
return results
...@@ -668,6 +668,7 @@ class ABINetLabelDecode(NRTRLabelDecode): ...@@ -668,6 +668,7 @@ class ABINetLabelDecode(NRTRLabelDecode):
dict_character = ['</s>'] + dict_character dict_character = ['</s>'] + dict_character
return dict_character return dict_character
class SPINLabelDecode(AttnLabelDecode): class SPINLabelDecode(AttnLabelDecode):
""" Convert between text-label and text-index """ """ Convert between text-label and text-index """
...@@ -681,4 +682,106 @@ class SPINLabelDecode(AttnLabelDecode): ...@@ -681,4 +682,106 @@ class SPINLabelDecode(AttnLabelDecode):
self.end_str = "eos" self.end_str = "eos"
dict_character = dict_character dict_character = dict_character
dict_character = [self.beg_str] + [self.end_str] + dict_character dict_character = [self.beg_str] + [self.end_str] + dict_character
return dict_character return dict_character
\ No newline at end of file
class VLLabelDecode(BaseRecLabelDecode):
""" Convert between text-label and text-index """
def __init__(self, character_dict_path=None, use_space_char=False,
**kwargs):
super(VLLabelDecode, self).__init__(character_dict_path, use_space_char)
self.max_text_length = kwargs.get('max_text_length', 25)
self.nclass = len(self.character) + 1
self.character = self.character[10:] + self.character[
1:10] + [self.character[0]]
def decode(self, text_index, text_prob=None, is_remove_duplicate=False):
""" convert text-index into text-label. """
result_list = []
ignored_tokens = self.get_ignored_tokens()
batch_size = len(text_index)
for batch_idx in range(batch_size):
selection = np.ones(len(text_index[batch_idx]), dtype=bool)
if is_remove_duplicate:
selection[1:] = text_index[batch_idx][1:] != text_index[
batch_idx][:-1]
for ignored_token in ignored_tokens:
selection &= text_index[batch_idx] != ignored_token
char_list = [
self.character[text_id - 1]
for text_id in text_index[batch_idx][selection]
]
if text_prob is not None:
conf_list = text_prob[batch_idx][selection]
else:
conf_list = [1] * len(selection)
if len(conf_list) == 0:
conf_list = [0]
text = ''.join(char_list)
result_list.append((text, np.mean(conf_list).tolist()))
return result_list
def __call__(self, preds, label=None, length=None, *args, **kwargs):
if len(preds) == 2: # eval mode
text_pre, x = preds
b = text_pre.shape[1]
lenText = self.max_text_length
nsteps = self.max_text_length
if not isinstance(text_pre, paddle.Tensor):
text_pre = paddle.to_tensor(text_pre, dtype='float32')
out_res = paddle.zeros(
shape=[lenText, b, self.nclass], dtype=x.dtype)
out_length = paddle.zeros(shape=[b], dtype=x.dtype)
now_step = 0
for _ in range(nsteps):
if 0 in out_length and now_step < nsteps:
tmp_result = text_pre[now_step, :, :]
out_res[now_step] = tmp_result
tmp_result = tmp_result.topk(1)[1].squeeze(axis=1)
for j in range(b):
if out_length[j] == 0 and tmp_result[j] == 0:
out_length[j] = now_step + 1
now_step += 1
for j in range(0, b):
if int(out_length[j]) == 0:
out_length[j] = nsteps
start = 0
output = paddle.zeros(
shape=[int(out_length.sum()), self.nclass], dtype=x.dtype)
for i in range(0, b):
cur_length = int(out_length[i])
output[start:start + cur_length] = out_res[0:cur_length, i, :]
start += cur_length
net_out = output
length = out_length
else: # train mode
net_out = preds[0]
length = length
net_out = paddle.concat([t[:l] for t, l in zip(net_out, length)])
text = []
if not isinstance(net_out, paddle.Tensor):
net_out = paddle.to_tensor(net_out, dtype='float32')
net_out = F.softmax(net_out, axis=1)
for i in range(0, length.shape[0]):
preds_idx = net_out[int(length[:i].sum()):int(length[:i].sum(
) + length[i])].topk(1)[1][:, 0].tolist()
preds_text = ''.join([
self.character[idx - 1]
if idx > 0 and idx <= len(self.character) else ''
for idx in preds_idx
])
preds_prob = net_out[int(length[:i].sum()):int(length[:i].sum(
) + length[i])].topk(1)[0][:, 0]
preds_prob = paddle.exp(
paddle.log(preds_prob).sum() / (preds_prob.shape[0] + 1e-6))
text.append((preds_text, preds_prob))
if label is None:
return text
label = self.decode(label)
return text, label
...@@ -49,3 +49,25 @@ class VQAReTokenLayoutLMPostProcess(object): ...@@ -49,3 +49,25 @@ class VQAReTokenLayoutLMPostProcess(object):
result.append((ocr_info_head, ocr_info_tail)) result.append((ocr_info_head, ocr_info_tail))
results.append(result) results.append(result)
return results return results
class DistillationRePostProcess(VQAReTokenLayoutLMPostProcess):
"""
DistillationRePostProcess
"""
def __init__(self, model_name=["Student"], key=None, **kwargs):
super().__init__(**kwargs)
if not isinstance(model_name, list):
model_name = [model_name]
self.model_name = model_name
self.key = key
def __call__(self, preds, *args, **kwargs):
output = dict()
for name in self.model_name:
pred = preds[name]
if self.key is not None:
pred = pred[self.key]
output[name] = super().__call__(pred, *args, **kwargs)
return output
...@@ -93,3 +93,25 @@ class VQASerTokenLayoutLMPostProcess(object): ...@@ -93,3 +93,25 @@ class VQASerTokenLayoutLMPostProcess(object):
ocr_info[idx]["pred"] = self.id2label_map_for_show[int(pred_id)] ocr_info[idx]["pred"] = self.id2label_map_for_show[int(pred_id)]
results.append(ocr_info) results.append(ocr_info)
return results return results
class DistillationSerPostProcess(VQASerTokenLayoutLMPostProcess):
"""
DistillationSerPostProcess
"""
def __init__(self, class_path, model_name=["Student"], key=None, **kwargs):
super().__init__(class_path, **kwargs)
if not isinstance(model_name, list):
model_name = [model_name]
self.model_name = model_name
self.key = key
def __call__(self, preds, batch=None, *args, **kwargs):
output = dict()
for name in self.model_name:
pred = preds[name]
if self.key is not None:
pred = pred[self.key]
output[name] = super().__call__(pred, batch=batch, *args, **kwargs)
return output
...@@ -53,8 +53,12 @@ def load_model(config, model, optimizer=None, model_type='det'): ...@@ -53,8 +53,12 @@ def load_model(config, model, optimizer=None, model_type='det'):
checkpoints = global_config.get('checkpoints') checkpoints = global_config.get('checkpoints')
pretrained_model = global_config.get('pretrained_model') pretrained_model = global_config.get('pretrained_model')
best_model_dict = {} best_model_dict = {}
is_float16 = False
if model_type == 'vqa': if model_type == 'vqa':
# NOTE: for vqa model, resume training is not supported now
if config["Architecture"]["algorithm"] in ["Distillation"]:
return best_model_dict
checkpoints = config['Architecture']['Backbone']['checkpoints'] checkpoints = config['Architecture']['Backbone']['checkpoints']
# load vqa method metric # load vqa method metric
if checkpoints: if checkpoints:
...@@ -78,6 +82,7 @@ def load_model(config, model, optimizer=None, model_type='det'): ...@@ -78,6 +82,7 @@ def load_model(config, model, optimizer=None, model_type='det'):
logger.warning( logger.warning(
"{}.pdopt is not exists, params of optimizer is not loaded". "{}.pdopt is not exists, params of optimizer is not loaded".
format(checkpoints)) format(checkpoints))
return best_model_dict return best_model_dict
if checkpoints: if checkpoints:
...@@ -96,6 +101,9 @@ def load_model(config, model, optimizer=None, model_type='det'): ...@@ -96,6 +101,9 @@ def load_model(config, model, optimizer=None, model_type='det'):
key, params.keys())) key, params.keys()))
continue continue
pre_value = params[key] pre_value = params[key]
if pre_value.dtype == paddle.float16:
pre_value = pre_value.astype(paddle.float32)
is_float16 = True
if list(value.shape) == list(pre_value.shape): if list(value.shape) == list(pre_value.shape):
new_state_dict[key] = pre_value new_state_dict[key] = pre_value
else: else:
...@@ -103,7 +111,10 @@ def load_model(config, model, optimizer=None, model_type='det'): ...@@ -103,7 +111,10 @@ def load_model(config, model, optimizer=None, model_type='det'):
"The shape of model params {} {} not matched with loaded params shape {} !". "The shape of model params {} {} not matched with loaded params shape {} !".
format(key, value.shape, pre_value.shape)) format(key, value.shape, pre_value.shape))
model.set_state_dict(new_state_dict) model.set_state_dict(new_state_dict)
if is_float16:
logger.info(
"The parameter type is float16, which is converted to float32 when loading"
)
if optimizer is not None: if optimizer is not None:
if os.path.exists(checkpoints + '.pdopt'): if os.path.exists(checkpoints + '.pdopt'):
optim_dict = paddle.load(checkpoints + '.pdopt') optim_dict = paddle.load(checkpoints + '.pdopt')
...@@ -122,9 +133,10 @@ def load_model(config, model, optimizer=None, model_type='det'): ...@@ -122,9 +133,10 @@ def load_model(config, model, optimizer=None, model_type='det'):
best_model_dict['start_epoch'] = states_dict['epoch'] + 1 best_model_dict['start_epoch'] = states_dict['epoch'] + 1
logger.info("resume from {}".format(checkpoints)) logger.info("resume from {}".format(checkpoints))
elif pretrained_model: elif pretrained_model:
load_pretrained_params(model, pretrained_model) is_float16 = load_pretrained_params(model, pretrained_model)
else: else:
logger.info('train from scratch') logger.info('train from scratch')
best_model_dict['is_float16'] = is_float16
return best_model_dict return best_model_dict
...@@ -138,19 +150,28 @@ def load_pretrained_params(model, path): ...@@ -138,19 +150,28 @@ def load_pretrained_params(model, path):
params = paddle.load(path + '.pdparams') params = paddle.load(path + '.pdparams')
state_dict = model.state_dict() state_dict = model.state_dict()
new_state_dict = {} new_state_dict = {}
is_float16 = False
for k1 in params.keys(): for k1 in params.keys():
if k1 not in state_dict.keys(): if k1 not in state_dict.keys():
logger.warning("The pretrained params {} not in model".format(k1)) logger.warning("The pretrained params {} not in model".format(k1))
else: else:
if params[k1].dtype == paddle.float16:
params[k1] = params[k1].astype(paddle.float32)
is_float16 = True
if list(state_dict[k1].shape) == list(params[k1].shape): if list(state_dict[k1].shape) == list(params[k1].shape):
new_state_dict[k1] = params[k1] new_state_dict[k1] = params[k1]
else: else:
logger.warning( logger.warning(
"The shape of model params {} {} not matched with loaded params {} {} !". "The shape of model params {} {} not matched with loaded params {} {} !".
format(k1, state_dict[k1].shape, k1, params[k1].shape)) format(k1, state_dict[k1].shape, k1, params[k1].shape))
model.set_state_dict(new_state_dict) model.set_state_dict(new_state_dict)
if is_float16:
logger.info(
"The parameter type is float16, which is converted to float32 when loading"
)
logger.info("load pretrain successful from {}".format(path)) logger.info("load pretrain successful from {}".format(path))
return model return is_float16
def save_model(model, def save_model(model,
...@@ -166,15 +187,19 @@ def save_model(model, ...@@ -166,15 +187,19 @@ def save_model(model,
""" """
_mkdir_if_not_exist(model_path, logger) _mkdir_if_not_exist(model_path, logger)
model_prefix = os.path.join(model_path, prefix) model_prefix = os.path.join(model_path, prefix)
paddle.save(optimizer.state_dict(), model_prefix + '.pdopt') if config['Architecture']["model_type"] != 'vqa':
paddle.save(optimizer.state_dict(), model_prefix + '.pdopt')
if config['Architecture']["model_type"] != 'vqa': if config['Architecture']["model_type"] != 'vqa':
paddle.save(model.state_dict(), model_prefix + '.pdparams') paddle.save(model.state_dict(), model_prefix + '.pdparams')
metric_prefix = model_prefix metric_prefix = model_prefix
else: else: # for vqa system, we follow the save/load rules in NLP
if config['Global']['distributed']: if config['Global']['distributed']:
model._layers.backbone.model.save_pretrained(model_prefix) arch = model._layers
else: else:
model.backbone.model.save_pretrained(model_prefix) arch = model
if config["Architecture"]["algorithm"] in ["Distillation"]:
arch = arch.Student
arch.backbone.model.save_pretrained(model_prefix)
metric_prefix = os.path.join(model_prefix, 'metric') metric_prefix = os.path.join(model_prefix, 'metric')
# save metric and config # save metric and config
with open(metric_prefix + '.states', 'wb') as f: with open(metric_prefix + '.states', 'wb') as f:
......
# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import os
import sys
__dir__ = os.path.dirname(os.path.abspath(__file__))
sys.path.append(__dir__)
sys.path.insert(0, os.path.abspath(os.path.join(__dir__, '../..')))
os.environ["FLAGS_allocator_strategy"] = 'auto_growth'
import cv2
import numpy as np
import time
import tools.infer.utility as utility
from ppocr.data import create_operators, transform
from ppocr.postprocess import build_post_process
from ppocr.utils.logging import get_logger
from ppocr.utils.utility import get_image_file_list, check_and_read_gif
from ppstructure.utility import parse_args
from picodet_postprocess import PicoDetPostProcess
logger = get_logger()
class LayoutPredictor(object):
def __init__(self, args):
pre_process_list = [{
'Resize': {
'size': [800, 608]
}
}, {
'NormalizeImage': {
'std': [0.229, 0.224, 0.225],
'mean': [0.485, 0.456, 0.406],
'scale': '1./255.',
'order': 'hwc'
}
}, {
'ToCHWImage': None
}, {
'KeepKeys': {
'keep_keys': ['image']
}
}]
postprocess_params = {
'name': 'PicoDetPostProcess',
"layout_dict_path": args.layout_dict_path,
"score_threshold": args.layout_score_threshold,
"nms_threshold": args.layout_nms_threshold,
}
self.preprocess_op = create_operators(pre_process_list)
self.postprocess_op = build_post_process(postprocess_params)
self.predictor, self.input_tensor, self.output_tensors, self.config = \
utility.create_predictor(args, 'layout', logger)
def __call__(self, img):
ori_im = img.copy()
data = {'image': img}
data = transform(data, self.preprocess_op)
img = data[0]
if img is None:
return None, 0
img = np.expand_dims(img, axis=0)
img = img.copy()
preds, elapse = 0, 1
starttime = time.time()
self.input_tensor.copy_from_cpu(img)
self.predictor.run()
np_score_list, np_boxes_list = [], []
output_names = self.predictor.get_output_names()
num_outs = int(len(output_names) / 2)
for out_idx in range(num_outs):
np_score_list.append(
self.predictor.get_output_handle(output_names[out_idx])
.copy_to_cpu())
np_boxes_list.append(
self.predictor.get_output_handle(output_names[
out_idx + num_outs]).copy_to_cpu())
preds = dict(boxes=np_score_list, boxes_num=np_boxes_list)
post_preds = self.postprocess_op(ori_im, img, preds)
elapse = time.time() - starttime
return post_preds, elapse
def main(args):
image_file_list = get_image_file_list(args.image_dir)
layout_predictor = LayoutPredictor(args)
count = 0
total_time = 0
repeats = 50
for image_file in image_file_list:
img, flag = check_and_read_gif(image_file)
if not flag:
img = cv2.imread(image_file)
if img is None:
logger.info("error in loading image:{}".format(image_file))
continue
layout_res, elapse = layout_predictor(img)
logger.info("result: {}".format(layout_res))
if count > 0:
total_time += elapse
count += 1
logger.info("Predict time of {}: {}".format(image_file, elapse))
if __name__ == "__main__":
main(parse_args())
...@@ -32,15 +32,18 @@ def init_args(): ...@@ -32,15 +32,18 @@ def init_args():
type=str, type=str,
default="../ppocr/utils/dict/table_structure_dict.txt") default="../ppocr/utils/dict/table_structure_dict.txt")
# params for layout # params for layout
parser.add_argument("--layout_model_dir", type=str)
parser.add_argument( parser.add_argument(
"--layout_path_model", "--layout_dict_path",
type=str, type=str,
default="lp://PubLayNet/ppyolov2_r50vd_dcn_365e_publaynet/config") default="../ppocr/utils/dict/layout_pubalynet_dict.txt")
parser.add_argument( parser.add_argument(
"--layout_label_map", "--layout_score_threshold",
type=ast.literal_eval, type=float,
default=None, default=0.5,
help='label map according to ppstructure/layout/README_ch.md') help="Threshold of score.")
parser.add_argument(
"--layout_nms_threshold", type=float, default=0.5, help="Threshold of nms.")
# params for vqa # params for vqa
parser.add_argument("--vqa_algorithm", type=str, default='LayoutXLM') parser.add_argument("--vqa_algorithm", type=str, default='LayoutXLM')
parser.add_argument("--ser_model_dir", type=str) parser.add_argument("--ser_model_dir", type=str)
...@@ -87,7 +90,7 @@ def draw_structure_result(image, result, font_path): ...@@ -87,7 +90,7 @@ def draw_structure_result(image, result, font_path):
image = Image.fromarray(image) image = Image.fromarray(image)
boxes, txts, scores = [], [], [] boxes, txts, scores = [], [], []
for region in result: for region in result:
if region['type'] == 'Table': if region['type'] == 'table':
pass pass
else: else:
for text_result in region['res']: for text_result in region['res']:
......
...@@ -216,7 +216,7 @@ Use the following command to complete the tandem prediction of `OCR + SER` based ...@@ -216,7 +216,7 @@ Use the following command to complete the tandem prediction of `OCR + SER` based
```shell ```shell
cd ppstructure cd ppstructure
CUDA_VISIBLE_DEVICES=0 python3.7 vqa/predict_vqa_token_ser.py --vqa_algorithm=LayoutXLM --ser_model_dir=../output/ser/infer --ser_dict_path=../train_data/XFUND/class_list_xfun.txt --image_dir=docs/vqa/input/zh_val_42.jpg --output=output CUDA_VISIBLE_DEVICES=0 python3.7 vqa/predict_vqa_token_ser.py --vqa_algorithm=LayoutXLM --ser_model_dir=../output/ser/infer --ser_dict_path=../train_data/XFUND/class_list_xfun.txt --vis_font_path=../doc/fonts/simfang.ttf --image_dir=docs/vqa/input/zh_val_42.jpg --output=output
``` ```
After the prediction is successful, the visualization images and results will be saved in the directory specified by the `output` field After the prediction is successful, the visualization images and results will be saved in the directory specified by the `output` field
......
...@@ -215,7 +215,7 @@ python3.7 tools/export_model.py -c configs/vqa/ser/layoutxlm.yml -o Architecture ...@@ -215,7 +215,7 @@ python3.7 tools/export_model.py -c configs/vqa/ser/layoutxlm.yml -o Architecture
```shell ```shell
cd ppstructure cd ppstructure
CUDA_VISIBLE_DEVICES=0 python3.7 vqa/predict_vqa_token_ser.py --vqa_algorithm=LayoutXLM --ser_model_dir=../output/ser/infer --ser_dict_path=../train_data/XFUND/class_list_xfun.txt --image_dir=docs/vqa/input/zh_val_42.jpg --output=output CUDA_VISIBLE_DEVICES=0 python3.7 vqa/predict_vqa_token_ser.py --vqa_algorithm=LayoutXLM --ser_model_dir=../output/ser/infer --ser_dict_path=../train_data/XFUND/class_list_xfun.txt --vis_font_path=../doc/fonts/simfang.ttf --image_dir=docs/vqa/input/zh_val_42.jpg --output=output
``` ```
预测成功后,可视化图片和结果会保存在`output`字段指定的目录下 预测成功后,可视化图片和结果会保存在`output`字段指定的目录下
......
...@@ -153,7 +153,7 @@ def main(args): ...@@ -153,7 +153,7 @@ def main(args):
img_res = draw_ser_results( img_res = draw_ser_results(
image_file, image_file,
ser_res, ser_res,
font_path="../doc/fonts/simfang.ttf", ) font_path=args.vis_font_path, )
img_save_path = os.path.join(args.output, img_save_path = os.path.join(args.output,
os.path.basename(image_file)) os.path.basename(image_file))
......
...@@ -114,7 +114,7 @@ Train: ...@@ -114,7 +114,7 @@ Train:
name: SimpleDataSet name: SimpleDataSet
data_dir: ./train_data/ic15_data/ data_dir: ./train_data/ic15_data/
label_file_list: label_file_list:
- ./train_data/ic15_data/rec_gt_train4w.txt - ./train_data/ic15_data/rec_gt_train.txt
transforms: transforms:
- DecodeImage: - DecodeImage:
img_mode: BGR img_mode: BGR
......
...@@ -153,7 +153,7 @@ Train: ...@@ -153,7 +153,7 @@ Train:
data_dir: ./train_data/ic15_data/ data_dir: ./train_data/ic15_data/
ext_op_transform_idx: 1 ext_op_transform_idx: 1
label_file_list: label_file_list:
- ./train_data/ic15_data/rec_gt_train4w.txt - ./train_data/ic15_data/rec_gt_train.txt
transforms: transforms:
- DecodeImage: - DecodeImage:
img_mode: BGR img_mode: BGR
......
...@@ -52,8 +52,9 @@ null:null ...@@ -52,8 +52,9 @@ null:null
===========================infer_benchmark_params========================== ===========================infer_benchmark_params==========================
random_infer_input:[{float32,[3,48,320]}] random_infer_input:[{float32,[3,48,320]}]
===========================train_benchmark_params========================== ===========================train_benchmark_params==========================
batch_size:128 batch_size:64
fp_items:fp32|fp16 fp_items:fp32|fp16
epoch:1 epoch:1
--profiler_options:batch_range=[10,20];state=GPU;tracer_option=Default;profile_path=model.profile --profiler_options:batch_range=[10,20];state=GPU;tracer_option=Default;profile_path=model.profile
flags:FLAGS_eager_delete_tensor_gb=0.0;FLAGS_fraction_of_gpu_memory_to_use=0.98;FLAGS_conv_workspace_size_limit=4096 flags:FLAGS_eager_delete_tensor_gb=0.0;FLAGS_fraction_of_gpu_memory_to_use=0.98;FLAGS_conv_workspace_size_limit=4096
===========================cpp_infer_params=========================== ===========================cpp_infer_params===========================
model_name:ch_ppocr_mobile_v2.0 model_name:ch_ppocr_mobile_v2_0
use_opencv:True use_opencv:True
infer_model:./inference/ch_ppocr_mobile_v2.0_det_infer/ infer_model:./inference/ch_ppocr_mobile_v2.0_det_infer/
infer_quant:False infer_quant:False
......
===========================ch_ppocr_mobile_v2.0=========================== ===========================ch_ppocr_mobile_v2.0===========================
model_name:ch_ppocr_mobile_v2.0 model_name:ch_ppocr_mobile_v2_0
python:python3.7 python:python3.7
infer_model:./inference/ch_ppocr_mobile_v2.0_det_infer/ infer_model:./inference/ch_ppocr_mobile_v2.0_det_infer/
infer_export:null infer_export:null
......
===========================paddle2onnx_params=========================== ===========================paddle2onnx_params===========================
model_name:ch_ppocr_mobile_v2.0 model_name:ch_ppocr_mobile_v2_0
python:python3.7 python:python3.7
2onnx: paddle2onnx 2onnx: paddle2onnx
--det_model_dir:./inference/ch_ppocr_mobile_v2.0_det_infer/ --det_model_dir:./inference/ch_ppocr_mobile_v2.0_det_infer/
......
===========================serving_params=========================== ===========================serving_params===========================
model_name:ch_ppocr_mobile_v2.0 model_name:ch_ppocr_mobile_v2_0
python:python3.7 python:python3.7
trans_model:-m paddle_serving_client.convert trans_model:-m paddle_serving_client.convert
--det_dirname:./inference/ch_ppocr_mobile_v2.0_det_infer/ --det_dirname:./inference/ch_ppocr_mobile_v2.0_det_infer/
......
===========================serving_params=========================== ===========================serving_params===========================
model_name:ch_ppocr_mobile_v2.0 model_name:ch_ppocr_mobile_v2_0
python:python3.7 python:python3.7
trans_model:-m paddle_serving_client.convert trans_model:-m paddle_serving_client.convert
--det_dirname:./inference/ch_ppocr_mobile_v2.0_det_infer/ --det_dirname:./inference/ch_ppocr_mobile_v2.0_det_infer/
......
===========================cpp_infer_params=========================== ===========================cpp_infer_params===========================
model_name:ch_ppocr_mobile_v2.0_det model_name:ch_ppocr_mobile_v2_0_det
use_opencv:True use_opencv:True
infer_model:./inference/ch_ppocr_mobile_v2.0_det_infer/ infer_model:./inference/ch_ppocr_mobile_v2.0_det_infer/
infer_quant:False infer_quant:False
......
===========================infer_params=========================== ===========================infer_params===========================
model_name:ch_ppocr_mobile_v2.0_det model_name:ch_ppocr_mobile_v2_0_det
python:python python:python
infer_model:./inference/ch_ppocr_mobile_v2.0_det_infer infer_model:./inference/ch_ppocr_mobile_v2.0_det_infer
infer_export:null infer_export:null
......
===========================paddle2onnx_params=========================== ===========================paddle2onnx_params===========================
model_name:ch_ppocr_mobile_v2.0_det model_name:ch_ppocr_mobile_v2_0_det
python:python3.7 python:python3.7
2onnx: paddle2onnx 2onnx: paddle2onnx
--det_model_dir:./inference/ch_ppocr_mobile_v2.0_det_infer/ --det_model_dir:./inference/ch_ppocr_mobile_v2.0_det_infer/
......
===========================serving_params=========================== ===========================serving_params===========================
model_name:ch_ppocr_mobile_v2.0_det model_name:ch_ppocr_mobile_v2_0_det
python:python3.7 python:python3.7
trans_model:-m paddle_serving_client.convert trans_model:-m paddle_serving_client.convert
--det_dirname:./inference/ch_ppocr_mobile_v2.0_det_infer/ --det_dirname:./inference/ch_ppocr_mobile_v2.0_det_infer/
......
===========================train_params=========================== ===========================train_params===========================
model_name:ch_ppocr_mobile_v2.0_det model_name:ch_ppocr_mobile_v2_0_det
python:python3.7 python:python3.7
gpu_list:0|0,1 gpu_list:0|0,1
Global.use_gpu:True|True Global.use_gpu:True|True
......
===========================train_params=========================== ===========================train_params===========================
model_name:ch_ppocr_mobile_v2.0_det model_name:ch_ppocr_mobile_v2_0_det
python:python3.7 python:python3.7
gpu_list:192.168.0.1,192.168.0.2;0,1 gpu_list:192.168.0.1,192.168.0.2;0,1
Global.use_gpu:True Global.use_gpu:True
......
===========================train_params=========================== ===========================train_params===========================
model_name:ch_ppocr_mobile_v2.0_det model_name:ch_ppocr_mobile_v2_0_det
python:python3.7 python:python3.7
gpu_list:0|0,1 gpu_list:0|0,1
Global.use_gpu:True|True Global.use_gpu:True|True
......
===========================train_params=========================== ===========================train_params===========================
model_name:ch_ppocr_mobile_v2.0_det_PACT model_name:ch_ppocr_mobile_v2_0_det_PACT
python:python3.7 python:python3.7
gpu_list:0|0,1 gpu_list:0|0,1
Global.use_gpu:True|True Global.use_gpu:True|True
......
===========================kl_quant_params=========================== ===========================kl_quant_params===========================
model_name:ch_ppocr_mobile_v2.0_det_KL model_name:ch_ppocr_mobile_v2_0_det_KL
python:python3.7 python:python3.7
Global.pretrained_model:null Global.pretrained_model:null
Global.save_inference_dir:null Global.save_inference_dir:null
......
===========================train_params=========================== ===========================train_params===========================
model_name:ch_ppocr_mobile_v2.0_det_FPGM model_name:ch_ppocr_mobile_v2_0_det_FPGM
python:python3.7 python:python3.7
gpu_list:0|0,1 gpu_list:0|0,1
Global.use_gpu:True|True Global.use_gpu:True|True
......
===========================train_params=========================== ===========================train_params===========================
model_name:ch_ppocr_mobile_v2.0_det_FPGM model_name:ch_ppocr_mobile_v2_0_det_FPGM
python:python3.7 python:python3.7
gpu_list:0|0,1 gpu_list:0|0,1
Global.use_gpu:True|True Global.use_gpu:True|True
......
===========================cpp_infer_params=========================== ===========================cpp_infer_params===========================
model_name:ch_ppocr_mobile_v2.0_det_KL model_name:ch_ppocr_mobile_v2_0_det_KL
use_opencv:True use_opencv:True
infer_model:./inference/ch_ppocr_mobile_v2.0_det_klquant_infer infer_model:./inference/ch_ppocr_mobile_v2.0_det_klquant_infer
infer_quant:False infer_quant:False
......
===========================serving_params=========================== ===========================serving_params===========================
model_name:ch_ppocr_mobile_v2.0_rec_KL model_name:ch_ppocr_mobile_v2_0_det_KL
python:python3.7 python:python3.7
trans_model:-m paddle_serving_client.convert trans_model:-m paddle_serving_client.convert
--det_dirname:./inference/ch_ppocr_mobile_v2.0_det_klquant_infer/ --det_dirname:./inference/ch_ppocr_mobile_v2.0_det_klquant_infer/
......
===========================serving_params=========================== ===========================serving_params===========================
model_name:ch_ppocr_mobile_v2.0_det_KL model_name:ch_ppocr_mobile_v2_0_det_KL
python:python3.7 python:python3.7
trans_model:-m paddle_serving_client.convert trans_model:-m paddle_serving_client.convert
--det_dirname:./inference/ch_ppocr_mobile_v2.0_det_klquant_infer/ --det_dirname:./inference/ch_ppocr_mobile_v2.0_det_klquant_infer/
......
===========================cpp_infer_params=========================== ===========================cpp_infer_params===========================
model_name:ch_ppocr_mobile_v2.0_det_PACT model_name:ch_ppocr_mobile_v2_0_det_PACT
use_opencv:True use_opencv:True
infer_model:./inference/ch_ppocr_mobile_v2.0_det_pact_infer infer_model:./inference/ch_ppocr_mobile_v2.0_det_pact_infer
infer_quant:False infer_quant:False
......
===========================serving_params=========================== ===========================serving_params===========================
model_name:ch_ppocr_mobile_v2.0_rec_PACT model_name:ch_ppocr_mobile_v2_0_det_PACT
python:python3.7 python:python3.7
trans_model:-m paddle_serving_client.convert trans_model:-m paddle_serving_client.convert
--det_dirname:./inference/ch_ppocr_mobile_v2.0_det_pact_infer/ --det_dirname:./inference/ch_ppocr_mobile_v2.0_det_pact_infer/
......
===========================serving_params=========================== ===========================serving_params===========================
model_name:ch_ppocr_mobile_v2.0_det_PACT model_name:ch_ppocr_mobile_v2_0_det_PACT
python:python3.7 python:python3.7
trans_model:-m paddle_serving_client.convert trans_model:-m paddle_serving_client.convert
--det_dirname:./inference/ch_ppocr_mobile_v2.0_det_pact_infer/ --det_dirname:./inference/ch_ppocr_mobile_v2.0_det_pact_infer/
......
===========================cpp_infer_params=========================== ===========================cpp_infer_params===========================
model_name:ch_ppocr_mobile_v2.0_rec model_name:ch_ppocr_mobile_v2_0_rec
use_opencv:True use_opencv:True
infer_model:./inference/ch_ppocr_mobile_v2.0_rec_infer/ infer_model:./inference/ch_ppocr_mobile_v2.0_rec_infer/
infer_quant:False infer_quant:False
......
===========================paddle2onnx_params=========================== ===========================paddle2onnx_params===========================
model_name:ch_ppocr_mobile_v2.0_rec model_name:ch_ppocr_mobile_v2_0_rec
python:python3.7 python:python3.7
2onnx: paddle2onnx 2onnx: paddle2onnx
--det_model_dir: --det_model_dir:
......
===========================serving_params=========================== ===========================serving_params===========================
model_name:ch_ppocr_mobile_v2.0_rec model_name:ch_ppocr_mobile_v2_0_rec
python:python3.7 python:python3.7
trans_model:-m paddle_serving_client.convert trans_model:-m paddle_serving_client.convert
--det_dirname:null --det_dirname:null
......
===========================train_params=========================== ===========================train_params===========================
model_name:ch_ppocr_mobile_v2.0_rec model_name:ch_ppocr_mobile_v2_0_rec
python:python3.7 python:python3.7
gpu_list:0|0,1 gpu_list:0|0,1
Global.use_gpu:True|True Global.use_gpu:True|True
......
===========================train_params=========================== ===========================train_params===========================
model_name:ch_ppocr_mobile_v2.0_rec model_name:ch_ppocr_mobile_v2_0_rec
python:python3.7 python:python3.7
gpu_list:192.168.0.1,192.168.0.2;0,1 gpu_list:192.168.0.1,192.168.0.2;0,1
Global.use_gpu:True Global.use_gpu:True
......
===========================train_params=========================== ===========================train_params===========================
model_name:ch_ppocr_mobile_v2.0_rec model_name:ch_ppocr_mobile_v2_0_rec
python:python3.7 python:python3.7
gpu_list:0|0,1 gpu_list:0|0,1
Global.use_gpu:True|True Global.use_gpu:True|True
......
===========================train_params=========================== ===========================train_params===========================
model_name:ch_ppocr_mobile_v2.0_rec_PACT model_name:ch_ppocr_mobile_v2_0_rec_PACT
python:python3.7 python:python3.7
gpu_list:0 gpu_list:0
Global.use_gpu:True|True Global.use_gpu:True|True
...@@ -14,7 +14,7 @@ null:null ...@@ -14,7 +14,7 @@ null:null
## ##
trainer:pact_train trainer:pact_train
norm_train:null norm_train:null
pact_train:deploy/slim/quantization/quant.py -c test_tipc/configs/ch_ppocr_mobile_v2.0_rec_PACT/rec_chinese_lite_train_v2.0.yml -o pact_train:deploy/slim/quantization/quant.py -c test_tipc/configs/ch_ppocr_mobile_v2_0_rec_PACT/rec_chinese_lite_train_v2.0.yml -o
fpgm_train:null fpgm_train:null
distill_train:null distill_train:null
null:null null:null
...@@ -28,7 +28,7 @@ null:null ...@@ -28,7 +28,7 @@ null:null
Global.save_inference_dir:./output/ Global.save_inference_dir:./output/
Global.checkpoints: Global.checkpoints:
norm_export:null norm_export:null
quant_export:deploy/slim/quantization/export_model.py -c test_tipc/configs/ch_ppocr_mobile_v2.0_rec_PACT/rec_chinese_lite_train_v2.0.yml -o quant_export:deploy/slim/quantization/export_model.py -c test_tipc/configs/ch_ppocr_mobile_v2_0_rec_PACT/rec_chinese_lite_train_v2.0.yml -o
fpgm_export:null fpgm_export:null
distill_export:null distill_export:null
export1:null export1:null
......
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
Markdown is supported
0% .
You are about to add 0 people to the discussion. Proceed with caution.
先完成此消息的编辑!
想要评论请 注册