diff --git a/doc/doc_ch/dataset/table_datasets.md b/doc/doc_ch/dataset/table_datasets.md index ae902b23ccf985d522386b7454c7f76a74917502..58f4cf470542ff7ef20f518efb8b6942a3caa2f0 100644 --- a/doc/doc_ch/dataset/table_datasets.md +++ b/doc/doc_ch/dataset/table_datasets.md @@ -3,6 +3,7 @@ - [数据集汇总](#数据集汇总) - [1. PubTabNet数据集](#1-pubtabnet数据集) - [2. 好未来表格识别竞赛数据集](#2-好未来表格识别竞赛数据集) +- [3. 好未来表格识别竞赛数据集](#2-WTW中文场景表格数据集) 这里整理了常用表格识别数据集,持续更新中,欢迎各位小伙伴贡献数据集~ @@ -12,6 +13,7 @@ |---|---|---| | PubTabNet |https://github.com/ibm-aur-nlp/PubTabNet| jsonl格式,可直接用[pubtab_dataset.py](../../../ppocr/data/pubtab_dataset.py)加载 | | 好未来表格识别竞赛数据集 |https://ai.100tal.com/dataset| jsonl格式,可直接用[pubtab_dataset.py](../../../ppocr/data/pubtab_dataset.py)加载 | +| WTW中文场景表格数据集 |https://github.com/wangwen-whu/WTW-Dataset| 需要进行转换后才能用[pubtab_dataset.py](../../../ppocr/data/pubtab_dataset.py)加载 | ## 1. PubTabNet数据集 - **数据简介**:PubTabNet数据集的训练集合中包含50万张图像,验证集合中包含0.9万张图像。部分图像可视化如下所示。 @@ -31,3 +33,12 @@ + +## 3. WTW中文场景表格数据集 +- **数据简介**:WTW中文场景表格数据集包含表格检测和表格数据两部分数据,数据集中同时包含扫描和拍照两张场景的图像。 + +https://github.com/wangwen-whu/WTW-Dataset/blob/main/demo/20210816_210413.gif + +
+ +
diff --git a/doc/doc_en/dataset/table_datasets_en.md b/doc/doc_en/dataset/table_datasets_en.md index e30147909812a153f311add50f0bef5d1d1e0e32..70ca8309798994c6225ab0c10d4689da2387962b 100644 --- a/doc/doc_en/dataset/table_datasets_en.md +++ b/doc/doc_en/dataset/table_datasets_en.md @@ -3,6 +3,7 @@ - [Dataset Summary](#dataset-summary) - [1. PubTabNet](#1-pubtabnet) - [2. TAL Table Recognition Competition Dataset](#2-tal-table-recognition-competition-dataset) +- [3. WTW Chinese scene table dataset](#3-wtw-chinese-scene-table-dataset) Here are the commonly used table recognition datasets, which are being updated continuously. Welcome to contribute datasets~ @@ -12,6 +13,7 @@ Here are the commonly used table recognition datasets, which are being updated c |---|---|---| | PubTabNet |https://github.com/ibm-aur-nlp/PubTabNet| jsonl format, which can be loaded directly with [pubtab_dataset.py](../../../ppocr/data/pubtab_dataset.py) | | TAL Table Recognition Competition Dataset |https://ai.100tal.com/dataset| jsonl format, which can be loaded directly with [pubtab_dataset.py](../../../ppocr/data/pubtab_dataset.py) | +| WTW Chinese scene table dataset |https://github.com/wangwen-whu/WTW-Dataset| Conversion is required to load with [pubtab_dataset.py](../../../ppocr/data/pubtab_dataset.py)| ## 1. PubTabNet - **Data Introduction**:The training set of the PubTabNet dataset contains 500,000 images and the validation set contains 9000 images. Part of the image visualization is shown below. @@ -30,3 +32,11 @@ Here are the commonly used table recognition datasets, which are being updated c + +## 3. WTW Chinese scene table dataset +- **Data Introduction**:The WTW Chinese scene table dataset consists of two parts: table detection and table data. The dataset contains images of two scenes, scanned and photographed. +https://github.com/wangwen-whu/WTW-Dataset/blob/main/demo/20210816_210413.gif + +
+ +
diff --git a/ppstructure/table/README.md b/ppstructure/table/README.md index 45c13565ee6d839f3b62a3adab1ece7941259f23..10308b4923ba581a835aa866bf7918e91599948d 100644 --- a/ppstructure/table/README.md +++ b/ppstructure/table/README.md @@ -63,7 +63,16 @@ After the operation is completed, the excel table of each image will be saved to In this chapter, we only introduce the training of the table structure model, For model training of [text detection](../../doc/doc_en/detection_en.md) and [text recognition](../../doc/doc_en/recognition_en.md), please refer to the corresponding documents * data preparation -The training data uses public data set [PubTabNet](https://arxiv.org/abs/1911.10683 ), Can be downloaded from the official [website](https://github.com/ibm-aur-nlp/PubTabNet) 。The PubTabNet data set contains about 500,000 images, as well as annotations in html format。 + +For the Chinese model and the English model, the data sources are different, as follows: + +English dataset: The training data uses public data set [PubTabNet](https://arxiv.org/abs/1911.10683 ), Can be downloaded from the official [website](https://github.com/ibm-aur-nlp/PubTabNet) 。The PubTabNet data set contains about 500,000 images, as well as annotations in html format。 + +Chinese dataset: The Chinese dataset consists of the following two parts, which are trained with a 1:1 sampling ratio. +> 1. Generate dataset: Use [Table Generation Tool](https://github.com/WenmuZhou/TableGeneration) to generate 40,000 images. +> 2. Crop 10,000 images from [WTW](https://github.com/wangwen-whu/WTW-Dataset). + +For a detailed introduction to public datasets, please refer to [table_datasets](../../doc/doc_en/dataset/table_datasets_en.md). The following training and evaluation procedures are based on the English dataset as an example. * Start training *If you are installing the cpu version of paddle, please modify the `use_gpu` field in the configuration file to false* diff --git a/ppstructure/table/README_ch.md b/ppstructure/table/README_ch.md index 21fb7960cc85a5f42ae4daead23a6a43360a49c5..3f31c0106ba39758c0180f64d980480868a14358 100644 --- a/ppstructure/table/README_ch.md +++ b/ppstructure/table/README_ch.md @@ -75,7 +75,15 @@ note: 上述模型是在 PubLayNet 数据集上训练的表格识别模型,仅 * 数据准备 -训练数据使用公开数据集PubTabNet ([论文](https://arxiv.org/abs/1911.10683),[下载地址](https://github.com/ibm-aur-nlp/PubTabNet))。PubTabNet数据集包含约50万张表格数据的图像,以及图像对应的html格式的注释。 +对于中文模型和英文模型,数据来源不同,分别介绍如下 + +英文数据集: 训练数据使用公开数据集PubTabNet ([论文](https://arxiv.org/abs/1911.10683),[下载地址](https://github.com/ibm-aur-nlp/PubTabNet))。PubTabNet数据集包含约50万张表格数据的图像,以及图像对应的html格式的注释。 + +中文数据集: 中文数据集下面两部分构成,这两部分安装1:1的采样比例进行训练。 +> 1. 生成数据集: 使用[表格生成工具](https://github.com/WenmuZhou/TableGeneration)生成4w张。 +> 2. 从[WTW](https://github.com/wangwen-whu/WTW-Dataset)中获取1w张。 + +关于公开数据集的详细介绍可以参考 [table_datasets](../../doc/doc_ch/dataset/table_datasets.md),下述训练和评估流程均以英文数据集为例。 * 启动训练