From d69b74e4345e57ccb615e2132a9d414ac1721940 Mon Sep 17 00:00:00 2001 From: WenmuZhou <572459439@qq.com> Date: Tue, 16 Aug 2022 07:45:51 +0000 Subject: [PATCH] add dataset desc --- doc/doc_ch/dataset/table_datasets.md | 11 +++++++++++ doc/doc_en/dataset/table_datasets_en.md | 10 ++++++++++ ppstructure/table/README.md | 11 ++++++++++- ppstructure/table/README_ch.md | 10 +++++++++- 4 files changed, 40 insertions(+), 2 deletions(-) diff --git a/doc/doc_ch/dataset/table_datasets.md b/doc/doc_ch/dataset/table_datasets.md index ae902b23..58f4cf47 100644 --- a/doc/doc_ch/dataset/table_datasets.md +++ b/doc/doc_ch/dataset/table_datasets.md @@ -3,6 +3,7 @@ - [数据集汇总](#数据集汇总) - [1. PubTabNet数据集](#1-pubtabnet数据集) - [2. 好未来表格识别竞赛数据集](#2-好未来表格识别竞赛数据集) +- [3. 好未来表格识别竞赛数据集](#2-WTW中文场景表格数据集) 这里整理了常用表格识别数据集,持续更新中,欢迎各位小伙伴贡献数据集~ @@ -12,6 +13,7 @@ |---|---|---| | PubTabNet |https://github.com/ibm-aur-nlp/PubTabNet| jsonl格式,可直接用[pubtab_dataset.py](../../../ppocr/data/pubtab_dataset.py)加载 | | 好未来表格识别竞赛数据集 |https://ai.100tal.com/dataset| jsonl格式,可直接用[pubtab_dataset.py](../../../ppocr/data/pubtab_dataset.py)加载 | +| WTW中文场景表格数据集 |https://github.com/wangwen-whu/WTW-Dataset| 需要进行转换后才能用[pubtab_dataset.py](../../../ppocr/data/pubtab_dataset.py)加载 | ## 1. PubTabNet数据集 - **数据简介**:PubTabNet数据集的训练集合中包含50万张图像,验证集合中包含0.9万张图像。部分图像可视化如下所示。 @@ -31,3 +33,12 @@ + +## 3. WTW中文场景表格数据集 +- **数据简介**:WTW中文场景表格数据集包含表格检测和表格数据两部分数据,数据集中同时包含扫描和拍照两张场景的图像。 + +https://github.com/wangwen-whu/WTW-Dataset/blob/main/demo/20210816_210413.gif + +
+ +
diff --git a/doc/doc_en/dataset/table_datasets_en.md b/doc/doc_en/dataset/table_datasets_en.md index e3014790..70ca8309 100644 --- a/doc/doc_en/dataset/table_datasets_en.md +++ b/doc/doc_en/dataset/table_datasets_en.md @@ -3,6 +3,7 @@ - [Dataset Summary](#dataset-summary) - [1. PubTabNet](#1-pubtabnet) - [2. TAL Table Recognition Competition Dataset](#2-tal-table-recognition-competition-dataset) +- [3. WTW Chinese scene table dataset](#3-wtw-chinese-scene-table-dataset) Here are the commonly used table recognition datasets, which are being updated continuously. Welcome to contribute datasets~ @@ -12,6 +13,7 @@ Here are the commonly used table recognition datasets, which are being updated c |---|---|---| | PubTabNet |https://github.com/ibm-aur-nlp/PubTabNet| jsonl format, which can be loaded directly with [pubtab_dataset.py](../../../ppocr/data/pubtab_dataset.py) | | TAL Table Recognition Competition Dataset |https://ai.100tal.com/dataset| jsonl format, which can be loaded directly with [pubtab_dataset.py](../../../ppocr/data/pubtab_dataset.py) | +| WTW Chinese scene table dataset |https://github.com/wangwen-whu/WTW-Dataset| Conversion is required to load with [pubtab_dataset.py](../../../ppocr/data/pubtab_dataset.py)| ## 1. PubTabNet - **Data Introduction**:The training set of the PubTabNet dataset contains 500,000 images and the validation set contains 9000 images. Part of the image visualization is shown below. @@ -30,3 +32,11 @@ Here are the commonly used table recognition datasets, which are being updated c + +## 3. WTW Chinese scene table dataset +- **Data Introduction**:The WTW Chinese scene table dataset consists of two parts: table detection and table data. The dataset contains images of two scenes, scanned and photographed. +https://github.com/wangwen-whu/WTW-Dataset/blob/main/demo/20210816_210413.gif + +
+ +
diff --git a/ppstructure/table/README.md b/ppstructure/table/README.md index 45c13565..10308b49 100644 --- a/ppstructure/table/README.md +++ b/ppstructure/table/README.md @@ -63,7 +63,16 @@ After the operation is completed, the excel table of each image will be saved to In this chapter, we only introduce the training of the table structure model, For model training of [text detection](../../doc/doc_en/detection_en.md) and [text recognition](../../doc/doc_en/recognition_en.md), please refer to the corresponding documents * data preparation -The training data uses public data set [PubTabNet](https://arxiv.org/abs/1911.10683 ), Can be downloaded from the official [website](https://github.com/ibm-aur-nlp/PubTabNet) 。The PubTabNet data set contains about 500,000 images, as well as annotations in html format。 + +For the Chinese model and the English model, the data sources are different, as follows: + +English dataset: The training data uses public data set [PubTabNet](https://arxiv.org/abs/1911.10683 ), Can be downloaded from the official [website](https://github.com/ibm-aur-nlp/PubTabNet) 。The PubTabNet data set contains about 500,000 images, as well as annotations in html format。 + +Chinese dataset: The Chinese dataset consists of the following two parts, which are trained with a 1:1 sampling ratio. +> 1. Generate dataset: Use [Table Generation Tool](https://github.com/WenmuZhou/TableGeneration) to generate 40,000 images. +> 2. Crop 10,000 images from [WTW](https://github.com/wangwen-whu/WTW-Dataset). + +For a detailed introduction to public datasets, please refer to [table_datasets](../../doc/doc_en/dataset/table_datasets_en.md). The following training and evaluation procedures are based on the English dataset as an example. * Start training *If you are installing the cpu version of paddle, please modify the `use_gpu` field in the configuration file to false* diff --git a/ppstructure/table/README_ch.md b/ppstructure/table/README_ch.md index 21fb7960..3f31c010 100644 --- a/ppstructure/table/README_ch.md +++ b/ppstructure/table/README_ch.md @@ -75,7 +75,15 @@ note: 上述模型是在 PubLayNet 数据集上训练的表格识别模型,仅 * 数据准备 -训练数据使用公开数据集PubTabNet ([论文](https://arxiv.org/abs/1911.10683),[下载地址](https://github.com/ibm-aur-nlp/PubTabNet))。PubTabNet数据集包含约50万张表格数据的图像,以及图像对应的html格式的注释。 +对于中文模型和英文模型,数据来源不同,分别介绍如下 + +英文数据集: 训练数据使用公开数据集PubTabNet ([论文](https://arxiv.org/abs/1911.10683),[下载地址](https://github.com/ibm-aur-nlp/PubTabNet))。PubTabNet数据集包含约50万张表格数据的图像,以及图像对应的html格式的注释。 + +中文数据集: 中文数据集下面两部分构成,这两部分安装1:1的采样比例进行训练。 +> 1. 生成数据集: 使用[表格生成工具](https://github.com/WenmuZhou/TableGeneration)生成4w张。 +> 2. 从[WTW](https://github.com/wangwen-whu/WTW-Dataset)中获取1w张。 + +关于公开数据集的详细介绍可以参考 [table_datasets](../../doc/doc_ch/dataset/table_datasets.md),下述训练和评估流程均以英文数据集为例。 * 启动训练 -- GitLab