未验证 提交 d587b7ac 编写于 作者: M MissPenguin 提交者: GitHub

Merge pull request #7306 from littletomatodonkey/dyg/fix_dep

fix dep
# 关键信息抽取数据集 # 关键信息抽取数据集
这里整理了常见的DocVQA数据集,持续更新中,欢迎各位小伙伴贡献数据集~ 这里整理了常见的关键信息抽取数据集,持续更新中,欢迎各位小伙伴贡献数据集~
- [FUNSD数据集](#funsd) - [FUNSD数据集](#funsd)
- [XFUND数据集](#xfund) - [XFUND数据集](#xfund)
......
## Key Imnformation Extraction dataset ## Key Information Extraction dataset
Here are the common datasets key information extraction, which are being updated continuously. Welcome to contribute datasets.
Here are the common DocVQA datasets, which are being updated continuously. Welcome to contribute datasets.
- [FUNSD dataset](#funsd) - [FUNSD dataset](#funsd)
- [XFUND dataset](#xfund) - [XFUND dataset](#xfund)
- [wildreceipt dataset](#wildreceipt数据集) - [wildreceipt dataset](#wildreceipt-dataset)
<a name="funsd"></a> <a name="funsd"></a>
#### 1. FUNSD dataset #### 1. FUNSD dataset
...@@ -20,7 +21,8 @@ Here are the common DocVQA datasets, which are being updated continuously. Welco ...@@ -20,7 +21,8 @@ Here are the common DocVQA datasets, which are being updated continuously. Welco
<a name="xfund"></a> <a name="xfund"></a>
#### 2. XFUND dataset #### 2. XFUND dataset
- **Data source**: https://github.com/doc-analysis/XFUND - **Data source**: https://github.com/doc-analysis/XFUND
- **Data introduction**: XFUND is a multilingual form comprehension dataset, which contains form data in 7 different languages, and all are manually annotated in the form of key-value pairs. The data for each language contains 199 form data, which are divided into 149 training sets and 50 test sets. Part of the image and the annotation box visualization are shown below: - **Data introduction**: XFUND is a multilingual form comprehension dataset, which contains form data in 7 different languages, and all are manually annotated in the form of key-value pairs. The data for each language contains 199 form data, which are divided into 149 training sets and 50 test sets. Part of the image and the annotation box visualization are shown below.
<div align="center"> <div align="center">
<img src="../../datasets/xfund_demo/gt_zh_train_0.jpg" width="500"> <img src="../../datasets/xfund_demo/gt_zh_train_0.jpg" width="500">
<img src="../../datasets/xfund_demo/gt_zh_train_1.jpg" width="500"> <img src="../../datasets/xfund_demo/gt_zh_train_1.jpg" width="500">
......
## Layout Analysis Dataset
Here are the common datasets of layout anlysis, which are being updated continuously. Welcome to contribute datasets.
- [PubLayNet dataset](#publaynet)
- [CDLA dataset](#CDLA)
- [TableBank dataset](#TableBank)
Most of the layout analysis datasets are object detection datasets. In addition to open source datasets, you can also label or synthesize datasets using tools such as [labelme](https://github.com/wkentaro/labelme) and so on.
<a name="publaynet"></a>
#### 1. PubLayNet dataset
- **Data source**: https://github.com/ibm-aur-nlp/PubLayNet
- **Data introduction**: The PubLayNet dataset contains 350000 training images and 11000 validation images. There are 5 categories in total, namely: `text, title, list, table, figure`. Some images and their annotations as shown below.
<div align="center">
<img src="../../datasets/publaynet_demo/gt_PMC3724501_00006.jpg" width="500">
<img src="../../datasets/publaynet_demo/gt_PMC5086060_00002.jpg" width="500">
</div>
- **Download address**: https://developer.ibm.com/exchanges/data/all/publaynet/
- **Note**: When using this dataset, you need to follow [CDLA-Permissive](https://cdla.io/permissive-1-0/) license.
<a name="CDLA"></a>
#### 2、CDLA数据集
- **Data source**: https://github.com/buptlihang/CDLA
- **Data introduction**: CDLA dataset contains 5000 training images and 1000 validation images with 10 categories, which are `Text, Title, Figure, Figure caption, Table, Table caption, Header, Footer, Reference, Equation`. Some images and their annotations as shown below.
<div align="center">
<img src="../../datasets/CDLA_demo/val_0633.jpg" width="500">
<img src="../../datasets/CDLA_demo/val_0941.jpg" width="500">
</div>
- **Download address**: https://github.com/buptlihang/CDLA
- **Note**: When you train detection model on CDLA dataset using [PaddleDetection](https://github.com/PaddlePaddle/PaddleDetection/tree/develop), you need to remove the label `__ignore__` and `_background_`.
<a name="TableBank"></a>
#### 3、TableBank dataet
- **Data source**: https://doc-analysis.github.io/tablebank-page/index.html
- **Data introduction**: TableBank dataset contains 2 types of document: Latex (187199 training images, 7265 validation images and 5719 testing images) and Word (73383 training images 2735 validation images and 2281 testing images). Some images and their annotations as shown below.
<div align="center">
<img src="../../datasets/tablebank_demo/004.png" height="700">
<img src="../../datasets/tablebank_demo/005.png" height="700">
</div>
- **Data source**: https://doc-analysis.github.io/tablebank-page/index.html
- **Note**: When using this dataset, you need to follow [Apache-2.0](https://github.com/doc-analysis/TableBank/blob/master/LICENSE) license.
sentencepiece sentencepiece
yacs yacs
seqeval seqeval
git+https://github.com/PaddlePaddle/PaddleNLP
pypandoc pypandoc
attrdict attrdict
python_docx python_docx
https://paddleocr.bj.bcebos.com/ppstructure/whl/paddlenlp-2.3.0.dev0-py3-none-any.whl
Markdown is supported
0% .
You are about to add 0 people to the discussion. Proceed with caution.
先完成此消息的编辑!
想要评论请 注册