未验证 提交 cd5d9407 编写于 作者: E Evezerest 提交者: GitHub

Merge pull request #5616 from UncleLLD/add_chinese_handwriteen_dataset

add chinese handwriteen dataset scut-ept
# 手写OCR数据集
这里整理了常用手写数据集,持续更新中,欢迎各位小伙伴贡献数据集~
- [中科院自动化研究所-手写中文数据集](#中科院自动化研究所-手写中文数据集)
- [华南理工大学-手写中文数据集](#华南理工大学-手写中文数据集)
- [NIST手写单字数据集-英文](#NIST手写单字数据集-英文)
<a name="中科院自动化研究所-手写中文数据集"></a>
## 中科院自动化研究所-手写中文数据集
- **数据来源**:http://www.nlpr.ia.ac.cn/databases/handwriting/Download.html
- **数据简介**
* 包含在线和离线两类手写数据,`HWDB1.0~1.2`总共有3895135个手写单字样本,分属7356类(7185个汉字和171个英文字母、数字、符号);`HWDB2.0~2.2`总共有5091页图像,分割为52230个文本行和1349414个文字。所有文字和文本样本均存为灰度图像。部分单字样本图片如下所示。
![](../datasets/CASIA_0.jpg)
- **数据简介**
* 包含在线和离线两类手写数据,`HWDB1.0~1.2`总共有3895135个手写单字样本,分属7356类(7185个汉字和171个英文字母、数字、符号);`HWDB2.0~2.2`总共有5091页图像,分割为52230个文本行和1349414个文字。所有文字和文本样本均存为灰度图像。部分单字样本图片如下所示。
![](../datasets/CASIA_0.jpg)
- **下载地址**:http://www.nlpr.ia.ac.cn/databases/handwriting/Download.html
- **使用建议**:数据为单字,白色背景,可以大量合成文字行进行训练。白色背景可以处理成透明状态,方便添加各种背景。对于需要语义的情况,建议从真实语料出发,抽取单字组成文字行
<a name="华南理工大学-手写中文数据集"></a>
## 华南理工大学-手写中文数据集(SCUT-EPT Dataset)
- **数据来源**:https://github.com/HCIILAB/SCUT-EPT_Dataset_Release
- **数据简介**
* `SCUT-EPT`数据集适用于手写文档和字符识别的模型训练,从2986位志愿者试卷中提取得到,总共包含5万张文本行图片,分属4250类(4033个常见汉字、104个符号和113个生僻字);生僻字是指字符不在`CASIA-HWDB1.0-1.2`字符集合中的字。`SCUT-EPT`数据集中总字符数为1,267,161,每个文本行大约25个字符。部分样本图片如下所示。
![](../datasets/SCUT_0.jpg)
- **下载地址**:https://pan.baidu.com/s/1h4d1ogn_MAnE_X0LNHowYg
* 如果下载链接失效以及获取解压密码,请查看[数据来源主页](https://github.com/HCIILAB/SCUT-EPT_Dataset_Release)
<a name="NIST手写单字数据集-英文"></a>
## NIST手写单字数据集-英文(NIST Handprinted Forms and Characters Database)
- **数据来源**: [https://www.nist.gov/srd/nist-special-database-19](https://www.nist.gov/srd/nist-special-database-19)
......
# Handwritten OCR dataset
Here we have sorted out the commonly used handwritten OCR dataset datasets, which are being updated continuously. We welcome you to contribute datasets ~
- [Institute of automation, Chinese Academy of Sciences - handwritten Chinese dataset](#Institute of automation, Chinese Academy of Sciences - handwritten Chinese dataset)
- [South China University of Technology - handwritten Chinese dataset](#South China University of Technology - handwritten Chinese dataset)
- [NIST handwritten single character dataset - English](#NIST handwritten single character dataset - English)
<a name="Institute of automation, Chinese Academy of Sciences - handwritten Chinese dataset"></a>
## Institute of automation, Chinese Academy of Sciences - handwritten Chinese dataset
- **Data source**:http://www.nlpr.ia.ac.cn/databases/handwriting/Download.html
- **Data introduction**:
* It includes online and offline handwritten data,`HWDB1.0~1.2` has totally 3895135 handwritten single character samples, which belong to 7356 categories (7185 Chinese characters and 171 English letters, numbers and symbols);`HWDB2.0~2.2` has totally 5091 pages of images, which are divided into 52230 text lines and 1349414 words. All text and text samples are stored as grayscale images. Some sample words are shown below.
* It includes online and offline handwritten data,`HWDB1.0~1.2` has totally 3895135 handwritten single character samples, which belong to 7356 categories (7185 Chinese characters and 171 English letters, numbers and symbols);`HWDB2.0~2.2` has totally 5091 pages of images, which are divided into 52230 text lines and 1349414 words. All text and text samples are stored as grayscale images. Some sample words are shown below.
![](../datasets/CASIA_0.jpg)
- **Download address**:http://www.nlpr.ia.ac.cn/databases/handwriting/Download.html
![](../datasets/CASIA_0.jpg)
- **Recommended**:Data for single character, white background, can form a large number of text lines for training. White background can be processed into transparent state, which is convenient to add various backgrounds. For the case of semantic needs, it is suggested to extract single character from real corpus to form text lines.
- **Download address**:http://www.nlpr.ia.ac.cn/databases/handwriting/Download.html
- **使用建议**:Data for single character, white background, can form a large number of text lines for training. White background can be processed into transparent state, which is convenient to add various backgrounds. For the case of semantic needs, it is suggested to extract single character from real corpus to form text lines.
<a name="South China University of Technology - handwritten Chinese dataset"></a>
## South China University of Technology - handwritten Chinese dataset
- **Data source**:https://github.com/HCIILAB/SCUT-EPT_Dataset_Release
- **Data introduction**:
* The SCUT-EPT Dataset contains 50,000 text line images, selected from examination papers of 2,986 volunteers. There are totally 4,250 classes in SCUT-EPT dataset, including 4,033 commonly used Chinese characters, 104 symbols, and 113 outlier Chinese characters, where outlier Chinese character means that the Chinese character is outside the character set of the popular CASIA-HWDB1.0-1.2. The total character samples in the SCUT-EPT dataset is 1,267,161, with approximately 25 characters each text line.
![](../datasets/SCUT_0.jpg)
- **Download address**: [https://pan.baidu.com/s/1h4d1ogn_MAnE_X0LNHowYg](https://pan.baidu.com/s/1h4d1ogn_MAnE_X0LNHowYg)
* If the link fails or get the unzip password, please check it on the [GitHub Project](https://github.com/HCIILAB/SCUT-EPT_Dataset_Release)
<a name="NIST handwritten single character dataset - English"></a>
## NIST handwritten single character dataset - English(NIST Handprinted Forms and Characters Database)
- **Data source**: [https://www.nist.gov/srd/nist-special-database-19](https://www.nist.gov/srd/nist-special-database-19)
......
Markdown is supported
0% .
You are about to add 0 people to the discussion. Proceed with caution.
先完成此消息的编辑!
想要评论请 注册