ocr_datasets_en.md 7.8 KB
Newer Older
文幕地方's avatar
文幕地方 已提交
1
# OCR datasets
文幕地方's avatar
文幕地方 已提交
2

文幕地方's avatar
文幕地方 已提交
3 4 5 6 7 8 9 10 11
- [1. Text detection](#1-text-detection)
  - [1.1 PaddleOCR text detection format annotation](#11-paddleocr-text-detection-format-annotation)
  - [1.2 Public dataset](#12-public-dataset)
    - [1.2.1 ICDAR 2015](#121-icdar-2015)
- [2. Text recognition](#2-text-recognition)
  - [2.1 PaddleOCR text recognition format annotation](#21-paddleocr-text-recognition-format-annotation)
  - [2.2 Public dataset](#22-public-dataset)
    - [2.1 ICDAR 2015](#21-icdar-2015)
- [3. Data storage path](#3-data-storage-path)
文幕地方's avatar
文幕地方 已提交
12 13 14

Here is a list of public datasets commonly used in OCR, which are being continuously updated. Welcome to contribute datasets~

文幕地方's avatar
文幕地方 已提交
15
## 1. Text detection
文幕地方's avatar
文幕地方 已提交
16

文幕地方's avatar
文幕地方 已提交
17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33
### 1.1 PaddleOCR text detection format annotation

The annotation file formats supported by the PaddleOCR text detection algorithm are as follows, separated by "\t":
```
" Image file name             Image annotation information encoded by json.dumps"
ch4_test_images/img_61.jpg    [{"transcription": "MASA", "points": [[310, 104], [416, 141], [418, 216], [312, 179]]}, {...}]
```
The image annotation after **json.dumps()** encoding is a list containing multiple dictionaries.

The `points` in the dictionary represent the coordinates (x, y) of the four points of the text box, arranged clockwise from the point at the upper left corner.

`transcription` represents the text of the current text box. **When its content is "###" it means that the text box is invalid and will be skipped during training.**

If you want to train PaddleOCR on other datasets, please build the annotation file according to the above format.

### 1.2 Public dataset
| dataset | Image download link | PaddleOCR format annotation download link |
文幕地方's avatar
文幕地方 已提交
34 35 36
|---|---|---|
| ICDAR 2015 | https://rrc.cvc.uab.es/?ch=4&com=downloads            | [train](https://paddleocr.bj.bcebos.com/dataset/train_icdar2015_label.txt) / [test](https://paddleocr.bj.bcebos.com/dataset/test_icdar2015_label.txt) |
| ctw1500 | https://paddleocr.bj.bcebos.com/dataset/ctw1500.zip   | Included in the downloaded image zip                                                                                                           |
文幕地方's avatar
文幕地方 已提交
37
| total text | https://paddleocr.bj.bcebos.com/dataset/total_text.tar |  Included in the downloaded image zip  |
文幕地方's avatar
文幕地方 已提交
38

文幕地方's avatar
文幕地方 已提交
39
#### 1.2.1 ICDAR 2015
文幕地方's avatar
文幕地方 已提交
40 41 42 43 44 45 46 47 48 49

The icdar2015 dataset contains train set which has 1000 images obtained with wearable cameras and test set which has 500 images obtained with wearable cameras. The icdar2015 dataset can be downloaded from the link in the table above. Registration is required for downloading.


After registering and logging in, download the part marked in the red box in the figure below. And, the content downloaded by `Training Set Images` should be saved as the folder `icdar_c4_train_imgs`, and the content downloaded by `Test Set Images` is saved as the folder `ch4_test_images`

<p align="center">
 <img src="../../datasets/ic15_location_download.png" align="middle" width = "700"/>
<p align="center">

文幕地方's avatar
文幕地方 已提交
50
Decompress the downloaded dataset to the working directory, assuming it is decompressed under PaddleOCR/train_data/. Then download the PaddleOCR format annotation file from the table above.
文幕地方's avatar
文幕地方 已提交
51

文幕地方's avatar
文幕地方 已提交
52
PaddleOCR also provides a data format conversion script, which can convert the official website label to the PaddleOCR format. The data conversion tool is in `ppocr/utils/gen_label.py`, here is the training set as an example:
文幕地方's avatar
文幕地方 已提交
53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69
```
# Convert the label file downloaded from the official website to train_icdar2015_label.txt
python gen_label.py --mode="det" --root_path="/path/to/icdar_c4_train_imgs/"  \
                    --input_path="/path/to/ch4_training_localization_transcription_gt" \
                    --output_label="/path/to/train_icdar2015_label.txt"
```

After decompressing the data set and downloading the annotation file, PaddleOCR/train_data/ has two folders and two files, which are:
```
/PaddleOCR/train_data/icdar2015/text_localization/
  └─ icdar_c4_train_imgs/         Training data of icdar dataset
  └─ ch4_test_images/             Testing data of icdar dataset
  └─ train_icdar2015_label.txt    Training annotation of icdar dataset
  └─ test_icdar2015_label.txt     Test annotation of icdar dataset
```


文幕地方's avatar
文幕地方 已提交
70 71 72 73 74 75 76 77 78 79
## 2. Text recognition

### 2.1 PaddleOCR text recognition format annotation

The text recognition algorithm in PaddleOCR supports two data formats:
 - `lmdb` is used to train data sets stored in lmdb format, use [lmdb_dataset.py](../../../ppocr/data/lmdb_dataset.py) to load;
 - `通用数据` is used to train data sets stored in text files, use [simple_dataset.py](../../../ppocr/data/simple_dataset.py) to load.


If you want to use your own data for training, please refer to the following to organize your data.
文幕地方's avatar
文幕地方 已提交
80

文幕地方's avatar
文幕地方 已提交
81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125
- Training set

It is recommended to put the training images in the same folder, and use a txt file (rec_gt_train.txt) to store the image path and label. The contents of the txt file are as follows:

* Note: by default, the image path and image label are split with \t, if you use other methods to split, it will cause training error

```
" Image file name           Image annotation "

train_data/rec/train/word_001.jpg   简单可依赖
train_data/rec/train/word_002.jpg   用科技让复杂的世界更简单
...
```

The final training set should have the following file structure:

```
|-train_data
  |-rec
    |- rec_gt_train.txt
    |- train
        |- word_001.png
        |- word_002.jpg
        |- word_003.jpg
        | ...
```

- Test set

Similar to the training set, the test set also needs to be provided a folder containing all images (test) and a rec_gt_test.txt. The structure of the test set is as follows:

```
|-train_data
  |-rec
    |-ic15_data
        |- rec_gt_test.txt
        |- test
            |- word_001.jpg
            |- word_002.jpg
            |- word_003.jpg
            | ...
```

### 2.2 Public dataset
| dataset | Image download link | PaddleOCR format annotation download link |
文幕地方's avatar
文幕地方 已提交
126 127
|---|---|---|
| en benchmark(MJ, SJ, IIIT, SVT, IC03, IC13, IC15, SVTP, and CUTE.) | [DTRB](https://github.com/clovaai/deep-text-recognition-benchmark#download-lmdb-dataset-for-traininig-and-evaluation-from-here) | LMDB format, which can be loaded directly with [lmdb_dataset.py](../../../ppocr/data/lmdb_dataset.py) |
文幕地方's avatar
文幕地方 已提交
128 129 130
|ICDAR 2015| http://rrc.cvc.uab.es/?ch=4&com=downloads | [train](https://paddleocr.bj.bcebos.com/dataset/rec_gt_train.txt)/ [test](https://paddleocr.bj.bcebos.com/dataset/rec_gt_test.txt) |
| Multilingual datasets |[Baidu network disk](https://pan.baidu.com/s/1bS_u207Rm7YbY33wOECKDA) Extraction code: frgi <br> [google drive](https://drive.google.com/file/d/18cSWX7wXSy4G0tbKJ0d9PuIaiwRLHpjA/view) | Included in the downloaded image zip |

文幕地方's avatar
文幕地方 已提交
131
#### 2.1 ICDAR 2015
文幕地方's avatar
文幕地方 已提交
132

文幕地方's avatar
文幕地方 已提交
133
The ICDAR 2015 dataset can be downloaded from the link in the table above for quick validation. The lmdb format dataset required by en benchmark can also be downloaded from the table above.
文幕地方's avatar
文幕地方 已提交
134 135 136 137 138 139 140 141 142 143 144 145 146 147

Then download the PaddleOCR format annotation file from the table above.

PaddleOCR also provides a data format conversion script, which can convert the ICDAR official website label to the data format supported by PaddleOCR. The data conversion tool is in `ppocr/utils/gen_label.py`, here is the training set as an example:

```
# Convert the label file downloaded from the official website to rec_gt_label.txt
python gen_label.py --mode="rec" --input_path="{path/of/origin/label}" --output_label="rec_gt_label.txt"
```

The data format is as follows, (a) is the original picture, (b) is the Ground Truth text file corresponding to each picture:

![](../../datasets/icdar_rec.png)

文幕地方's avatar
文幕地方 已提交
148
## 3. Data storage path
文幕地方's avatar
文幕地方 已提交
149 150 151 152 153 154 155 156 157

The default storage path for PaddleOCR training data is `PaddleOCR/train_data`, if you already have a dataset on your disk, just create a soft link to the dataset directory:

```
# linux and mac os
ln -sf <path/to/dataset> <path/to/paddle_ocr>/train_data/dataset
# windows
mklink /d <path/to/paddle_ocr>/train_data/dataset <path/to/dataset>
```