Here is a list of public datasets commonly used in OCR, which are being continuously updated. Welcome to contribute datasets~
Here is a list of public datasets commonly used in OCR, which are being continuously updated. Welcome to contribute datasets~
...
@@ -14,6 +15,38 @@ Here is a list of public datasets commonly used in OCR, which are being continuo
...
@@ -14,6 +15,38 @@ Here is a list of public datasets commonly used in OCR, which are being continuo
| ctw1500 | https://paddleocr.bj.bcebos.com/dataset/ctw1500.zip | Included in the downloaded image zip |
| ctw1500 | https://paddleocr.bj.bcebos.com/dataset/ctw1500.zip | Included in the downloaded image zip |
| total text | https://paddleocr.bj.bcebos.com/dataset/total_text.tar | Included in the downloaded image zip |
| total text | https://paddleocr.bj.bcebos.com/dataset/total_text.tar | Included in the downloaded image zip |
<aname="11"></a>
#### 1.1 ICDAR 2015
The icdar2015 dataset contains train set which has 1000 images obtained with wearable cameras and test set which has 500 images obtained with wearable cameras. The icdar2015 dataset can be downloaded from the link in the table above. Registration is required for downloading.
After registering and logging in, download the part marked in the red box in the figure below. And, the content downloaded by `Training Set Images` should be saved as the folder `icdar_c4_train_imgs`, and the content downloaded by `Test Set Images` is saved as the folder `ch4_test_images`
Decompress the downloaded dataset to the working directory, assuming it is decompressed under PaddleOCR/train_data/. Then download the PPOCR format annotation file from the table above.
PaddleOCR also provides a data format conversion script, which can convert the official website label to the PPOCR format. The data conversion tool is in `ppocr/utils/gen_label.py`, here is the training set as an example:
```
# Convert the label file downloaded from the official website to train_icdar2015_label.txt
@@ -20,33 +22,12 @@ This section uses the icdar2015 dataset as an example to introduce the training,
...
@@ -20,33 +22,12 @@ This section uses the icdar2015 dataset as an example to introduce the training,
### 1.1 Data Preparation
### 1.1 Data Preparation
The icdar2015 dataset contains train set which has 1000 images obtained with wearable cameras and test set which has 500 images obtained with wearable cameras. The icdar2015 can be obtained from [official website](https://rrc.cvc.uab.es/?ch=4&com=downloads). Registration is required for downloading.
### 1.1.1 Public dataset
Public datasets can be downloaded and prepared by referring to [ocr_datasets](./dataset/ocr_datasets_en.md).
### 1.1.2 Custom dataset
After registering and logging in, download the part marked in the red box in the figure below. And, the content downloaded by `Training Set Images` should be saved as the folder `icdar_c4_train_imgs`, and the content downloaded by `Test Set Images` is saved as the folder `ch4_test_images`
The annotation file formats supported by the PaddleOCR text detection algorithm are as follows, separated by "\t":
Decompress the downloaded dataset to the working directory, assuming it is decompressed under PaddleOCR/train_data/. In addition, PaddleOCR organizes many scattered annotation files into two separate annotation files for train and test respectively, which can be downloaded by wget: