From 248669a81e711e87f95b782042dd4cab3acaf7db Mon Sep 17 00:00:00 2001 From: WenmuZhou Date: Wed, 3 Feb 2021 11:58:24 +0800 Subject: [PATCH] update rec doc --- doc/doc_ch/recognition.md | 140 ++++++++++++++++++----------------- doc/doc_en/recognition_en.md | 107 +++++++++++++------------- 2 files changed, 131 insertions(+), 116 deletions(-) diff --git a/doc/doc_ch/recognition.md b/doc/doc_ch/recognition.md index c5f459bd..c2b61a28 100644 --- a/doc/doc_ch/recognition.md +++ b/doc/doc_ch/recognition.md @@ -1,60 +1,90 @@ ## 文字识别 -- [一、数据准备](#数据准备) - - [数据下载](#数据下载) - - [自定义数据集](#自定义数据集) - - [字典](#字典) - - [支持空格](#支持空格) +- [1 数据准备](#数据准备) + - [1.1 自定义数据集](#自定义数据集) + - [1.2 数据下载](#数据下载) + - [1.3 字典](#字典) + - [1.4 支持空格](#支持空格) -- [二、启动训练](#启动训练) - - [1. 数据增强](#数据增强) - - [2. 训练](#训练) - - [3. 小语种](#小语种) +- [2 启动训练](#启动训练) + - [2.1 数据增强](#数据增强) + - [2.2 训练](#训练) + - [2.3 小语种](#小语种) -- [三、评估](#评估) +- [3 评估](#评估) -- [四、预测](#预测) - - [1. 训练引擎预测](#训练引擎预测) +- [4 预测](#预测) + - [4.1 训练引擎预测](#训练引擎预测) -### 数据准备 +### 1. 数据准备 -PaddleOCR 支持两种数据格式: `lmdb` 用于训练公开数据,调试算法; `通用数据` 训练自己的数据: - -请按如下步骤设置数据集: +PaddleOCR 支持两种数据格式: + - `lmdb` 用于训练以lmdb格式存储的数据集; + - `通用数据` 用于训练以文本文件存储的数据集: 训练数据的默认存储路径是 `PaddleOCR/train_data`,如果您的磁盘上已有数据集,只需创建软链接至数据集目录: ``` +# linux and mac os ln -sf /train_data/dataset +# windows +mklink /d /train_data/dataset ``` - -* 数据下载 - -若您本地没有数据集,可以在官网下载 [icdar2015](http://rrc.cvc.uab.es/?ch=4&com=downloads) 数据,用于快速验证。也可以参考[DTRB](https://github.com/clovaai/deep-text-recognition-benchmark#download-lmdb-dataset-for-traininig-and-evaluation-from-here),下载 benchmark 所需的lmdb格式数据集。 + +#### 1.1 自定义数据集 +下面以通用数据集为例, 介绍如何准备数据集: - -* 使用自己数据集 +* 训练集 -若您希望使用自己的数据进行训练,请参考下文组织您的数据。 +建议将训练图片放入同一个文件夹,并用一个txt文件(rec_gt_train.txt)记录图片路径和标签,txt文件里的内容如下: -- 训练集 +**注意:** txt文件中默认请将图片路径和图片标签用 \t 分割,如用其他方式分割将造成训练报错。 -首先请将训练图片放入同一个文件夹(train_images),并用一个txt文件(rec_gt_train.txt)记录图片路径和标签。 +``` +" 图像文件名 图像标注信息 " -**注意:** 默认请将图片路径和图片标签用 \t 分割,如用其他方式分割将造成训练报错 +train_data/train/word_001.jpg 简单可依赖 +train_data/train/word_002.jpg 用科技让复杂的世界更简单 +... +``` +最终训练集应有如下文件结构: +``` +|-train_data + |- rec_gt_train.txt + |- train + |- word_001.png + |- word_002.jpg + |- word_003.jpg + | ... ``` -" 图像文件名 图像标注信息 " -train_data/train_0001.jpg 简单可依赖 -train_data/train_0002.jpg 用科技让复杂的世界更简单 +- 测试集 + +同训练集类似,测试集也需要提供一个包含所有图片的文件夹(test)和一个rec_gt_test.txt,测试集的结构如下所示: + +``` +|-train_data + |- rec_gt_test.txt + |- test + |- word_001.jpg + |- word_002.jpg + |- word_003.jpg + | ... ``` -PaddleOCR 提供了一份用于训练 icdar2015 数据集的标签文件,通过以下方式下载: + + + +1.2 数据下载 + +若您本地没有数据集,可以在官网下载 [icdar2015](http://rrc.cvc.uab.es/?ch=4&com=downloads) 数据,用于快速验证。也可以参考[DTRB](https://github.com/clovaai/deep-text-recognition-benchmark#download-lmdb-dataset-for-traininig-and-evaluation-from-here) ,下载 benchmark 所需的lmdb格式数据集。 + +如果你使用的是icdar2015的公开数据集,PaddleOCR 提供了一份用于训练 icdar2015 数据集的标签文件,通过以下方式下载: ``` # 训练集标签 @@ -70,34 +100,8 @@ PaddleOCR 也提供了数据格式转换脚本,可以将官网 label 转换支 python gen_label.py --mode="rec" --input_path="{path/of/origin/label}" --output_label="rec_gt_label.txt" ``` -最终训练集应有如下文件结构: -``` -|-train_data - |-ic15_data - |- rec_gt_train.txt - |- train - |- word_001.png - |- word_002.jpg - |- word_003.jpg - | ... -``` - -- 测试集 - -同训练集类似,测试集也需要提供一个包含所有图片的文件夹(test)和一个rec_gt_test.txt,测试集的结构如下所示: - -``` -|-train_data - |-ic15_data - |- rec_gt_test.txt - |- test - |- word_001.jpg - |- word_002.jpg - |- word_003.jpg - | ... -``` -- 字典 +1.3 字典 最后需要提供一个字典({word_dict_name}.txt),使模型在训练时,可以将所有出现的字符映射为字典的索引。 @@ -114,6 +118,10 @@ n word_dict.txt 每行有一个单字,将字符与数字索引映射在一起,“and” 将被映射成 [2 5 1] +* 内置字典 + +PaddleOCR内置了一部分字典,可以按需使用。 + `ppocr/utils/ppocr_keys_v1.txt` 是一个包含6623个字符的中文字典 `ppocr/utils/ic15_dict.txt` 是一个包含36个字符的英文字典 @@ -129,7 +137,7 @@ word_dict.txt 每行有一个单字,将字符与数字索引映射在一起, `ppocr/utils/dict/en_dict.txt` 是一个包含63个字符的英文字典 -您可以按需使用。 + 目前的多语言模型仍处在demo阶段,会持续优化模型并补充语种,**非常欢迎您为我们提供其他语言的字典和字体**, 如您愿意可将字典文件提交至 [dict](../../ppocr/utils/dict) 将语料文件提交至[corpus](../../ppocr/utils/corpus),我们会在Repo中感谢您。 @@ -140,13 +148,13 @@ word_dict.txt 每行有一个单字,将字符与数字索引映射在一起, 并将 `character_type` 设置为 `ch`。 -- 添加空格类别 +1.4 添加空格类别 如果希望支持识别"空格"类别, 请将yml文件中的 `use_space_char` 字段设置为 `True`。 -### 启动训练 +### 2. 启动训练 PaddleOCR提供了训练脚本、评估脚本和预测脚本,本节将以 CRNN 识别模型为例: @@ -171,7 +179,7 @@ tar -xf rec_mv3_none_bilstm_ctc_v2.0_train.tar && rm -rf rec_mv3_none_bilstm_ctc python3 -m paddle.distributed.launch --gpus '0,1,2,3' tools/train.py -c configs/rec/rec_icdar15_train.yml ``` -- 数据增强 +#### 2.1 数据增强 PaddleOCR提供了多种数据增强方式,如果您希望在训练时加入扰动,请在配置文件中设置 `distort: true`。 @@ -182,7 +190,7 @@ PaddleOCR提供了多种数据增强方式,如果您希望在训练时加入 *由于OpenCV的兼容性问题,扰动操作暂时只支持Linux* -- 训练 +#### 2.2 训练 PaddleOCR支持训练和评估交替进行, 可以在 `configs/rec/rec_icdar15_train.yml` 中修改 `eval_batch_step` 设置评估频率,默认每500个iter评估一次。评估过程中默认将最佳acc模型,保存为 `output/rec_CRNN/best_accuracy` 。 @@ -268,7 +276,7 @@ Eval: **注意,预测/评估时的配置文件请务必与训练一致。** -- 小语种 +#### 2.3 小语种 PaddleOCR目前已支持26种(除中文外)语种识别,`configs/rec/multi_languages` 路径下提供了一个多语言的配置文件模版: [rec_multi_language_lite_train.yml](../../configs/rec/multi_language/rec_multi_language_lite_train.yml)。 @@ -411,7 +419,7 @@ Eval: ... ``` -### 评估 +### 3 评估 评估数据集可以通过 `configs/rec/rec_icdar15_train.yml` 修改Eval中的 `label_file_path` 设置。 @@ -421,10 +429,10 @@ python3 -m paddle.distributed.launch --gpus '0' tools/eval.py -c configs/rec/rec ``` -### 预测 +### 4 预测 -* 训练引擎的预测 +#### 4.1 训练引擎的预测 使用 PaddleOCR 训练好的模型,可以通过以下脚本进行快速预测。 diff --git a/doc/doc_en/recognition_en.md b/doc/doc_en/recognition_en.md index 22f89cde..f6c4c105 100644 --- a/doc/doc_en/recognition_en.md +++ b/doc/doc_en/recognition_en.md @@ -1,79 +1,69 @@ ## TEXT RECOGNITION -- [DATA PREPARATION](#DATA_PREPARATION) - - [Dataset Download](#Dataset_download) - - [Costom Dataset](#Costom_Dataset) - - [Dictionary](#Dictionary) - - [Add Space Category](#Add_space_category) +- [1 DATA PREPARATION](#DATA_PREPARATION) + - [1.1 Costom Dataset](#Costom_Dataset) + - [1.2 Dataset Download](#Dataset_download) + - [1.3 Dictionary](#Dictionary) + - [1.4 Add Space Category](#Add_space_category) -- [TRAINING](#TRAINING) - - [Data Augmentation](#Data_Augmentation) - - [Training](#Training) - - [Multi-language](#Multi_language) +- [2 TRAINING](#TRAINING) + - [2.1 Data Augmentation](#Data_Augmentation) + - [2.2 Training](#Training) + - [2.3 Multi-language](#Multi_language) -- [EVALUATION](#EVALUATION) +- [3 EVALUATION](#EVALUATION) -- [PREDICTION](#PREDICTION) - - [Training engine prediction](#Training_engine_prediction) +- [4 PREDICTION](#PREDICTION) + - [4.1 Training engine prediction](#Training_engine_prediction) ### DATA PREPARATION -PaddleOCR supports two data formats: `LMDB` is used to train public data and evaluation algorithms; `general data` is used to train your own data: +PaddleOCR supports two data formats: +- `LMDB` is used to train data sets stored in lmdb format; +- `general data` is used to train data sets stored in text files: Please organize the dataset as follows: The default storage path for training data is `PaddleOCR/train_data`, if you already have a dataset on your disk, just create a soft link to the dataset directory: ``` +# linux and mac os ln -sf /train_data/dataset +# windows +mklink /d /train_data/dataset ``` - -* Dataset download - -If you do not have a dataset locally, you can download it on the official website [icdar2015](http://rrc.cvc.uab.es/?ch=4&com=downloads). Also refer to [DTRB](https://github.com/clovaai/deep-text-recognition-benchmark#download-lmdb-dataset-for-traininig-and-evaluation-from-here),download the lmdb format dataset required for benchmark - -If you want to reproduce the paper indicators of SRN, you need to download offline [augmented data](https://pan.baidu.com/s/1-HSZ-ZVdqBF2HaBZ5pRAKA), extraction code: y3ry. The augmented data is obtained by rotation and perturbation of mjsynth and synthtext. Please unzip the data to {your_path}/PaddleOCR/train_data/data_lmdb_Release/training/path. - -* Use your own dataset: +#### 1.1 Costom dataset If you want to use your own data for training, please refer to the following to organize your data. - Training set -First put the training images in the same folder (train_images), and use a txt file (rec_gt_train.txt) to store the image path and label. +It is recommended to put the training images in the same folder, and use a txt file (rec_gt_train.txt) to store the image path and label. The contents of the txt file are as follows: * Note: by default, the image path and image label are split with \t, if you use other methods to split, it will cause training error ``` " Image file name Image annotation " -train_data/train_0001.jpg 简单可依赖 -train_data/train_0002.jpg 用科技让复杂的世界更简单 -``` -PaddleOCR provides label files for training the icdar2015 dataset, which can be downloaded in the following ways: - -``` -# Training set label -wget -P ./train_data/ic15_data https://paddleocr.bj.bcebos.com/dataset/rec_gt_train.txt -# Test Set Label -wget -P ./train_data/ic15_data https://paddleocr.bj.bcebos.com/dataset/rec_gt_test.txt +train_data/train/word_001.jpg 简单可依赖 +train_data/train/word_002.jpg 用科技让复杂的世界更简单 +... ``` The final training set should have the following file structure: ``` |-train_data - |-ic15_data - |- rec_gt_train.txt - |- train - |- word_001.png - |- word_002.jpg - |- word_003.jpg - | ... + |- rec_gt_train.txt + |- train + |- word_001.png + |- word_002.jpg + |- word_003.jpg + | ... ``` - Test set @@ -90,8 +80,25 @@ Similar to the training set, the test set also needs to be provided a folder con |- word_003.jpg | ... ``` + + +#### 1.2 Dataset download + +If you do not have a dataset locally, you can download it on the official website [icdar2015](http://rrc.cvc.uab.es/?ch=4&com=downloads). Also refer to [DTRB](https://github.com/clovaai/deep-text-recognition-benchmark#download-lmdb-dataset-for-traininig-and-evaluation-from-here) ,download the lmdb format dataset required for benchmark + +If you want to reproduce the paper indicators of SRN, you need to download offline [augmented data](https://pan.baidu.com/s/1-HSZ-ZVdqBF2HaBZ5pRAKA), extraction code: y3ry. The augmented data is obtained by rotation and perturbation of mjsynth and synthtext. Please unzip the data to {your_path}/PaddleOCR/train_data/data_lmdb_Release/training/path. + +PaddleOCR provides label files for training the icdar2015 dataset, which can be downloaded in the following ways: + +``` +# Training set label +wget -P ./train_data/ic15_data https://paddleocr.bj.bcebos.com/dataset/rec_gt_train.txt +# Test Set Label +wget -P ./train_data/ic15_data https://paddleocr.bj.bcebos.com/dataset/rec_gt_test.txt +``` + -- Dictionary +#### 1.3 Dictionary Finally, a dictionary ({word_dict_name}.txt) needs to be provided so that when the model is trained, all the characters that appear can be mapped to the dictionary index. @@ -108,6 +115,8 @@ n In `word_dict.txt`, there is a single word in each line, which maps characters and numeric indexes together, e.g "and" will be mapped to [2 5 1] +PaddleOCR has built-in dictionaries, which can be used on demand. + `ppocr/utils/ppocr_keys_v1.txt` is a Chinese dictionary with 6623 characters. `ppocr/utils/ic15_dict.txt` is an English dictionary with 63 characters @@ -123,8 +132,6 @@ In `word_dict.txt`, there is a single word in each line, which maps characters a `ppocr/utils/dict/en_dict.txt` is a English dictionary with 63 characters -You can use it on demand. - The current multi-language model is still in the demo stage and will continue to optimize the model and add languages. **You are very welcome to provide us with dictionaries and fonts in other languages**, If you like, you can submit the dictionary file to [dict](../../ppocr/utils/dict) or corpus file to [corpus](../../ppocr/utils/corpus) and we will thank you in the Repo. @@ -136,14 +143,14 @@ To customize the dict file, please modify the `character_dict_path` field in `co If you need to customize dic file, please add character_dict_path field in configs/rec/rec_icdar15_train.yml to point to your dictionary path. And set character_type to ch. -- Add space category +#### 1.4 Add space category If you want to support the recognition of the `space` category, please set the `use_space_char` field in the yml file to `True`. **Note: use_space_char only takes effect when character_type=ch** -### TRAINING +### 2 TRAINING PaddleOCR provides training scripts, evaluation scripts, and prediction scripts. In this section, the CRNN recognition model will be used as an example: @@ -166,7 +173,7 @@ Start training: python3 -m paddle.distributed.launch --gpus '0,1,2,3' tools/train.py -c configs/rec/rec_icdar15_train.yml ``` -- Data Augmentation +#### 2.1 Data Augmentation PaddleOCR provides a variety of data augmentation methods. If you want to add disturbance during training, please set `distort: true` in the configuration file. @@ -175,7 +182,7 @@ The default perturbation methods are: cvtColor, blur, jitter, Gasuss noise, rand Each disturbance method is selected with a 50% probability during the training process. For specific code implementation, please refer to: [img_tools.py](https://github.com/PaddlePaddle/PaddleOCR/blob/develop/ppocr/data/rec/img_tools.py) -- Training +#### 2.2 Training PaddleOCR supports alternating training and evaluation. You can modify `eval_batch_step` in `configs/rec/rec_icdar15_train.yml` to set the evaluation frequency. By default, it is evaluated every 500 iter and the best acc model is saved under `output/rec_CRNN/best_accuracy` during the evaluation process. @@ -264,7 +271,7 @@ Eval: **Note that the configuration file for prediction/evaluation must be consistent with the training.** -- Multi-language +#### 2.3 Multi-language PaddleOCR currently supports 26 (except Chinese) language recognition. A multi-language configuration file template is provided under the path `configs/rec/multi_languages`: [rec_multi_language_lite_train.yml](../../configs/rec/multi_language/rec_multi_language_lite_train.yml)。 @@ -416,7 +423,7 @@ Eval: ``` -### EVALUATION +### 3 EVALUATION The evaluation dataset can be set by modifying the `Eval.dataset.label_file_list` field in the `configs/rec/rec_icdar15_train.yml` file. @@ -426,10 +433,10 @@ python3 -m paddle.distributed.launch --gpus '0' tools/eval.py -c configs/rec/rec ``` -### PREDICTION +### 4 PREDICTION -* Training engine prediction +#### 4.1 Training engine prediction Using the model trained by paddleocr, you can quickly get prediction through the following script. -- GitLab