This article introduces the use of the Python inference engine for the PP-OCR model library. The content is in order of text detection, text recognition, direction classifier and the prediction method of the three in series on the CPU and GPU.
-[TEXT DETECTION MODEL INFERENCE](#DETECTION_MODEL_INFERENCE)
-[TEXT RECOGNITION MODEL INFERENCE](#RECOGNITION_MODEL_INFERENCE)
-[1. LIGHTWEIGHT CHINESE MODEL](#LIGHTWEIGHT_RECOGNITION)
-[2. MULTILINGUAL MODEL INFERENCE](MULTILINGUAL_MODEL_INFERENCE)
-[ANGLE CLASSIFICATION MODEL INFERENCE](#ANGLE_CLASS_MODEL_INFERENCE)
-[TEXT DETECTION ANGLE CLASSIFICATION AND RECOGNITION INFERENCE CONCATENATION](#CONCATENATION)
<aname="DETECTION_MODEL_INFERENCE"></a>
## TEXT DETECTION MODEL INFERENCE
The default configuration is based on the inference setting of the DB text detection model. For lightweight Chinese detection model inference, you can execute the following commands:
The visual text detection results are saved to the ./inference_results folder by default, and the name of the result file is prefixed with'det_res'. Examples of results are as follows:
![](../imgs_results/det_res_00018069.jpg)
You can use the parameters `limit_type` and `det_limit_side_len` to limit the size of the input image,
The optional parameters of `limit_type` are [`max`, `min`], and
`det_limit_size_len` is a positive integer, generally set to a multiple of 32, such as 960.
The default setting of the parameters is `limit_type='max', det_limit_side_len=960`. Indicates that the longest side of the network input image cannot exceed 960,
If this value is exceeded, the image will be resized with the same width ratio to ensure that the longest side is `det_limit_side_len`.
Set as `limit_type='min', det_limit_side_len=960`, it means that the shortest side of the image is limited to 960.
If the resolution of the input picture is relatively large and you want to use a larger resolution prediction, you can set det_limit_side_len to the desired value, such as 1216:
After executing the command, the prediction results (recognized text and score) of the above image will be printed on the screen.
```bash
Predicts of ./doc/imgs_words_en/word_10.png:('PAIN', 0.9897658)
```
<aname="MULTILINGUAL_MODEL_INFERENCE"></a>
### 2. MULTILINGAUL MODEL INFERENCE
If you need to predict other language models, when using inference model prediction, you need to specify the dictionary path used by `--rec_char_dict_path`. At the same time, in order to get the correct visualization results,
You need to specify the visual font path through `--vis_font_path`. There are small language fonts provided by default under the `doc/fonts` path, such as Korean recognition:
After executing the command, the prediction results (classification angle and score) of the above image will be printed on the screen.
```
Predicts of ./doc/imgs_words_en/word_10.png:['0', 0.9999995]
```
<aname="CONCATENATION"></a>
## TEXT DETECTION ANGLE CLASSIFICATION AND RECOGNITION INFERENCE CONCATENATION
When performing prediction, you need to specify the path of a single image or a folder of images through the parameter `image_dir`, the parameter `det_model_dir` specifies the path to detect the inference model, the parameter `cls_model_dir` specifies the path to angle classification inference model and the parameter `rec_model_dir` specifies the path to identify the inference model. The parameter `use_angle_cls` is used to control whether to enable the angle classification model. The parameter `use_mp` specifies whether to use multi-process to infer `total_process_num` specifies process number when using multi-process. The parameter . The visualized recognition results are saved to the `./inference_results` folder by default.
-[Paste Your Document In Here](#paste-your-document-in-here)
-[INTRODUCTION ABOUT OCR](#introduction-about-ocr)
-[INTRODUCTION ABOUT OCR](#introduction-about-ocr)
*[BASIC CONCEPTS OF OCR DETECTION MODEL](#basic-concepts-of-ocr-detection-model)
*[BASIC CONCEPTS OF OCR DETECTION MODEL](#basic-concepts-of-ocr-detection-model)
*[Basic concepts of OCR recognition model](#basic-concepts-of-ocr-recognition-model)
*[Basic concepts of OCR recognition model](#basic-concepts-of-ocr-recognition-model)
...
@@ -8,7 +7,7 @@
...
@@ -8,7 +7,7 @@
*[On the right](#on-the-right)
*[On the right](#on-the-right)
# INTRODUCTION ABOUT OCR
## 1. INTRODUCTION ABOUT OCR
This section briefly introduces the basic concepts of OCR detection model and recognition model, and introduces PaddleOCR's PP-OCR model.
This section briefly introduces the basic concepts of OCR detection model and recognition model, and introduces PaddleOCR's PP-OCR model.
...
@@ -17,7 +16,7 @@ OCR (Optical Character Recognition, Optical Character Recognition) is currently
...
@@ -17,7 +16,7 @@ OCR (Optical Character Recognition, Optical Character Recognition) is currently
OCR text recognition generally includes two parts, text detection and text recognition. The text detection module first uses detection algorithms to detect text lines in the image. And then the recognition algorithm to identify the specific text in the text line.
OCR text recognition generally includes two parts, text detection and text recognition. The text detection module first uses detection algorithms to detect text lines in the image. And then the recognition algorithm to identify the specific text in the text line.
## BASIC CONCEPTS OF OCR DETECTION MODEL
### 1.1 BASIC CONCEPTS OF OCR DETECTION MODEL
Text detection can locate the text area in the image, and then usually mark the word or text line in the form of a bounding box. Traditional text detection algorithms mostly extract features manually, which are characterized by fast speed and good effect in simple scenes, but the effect will be greatly reduced when faced with natural scenes. Currently, deep learning methods are mostly used.
Text detection can locate the text area in the image, and then usually mark the word or text line in the form of a bounding box. Traditional text detection algorithms mostly extract features manually, which are characterized by fast speed and good effect in simple scenes, but the effect will be greatly reduced when faced with natural scenes. Currently, deep learning methods are mostly used.
...
@@ -27,14 +26,14 @@ Text detection algorithms based on deep learning can be roughly divided into the
...
@@ -27,14 +26,14 @@ Text detection algorithms based on deep learning can be roughly divided into the
3. Hybrid target detection and segmentation method.
3. Hybrid target detection and segmentation method.
## Basic concepts of OCR recognition model
### 1.2 Basic concepts of OCR recognition model
The input of the OCR recognition algorithm is generally text lines images which has less background information, and the text information occupies the main part. The recognition algorithm can be divided into two types of algorithms:
The input of the OCR recognition algorithm is generally text lines images which has less background information, and the text information occupies the main part. The recognition algorithm can be divided into two types of algorithms:
1. CTC-based method. The text prediction module of the recognition algorithm is based on CTC, and the commonly used algorithm combination is CNN+RNN+CTC. There are also some algorithms that try to add transformer modules to the network and so on.
1. CTC-based method. The text prediction module of the recognition algorithm is based on CTC, and the commonly used algorithm combination is CNN+RNN+CTC. There are also some algorithms that try to add transformer modules to the network and so on.
2. Attention-based method. The text prediction module of the recognition algorithm is based on Attention, and the commonly used algorithm combination is CNN+RNN+Attention.
2. Attention-based method. The text prediction module of the recognition algorithm is based on Attention, and the commonly used algorithm combination is CNN+RNN+Attention.
## PP-OCR model
### 1.3 PP-OCR model
PaddleOCR integrates many OCR algorithms, text detection algorithms include DB, EAST, SAST, etc., text recognition algorithms include CRNN, RARE, StarNet, Rosetta, SRN and other algorithms.
PaddleOCR integrates many OCR algorithms, text detection algorithms include DB, EAST, SAST, etc., text recognition algorithms include CRNN, RARE, StarNet, Rosetta, SRN and other algorithms.
@@ -62,45 +62,6 @@ Relationship of the above models is as follows.
...
@@ -62,45 +62,6 @@ Relationship of the above models is as follows.
<aname="Multilingual"></a>
<aname="Multilingual"></a>
#### Multilingual Recognition Model(Updating...)
#### Multilingual Recognition Model(Updating...)
**Note:** The configuration file of the new multi language model is generated by code. You can use the `--help` parameter to check which multi language are supported by current PaddleOCR.
```bash
# The code needs to run in the specified directory
Take the Italian configuration file as an example:
##### 1.Generate Italian configuration file to test the model provided
you can generate the default configuration file through the following command, and use the default language dictionary provided by paddleocr for prediction.
```bash
# The code needs to run in the specified directory
# Set the required language configuration file through -l or --language parameter
# This command will write the default parameter to the configuration file.
python3 generate_multi_language_configs.py -l it
```
##### 2. Generate Italian configuration file to train your own data
If you want to train your own model, you can prepare the training set file, verification set file, dictionary file and training data path. Here we assume that the Italian training set, verification set, dictionary and training data path are:
- Training set:{your/path/}PaddleOCR/train_data/train_list.txt
| french_mobile_v2.0_rec | ppocr/utils/dict/french_dict.txt | Lightweight model for French recognition|[rec_french_lite_train.yml](../../configs/rec/multi_language/rec_french_lite_train.yml)|2.65M|[inference model](https://paddleocr.bj.bcebos.com/dygraph_v2.0/multilingual/french_mobile_v2.0_rec_infer.tar) / [trained model](https://paddleocr.bj.bcebos.com/dygraph_v2.0/multilingual/french_mobile_v2.0_rec_train.tar) |
| french_mobile_v2.0_rec | ppocr/utils/dict/french_dict.txt | Lightweight model for French recognition|[rec_french_lite_train.yml](../../configs/rec/multi_language/rec_french_lite_train.yml)|2.65M|[inference model](https://paddleocr.bj.bcebos.com/dygraph_v2.0/multilingual/french_mobile_v2.0_rec_infer.tar) / [trained model](https://paddleocr.bj.bcebos.com/dygraph_v2.0/multilingual/french_mobile_v2.0_rec_train.tar) |
-[2.2.1 Chinese & English Model and Multilingual Model](#221-chinese---english-model-and-multilingual-model)
-[2.2.1 Chinese & English Model and Multilingual Model](#221-chinese---english-model-and-multilingual-model)
-[2.2.2 LayoutParser](#222-layoutparser)
-[2.2.2 Layout Analysis](#222-layoutAnalysis)
...
@@ -132,9 +132,11 @@ Commonly used multilingual abbreviations include
...
@@ -132,9 +132,11 @@ Commonly used multilingual abbreviations include
| Chinese Traditional | chinese_cht | | Italian | it | | Russian | ru |
| Chinese Traditional | chinese_cht | | Italian | it | | Russian | ru |
A list of all languages and their corresponding abbreviations can be found in [Multi-Language Model Tutorial](./multi_languages_en.md)
A list of all languages and their corresponding abbreviations can be found in [Multi-Language Model Tutorial](./multi_languages_en.md)
<aname="213-layoutparser"></a>
<aname="213-layoutAnalysis"></a>
#### 2.1.3 LayoutParser
#### 2.1.3 Layout Analysis
Layout analysis refers to the division of 5 types of areas of the document, including text, title, list, picture and table. For the first three types of regions, directly use the OCR model to complete the text detection and recognition of the corresponding regions, and save the results in txt. For the table area, after the table structuring process, the table picture is converted into an Excel file of the same table style. The picture area will be individually cropped into an image.
To use the layout analysis function of PaddleOCR, you need to specify `--type=structure`
To use the layout analysis function of PaddleOCR, you need to specify `--type=structure`
-[2. Data and vertical scenes](#2-data-and-vertical-scenes)
-[3. Data and vertical scenes](#3-data-and-vertical-scenes)
*[2.1 Training data](#21-training-data)
*[3.1 Training data](#31-training-data)
*[2.2 Vertical scene](#22-vertical-scene)
*[3.2 Vertical scene](#32-vertical-scene)
*[2.3 Build your own data set](#23-build-your-own-data-set)
*[3.3 Build your own data set](#33-build-your-own-data-set)
*[3. FAQ](#3-faq)
This article will introduce the basic concepts that need to be mastered during model training and the tuning methods during training.
This article will introduce the basic concepts that need to be mastered during model training and the tuning methods during training.
...
@@ -69,34 +69,13 @@ Optimizer:
...
@@ -69,34 +69,13 @@ Optimizer:
(3) End-to-end statistics: End-to-end recall rate: accurately detect and correctly identify the proportion of text lines in all labeled text lines; End-to-end accuracy rate: accurately detect and correctly identify the number of text lines in the detected text lines The standard for accurate detection is that the IOU of the detection box and the labeled box is greater than a certain threshold, and the text in the correctly identified detection box is the same as the labeled text.
(3) End-to-end statistics: End-to-end recall rate: accurately detect and correctly identify the proportion of text lines in all labeled text lines; End-to-end accuracy rate: accurately detect and correctly identify the number of text lines in the detected text lines The standard for accurate detection is that the IOU of the detection box and the labeled box is greater than a certain threshold, and the text in the correctly identified detection box is the same as the labeled text.
<aname="2-faq"></a>
<aname="2-data-and-vertical-scenes"></a>
# 2. FAQ
**Q**: How to choose a suitable network input shape when training CRNN recognition?
# 2. Data and vertical scenes
A: The general height is 32, the longest width is selected, there are two methods:
(1) Calculate the aspect ratio distribution of training sample images. The selection of the maximum aspect ratio considers 80% of the training samples.
(2) Count the number of texts in training samples. The selection of the longest number of characters considers the training sample that satisfies 80%. Then the aspect ratio of Chinese characters is approximately considered to be 1, and that of English is 3:1, and the longest width is estimated.
**Q**: During the recognition training, the accuracy of the training set has reached 90, but the accuracy of the verification set has been kept at 70, what should I do?
A: If the accuracy of the training set is 90 and the test set is more than 70, it should be over-fitting. There are two methods to try:
(1) Add more augmentation methods or increase the [probability] of augmented prob (https://github.com/PaddlePaddle/PaddleOCR/blob/dygraph/ppocr/data/imaug/rec_img_aug.py#L341), The default is 0.4.
<aname="21-training-data"></a>
(2) Increase the [l2 dcay value] of the system (https://github.com/PaddlePaddle/PaddleOCR/blob/a501603d54ff5513fc4fc760319472e59da25424/configs/rec/ch_ppocr_v1.1/rec_chinese_lite_train_v1.1.yml#L47)
**Q**: When the recognition model is trained, loss can drop normally, but acc is always 0
## 2.1 Training data
A: It is normal for the acc to be 0 at the beginning of the recognition model training, and the indicator will come up after a longer training period.
<aname="3-data-and-vertical-scenes"></a>
# 3. Data and vertical scenes
<aname="31-training-data"></a>
## 3.1 Training data
The current open source models, data sets and magnitudes are as follows:
The current open source models, data sets and magnitudes are as follows:
...
@@ -111,14 +90,16 @@ The current open source models, data sets and magnitudes are as follows:
...
@@ -111,14 +90,16 @@ The current open source models, data sets and magnitudes are as follows:
Among them, the public data sets are all open source, users can search and download by themselves, or refer to [Chinese data set](./datasets.md), synthetic data is not open source, users can use open source synthesis tools to synthesize by themselves. Synthesis tools include [text_renderer](https://github.com/Sanster/text_renderer), [SynthText](https://github.com/ankush-me/SynthText), [TextRecognitionDataGenerator](https://github.com/Belval/TextRecognitionDataGenerator) etc.
Among them, the public data sets are all open source, users can search and download by themselves, or refer to [Chinese data set](./datasets.md), synthetic data is not open source, users can use open source synthesis tools to synthesize by themselves. Synthesis tools include [text_renderer](https://github.com/Sanster/text_renderer), [SynthText](https://github.com/ankush-me/SynthText), [TextRecognitionDataGenerator](https://github.com/Belval/TextRecognitionDataGenerator) etc.
<aname="32-vertical-scene"></a>
<aname="22-vertical-scene"></a>
## 3.2 Vertical scene
## 2.2 Vertical scene
PaddleOCR mainly focuses on general OCR. If you have vertical requirements, you can use PaddleOCR + vertical data to train yourself;
PaddleOCR mainly focuses on general OCR. If you have vertical requirements, you can use PaddleOCR + vertical data to train yourself;
If there is a lack of labeled data, or if you do not want to invest in research and development costs, it is recommended to directly call the open API, which covers some of the more common vertical categories.
If there is a lack of labeled data, or if you do not want to invest in research and development costs, it is recommended to directly call the open API, which covers some of the more common vertical categories.
<aname="33-build-your-own-data-set"></a>
<aname="23-build-your-own-data-set"></a>
## 3.3 Build your own data set
## 2.3 Build your own data set
There are several experiences for reference when constructing the data set:
There are several experiences for reference when constructing the data set:
...
@@ -133,3 +114,28 @@ There are several experiences for reference when constructing the data set:
...
@@ -133,3 +114,28 @@ There are several experiences for reference when constructing the data set:
a. Manually collect more training data, the most direct and effective way.
a. Manually collect more training data, the most direct and effective way.
b. Basic image processing or transformation based on PIL and opencv. For example, the three modules of ImageFont, Image, ImageDraw in PIL write text into the background, opencv's rotating affine transformation, Gaussian filtering and so on.
b. Basic image processing or transformation based on PIL and opencv. For example, the three modules of ImageFont, Image, ImageDraw in PIL write text into the background, opencv's rotating affine transformation, Gaussian filtering and so on.
c. Use data generation algorithms to synthesize data, such as algorithms such as pix2pix.
c. Use data generation algorithms to synthesize data, such as algorithms such as pix2pix.
<aname="3-faq"></a>
# 3. FAQ
**Q**: How to choose a suitable network input shape when training CRNN recognition?
A: The general height is 32, the longest width is selected, there are two methods:
(1) Calculate the aspect ratio distribution of training sample images. The selection of the maximum aspect ratio considers 80% of the training samples.
(2) Count the number of texts in training samples. The selection of the longest number of characters considers the training sample that satisfies 80%. Then the aspect ratio of Chinese characters is approximately considered to be 1, and that of English is 3:1, and the longest width is estimated.
**Q**: During the recognition training, the accuracy of the training set has reached 90, but the accuracy of the verification set has been kept at 70, what should I do?
A: If the accuracy of the training set is 90 and the test set is more than 70, it should be over-fitting. There are two methods to try:
(1) Add more augmentation methods or increase the [probability] of augmented prob (https://github.com/PaddlePaddle/PaddleOCR/blob/dygraph/ppocr/data/imaug/rec_img_aug.py#L341), The default is 0.4.
(2) Increase the [l2 dcay value] of the system (https://github.com/PaddlePaddle/PaddleOCR/blob/a501603d54ff5513fc4fc760319472e59da25424/configs/rec/ch_ppocr_v1.1/rec_chinese_lite_train_v1.1.yml#L47)
**Q**: When the recognition model is trained, loss can drop normally, but acc is always 0
A: It is normal for the acc to be 0 at the beginning of the recognition model training, and the indicator will come up after a longer training period.