Merge pull request #6111 from Topdu/dygraph

[doc] Add nrtr and svtr en docs.

Merge pull request #6111 from Topdu/dygraph
[doc] Add nrtr and svtr en docs.
8c8ba83b · MissPenguin · GitHub · 77acb3bf · b42984b9 · 8c8ba83b
5 changed file
--- a/doc/doc_ch/algorithm_rec_nrtr.md
+++ b/doc/doc_ch/algorithm_rec_nrtr.md
@@ -61,7 +61,7 @@ python3 -m paddle.distributed.launch --gpus '0,1,2,3'  tools/train.py -c configs
 ```shell
 # 注意将pretrained_model的路径设置为本地路径。
-python3 tools/eval.py -c configs/rec/rec_mtb_nrtr.yml -o Global.pretrained_model=./rec_mtb_nrtr_train/best_accuracy
+python3 -m paddle.distributed.launch --gpus '0' tools/eval.py -c configs/rec/rec_mtb_nrtr.yml -o Global.pretrained_model=./rec_mtb_nrtr_train/best_accuracy
 ```
 <a name="3-3"></a>
@@ -144,9 +144,9 @@ Predicts of ./doc/imgs_words_en/word_10.png:('pain', 0.9265879392623901)
 ```bibtex
 @article{Sheng2019NRTR,
-  author    = {Fenfen Sheng and Zhineng Chen andBo Xu},
  title     = {NRTR: A No-Recurrence Sequence-to-Sequence Model For Scene Text Recognition},
-  journal   = {ICDAR},
+  author    = {Fenfen Sheng and Zhineng Chen andBo Xu},
+  booktitle = {ICDAR},
  year      = {2019},
  url       = {http://arxiv.org/abs/1806.00926},
  pages     = {781-786}

--- a/doc/doc_ch/algorithm_rec_svtr.md
+++ b/doc/doc_ch/algorithm_rec_svtr.md
@@ -17,7 +17,7 @@
 ## 1. 算法简介
 论文信息：
-> [SVTR: Scene Text Recognition with a Single Visual Model]()
+> [SVTR: Scene Text Recognition with a Single Visual Model](https://arxiv.org/abs/2205.00159)
 > Yongkun Du and Zhineng Chen and Caiyan Jia Xiaoting Yin and Tianlun Zheng and Chenxia Li and Yuning Du and Yu-Gang Jiang
 > IJCAI, 2022
@@ -80,8 +80,10 @@ python3 -m paddle.distributed.launch --gpus '0,1,2,3'  tools/train.py -c configs
 可下载`SVTR`提供的模型文件和配置文件：[下载地址](https://paddleocr.bj.bcebos.com/PP-OCRv3/chinese/rec_svtr_tiny_none_ctc_en_train.tar) ，以`SVTR-T`为例，使用如下命令进行评估：
 ```shell
+# 下载包含SVTR-T的模型文件和配置文件的tar压缩包并解压
+wget https://paddleocr.bj.bcebos.com/PP-OCRv3/chinese/rec_svtr_tiny_none_ctc_en_train.tar && tar xf rec_svtr_tiny_none_ctc_en_train.tar
 # 注意将pretrained_model的路径设置为本地路径。
-python3 tools/eval.py -c ./rec_svtr_tiny_en_train/rec_svtr_tiny_6local_6global_stn_en.yml -o Global.pretrained_model=./rec_svtr_tiny_none_ctc_en_train/best_accuracy
+python3 -m paddle.distributed.launch --gpus '0' tools/eval.py -c ./rec_svtr_tiny_none_ctc_en_train/rec_svtr_tiny_6local_6global_stn_en.yml -o Global.pretrained_model=./rec_svtr_tiny_none_ctc_en_train/best_accuracy
 ```
 <a name="3-3"></a>
@@ -90,7 +92,7 @@ python3 tools/eval.py -c ./rec_svtr_tiny_en_train/rec_svtr_tiny_6local_6global_s
 使用如下命令进行单张图片预测：
 ```shell
 # 注意将pretrained_model的路径设置为本地路径。
-python3 tools/infer_rec.py -c ./rec_svtr_tiny_en_train/rec_svtr_tiny_6local_6global_stn_en.yml -o Global.infer_img='./doc/imgs_words_en/word_10.png' Global.pretrained_model=./rec_svtr_tiny_none_ctc_en_train/best_accuracy
+python3 tools/infer_rec.py -c ./rec_svtr_tiny_none_ctc_en_train/rec_svtr_tiny_6local_6global_stn_en.yml -o Global.infer_img='./doc/imgs_words_en/word_10.png' Global.pretrained_model=./rec_svtr_tiny_none_ctc_en_train/best_accuracy
 # 预测文件夹下所有图像时，可修改infer_img为文件夹，如 Global.infer_img='./doc/imgs_words_en/'。
 ```
@@ -100,11 +102,11 @@ python3 tools/infer_rec.py -c ./rec_svtr_tiny_en_train/rec_svtr_tiny_6local_6glo
 <a name="4-1"></a>
 ### 4.1 Python推理
-首先将训练得到best模型，转换成inference model。下面以基于`SVTR-T`，在英文数据集训练的模型为例（[模型下载地址](https://paddleocr.bj.bcebos.com/PP-OCRv3/chinese/rec_svtr_tiny_none_ctc_en_train.tar) )，可以使用如下命令进行转换：
+首先将训练得到best模型，转换成inference model。下面以基于`SVTR-T`，在英文数据集训练的模型为例（[模型和配置文件下载地址](https://paddleocr.bj.bcebos.com/PP-OCRv3/chinese/rec_svtr_tiny_none_ctc_en_train.tar) )，可以使用如下命令进行转换：
 ```shell
 # 注意将pretrained_model的路径设置为本地路径。
-python3 tools/export_model.py -c ./rec_svtr_tiny_en_train/rec_svtr_tiny_6local_6global_stn_en.yml -o Global.pretrained_model=./rec_svtr_tiny_none_ctc_en_train/best_accuracy Global.save_inference_dir=./inference/rec_svtr_tiny_stn_en
+python3 tools/export_model.py -c ./rec_svtr_tiny_none_ctc_en_train/rec_svtr_tiny_6local_6global_stn_en.yml -o Global.pretrained_model=./rec_svtr_tiny_none_ctc_en_train/best_accuracy Global.save_inference_dir=./inference/rec_svtr_tiny_stn_en
 ```
 **注意：**
@@ -158,4 +160,17 @@ Predicts of ./doc/imgs_words_en/word_10.png:('pain', 0.9999998807907104)
 <a name="5"></a>
 ## 5. FAQ
-1. 由于`SVTR`使用的op算子大多为矩阵相乘，在GPU环境下，速度具有优势，但在CPU开启mkldnn加速环境下，`SVTR`相比于被优化的卷积网络没有优势。
+1. 由于`SVTR`使用的算子大多为矩阵相乘，在GPU环境下，速度具有优势，但在CPU开启mkldnn加速环境下，`SVTR`相比于被优化的卷积网络没有优势。
+## 引用
+```bibtex
+@article{Du2022SVTR,
+  title     = {SVTR: Scene Text Recognition with a Single Visual Model},
+  author    = {Du, Yongkun and Chen, Zhineng and Jia, Caiyan and Yin, Xiaoting and Zheng, Tianlun and Li, Chenxia and Du, Yuning and Jiang, Yu-Gang},
+  booktitle = {IJCAI},
+  year      = {2022},
+  url       = {https://arxiv.org/abs/2205.00159}
+}
+```
--- a/doc/doc_en/algorithm_rec_nrtr_en.md
+++ b/doc/doc_en/algorithm_rec_nrtr_en.md
+# NRTR
+- [1. Introduction](#1)
+- [2. Environment](#2)
+- [3. Model Training / Evaluation / Prediction](#3)
+    - [3.1 Training](#3-1)
+    - [3.2 Evaluation](#3-2)
+    - [3.3 Prediction](#3-3)
+- [4. Inference and Deployment](#4)
+    - [4.1 Python Inference](#4-1)
+    - [4.2 C++ Inference](#4-2)
+    - [4.3 Serving](#4-3)
+    - [4.4 More](#4-4)
+- [5. FAQ](#5)
+<a name="1"></a>
+## 1. Introduction
+Paper:
+> [NRTR: A No-Recurrence Sequence-to-Sequence Model For Scene Text Recognition](https://arxiv.org/abs/1806.00926)
+> Fenfen Sheng and Zhineng Chen and Bo Xu
+> ICDAR, 2019
+Using MJSynth and SynthText two text recognition datasets for training, and evaluating on IIIT, SVT, IC03, IC13, IC15, SVTP, CUTE datasets, the algorithm reproduction effect is as follows:
+|Model|Backbone|config|Acc|Download link|
+| --- | --- | --- | --- | --- |
+|NRTR|MTB|[rec_mtb_nrtr.yml](../../configs/rec/rec_mtb_nrtr.yml)|84.21%|[train model](https://paddleocr.bj.bcebos.com/dygraph_v2.0/en/rec_mtb_nrtr_train.tar)|
+<a name="2"></a>
+## 2. Environment
+Please refer to ["Environment Preparation"](./environment.md) to configure the PaddleOCR environment, and refer to ["Project Clone"](./clone.md) to clone the project code.
+<a name="3"></a>
+## 3. Model Training / Evaluation / Prediction
+Please refer to [Text Recognition Tutorial](./recognition.md). PaddleOCR modularizes the code, and training different recognition models only requires **changing the configuration file**.
+Training:
+Specifically, after the data preparation is completed, the training can be started. The training command is as follows:
+```
+#Single GPU training (long training period, not recommended)
+python3 tools/train.py -c configs/rec/rec_mtb_nrtr.yml
+#Multi GPU training, specify the gpu number through the --gpus parameter
+python3 -m paddle.distributed.launch --gpus '0,1,2,3'  tools/train.py -c configs/rec/rec_mtb_nrtr.yml
+```
+Evaluation:
+```
+# GPU evaluation
+python3 -m paddle.distributed.launch --gpus '0' tools/eval.py -c configs/rec/rec_mtb_nrtr.yml -o Global.pretrained_model={path/to/weights}/best_accuracy
+```
+Prediction:
+```
+# The configuration file used for prediction must match the training
+python3 tools/infer_rec.py -c configs/rec/rec_mtb_nrtr.yml -o Global.infer_img='./doc/imgs_words_en/word_10.png' Global.pretrained_model=./rec_mtb_nrtr_train/best_accuracy
+```
+<a name="4"></a>
+## 4. Inference and Deployment
+<a name="4-1"></a>
+### 4.1 Python Inference
+First, the model saved during the NRTR text recognition training process is converted into an inference model. ( [Model download link](https://paddleocr.bj.bcebos.com/dygraph_v2.0/en/rec_mtb_nrtr_train.tar)) ), you can use the following command to convert:
+```
+python3 tools/export_model.py -c configs/rec/rec_mtb_nrtr.yml -o Global.pretrained_model=./rec_mtb_nrtr_train/best_accuracy  Global.save_inference_dir=./inference/rec_mtb_nrtr
+```
+**Note:**
+- If you are training the model on your own dataset and have modified the dictionary file, please pay attention to modify the `character_dict_path` in the configuration file to the modified dictionary file.
+- If you modified the input size during training, please modify the `infer_shape` corresponding to NRTR in the `tools/export_model.py` file.
+After the conversion is successful, there are three files in the directory:
+```
+/inference/rec_mtb_nrtr/
+    ├── inference.pdiparams
+    ├── inference.pdiparams.info
+    └── inference.pdmodel
+```
+For NRTR text recognition model inference, the following commands can be executed:
+```
+python3 tools/infer/predict_rec.py --image_dir='./doc/imgs_words_en/word_10.png' --rec_model_dir='./inference/rec_mtb_nrtr/' --rec_algorithm='NRTR' --rec_image_shape='1,32,100' --rec_char_dict_path='./ppocr/utils/EN_symbol_dict.txt'
+```
+![](../imgs_words_en/word_10.png)
+After executing the command, the prediction result (recognized text and score) of the image above is printed to the screen, an example is as follows:
+The result is as follows:
+```shell
+Predicts of ./doc/imgs_words_en/word_10.png:('pain', 0.9265879392623901)
+```
+<a name="4-2"></a>
+### 4.2 C++ Inference
+Not supported
+<a name="4-3"></a>
+### 4.3 Serving
+Not supported
+<a name="4-4"></a>
+### 4.4 More
+Not supported
+<a name="5"></a>
+## 5. FAQ
+1. In the `NRTR` paper, Beam search is used to decode characters, but the speed is slow. Beam search is not used by default here, and greedy search is used to decode characters.
+## Citation
+```bibtex
+@article{Sheng2019NRTR,
+  title     = {NRTR: A No-Recurrence Sequence-to-Sequence Model For Scene Text Recognition},
+  author    = {Fenfen Sheng and Zhineng Chen andBo Xu},
+  booktitle = {ICDAR},
+  year      = {2019},
+  url       = {http://arxiv.org/abs/1806.00926},
+  pages     = {781-786}
+}
+```
--- a/doc/doc_en/algorithm_rec_svtr_en.md
+++ b/doc/doc_en/algorithm_rec_svtr_en.md
+# SVTR
+- [1. Introduction](#1)
+- [2. Environment](#2)
+- [3. Model Training / Evaluation / Prediction](#3)
+    - [3.1 Training](#3-1)
+    - [3.2 Evaluation](#3-2)
+    - [3.3 Prediction](#3-3)
+- [4. Inference and Deployment](#4)
+    - [4.1 Python Inference](#4-1)
+    - [4.2 C++ Inference](#4-2)
+    - [4.3 Serving](#4-3)
+    - [4.4 More](#4-4)
+- [5. FAQ](#5)
+<a name="1"></a>
+## 1. Introduction
+Paper:
+> [SVTR: Scene Text Recognition with a Single Visual Model](https://arxiv.org/abs/2205.00159)
+> Yongkun Du and Zhineng Chen and Caiyan Jia Xiaoting Yin and Tianlun Zheng and Chenxia Li and Yuning Du and Yu-Gang Jiang
+> IJCAI, 2022
+<a name="model"></a>
+The accuracy (%) and model files of SVTR on the public dataset of scene text recognition are as follows:
+* Chinese dataset from [Chinese Benckmark](https://arxiv.org/abs/2112.15093) , and the Chinese training evaluation strategy of SVTR follows the paper.
+|   Model    |IC13<br/>857 |  SVT  |IIIT5k<br/>3000 |IC15<br/>1811| SVTP  |CUTE80 | Avg_6 |IC15<br/>2077 |IC13<br/>1015 |IC03<br/>867|IC03<br/>860|Avg_10 | Chinese<br/>scene_test|                                                                                            Download link                                                                                            |
+|:----------:|:------:|:-----:|:---------:|:------:|:-----:|:-----:|:-----:|:-------:|:-------:|:-----:|:-----:|:---------------------------------------------:|:-----:|:---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------:|
+| SVTR Tiny  | 96.85  | 91.34 |   94.53   | 83.99  | 85.43 | 89.24 | 90.87 |  80.55  |  95.37  | 95.27 | 95.70 | 90.13 | 67.90 | [English](https://paddleocr.bj.bcebos.com/PP-OCRv3/chinese/rec_svtr_tiny_none_ctc_en_train.tar)  / [Chinese](https://paddleocr.bj.bcebos.com/PP-OCRv3/chinese/rec_svtr_tiny_none_ctc_ch_train.tar)  |
+| SVTR Small | 95.92  | 93.04 |   95.03   | 84.70  | 87.91 | 92.01 | 91.63 |  82.72  |  94.88  | 96.08 | 96.28 | 91.02 | 69.00 | [English](https://paddleocr.bj.bcebos.com/PP-OCRv3/chinese/rec_svtr_small_none_ctc_en_train.tar) / [Chinese](https://paddleocr.bj.bcebos.com/PP-OCRv3/chinese/rec_svtr_small_none_ctc_ch_train.tar) |
+| SVTR Base  | 97.08  | 91.50 |   96.03   | 85.20  | 89.92 | 91.67 | 92.33 |  83.73  |  95.66  | 95.62 | 95.81 | 91.61 | 71.40 |                          [English](https://paddleocr.bj.bcebos.com/PP-OCRv3/chinese/rec_svtr_base_none_ctc_en_train.tar)  /                                              -                          |
+| SVTR Large | 97.20  | 91.65 |   96.30   | 86.58  | 88.37 | 95.14 | 92.82 |  84.54  |  96.35  | 96.54 | 96.74 | 92.24 | 72.10 | [English](https://paddleocr.bj.bcebos.com/PP-OCRv3/chinese/rec_svtr_large_none_ctc_en_train.tar) / [Chinese](https://paddleocr.bj.bcebos.com/PP-OCRv3/chinese/rec_svtr_large_none_ctc_ch_train.tar) |
+<a name="2"></a>
+## 2. Environment
+Please refer to ["Environment Preparation"](./environment.md) to configure the PaddleOCR environment, and refer to ["Project Clone"](./clone.md) to clone the project code.
+#### Dataset Preparation
+[English dataset download](https://github.com/clovaai/deep-text-recognition-benchmark#download-lmdb-dataset-for-traininig-and-evaluation-from-here)
+[Chinese dataset download](https://github.com/fudanvi/benchmarking-chinese-text-recognition#download)
+<a name="3"></a>
+## 3. Model Training / Evaluation / Prediction
+Please refer to [Text Recognition Tutorial](./recognition.md). PaddleOCR modularizes the code, and training different recognition models only requires **changing the configuration file**.
+Training:
+Specifically, after the data preparation is completed, the training can be started. The training command is as follows:
+```
+#Single GPU training (long training period, not recommended)
+python3 tools/train.py -c configs/rec/rec_svtrnet.yml
+#Multi GPU training, specify the gpu number through the --gpus parameter
+python3 -m paddle.distributed.launch --gpus '0,1,2,3'  tools/train.py -c configs/rec/rec_svtrnet.yml
+```
+Evaluation:
+You can download the model files and configuration files provided by `SVTR`: [download link](https://paddleocr.bj.bcebos.com/PP-OCRv3/chinese/rec_svtr_tiny_none_ctc_en_train.tar), take `SVTR-T` as an example, using the following command to evaluate:
+```
+# Download the tar archive containing the model files and configuration files of SVTR-T and extract it
+wget https://paddleocr.bj.bcebos.com/PP-OCRv3/chinese/rec_svtr_tiny_none_ctc_en_train.tar && tar xf rec_svtr_tiny_none_ctc_en_train.tar
+# GPU evaluation
+python3 -m paddle.distributed.launch --gpus '0' tools/eval.py -c ./rec_svtr_tiny_none_ctc_en_train/rec_svtr_tiny_6local_6global_stn_en.yml -o Global.pretrained_model=./rec_svtr_tiny_none_ctc_en_train/best_accuracy
+```
+Prediction:
+```
+python3 tools/infer_rec.py -c ./rec_svtr_tiny_none_ctc_en_train/rec_svtr_tiny_6local_6global_stn_en.yml -o Global.infer_img='./doc/imgs_words_en/word_10.png' Global.pretrained_model=./rec_svtr_tiny_none_ctc_en_train/best_accuracy
+```
+<a name="4"></a>
+## 4. Inference and Deployment
+<a name="4-1"></a>
+### 4.1 Python Inference
+First, the model saved during the SVTR text recognition training process is converted into an inference model. ( [Model download link](https://paddleocr.bj.bcebos.com/PP-OCRv3/chinese/rec_svtr_tiny_none_ctc_en_train.tar) ), you can use the following command to convert:
+```
+python3 tools/export_model.py -c configs/rec/rec_svtrnet.yml -o Global.pretrained_model=./rec_svtr_tiny_none_ctc_en_train/best_accuracy  Global.save_inference_dir=./inference/rec_svtr_tiny_stn_en
+```
+**Note:**
+- If you are training the model on your own dataset and have modified the dictionary file, please pay attention to modify the `character_dict_path` in the configuration file to the modified dictionary file.
+- If you modified the input size during training, please modify the `infer_shape` corresponding to SVTR in the `tools/export_model.py` file.
+After the conversion is successful, there are three files in the directory:
+```
+/inference/rec_svtr_tiny_stn_en/
+    ├── inference.pdiparams
+    ├── inference.pdiparams.info
+    └── inference.pdmodel
+```
+For SVTR text recognition model inference, the following commands can be executed:
+```
+python3 tools/infer/predict_rec.py --image_dir='./doc/imgs_words_en/word_10.png' --rec_model_dir='./inference/rec_svtr_tiny_stn_en/' --rec_algorithm='SVTR' --rec_image_shape='3,64,256' --rec_char_dict_path='./ppocr/utils/ic15_dict.txt'
+```
+![](../imgs_words_en/word_10.png)
+After executing the command, the prediction result (recognized text and score) of the image above is printed to the screen, an example is as follows:
+The result is as follows:
+```shell
+Predicts of ./doc/imgs_words_en/word_10.png:('pain', 0.9999998807907104)
+```
+<a name="4-2"></a>
+### 4.2 C++ Inference
+Not supported
+<a name="4-3"></a>
+### 4.3 Serving
+Not supported
+<a name="4-4"></a>
+### 4.4 More
+Not supported
+<a name="5"></a>
+## 5. FAQ
+1. Since most of the operators used by `SVTR` are matrix multiplication, in the GPU environment, the speed has an advantage, but in the environment where mkldnn is enabled on the CPU, `SVTR` has no advantage over the optimized convolutional network.
+## Citation
+```bibtex
+@article{Du2022SVTR,
+  title     = {SVTR: Scene Text Recognition with a Single Visual Model},
+  author    = {Du, Yongkun and Chen, Zhineng and Jia, Caiyan and Yin, Xiaoting and Zheng, Tianlun and Li, Chenxia and Du, Yuning and Jiang, Yu-Gang},
+  booktitle = {IJCAI},
+  year      = {2022},
+  url       = {https://arxiv.org/abs/2205.00159}
+}
+```
--- a/ppocr/modeling/backbones/rec_svtrnet.py
+++ b/ppocr/modeling/backbones/rec_svtrnet.py
@@ -169,17 +169,14 @@ class Attention(nn.Layer):
            self.N = H * W
            self.C = dim
        if mixer == 'Local' and HW is not None:
            hk = local_k[0]
            wk = local_k[1]
-            mask = np.ones([H * W, H * W])
+            mask = paddle.ones([H * W, H + hk - 1, W + wk - 1], dtype='float32')
-            for h in range(H):
+            for h in range(0, H):
-                for w in range(W):
+                for w in range(0, W):
-                    for kh in range(-(hk // 2), (hk // 2) + 1):
+                    mask[h * W + w, h:h + hk, w:w + wk] = 0.
-                        for kw in range(-(wk // 2), (wk // 2) + 1):
+            mask_paddle = mask[:, hk // 2:H + hk // 2, wk // 2:W + wk //
-                            if H > (h + kh) >= 0 and W > (w + kw) >= 0:
+                               2].flatten(1)
-                                mask[h * W + w][(h + kh) * W + (w + kw)] = 0
-            mask_paddle = paddle.to_tensor(mask, dtype='float32')
            mask_inf = paddle.full([H * W, H * W], '-inf', dtype='float32')
            mask = paddle.where(mask_paddle < 1, mask_paddle, mask_inf)
            self.mask = mask.unsqueeze([0, 1])