add svtr FAQ and data_aug (#8864)

955c46e0 · topduke · GitHub · b20e2dd5 · 955c46e0 · 955c46e0
7 changed file
--- a/configs/rec/rec_svtrnet.yml
+++ b/configs/rec/rec_svtrnet.yml
@@ -26,10 +26,10 @@ Optimizer:
  name: AdamW
  beta1: 0.9
  beta2: 0.99
-  epsilon: 8.e-8
+  epsilon: 1.e-8
  weight_decay: 0.05
  no_weight_decay_name: norm pos_embed
-  one_dim_param_no_weight_decay: true
+  one_dim_param_no_weight_decay: True
  lr:
    name: Cosine
    learning_rate: 0.0005
@@ -48,7 +48,7 @@ Architecture:
  Backbone:
    name: SVTRNet
    img_size: [32, 100]
-    out_char_num: 25
+    out_char_num: 25 # W//4 or W//8 or W/12
    out_channels: 192
    patch_merging: 'Conv'
    embed_dim: [64, 128, 256]
@@ -57,7 +57,7 @@ Architecture:
    mixer: ['Local','Local','Local','Local','Local','Local','Global','Global','Global','Global','Global','Global']
    local_mixer: [[7, 11], [7, 11], [7, 11]]
    last_stage: True
-    prenorm: false
+    prenorm: False
  Neck:
    name: SequenceEncoder
    encoder_type: reshape
@@ -82,6 +82,8 @@ Train:
      - DecodeImage: # load image
          img_mode: BGR
          channel_first: False
+      - SVTRRecAug:
+          aug_type: 0 # or 1
      - CTCLabelEncode: # Class handling label
      - SVTRRecResizeImg:
          image_shape: [3, 64, 256]
@@ -92,7 +94,7 @@ Train:
    shuffle: True
    batch_size_per_card: 512
    drop_last: True
-    num_workers: 4
+    num_workers: 8

 Eval:
  dataset:

--- a/configs/rec/rec_svtrnet_ch.yml
+++ b/configs/rec/rec_svtrnet_ch.yml
@@ -23,7 +23,7 @@ Optimizer:
  name: AdamW
  beta1: 0.9
  beta2: 0.99
-  epsilon: 8.0e-08
+  epsilon: 1.0e-08
  weight_decay: 0.05
  no_weight_decay_name: norm pos_embed
  one_dim_param_no_weight_decay: true
@@ -40,7 +40,7 @@ Architecture:
    img_size:
    - 32
    - 320
-    out_char_num: 40
+    out_char_num: 40 # W//4 or W//8 or W/12
    out_channels: 96
    patch_merging: Conv
    embed_dim:

--- a/doc/doc_ch/algorithm_rec_svtr.md
+++ b/doc/doc_ch/algorithm_rec_svtr.md
@@ -159,7 +159,23 @@ Predicts of ./doc/imgs_words_en/word_10.png:('pain', 0.9999998807907104)
 <a name="5"></a>
 ## 5. FAQ

-1. 由于`SVTR`使用的算子大多为矩阵相乘，在GPU环境下，速度具有优势，但在CPU开启mkldnn加速环境下，`SVTR`相比于被优化的卷积网络没有优势。
+- 1. GPU和CPU速度对比
+  - 由于`SVTR`使用的算子大多为矩阵相乘，在GPU环境下，速度具有优势，但在CPU开启mkldnn加速环境下，`SVTR`相比于被优化的卷积网络没有优势。
+- 2. SVTR模型转ONNX失败
+  - 保证`paddle2onnx`和`onnxruntime`版本最新，转onnx命令参考[SVTR模型转onnx步骤实例](https://github.com/PaddlePaddle/PaddleOCR/issues/7821#issuecomment-1271214273)。
+- 3. SVTR转ONNX成功但是推理结果不正确
+  - 可能的原因模型参数`out_char_num`设置不正确，应设置为W//4、W//8或者W//12，可以参考[高精度中文场景文本识别模型SVTR的3.3.3章节](https://aistudio.baidu.com/aistudio/projectdetail/5073182?contributionType=1)。
+- 4. 长文本识别优化
+  - 参考[高精度中文场景文本识别模型SVTR的3.3章节](https://aistudio.baidu.com/aistudio/projectdetail/5073182?contributionType=1)。
+- 5. 论文结果复现注意事项
+  - 数据集使用[ABINet](https://github.com/FangShancheng/ABINet)提供的数据集；
+  - 默认使用4卡GPU训练，单卡Batchsize默认为512，总Batchsize为2048，对应的学习率为0.0005，当修改Batchsize或者改变GPU卡数，学习率应等比例修改。
+- 6. 进一步优化的探索点
+  - 学习率调整：可以调整为默认的两倍保持Batchsize不变；或者将Batchsize减小为默认的1/2，保持学习率不变；
+  - 数据增强策略：可选`RecConAug`和`RecAug`；
+  - 如果不使用STN时，可以将`mixer`的`Local`替换为`Conv`、`local_mixer`全部修改为`[5, 5]`；
+  - 网格搜索最优的`embed_dim`、`depth`、`num_heads`配置；
+  - 使用`后Normalization策略`，即是将模型配置`prenorm`修改为`True`。


 ## 引用

--- a/doc/doc_en/algorithm_rec_svtr_en.md
+++ b/doc/doc_en/algorithm_rec_svtr_en.md
@@ -130,7 +130,23 @@ Not supported
 <a name="5"></a>
 ## 5. FAQ

-1. Since most of the operators used by `SVTR` are matrix multiplication, in the GPU environment, the speed has an advantage, but in the environment where mkldnn is enabled on the CPU, `SVTR` has no advantage over the optimized convolutional network.
+- 1. Speed situation on CPU and GPU
+  - Since most of the operators used by `SVTR` are matrix multiplication, in the GPU environment, the speed has an advantage, but in the environment where mkldnn is enabled on the CPU, `SVTR` has no advantage over the optimized convolutional network.
+- 2. SVTR model convert to ONNX failed
+  - Ensure `paddle2onnx` and `onnxruntime` versions are up to date, refer to [SVTR model to onnx step-by-step example](https://github.com/PaddlePaddle/PaddleOCR/issues/7821#issuecomment-) for the convert onnx command. 1271214273).
+- 3. SVTR model convert to ONNX is successful but the inference result is incorrect
+  - The possible reason is that the model parameter `out_char_num` is not set correctly, it should be set to W//4, W//8 or W//12, please refer to [Section 3.3.3 of SVTR, a high-precision Chinese scene text recognition model](https://aistudio.baidu.com/aistudio/) projectdetail/5073182?contributionType=1).
+- 4. Optimization of long text recognition
+  - Refer to [Section 3.3 of SVTR, a high-precision Chinese scene text recognition model](https://aistudio.baidu.com/aistudio/projectdetail/5073182?contributionType=1).
+- 5. Notes on the reproduction of the paper results
+  - Dataset using provided by [ABINet](https://github.com/FangShancheng/ABINet).
+  - By default, 4 cards of GPUs are used for training, the default Batchsize of a single card is 512, and the total Batchsize is 2048, corresponding to a learning rate of 0.0005. When modifying the Batchsize or changing the number of GPU cards, the learning rate should be modified in equal proportion.
+- 6. Exploration Directions for further optimization
+  - Learning rate adjustment: adjusting to twice the default to keep Batchsize unchanged; or reducing Batchsize to 1/2 the default to keep the learning rate unchanged.
+  - Data augmentation strategies: optionally `RecConAug` and `RecAug`.
+  - If STN is not used, `Local` of `mixer` can be replaced by `Conv` and `local_mixer` can all be modified to `[5, 5]`.
+  - Grid search for optimal `embed_dim`, `depth`, `num_heads` configurations.
+  - Use the `Post-Normalization strategy`, which is to modify the model configuration `prenorm` to `True`.

 ## Citation


--- a/ppocr/data/imaug/__init__.py
+++ b/ppocr/data/imaug/__init__.py
@@ -27,7 +27,7 @@ from .make_pse_gt import MakePseGt
 from .rec_img_aug import BaseDataAugmentation, RecAug, RecConAug, RecResizeImg, ClsResizeImg, \
    SRNRecResizeImg, GrayRecResizeImg, SARRecResizeImg, PRENResizeImg, \
    ABINetRecResizeImg, SVTRRecResizeImg, ABINetRecAug, VLRecResizeImg, SPINRecResizeImg, RobustScannerRecResizeImg, \
-    RFLRecResizeImg
+    RFLRecResizeImg, SVTRRecAug
 from .ssl_img_aug import SSLRotateResize
 from .randaugment import RandAugment
 from .copy_paste import CopyPaste

--- a/ppocr/data/imaug/abinet_aug.py
+++ b/ppocr/data/imaug/abinet_aug.py
@@ -405,3 +405,54 @@ class CVColorJitter(object):
    def __call__(self, img):
        if random.random() < self.p: return self.transforms(img)
        else: return img
+
+
+class SVTRDeterioration(object):
+    def __init__(self, var, degrees, factor, p=0.5):
+        self.p = p
+        transforms = []
+        if var is not None:
+            transforms.append(CVGaussianNoise(var=var))
+        if degrees is not None:
+            transforms.append(CVMotionBlur(degrees=degrees))
+        if factor is not None:
+            transforms.append(CVRescale(factor=factor))
+        self.transforms = transforms
+
+    def __call__(self, img):
+        if random.random() < self.p:
+            random.shuffle(self.transforms)
+            transforms = Compose(self.transforms)
+            return transforms(img)
+        else:
+            return img
+
+
+class SVTRGeometry(object):
+    def __init__(self,
+                 aug_type=0,
+                 degrees=15,
+                 translate=(0.3, 0.3),
+                 scale=(0.5, 2.),
+                 shear=(45, 15),
+                 distortion=0.5,
+                 p=0.5):
+        self.aug_type = aug_type
+        self.p = p
+        self.transforms = []
+        self.transforms.append(CVRandomRotation(degrees=degrees))
+        self.transforms.append(CVRandomAffine(
+                degrees=degrees, translate=translate, scale=scale, shear=shear))
+        self.transforms.append(CVRandomPerspective(distortion=distortion))
+
+    def __call__(self, img):
+        if random.random() < self.p:
+            if self.aug_type:
+                random.shuffle(self.transforms)
+                transforms = Compose(self.transforms[:random.randint(1, 3)])
+                img = transforms(img)
+            else:
+                img = self.transforms[random.randint(0, 2)](img)
+            return img
+        else:
+            return img
\ No newline at end of file
--- a/ppocr/data/imaug/rec_img_aug.py
+++ b/ppocr/data/imaug/rec_img_aug.py
@@ -19,7 +19,7 @@ import random
 import copy
 from PIL import Image
 from .text_image_aug import tia_perspective, tia_stretch, tia_distort
-from .abinet_aug import CVGeometry, CVDeterioration, CVColorJitter
+from .abinet_aug import CVGeometry, CVDeterioration, CVColorJitter, SVTRGeometry, SVTRDeterioration
 from paddle.vision.transforms import Compose


@@ -109,8 +109,9 @@ class ABINetRecAug(object):
                scale=(0.5, 2.),
                shear=(45, 15),
                distortion=0.5,
-                p=geometry_p), CVDeterioration(
-                    var=20, degrees=6, factor=4, p=deterioration_p),
+                p=geometry_p),
+            CVDeterioration(
+                var=20, degrees=6, factor=4, p=deterioration_p),
            CVColorJitter(
                brightness=0.5,
                contrast=0.5,
@@ -169,6 +170,39 @@ class RecConAug(object):
        return data


+class SVTRRecAug(object):
+    def __init__(self,
+                 aug_type=0,
+                 geometry_p=0.5,
+                 deterioration_p=0.25,
+                 colorjitter_p=0.25,
+                 **kwargs):
+        self.transforms = Compose([
+            SVTRGeometry(
+                aug_type=aug_type,
+                degrees=45,
+                translate=(0.0, 0.0),
+                scale=(0.5, 2.),
+                shear=(45, 15),
+                distortion=0.5,
+                p=geometry_p),
+            SVTRDeterioration(
+                var=20, degrees=6, factor=4, p=deterioration_p),
+            CVColorJitter(
+                brightness=0.5,
+                contrast=0.5,
+                saturation=0.5,
+                hue=0.1,
+                p=colorjitter_p)
+        ])
+
+    def __call__(self, data):
+        img = data['image']
+        img = self.transforms(img)
+        data['image'] = img
+        return data
+
+
 class ClsResizeImg(object):
    def __init__(self, image_shape, **kwargs):
        self.image_shape = image_shape