add slanet dist training (#7825)

* add slanet dist training * fix * fix

add slanet dist training (#7825)
* add slanet dist training * fix * fix
2eabedf0 · littletomatodonkey · GitHub · 4c00402c · 2eabedf0 · 2eabedf0
隐藏空白更改
内联并排

Showing with 42 addition and 16 deletion

doc/doc_ch/distributed_training.md doc/doc_ch/distributed_training.md +22 -8

doc/doc_en/distributed_training_en.md doc/doc_en/distributed_training_en.md +20 -8

未找到文件。
--- a/doc/doc_ch/distributed_training.md
+++ b/doc/doc_ch/distributed_training.md
@@ -41,16 +41,30 @@ python3 -m paddle.distributed.launch \

 ## 性能效果测试

-* 在2机8卡P40的机器上，基于26W公开识别数据集(LSVT, RCTW, MTWI)上进行训练，最终耗时如下。
+* 在2机8卡P40的机器上进行模型训练，不同模型的精度、训练耗时、多机加速比情况如下所示。

-| 模型   | 配置  | 精度     | 单机8卡耗时 | 2机8卡耗时 | 加速比 |
-|------|-----|--------|--------|--------|-----|
-| CRNN | [rec_chinese_lite_train_v2.0.yml](../../configs/rec/ch_ppocr_v2.0/rec_chinese_lite_train_v2.0.yml) | 67.0% | 2.50d   | 1.67d  | **1.5** |
+| 模型   | 配置  | 数据集   | 单机8卡耗时/精度 | 2机8卡耗时/精度 | 加速比 |
+|:------:|:-----:|:--------:|:--------:|:--------:|:-----:|
+| CRNN | [rec_chinese_lite_train_v2.0.yml](../../configs/rec/ch_ppocr_v2.0/rec_chinese_lite_train_v2.0.yml) |  26W中文数据集 | 2.50d/66.7%   | 1.67d/67.0%  | **1.5** |


-* 在4机8卡V100的机器上，基于全量数据训练，最终耗时如下
+* 在3机8卡V100的机器上进行模型训练，不同模型的精度、训练耗时、多机加速比情况如下所示。

+| 模型   | 配置  | 数据集   | 单机8卡耗时/精度 | 3机8卡耗时/精度 | 加速比 |
+|:------:|:-----:|:--------:|:--------:|:--------:|:-----:|
+| SLANet | [SLANet.yml](../../configs/table/SLANet.yml) |  PubTabNet | 49.8h/76.2%   | 19.75h/74.77%  | **2.52** |

-| 模型   | 配置  | 精度     | 单机8卡耗时 | 4机8卡耗时 | 加速比 |
-|------|-----|--------|--------|--------|-----|
-| SVTR | [ch_PP-OCRv3_rec_distillation.yml](../../configs/rec/PP-OCRv3/ch_PP-OCRv3_rec_distillation.yml) | 74.0% | 10d   | 2.84d  | **3.5** |
+
+    > 注意：这里3机8卡训练时，单卡batch size相比于单机8卡不变，学习率乘以2 (默认乘以3的话，精度仅有73.42%)
+
+
+* 在4机8卡V100的机器上进行模型训练，不同模型的精度、训练耗时、多机加速比情况如下所示。
+
+
+| 模型   | 配置  | 数据集   | 单机8卡耗时/精度 | 4机8卡耗时/精度 | 加速比 |
+|:------:|:-----:|:--------:|:--------:|:--------:|:-----:|
+| SVTR | [ch_PP-OCRv3_rec_distillation.yml](../../configs/rec/PP-OCRv3/ch_PP-OCRv3_rec_distillation.yml) |  PP-OCRv3_rec data | 10d/-   | 2.84d/74.0%  | **3.5** |
+
+
+* **注意**
+    * 在训练的GPU卡数过多时，精度会稍微有所损失（1%左右），此时可以尝试通过添加warmup或者适当增加迭代轮数来弥补精度损失。
--- a/doc/doc_en/distributed_training_en.md
+++ b/doc/doc_en/distributed_training_en.md
@@ -40,17 +40,29 @@ python3 -m paddle.distributed.launch \

 ## Performance comparison

-* On two 8-card P40 graphics cards, the final time consumption and speedup ratio for public recognition dataset (LSVT, RCTW, MTWI) containing 260k images are as follows.
+* We conducted model training on 2x8 P40 GPUs. Accuracy, training time, and multi machine acceleration ratio of different models are shown below.

+| Model    | Configuration | Configuration   | 8 GPU training time / Accuracy | 3x8 GPU training time / Accuracy | Acceleration ratio  |

-| Model   | Config file  | Recognition acc     | single 8-card training time | two 8-card training time | Speedup ratio |
-|------|-----|--------|--------|--------|-----|
-| CRNN | [rec_chinese_lite_train_v2.0.yml](../../configs/rec/ch_ppocr_v2.0/rec_chinese_lite_train_v2.0.yml) | 67.0% | 2.50d   | 1.67d  | **1.5** |

+| Model    | Configuration | Configuration   | 8 GPU training time / Accuracy | 3x8 GPU training time / Accuracy | Acceleration ratio  |
+|:------:|:-----:|:--------:|:--------:|:--------:|:-----:|
+| CRNN | [rec_chinese_lite_train_v2.0.yml](../../configs/rec/ch_ppocr_v2.0/rec_chinese_lite_train_v2.0.yml) |  260k Chinese dataset | 2.50d/66.7%   | 1.67d/67.0%  | **1.5** |

-* On four 8-card V100 graphics cards, the final time consumption and speedup ratio for full data are as follows.

+* We conducted model training on 3x8 V100 GPUs. Accuracy, training time, and multi machine acceleration ratio of different models are shown below.

-| Model   | Config file  | Recognition acc     | single 8-card training time | four 8-card training time | Speedup ratio |
-|------|-----|--------|--------|--------|-----|
-| SVTR | [ch_PP-OCRv3_rec_distillation.yml](../../configs/rec/PP-OCRv3/ch_PP-OCRv3_rec_distillation.yml) | 74.0% | 10d   | 2.84d  | **3.5** |
+| Model    | Configuration | Configuration   | 8 GPU training time / Accuracy | 3x8 GPU training time / Accuracy | Acceleration ratio  |
+|:------:|:-----:|:--------:|:--------:|:--------:|:-----:|
+| SLANet | [SLANet.yml](../../configs/table/SLANet.yml) |  PubTabNet | 49.8h/76.2%   | 19.75h/74.77%  | **2.52** |
+
+
+    > Note: when training with 3x8 GPUs, the single card batch size is unchanged compared with the 1x8 GPUs' training process, and the learning rate is multiplied by 2 (if it is multiplied by 3 by default, the accuracy is only 73.42%).
+
+
+* We conducted model training on 4x8 V100 GPUs. Accuracy, training time, and multi machine acceleration ratio of different models are shown below.
+
+
+| Model    | Configuration | Configuration   | 8 GPU training time / Accuracy | 4x8 GPU training time / Accuracy | Acceleration ratio  |
+|:------:|:-----:|:--------:|:--------:|:--------:|:-----:|
+| SVTR | [ch_PP-OCRv3_rec_distillation.yml](../../configs/rec/PP-OCRv3/ch_PP-OCRv3_rec_distillation.yml) |  PP-OCRv3_rec data | 10d/-   | 2.84d/74.0%  | **3.5** |