Update README.md

31dd21d3 · Snow · GitHub · 625a424d · 31dd21d3
隐藏空白更改
内联并排

Showing with 152 addition and 71 deletion

NVIDIADeepLearningExamples/PyTorch/BERT/README.md NVIDIADeepLearningExamples/PyTorch/BERT/README.md +152 -71

未找到文件。
--- a/NVIDIADeepLearningExamples/PyTorch/BERT/README.md
+++ b/NVIDIADeepLearningExamples/PyTorch/BERT/README.md
@@ -8,25 +8,28 @@

 ## 内容目录 Table Of Contents

-* [概述 Overview](#---overview)
-* [内容目录 Table Of Contents](#-----table-of-contents)
-* [环境 Environment](#---environment)
-  + [系统](#--)
-    - [硬件](#--)
-    - [软件](#--)
-  + [NGC 容器](#ngc---)
-    * [Feature support matrix](#feature-support-matrix)
-* [快速开始 Quick Start](#-----quick-start)
-  + [1. 前期准备](#1-----)
-    - [数据集](#---)
-    - [镜像及容器](#-----)
-    - [SSH 免密](#ssh---)
-  + [2. 运行测试](#2-----)
-  + [3. 数据处理](#3-----)
-* [性能结果 Performance](#-----performance)
-  + [FP32 & W/O XLA](#fp32---w-o-xla)
-    - [BERT-Base batch_size = 32](#bert-base-batch-size---32)
-    - [BERT-Base batch_size = 48](#bert-base-batch-size---48)
+  * [概述 Overview](#---overview)
+  * [内容目录 Table Of Contents](#-----table-of-contents)
+  * [环境 Environment](#---environment)
+    + [系统](#--)
+      - [硬件](#--)
+      - [软件](#--)
+    + [NGC 容器](#ngc---)
+        * [Feature support matrix](#feature-support-matrix)
+  * [快速开始 Quick Start](#-----quick-start)
+    + [1. 前期准备](#1-----)
+      - [数据集](#---)
+      - [镜像及容器](#-----)
+      - [SSH 免密](#ssh---)
+    + [2. 运行测试](#2-----)
+    + [3. 数据处理](#3-----)
+  * [性能结果 Performance](#-----performance)
+    + [FP32](#fp32)
+      - [BERT-Base batch_size = 32](#bert-base-batch-size---32)
+      - [BERT-Base batch_size = 48](#bert-base-batch-size---48)
+  * [FP16](#fp16)
+      - [BERT-Base batch_size = 64](#bert-base-batch-size---64)
+      - [BERT-Base batch_size = 96](#bert-base-batch-size---96)


 ## 环境 Environment
@@ -72,8 +75,8 @@
  | Feature                         | BERT PyTorch |
  | ------------------------------- | ------------ |
  | Multi-gpu training              | Yes          |
-  | Multi-node                      | No           |
-  | Automatic mixed precision (AMP) | No           |
+  | Multi-node                      | Yes           |
+  | Automatic mixed precision (AMP) | Yes           |



@@ -207,9 +210,27 @@ bash run_single_node.sh

 即可执行针对单机单卡、单机 2 卡、4 卡、 8 卡， batch_size 分别取 32、48 等情况的集成测试，并将 log 信息保存在当前目录的 /ngc/pytorch/ 对应分布式配置路径中，如单机单卡为 /1n1g，意为 1 node 1 gpu；单机 8卡 为 /1n8g，意为 1 node 8 gpus，以此类推。

+如需测试 `fp16`，直接修改脚本中的 `PREC` 为 `fp16` 即可。
+
+- **多机测试**
+
+将本仓库 /DLPerf/NVIDIADeepLearningExamples/PyTorch/BERT/scripts 目录源码移至 /workspace/examples/bert/test_scripts（需新建） 下，2 机 16 卡 执行脚本
+
+```
+bash run_two_node.sh
+```
+
+4 机 32 卡 执行脚本：
+
+```
+bash run_multi_nodes.sh
+```
+
+即可执行多节点 batch_size 分别取 32、48 等情况的集成测试，并将 log 信息保存在当前目录的 /ngc/pytorch/ 对应分布式配置路径中。
+
 ### 3. 数据处理

-测试进行了多组训练（本测试中取 5 次），每次训练过程只取第 1 个 epoch 的前 120 iter，计算训练速度时只取后 100 iter 的数据，以降低抖动。最后将 5 次训练的结果取中位数得到最终速度，并以此数据计算加速比。
+测试进行了多组训练（本测试中取 5 次），每次训练过程只取第 1 个 epoch 的前 150 iter，计算训练速度时只取后 100 iter 的数据，以降低抖动。最后将 5 次训练的结果取中位数得到最终速度，并以此数据计算加速比。

 运行 /DLPerf/NVIDIADeepLearningExamples/PyTorch/BERT/extract_pytorch_logs_time.py，即可得到针对不同配置测试结果 log 数据处理的结果： 

@@ -220,42 +241,75 @@ python extract_pytorch_logs_time.py --log_dir /workspace/examples/bert/test_scri
 结果打印如下

 ```
-/workspace/examples/bert/test_scripts/ngc_bert_b48/pytorch/1n2g/bert-base-adam-training_b48_fp32_2.log {2: 230.0}
-/workspace/examples/bert/test_scripts/ngc_bert_b48/pytorch/1n2g/bert-base-adam-training_b48_fp32_3.log {2: 230.0, 3: 230.45}
-/workspace/examples/bert/test_scripts/ngc_bert_b48/pytorch/1n2g/bert-base-adam-training_b48_fp32_4.log {2: 230.0, 3: 230.45, 4: 230.03}
-/workspace/examples/bert/test_scripts/ngc_bert_b48/pytorch/1n2g/bert-base-adam-training_b48_fp32_1.log {2: 230.0, 3: 230.45, 4: 230.03, 1: 230.19}
-/workspace/examples/bert/test_scripts/ngc_bert_b48/pytorch/1n2g/bert-base-adam-training_b48_fp32_5.log {2: 230.0, 3: 230.45, 4: 230.03, 1: 230.19, 5: 230.74}
-/workspace/examples/bert/test_scripts/ngc_bert_b48/pytorch/1n1g/bert-base-adam-training_b48_fp32_2.log {2: 122.79}
-/workspace/examples/bert/test_scripts/ngc_bert_b48/pytorch/1n1g/bert-base-adam-training_b48_fp32_3.log {2: 122.79, 3: 122.82}
-/workspace/examples/bert/test_scripts/ngc_bert_b48/pytorch/1n1g/bert-base-adam-training_b48_fp32_4.log {2: 122.79, 3: 122.82, 4: 122.96}
-/workspace/examples/bert/test_scripts/ngc_bert_b48/pytorch/1n1g/bert-base-adam-training_b48_fp32_1.log {2: 122.79, 3: 122.82, 4: 122.96, 1: 123.12}
-/workspace/examples/bert/test_scripts/ngc_bert_b48/pytorch/1n1g/bert-base-adam-training_b48_fp32_5.log {2: 122.79, 3: 122.82, 4: 122.96, 1: 123.12, 5: 122.91}
-/workspace/examples/bert/test_scripts/ngc_bert_b48/pytorch/1n8g/bert-base-adam-training_b48_fp32_2.log {2: 938.46}
-/workspace/examples/bert/test_scripts/ngc_bert_b48/pytorch/1n8g/bert-base-adam-training_b48_fp32_3.log {2: 938.46, 3: 938.06}
-/workspace/examples/bert/test_scripts/ngc_bert_b48/pytorch/1n8g/bert-base-adam-training_b48_fp32_4.log {2: 938.46, 3: 938.06, 4: 938.88}
-/workspace/examples/bert/test_scripts/ngc_bert_b48/pytorch/1n8g/bert-base-adam-training_b48_fp32_1.log {2: 938.46, 3: 938.06, 4: 938.88, 1: 936.75}
-/workspace/examples/bert/test_scripts/ngc_bert_b48/pytorch/1n8g/bert-base-adam-training_b48_fp32_5.log {2: 938.46, 3: 938.06, 4: 938.88, 1: 936.75, 5: 940.09}
-/workspace/examples/bert/test_scripts/ngc_bert_b48/pytorch/1n4g/bert-base-adam-training_b48_fp32_2.log {2: 469.75}
-/workspace/examples/bert/test_scripts/ngc_bert_b48/pytorch/1n4g/bert-base-adam-training_b48_fp32_3.log {2: 469.75, 3: 469.92}
-/workspace/examples/bert/test_scripts/ngc_bert_b48/pytorch/1n4g/bert-base-adam-training_b48_fp32_4.log {2: 469.75, 3: 469.92, 4: 469.32}
-/workspace/examples/bert/test_scripts/ngc_bert_b48/pytorch/1n4g/bert-base-adam-training_b48_fp32_1.log {2: 469.75, 3: 469.92, 4: 469.32, 1: 471.54}
-/workspace/examples/bert/test_scripts/ngc_bert_b48/pytorch/1n4g/bert-base-adam-training_b48_fp32_5.log {2: 469.75, 3: 469.92, 4: 469.32, 1: 471.54, 5: 469.6}
-{'bert-base-adam-training': {'1n1g': {'average_speed': 122.92,
-                                      'batch_size_per_device': 48,
-                                      'median_speed': 122.91,
+/workspace/examples/bert/test_scripts/fp16_ngc_bert_b96/pytorch/4n8g/bert-base-adam-training_b96_fp16_5.log {5: 10273.14}
+end_time:  2020-09-24 02:12:08.999291
+/workspace/examples/bert/test_scripts/fp16_ngc_bert_b96/pytorch/4n8g/bert-base-adam-training_b96_fp16_1.log {5: 10273.14, 1: 10552.87}
+end_time:  2020-09-24 02:15:18.098056
+/workspace/examples/bert/test_scripts/fp16_ngc_bert_b96/pytorch/4n8g/bert-base-adam-training_b96_fp16_3.log {5: 10273.14, 1: 10552.87, 3: 10324.68}
+end_time:  2020-09-24 02:16:52.945844
+/workspace/examples/bert/test_scripts/fp16_ngc_bert_b96/pytorch/4n8g/bert-base-adam-training_b96_fp16_4.log {5: 10273.14, 1: 10552.87, 3: 10324.68, 4: 10349.12}
+end_time:  2020-09-24 02:13:43.531300
+/workspace/examples/bert/test_scripts/fp16_ngc_bert_b96/pytorch/4n8g/bert-base-adam-training_b96_fp16_2.log {5: 10273.14, 1: 10552.87, 3: 10324.68, 4: 10349.12, 2: 10414.77}
+end_time:  2020-09-24 03:20:44.972941
+/workspace/examples/bert/test_scripts/fp16_ngc_bert_b96/pytorch/1n1g/bert-base-adam-training_b96_fp16_5.log {5: 463.85}
+end_time:  2020-09-24 03:15:44.213131
+/workspace/examples/bert/test_scripts/fp16_ngc_bert_b96/pytorch/1n1g/bert-base-adam-training_b96_fp16_1.log {5: 463.85, 1: 462.35}
+end_time:  2020-09-24 03:18:14.318222
+/workspace/examples/bert/test_scripts/fp16_ngc_bert_b96/pytorch/1n1g/bert-base-adam-training_b96_fp16_3.log {5: 463.85, 1: 462.35, 3: 466.94}
+end_time:  2020-09-24 03:19:29.565003
+/workspace/examples/bert/test_scripts/fp16_ngc_bert_b96/pytorch/1n1g/bert-base-adam-training_b96_fp16_4.log {5: 463.85, 1: 462.35, 3: 466.94, 4: 462.14}
+end_time:  2020-09-24 03:16:58.796182
+/workspace/examples/bert/test_scripts/fp16_ngc_bert_b96/pytorch/1n1g/bert-base-adam-training_b96_fp16_2.log {5: 463.85, 1: 462.35, 3: 466.94, 4: 462.14, 2: 462.35}
+end_time:  2020-09-24 02:26:50.557894
+/workspace/examples/bert/test_scripts/fp16_ngc_bert_b96/pytorch/2n8g/bert-base-adam-training_b96_fp16_5.log {5: 5366.7}
+end_time:  2020-09-24 02:20:55.793547
+/workspace/examples/bert/test_scripts/fp16_ngc_bert_b96/pytorch/2n8g/bert-base-adam-training_b96_fp16_1.log {5: 5366.7, 1: 5426.07}
+end_time:  2020-09-24 02:23:47.979051
+/workspace/examples/bert/test_scripts/fp16_ngc_bert_b96/pytorch/2n8g/bert-base-adam-training_b96_fp16_3.log {5: 5366.7, 1: 5426.07, 3: 5448.97}
+end_time:  2020-09-24 02:25:18.862542
+/workspace/examples/bert/test_scripts/fp16_ngc_bert_b96/pytorch/2n8g/bert-base-adam-training_b96_fp16_4.log {5: 5366.7, 1: 5426.07, 3: 5448.97, 4: 5439.94}
+end_time:  2020-09-24 02:22:21.485900
+/workspace/examples/bert/test_scripts/fp16_ngc_bert_b96/pytorch/2n8g/bert-base-adam-training_b96_fp16_2.log {5: 5366.7, 1: 5426.07, 3: 5448.97, 4: 5439.94, 2: 5410.84}
+end_time:  2020-09-24 03:34:57.059096
+/workspace/examples/bert/test_scripts/fp16_ngc_bert_b96/pytorch/1n8g/bert-base-adam-training_b96_fp16_5.log {5: 3339.15}
+end_time:  2020-09-24 03:29:01.089925
+/workspace/examples/bert/test_scripts/fp16_ngc_bert_b96/pytorch/1n8g/bert-base-adam-training_b96_fp16_1.log {5: 3339.15, 1: 3260.58}
+end_time:  2020-09-24 03:31:53.242659
+/workspace/examples/bert/test_scripts/fp16_ngc_bert_b96/pytorch/1n8g/bert-base-adam-training_b96_fp16_3.log {5: 3339.15, 1: 3260.58, 3: 3260.74}
+end_time:  2020-09-24 03:33:31.687091
+/workspace/examples/bert/test_scripts/fp16_ngc_bert_b96/pytorch/1n8g/bert-base-adam-training_b96_fp16_4.log {5: 3339.15, 1: 3260.58, 3: 3260.74, 4: 3310.51}
+end_time:  2020-09-24 03:30:27.478401
+/workspace/examples/bert/test_scripts/fp16_ngc_bert_b96/pytorch/1n8g/bert-base-adam-training_b96_fp16_2.log {5: 3339.15, 1: 3260.58, 3: 3260.74, 4: 3310.51, 2: 3287.12}
+end_time:  2020-09-24 03:27:35.906235
+/workspace/examples/bert/test_scripts/fp16_ngc_bert_b96/pytorch/1n4g/bert-base-adam-training_b96_fp16_5.log {5: 1727.93}
+end_time:  2020-09-24 03:22:04.678285
+/workspace/examples/bert/test_scripts/fp16_ngc_bert_b96/pytorch/1n4g/bert-base-adam-training_b96_fp16_1.log {5: 1727.93, 1: 1734.78}
+end_time:  2020-09-24 03:24:55.125125
+/workspace/examples/bert/test_scripts/fp16_ngc_bert_b96/pytorch/1n4g/bert-base-adam-training_b96_fp16_3.log {5: 1727.93, 1: 1734.78, 3: 1731.19}
+end_time:  2020-09-24 03:26:14.931147
+/workspace/examples/bert/test_scripts/fp16_ngc_bert_b96/pytorch/1n4g/bert-base-adam-training_b96_fp16_4.log {5: 1727.93, 1: 1734.78, 3: 1731.19, 4: 1726.71}
+end_time:  2020-09-24 03:23:23.394265
+/workspace/examples/bert/test_scripts/fp16_ngc_bert_b96/pytorch/1n4g/bert-base-adam-training_b96_fp16_2.log {5: 1727.93, 1: 1734.78, 3: 1731.19, 4: 1726.71, 2: 1723.52}
+{'bert-base-adam-training': {'1n1g': {'average_speed': 463.53,
+                                      'batch_size_per_device': 96,
+                                      'median_speed': 462.35,
                                      'speedup': 1.0},
-                             '1n2g': {'average_speed': 230.28,
-                                      'batch_size_per_device': 48,
-                                      'median_speed': 230.19,
-                                      'speedup': 1.87},
-                             '1n4g': {'average_speed': 470.03,
-                                      'batch_size_per_device': 48,
-                                      'median_speed': 469.75,
-                                      'speedup': 3.82},
-                             '1n8g': {'average_speed': 938.45,
-                                      'batch_size_per_device': 48,
-                                      'median_speed': 938.46,
-                                      'speedup': 7.64}}}
+                             '1n4g': {'average_speed': 1728.83,
+                                      'batch_size_per_device': 96,
+                                      'median_speed': 1727.93,
+                                      'speedup': 3.74},
+                             '1n8g': {'average_speed': 3291.62,
+                                      'batch_size_per_device': 96,
+                                      'median_speed': 3287.12,
+                                      'speedup': 7.11},
+                             '2n8g': {'average_speed': 5418.5,
+                                      'batch_size_per_device': 96,
+                                      'median_speed': 5426.07,
+                                      'speedup': 11.74},
+                             '4n8g': {'average_speed': 10382.92,
+                                      'batch_size_per_device': 96,
+                                      'median_speed': 10349.12,
+                                      'speedup': 22.38}}}
 Saving result to ./result/_result.json
 ```

@@ -263,27 +317,54 @@ Saving result to ./result/_result.json

 该小节提供针对 NVIDIA PyTorch 框架的 BERT 模型测试的性能结果和完整 log 日志。

-### FP32 & W/O XLA
+### FP32 

 - #### BERT-Base batch_size = 32

-| gpu_num_per_node | batch_size_per_device | samples/s(PyTorch) | speedup |
-| ---------------- | --------------------- | ------------------ | ------- |
-| 1                | 32                    | 119.61             | 1.00    |
-| 2                | 32                    | 221.18             | 1.85    |
-| 4                | 32                    | 455.7              | 3.81    |
-| 8                | 32                    | 908.85             | 7.6     |
+| node_num | gpu_num_per_node | batch_size_per_device | samples/s(PyTorch) | speedup |
+| -------- | ---------------- | --------------------- | ------------------ | ------- |
+| 1        | 1                | 32                    | 119.69             | 1.00    |
+| 1        | 4                | 32                    | 457.17             | 3.82    |
+| 1        | 8                | 32                    | 921.98             | 7.7     |
+| 2        | 8                | 32                    | 1495.71            | 12.5    |
+| 4        | 8                | 32                    | 2882.5             | 24.08   |

 - #### BERT-Base batch_size = 48

-| gpu_num_per_node | batch_size_per_device | samples/s(PyTorch) | speedup |
-| ---------------- | --------------------- | ------------------ | ------- |
-| 1                | 48                    | 122.91             | 1.00    |
-| 2                | 48                    | 230.19             | 1.87    |
-| 4                | 48                    | 469.75             | 3.82    |
-| 8                | 48                    | 938.46             | 7.64    |
+| node_num | gpu_num_per_node | batch_size_per_device | samples/s(PyTorch) | speedup |
+| -------- | ---------------- | --------------------- | ------------------ | ------- |
+| 1        | 1                | 48                    | 121.94             | 1.00    |
+| 1        | 4                | 48                    | 464.66             | 3.81    |
+| 1        | 8                | 48                    | 928.01             | 7.61    |
+| 2        | 8                | 48                    | 1584.32            | 12.99   |
+| 4        | 8                | 48                    | 3039.3             | 24.92   |
+
+## FP16 
+
+- #### BERT-Base batch_size = 64
+
+| node_num | gpu_num_per_node | batch_size_per_device | samples/s(PyTorch) | speedup |
+| -------- | ---------------- | --------------------- | ------------------ | ------- |
+| 1        | 1                | 64                    | 444.51             | 1.0     |
+| 1        | 4                | 64                    | 1671.66            | 3.76    |
+| 1        | 8                | 64                    | 3251.7             | 7.32    |
+| 2        | 8                | 64                    | 4936.92            | 11.11   |
+| 4        | 8                | 64                    | 9331.72            | 20.99   |
+
+
+
+- #### BERT-Base batch_size = 96
+
+| node_num | gpu_num_per_node | batch_size_per_device | samples/s(PyTorch) | speedup |
+| -------- | ---------------- | --------------------- | ------------------ | ------- |
+| 1        | 1                | 96                    | 462.35             | 1.0     |
+| 1        | 4                | 96                    | 1727.93            | 3.74    |
+| 1        | 8                | 96                    | 3287.12            | 7.11    |
+| 2        | 8                | 96                    | 5426.07            | 11.74   |
+| 4        | 8                | 96                    | 10349.12           | 22.38   |
+
+

 NVIDIA的 PyTorch 官方测评结果详见 [BERT For PyTorch - Performance Results](https://github.com/NVIDIA/DeepLearningExamples/blob/5cc03caa153faab7a2c3b1b5b5d63663f06ce1b4/PyTorch/LanguageModeling/BERT/README.md#results)

 详细 Log 信息可下载：[ngc_pytorch_bert.tar](https://oneflow-public.oss-cn-beijing.aliyuncs.com/DLPerf/logs/NVIDIA/Pytorch/ngc_pytorch_bert.tar)
-