Skip to content
体验新版
项目
组织
正在加载...
登录
切换导航
打开侧边栏
Oneflow-Inc
DLPerf
提交
31dd21d3
D
DLPerf
项目概览
Oneflow-Inc
/
DLPerf
上一次同步 2 年多
通知
4
Star
152
Fork
0
代码
文件
提交
分支
Tags
贡献者
分支图
Diff
Issue
0
列表
看板
标记
里程碑
合并请求
0
DevOps
流水线
流水线任务
计划
Wiki
0
Wiki
分析
仓库
DevOps
项目成员
Pages
D
DLPerf
项目概览
项目概览
详情
发布
仓库
仓库
文件
提交
分支
标签
贡献者
分支图
比较
Issue
0
Issue
0
列表
看板
标记
里程碑
合并请求
0
合并请求
0
Pages
DevOps
DevOps
流水线
流水线任务
计划
分析
分析
仓库分析
DevOps
Wiki
0
Wiki
成员
成员
收起侧边栏
关闭侧边栏
动态
分支图
创建新Issue
流水线任务
提交
Issue看板
前往新版Gitcode,体验更适合开发者的 AI 搜索 >>
未验证
提交
31dd21d3
编写于
9月 24, 2020
作者:
S
Snow
提交者:
GitHub
9月 24, 2020
浏览文件
操作
浏览文件
下载
电子邮件补丁
差异文件
Update README.md
上级
625a424d
变更
1
隐藏空白更改
内联
并排
Showing
1 changed file
with
152 addition
and
71 deletion
+152
-71
NVIDIADeepLearningExamples/PyTorch/BERT/README.md
NVIDIADeepLearningExamples/PyTorch/BERT/README.md
+152
-71
未找到文件。
NVIDIADeepLearningExamples/PyTorch/BERT/README.md
浏览文件 @
31dd21d3
...
...
@@ -8,25 +8,28 @@
## 内容目录 Table Of Contents
*
[
概述 Overview
](
#---overview
)
*
[
内容目录 Table Of Contents
](
#-----table-of-contents
)
*
[
环境 Environment
](
#---environment
)
+
[
系统
](
#--
)
-
[
硬件
](
#--
)
-
[
软件
](
#--
)
+
[
NGC 容器
](
#ngc---
)
*
[
Feature support matrix
](
#feature-support-matrix
)
*
[
快速开始 Quick Start
](
#-----quick-start
)
+
[
1. 前期准备
](
#1-----
)
-
[
数据集
](
#---
)
-
[
镜像及容器
](
#-----
)
-
[
SSH 免密
](
#ssh---
)
+
[
2. 运行测试
](
#2-----
)
+
[
3. 数据处理
](
#3-----
)
*
[
性能结果 Performance
](
#-----performance
)
+
[
FP32 & W/O XLA
](
#fp32---w-o-xla
)
-
[
BERT-Base batch_size = 32
](
#bert-base-batch-size---32
)
-
[
BERT-Base batch_size = 48
](
#bert-base-batch-size---48
)
*
[
概述 Overview
](
#---overview
)
*
[
内容目录 Table Of Contents
](
#-----table-of-contents
)
*
[
环境 Environment
](
#---environment
)
+
[
系统
](
#--
)
-
[
硬件
](
#--
)
-
[
软件
](
#--
)
+
[
NGC 容器
](
#ngc---
)
*
[
Feature support matrix
](
#feature-support-matrix
)
*
[
快速开始 Quick Start
](
#-----quick-start
)
+
[
1. 前期准备
](
#1-----
)
-
[
数据集
](
#---
)
-
[
镜像及容器
](
#-----
)
-
[
SSH 免密
](
#ssh---
)
+
[
2. 运行测试
](
#2-----
)
+
[
3. 数据处理
](
#3-----
)
*
[
性能结果 Performance
](
#-----performance
)
+
[
FP32
](
#fp32
)
-
[
BERT-Base batch_size = 32
](
#bert-base-batch-size---32
)
-
[
BERT-Base batch_size = 48
](
#bert-base-batch-size---48
)
*
[
FP16
](
#fp16
)
-
[
BERT-Base batch_size = 64
](
#bert-base-batch-size---64
)
-
[
BERT-Base batch_size = 96
](
#bert-base-batch-size---96
)
## 环境 Environment
...
...
@@ -72,8 +75,8 @@
| Feature | BERT PyTorch |
| ------------------------------- | ------------ |
| Multi-gpu training | Yes |
| Multi-node |
No
|
| Automatic mixed precision (AMP) |
No
|
| Multi-node |
Yes
|
| Automatic mixed precision (AMP) |
Yes
|
...
...
@@ -207,9 +210,27 @@ bash run_single_node.sh
即可执行针对单机单卡、单机 2 卡、4 卡、 8 卡, batch_size 分别取 32、48 等情况的集成测试,并将 log 信息保存在当前目录的 /ngc/pytorch/ 对应分布式配置路径中,如单机单卡为 /1n1g,意为 1 node 1 gpu;单机 8卡 为 /1n8g,意为 1 node 8 gpus,以此类推。
如需测试
`fp16`
,直接修改脚本中的
`PREC`
为
`fp16`
即可。
-
**多机测试**
将本仓库 /DLPerf/NVIDIADeepLearningExamples/PyTorch/BERT/scripts 目录源码移至 /workspace/examples/bert/test_scripts(需新建) 下,2 机 16 卡 执行脚本
```
bash run_two_node.sh
```
4 机 32 卡 执行脚本:
```
bash run_multi_nodes.sh
```
即可执行多节点 batch_size 分别取 32、48 等情况的集成测试,并将 log 信息保存在当前目录的 /ngc/pytorch/ 对应分布式配置路径中。
### 3. 数据处理
测试进行了多组训练(本测试中取 5 次),每次训练过程只取第 1 个 epoch 的前 1
2
0 iter,计算训练速度时只取后 100 iter 的数据,以降低抖动。最后将 5 次训练的结果取中位数得到最终速度,并以此数据计算加速比。
测试进行了多组训练(本测试中取 5 次),每次训练过程只取第 1 个 epoch 的前 1
5
0 iter,计算训练速度时只取后 100 iter 的数据,以降低抖动。最后将 5 次训练的结果取中位数得到最终速度,并以此数据计算加速比。
运行 /DLPerf/NVIDIADeepLearningExamples/PyTorch/BERT/extract_pytorch_logs_time.py,即可得到针对不同配置测试结果 log 数据处理的结果:
...
...
@@ -220,42 +241,75 @@ python extract_pytorch_logs_time.py --log_dir /workspace/examples/bert/test_scri
结果打印如下
```
/workspace/examples/bert/test_scripts/ngc_bert_b48/pytorch/1n2g/bert-base-adam-training_b48_fp32_2.log {2: 230.0}
/workspace/examples/bert/test_scripts/ngc_bert_b48/pytorch/1n2g/bert-base-adam-training_b48_fp32_3.log {2: 230.0, 3: 230.45}
/workspace/examples/bert/test_scripts/ngc_bert_b48/pytorch/1n2g/bert-base-adam-training_b48_fp32_4.log {2: 230.0, 3: 230.45, 4: 230.03}
/workspace/examples/bert/test_scripts/ngc_bert_b48/pytorch/1n2g/bert-base-adam-training_b48_fp32_1.log {2: 230.0, 3: 230.45, 4: 230.03, 1: 230.19}
/workspace/examples/bert/test_scripts/ngc_bert_b48/pytorch/1n2g/bert-base-adam-training_b48_fp32_5.log {2: 230.0, 3: 230.45, 4: 230.03, 1: 230.19, 5: 230.74}
/workspace/examples/bert/test_scripts/ngc_bert_b48/pytorch/1n1g/bert-base-adam-training_b48_fp32_2.log {2: 122.79}
/workspace/examples/bert/test_scripts/ngc_bert_b48/pytorch/1n1g/bert-base-adam-training_b48_fp32_3.log {2: 122.79, 3: 122.82}
/workspace/examples/bert/test_scripts/ngc_bert_b48/pytorch/1n1g/bert-base-adam-training_b48_fp32_4.log {2: 122.79, 3: 122.82, 4: 122.96}
/workspace/examples/bert/test_scripts/ngc_bert_b48/pytorch/1n1g/bert-base-adam-training_b48_fp32_1.log {2: 122.79, 3: 122.82, 4: 122.96, 1: 123.12}
/workspace/examples/bert/test_scripts/ngc_bert_b48/pytorch/1n1g/bert-base-adam-training_b48_fp32_5.log {2: 122.79, 3: 122.82, 4: 122.96, 1: 123.12, 5: 122.91}
/workspace/examples/bert/test_scripts/ngc_bert_b48/pytorch/1n8g/bert-base-adam-training_b48_fp32_2.log {2: 938.46}
/workspace/examples/bert/test_scripts/ngc_bert_b48/pytorch/1n8g/bert-base-adam-training_b48_fp32_3.log {2: 938.46, 3: 938.06}
/workspace/examples/bert/test_scripts/ngc_bert_b48/pytorch/1n8g/bert-base-adam-training_b48_fp32_4.log {2: 938.46, 3: 938.06, 4: 938.88}
/workspace/examples/bert/test_scripts/ngc_bert_b48/pytorch/1n8g/bert-base-adam-training_b48_fp32_1.log {2: 938.46, 3: 938.06, 4: 938.88, 1: 936.75}
/workspace/examples/bert/test_scripts/ngc_bert_b48/pytorch/1n8g/bert-base-adam-training_b48_fp32_5.log {2: 938.46, 3: 938.06, 4: 938.88, 1: 936.75, 5: 940.09}
/workspace/examples/bert/test_scripts/ngc_bert_b48/pytorch/1n4g/bert-base-adam-training_b48_fp32_2.log {2: 469.75}
/workspace/examples/bert/test_scripts/ngc_bert_b48/pytorch/1n4g/bert-base-adam-training_b48_fp32_3.log {2: 469.75, 3: 469.92}
/workspace/examples/bert/test_scripts/ngc_bert_b48/pytorch/1n4g/bert-base-adam-training_b48_fp32_4.log {2: 469.75, 3: 469.92, 4: 469.32}
/workspace/examples/bert/test_scripts/ngc_bert_b48/pytorch/1n4g/bert-base-adam-training_b48_fp32_1.log {2: 469.75, 3: 469.92, 4: 469.32, 1: 471.54}
/workspace/examples/bert/test_scripts/ngc_bert_b48/pytorch/1n4g/bert-base-adam-training_b48_fp32_5.log {2: 469.75, 3: 469.92, 4: 469.32, 1: 471.54, 5: 469.6}
{'bert-base-adam-training': {'1n1g': {'average_speed': 122.92,
'batch_size_per_device': 48,
'median_speed': 122.91,
/workspace/examples/bert/test_scripts/fp16_ngc_bert_b96/pytorch/4n8g/bert-base-adam-training_b96_fp16_5.log {5: 10273.14}
end_time: 2020-09-24 02:12:08.999291
/workspace/examples/bert/test_scripts/fp16_ngc_bert_b96/pytorch/4n8g/bert-base-adam-training_b96_fp16_1.log {5: 10273.14, 1: 10552.87}
end_time: 2020-09-24 02:15:18.098056
/workspace/examples/bert/test_scripts/fp16_ngc_bert_b96/pytorch/4n8g/bert-base-adam-training_b96_fp16_3.log {5: 10273.14, 1: 10552.87, 3: 10324.68}
end_time: 2020-09-24 02:16:52.945844
/workspace/examples/bert/test_scripts/fp16_ngc_bert_b96/pytorch/4n8g/bert-base-adam-training_b96_fp16_4.log {5: 10273.14, 1: 10552.87, 3: 10324.68, 4: 10349.12}
end_time: 2020-09-24 02:13:43.531300
/workspace/examples/bert/test_scripts/fp16_ngc_bert_b96/pytorch/4n8g/bert-base-adam-training_b96_fp16_2.log {5: 10273.14, 1: 10552.87, 3: 10324.68, 4: 10349.12, 2: 10414.77}
end_time: 2020-09-24 03:20:44.972941
/workspace/examples/bert/test_scripts/fp16_ngc_bert_b96/pytorch/1n1g/bert-base-adam-training_b96_fp16_5.log {5: 463.85}
end_time: 2020-09-24 03:15:44.213131
/workspace/examples/bert/test_scripts/fp16_ngc_bert_b96/pytorch/1n1g/bert-base-adam-training_b96_fp16_1.log {5: 463.85, 1: 462.35}
end_time: 2020-09-24 03:18:14.318222
/workspace/examples/bert/test_scripts/fp16_ngc_bert_b96/pytorch/1n1g/bert-base-adam-training_b96_fp16_3.log {5: 463.85, 1: 462.35, 3: 466.94}
end_time: 2020-09-24 03:19:29.565003
/workspace/examples/bert/test_scripts/fp16_ngc_bert_b96/pytorch/1n1g/bert-base-adam-training_b96_fp16_4.log {5: 463.85, 1: 462.35, 3: 466.94, 4: 462.14}
end_time: 2020-09-24 03:16:58.796182
/workspace/examples/bert/test_scripts/fp16_ngc_bert_b96/pytorch/1n1g/bert-base-adam-training_b96_fp16_2.log {5: 463.85, 1: 462.35, 3: 466.94, 4: 462.14, 2: 462.35}
end_time: 2020-09-24 02:26:50.557894
/workspace/examples/bert/test_scripts/fp16_ngc_bert_b96/pytorch/2n8g/bert-base-adam-training_b96_fp16_5.log {5: 5366.7}
end_time: 2020-09-24 02:20:55.793547
/workspace/examples/bert/test_scripts/fp16_ngc_bert_b96/pytorch/2n8g/bert-base-adam-training_b96_fp16_1.log {5: 5366.7, 1: 5426.07}
end_time: 2020-09-24 02:23:47.979051
/workspace/examples/bert/test_scripts/fp16_ngc_bert_b96/pytorch/2n8g/bert-base-adam-training_b96_fp16_3.log {5: 5366.7, 1: 5426.07, 3: 5448.97}
end_time: 2020-09-24 02:25:18.862542
/workspace/examples/bert/test_scripts/fp16_ngc_bert_b96/pytorch/2n8g/bert-base-adam-training_b96_fp16_4.log {5: 5366.7, 1: 5426.07, 3: 5448.97, 4: 5439.94}
end_time: 2020-09-24 02:22:21.485900
/workspace/examples/bert/test_scripts/fp16_ngc_bert_b96/pytorch/2n8g/bert-base-adam-training_b96_fp16_2.log {5: 5366.7, 1: 5426.07, 3: 5448.97, 4: 5439.94, 2: 5410.84}
end_time: 2020-09-24 03:34:57.059096
/workspace/examples/bert/test_scripts/fp16_ngc_bert_b96/pytorch/1n8g/bert-base-adam-training_b96_fp16_5.log {5: 3339.15}
end_time: 2020-09-24 03:29:01.089925
/workspace/examples/bert/test_scripts/fp16_ngc_bert_b96/pytorch/1n8g/bert-base-adam-training_b96_fp16_1.log {5: 3339.15, 1: 3260.58}
end_time: 2020-09-24 03:31:53.242659
/workspace/examples/bert/test_scripts/fp16_ngc_bert_b96/pytorch/1n8g/bert-base-adam-training_b96_fp16_3.log {5: 3339.15, 1: 3260.58, 3: 3260.74}
end_time: 2020-09-24 03:33:31.687091
/workspace/examples/bert/test_scripts/fp16_ngc_bert_b96/pytorch/1n8g/bert-base-adam-training_b96_fp16_4.log {5: 3339.15, 1: 3260.58, 3: 3260.74, 4: 3310.51}
end_time: 2020-09-24 03:30:27.478401
/workspace/examples/bert/test_scripts/fp16_ngc_bert_b96/pytorch/1n8g/bert-base-adam-training_b96_fp16_2.log {5: 3339.15, 1: 3260.58, 3: 3260.74, 4: 3310.51, 2: 3287.12}
end_time: 2020-09-24 03:27:35.906235
/workspace/examples/bert/test_scripts/fp16_ngc_bert_b96/pytorch/1n4g/bert-base-adam-training_b96_fp16_5.log {5: 1727.93}
end_time: 2020-09-24 03:22:04.678285
/workspace/examples/bert/test_scripts/fp16_ngc_bert_b96/pytorch/1n4g/bert-base-adam-training_b96_fp16_1.log {5: 1727.93, 1: 1734.78}
end_time: 2020-09-24 03:24:55.125125
/workspace/examples/bert/test_scripts/fp16_ngc_bert_b96/pytorch/1n4g/bert-base-adam-training_b96_fp16_3.log {5: 1727.93, 1: 1734.78, 3: 1731.19}
end_time: 2020-09-24 03:26:14.931147
/workspace/examples/bert/test_scripts/fp16_ngc_bert_b96/pytorch/1n4g/bert-base-adam-training_b96_fp16_4.log {5: 1727.93, 1: 1734.78, 3: 1731.19, 4: 1726.71}
end_time: 2020-09-24 03:23:23.394265
/workspace/examples/bert/test_scripts/fp16_ngc_bert_b96/pytorch/1n4g/bert-base-adam-training_b96_fp16_2.log {5: 1727.93, 1: 1734.78, 3: 1731.19, 4: 1726.71, 2: 1723.52}
{'bert-base-adam-training': {'1n1g': {'average_speed': 463.53,
'batch_size_per_device': 96,
'median_speed': 462.35,
'speedup': 1.0},
'1n2g': {'average_speed': 230.28,
'batch_size_per_device': 48,
'median_speed': 230.19,
'speedup': 1.87},
'1n4g': {'average_speed': 470.03,
'batch_size_per_device': 48,
'median_speed': 469.75,
'speedup': 3.82},
'1n8g': {'average_speed': 938.45,
'batch_size_per_device': 48,
'median_speed': 938.46,
'speedup': 7.64}}}
'1n4g': {'average_speed': 1728.83,
'batch_size_per_device': 96,
'median_speed': 1727.93,
'speedup': 3.74},
'1n8g': {'average_speed': 3291.62,
'batch_size_per_device': 96,
'median_speed': 3287.12,
'speedup': 7.11},
'2n8g': {'average_speed': 5418.5,
'batch_size_per_device': 96,
'median_speed': 5426.07,
'speedup': 11.74},
'4n8g': {'average_speed': 10382.92,
'batch_size_per_device': 96,
'median_speed': 10349.12,
'speedup': 22.38}}}
Saving result to ./result/_result.json
```
...
...
@@ -263,27 +317,54 @@ Saving result to ./result/_result.json
该小节提供针对 NVIDIA PyTorch 框架的 BERT 模型测试的性能结果和完整 log 日志。
### FP32
& W/O XLA
### FP32
-
#### BERT-Base batch_size = 32
| gpu_num_per_node | batch_size_per_device | samples/s(PyTorch) | speedup |
| ---------------- | --------------------- | ------------------ | ------- |
| 1 | 32 | 119.61 | 1.00 |
| 2 | 32 | 221.18 | 1.85 |
| 4 | 32 | 455.7 | 3.81 |
| 8 | 32 | 908.85 | 7.6 |
| node_num | gpu_num_per_node | batch_size_per_device | samples/s(PyTorch) | speedup |
| -------- | ---------------- | --------------------- | ------------------ | ------- |
| 1 | 1 | 32 | 119.69 | 1.00 |
| 1 | 4 | 32 | 457.17 | 3.82 |
| 1 | 8 | 32 | 921.98 | 7.7 |
| 2 | 8 | 32 | 1495.71 | 12.5 |
| 4 | 8 | 32 | 2882.5 | 24.08 |
-
#### BERT-Base batch_size = 48
| gpu_num_per_node | batch_size_per_device | samples/s(PyTorch) | speedup |
| ---------------- | --------------------- | ------------------ | ------- |
| 1 | 48 | 122.91 | 1.00 |
| 2 | 48 | 230.19 | 1.87 |
| 4 | 48 | 469.75 | 3.82 |
| 8 | 48 | 938.46 | 7.64 |
| node_num | gpu_num_per_node | batch_size_per_device | samples/s(PyTorch) | speedup |
| -------- | ---------------- | --------------------- | ------------------ | ------- |
| 1 | 1 | 48 | 121.94 | 1.00 |
| 1 | 4 | 48 | 464.66 | 3.81 |
| 1 | 8 | 48 | 928.01 | 7.61 |
| 2 | 8 | 48 | 1584.32 | 12.99 |
| 4 | 8 | 48 | 3039.3 | 24.92 |
## FP16
-
#### BERT-Base batch_size = 64
| node_num | gpu_num_per_node | batch_size_per_device | samples/s(PyTorch) | speedup |
| -------- | ---------------- | --------------------- | ------------------ | ------- |
| 1 | 1 | 64 | 444.51 | 1.0 |
| 1 | 4 | 64 | 1671.66 | 3.76 |
| 1 | 8 | 64 | 3251.7 | 7.32 |
| 2 | 8 | 64 | 4936.92 | 11.11 |
| 4 | 8 | 64 | 9331.72 | 20.99 |
-
#### BERT-Base batch_size = 96
| node_num | gpu_num_per_node | batch_size_per_device | samples/s(PyTorch) | speedup |
| -------- | ---------------- | --------------------- | ------------------ | ------- |
| 1 | 1 | 96 | 462.35 | 1.0 |
| 1 | 4 | 96 | 1727.93 | 3.74 |
| 1 | 8 | 96 | 3287.12 | 7.11 |
| 2 | 8 | 96 | 5426.07 | 11.74 |
| 4 | 8 | 96 | 10349.12 | 22.38 |
NVIDIA的 PyTorch 官方测评结果详见
[
BERT For PyTorch - Performance Results
](
https://github.com/NVIDIA/DeepLearningExamples/blob/5cc03caa153faab7a2c3b1b5b5d63663f06ce1b4/PyTorch/LanguageModeling/BERT/README.md#results
)
详细 Log 信息可下载:
[
ngc_pytorch_bert.tar
](
https://oneflow-public.oss-cn-beijing.aliyuncs.com/DLPerf/logs/NVIDIA/Pytorch/ngc_pytorch_bert.tar
)
编辑
预览
Markdown
is supported
0%
请重试
或
添加新附件
.
添加附件
取消
You are about to add
0
people
to the discussion. Proceed with caution.
先完成此消息的编辑!
取消
想要评论请
注册
或
登录