Skip to content
体验新版
项目
组织
正在加载...
登录
切换导航
打开侧边栏
Oneflow-Inc
DLPerf
提交
7cf3c693
D
DLPerf
项目概览
Oneflow-Inc
/
DLPerf
上一次同步 2 年多
通知
4
Star
152
Fork
0
代码
文件
提交
分支
Tags
贡献者
分支图
Diff
Issue
0
列表
看板
标记
里程碑
合并请求
0
DevOps
流水线
流水线任务
计划
Wiki
0
Wiki
分析
仓库
DevOps
项目成员
Pages
D
DLPerf
项目概览
项目概览
详情
发布
仓库
仓库
文件
提交
分支
标签
贡献者
分支图
比较
Issue
0
Issue
0
列表
看板
标记
里程碑
合并请求
0
合并请求
0
Pages
DevOps
DevOps
流水线
流水线任务
计划
分析
分析
仓库分析
DevOps
Wiki
0
Wiki
成员
成员
收起侧边栏
关闭侧边栏
动态
分支图
创建新Issue
流水线任务
提交
Issue看板
前往新版Gitcode,体验更适合开发者的 AI 搜索 >>
提交
7cf3c693
编写于
10月 28, 2020
作者:
F
Flowingsun007
浏览文件
操作
浏览文件
下载
电子邮件补丁
差异文件
add deepspeed-bert readme
上级
b1f1b503
变更
9
隐藏空白更改
内联
并排
Showing
9 changed file
with
937 addition
and
0 deletion
+937
-0
DeepSpeed/README.md
DeepSpeed/README.md
+427
-0
DeepSpeed/bert_base.json
DeepSpeed/bert_base.json
+57
-0
DeepSpeed/deepspeed_bsz64k_adam_config_seq128.json
DeepSpeed/deepspeed_bsz64k_adam_config_seq128.json
+32
-0
DeepSpeed/extract_deepspeed_logs.py
DeepSpeed/extract_deepspeed_logs.py
+130
-0
DeepSpeed/extract_deepspeed_logs_time.py
DeepSpeed/extract_deepspeed_logs_time.py
+146
-0
DeepSpeed/scripts/multi_node_train.sh
DeepSpeed/scripts/multi_node_train.sh
+72
-0
DeepSpeed/scripts/run_multi_node.sh
DeepSpeed/scripts/run_multi_node.sh
+20
-0
DeepSpeed/scripts/run_single_node.sh
DeepSpeed/scripts/run_single_node.sh
+37
-0
DeepSpeed/scripts/run_two_node.sh
DeepSpeed/scripts/run_two_node.sh
+16
-0
未找到文件。
DeepSpeed/README.md
0 → 100644
浏览文件 @
7cf3c693
# 【DLPerf】DeepSpeed - BERT测评
# Overview
本次复现采用了微软
[
DeepSpeed官方仓库
](
https://github.com/microsoft/DeepSpeed/tree/1afca8f722fbcedf4b0ec0bf9e165d60564a7bba
)
中的
[
BERT-base
](
https://github.com/microsoft/DeepSpeedExamples/tree/ba63ad0fa861d28b3b33bc2c20f702647403e258/bing_bert
)
,目的在于速度测评,同时根据测速结果给出1机、2机器、4机情况下的加速比,评判框架在分布式多机训练情况下的横向拓展能力。
目前,该测试覆盖了FP32、FP16混合精度,后续将持续维护,增加更多方式的测评。
# Environment
## 系统
-
系统:Ubuntu 16.04.4 LTS (GNU/Linux 4.4.0-116-generic x86_64)
-
显卡:Tesla V100-SXM2-16GB x 8
-
驱动:NVIDIA 440.33.01
-
CUDA:10.2
-
cuDNN:7.6.5
-
NCCL:2.7.3
## 框架
-
**torch 1.6.0 **
## Feature support matrix
| Feature | ResNet-50 v1.5 Paddle |
| ----------------------------- | --------------------- |
| Multi-node,multi-gpu training | Yes |
| NVIDIA NCCL | Yes |
| Mixed precision | Yes |
# Quick Start
## 项目代码
-
微软
[
DeepSpeed官方仓库
](
https://github.com/microsoft/DeepSpeed/tree/1afca8f722fbcedf4b0ec0bf9e165d60564a7bba
)
-
[
BERT
](
https://github.com/microsoft/DeepSpeedExamples/tree/ba63ad0fa861d28b3b33bc2c20f702647403e258/bing_bert
)
下载官方源码:
```
shell
git clone https://github.com/microsoft/DeepSpeed.git
cd
DeepSpeed
&&
git checkout 1afca8f722fbcedf4b0ec0bf9e165d60564a7bba
cd
DeepSpeedExamples/bing_bert
```
将本页面中所有.json配置文件、scripts文件夹中的.sh脚本文件全部放入:
`bing_bert/`
路径下。
## 依赖安装
如果您是通过
`docker pull deepspeed/deepspeed:latest`
拉取了DeepSpeed官方镜像,则可通过下面命令启动容器:
```
docker pull deepspeed/deepspeed:latest
docker run -it -d -p 12345:22 --shm-size=16g --ulimit memlock=-1 --privileged \
--name deepspeed --cap-add=IPC_LOCK \
-v /datasets/bert/deepspeed/data/test:/datasets \
deepspeed/deepspeed:latest
```
并且容器内已安装好各种依赖,无需额外进行下面的安装。如果您是在物理机上,则需要进行下面的步骤来安装环境。
### 创建环境
```
shell
# 创建conda环境
conda create
-n
deepspeed
python
=
3.7.9
conda activate deepspeed
# 安装pytorch
python3
-m
pip
install
torch
==
1.6.0
-i
https://mirror.baidu.com/pypi/simple
python3
-m
pip
install
torchvision
==
0.7.0
matplotlib
==
3.3.2
sudo
apt
install
pdsh
```
修改
`DeepSpeed/requirements/requirements.txt`
,将前两行注释掉:
```
# torch>=1.2
# torchvision>=0.4.0
```
安装BERT训练相关依赖
```
# DeepSpeed主目录下执行
python3 -m pip install -r requirements/requirements.txt
python3 -m pip install -r requirements/requirements-dev.txt
```
### 安装deepspeed
conda deepspeed环境下执行:
`bash install.sh`
,命令执行后会编译并安装一系列whl包如:apex,deepspeed等,过程中可能会报错:
![
error.png
](
https://cdn.nlark.com/yuque/0/2020/png/216914/1602677523508-d71b79b8-625c-453a-9d64-c75d84afba79.png
)
报错原因在于,脚本中执行的pip使用了本地.local中的pip,而其版本和conda环境deepspeed下的python3.7不同,导致报错:
`ERROR: apex-0.1-cp37-cp37m-linux_x86_64.whl is not a supported wheel on this platform`
如报错,可修改脚本install.sh
[
第153行
](
https://github.com/microsoft/DeepSpeed/blob/1afca8f722fbcedf4b0ec0bf9e165d60564a7bba/install.sh#L153
)
,改为:
```
PIP_SUDO="python3 -m "
```
后重新执行
`bash install.sh`
编译耗时大约几分钟,成功后:
![
success.png
](
https://cdn.nlark.com/yuque/0/2020/png/216914/1602678788832-9819f36a-0b68-4a4c-a9dd-3b6b5a24fc03.png
)
## NCCL
pytorch的分布式训练底层依赖NCCL库,需要从
[
NVIDIA-NCCL官网下载
](
https://developer.nvidia.com/nccl/nccl-download
)
并安装和操作系统、CUDA版本适配的NCCL。本次测试中安装2.7.3版本的NCCL:
```
shell
sudo
dpkg
-i
nccl-repo-ubuntu1604-2.7.3-ga-cuda10.2_1-1_amd64.deb
sudo
apt update
sudo
apt
install
libnccl2
=
2.7.3-1+cuda10.2 libnccl-dev
=
2.7.3-1+cuda10.2
```
## 数据集
### 训练集
本次训练使用Wikipedia数据集,并根据NVIDIA官方提供的脚本制作转换为.hdf5格式,详见:
[
NVIDIA-quick-start-guide
](
https://github.com/NVIDIA/DeepLearningExamples/tree/master/PyTorch/LanguageModeling/BERT#quick-start-guide
)
。
### 词表文件
由于直接运行训练,程序会自动从s3.amazonaws.com下载词表文件(vocab.txt),但速度很慢,故我们可以手动下载词表文件并放入新建文件夹
`bing_bert/data`
下,(直接运行训练,程序会自动从亚马逊amazonaws自动所有文件,但速度很慢)词表文件下载链接见:
[
tokenization.py
](
https://github.com/microsoft/DeepSpeedExamples/blob/ba63ad0fa861d28b3b33bc2c20f702647403e258/bing_bert/pytorch_pretrained_bert/tokenization.py
)
。下载完成并将词表文件存入
`bing_bert/data`
后,注释掉
[
tokenization.py Line:30
](
[tokenization.py](https://github.com/microsoft/DeepSpeedExamples/blob/ba63ad0fa861d28b3b33bc2c20f702647403e258/bing_bert/pytorch_pretrained_bert/tokenization.py#L30
)
) 的
`PRETRAINED_VOCAB_ARCHIVE_MAP{}`
,修改如下:
```
python3
CACHE_DIR = "/your/path/to/DeepSpeed/DeepSpeedExamples/bing_bert/data/"
PRETRAINED_VOCAB_ARCHIVE_MAP = {
'bert-base-uncased':
CACHE_DIR+"bert-base-uncased-vocab.txt",
'bert-large-uncased':
CACHE_DIR+"bert-large-uncased-vocab.txt",
'bert-base-cased':
CACHE_DIR+"bert-base-cased-vocab.txt",
'bert-large-cased':
CACHE_DIR+"bert-large-cased-vocab.txt",
'bert-base-multilingual-uncased':
CACHE_DIR+"bert-base-multilingual-uncased-vocab.txt",
'bert-base-multilingual-cased':
CACHE_DIR+"bert-base-multilingual-cased-vocab.txt",
'bert-base-chinese':
CACHE_DIR+"bert-base-chinese-vocab.txt",
}
```
修改完成后,程序将从本地加载所需词表文件。
# Training
集群中有4台节点:
-
NODE1=10.11.0.2
-
NODE2=10.11.0.3
-
NODE3=10.11.0.4
-
NODE4=10.11.0.5
每个节点有8张显卡,这里默认设置batch size为32,分别在1机1卡~4机32卡的情况下进行了多组训练。
训练前,我们还需要做一些准备工作:
1.
安装依赖:
`python3 -m pip install boto3 h5py`
(不安装直接跑bert pretraining会报错)
2.
修改deepdpeed_train.py
[
Line 197
](
https://github.com/microsoft/DeepSpeedExamples/blob/ba63ad0fa861d28b3b33bc2c20f702647403e258/bing_bert/deepspeed_train.py#L197
)
如下:
```
else:
epoch_step += 1
# Call DeepSpeed engine step on micro steps
model.network.step()
```
以使训练120iter后,程序会自动退出。
## 单机
`DeepSpeed/DeepSpeedExamples/bing_bert/`
目录下,执行脚本:
```
shell
bash run_single_node.sh
```
对单机1卡、4卡、8卡分别做5组测试,默认测试fp32精度,batch_size=32。
### 混合精度
可以通过参数指定fp16及batch_size:
```
shell
bash run_single_node.sh 32 fp16
```
也可以自行指定精度以及batch_size:
`bash run_single_node.sh 128 fp16`
,
`bash run_single_node.sh 64 fp32`
。
## 2机16卡
2机、4机等多机情况下,需要在所有机器节点上相同路径准备同样的数据集、以完成分布式训练;此外,还需配置各个机器节点直接的ssh免密登录,配置完成后,我们在
`deepSpeed/DeepSpeedExamples/bing_bert/`
目录下新建hosts文件,例如deepspeed_hosts:
```
NODE1 slots=8
NODE1 slots=8
```
此文件,指定了2个节点NODE1和NODE2,每个节点使用8块GPU进行训练。
在主节点(NODE1节点)
`deepSpeed/DeepSpeedExamples/bing_bert/`
目录下执行脚本:
```
shell
bash run_two_node.sh
```
即可运行2机16卡的训练,同样默认测试5组(fp32精度,batch_size=32)
### 混合精度
可以通过参数指定fp16及batch_size:
```
shell
bash run_two_node.sh 128 fp16
```
## 4机32卡
流程同上,在主节点上执行:
```
shell
bash run_multi_node.sh
```
以运行4机32卡的训练,默认测试5组(fp32精度,batch_size=32)。
### 混合精度
可以通过参数指定fp16及batch_size:
```
shell
bash run_multi_node.sh 128 fp16
```
#### 注:FP32精度下的多机训练过程中会报错:
```
shell
vs003: Traceback
(
most recent call last
)
:
vs003: File
"deepspeed_train.py"
, line 540,
in
<module>
vs003: main
()
vs003: File
"deepspeed_train.py"
, line 533,
in
main
vs003: run
(
args, model, optimizer, start_epoch
)
vs003: File
"deepspeed_train.py"
, line 499,
in
run
vs003: train
(
args, index, model, optimizer, pretrain_dataset_provider
)
vs003: File
"deepspeed_train.py"
, line 187,
in
train
vs003: report_step_metrics
(
args, lr_this_step, unscaled_loss,
vs003: UnboundLocalError:
local
variable
'lr_this_step'
referenced before assignment
```
此报错为官方代码中尚未修复的bug,详见:
https://github.com/microsoft/DeepSpeedExamples/issues/53
https://github.com/microsoft/DeepSpeed/issues/426
# Result
## 吞吐率及加速比
执行以下命令,即可计算各种测试配置下的吞吐率及加速比:
```
shell
python extract_deepspeed_logs_time.py
--log_dir
=
logs/deepspeed/bert/bz32
--batch_size_per_device
=
32
```
输出:
```
shell
python extract_deepspeed_logs.py
--log_dir
=
logs/deepspeed/bert/bz32
--batch_size_per_device
=
32
logs/deepspeed/bert/bz32/4n8g/bert_b32_fp32_4.log
{
4: 4891.77
}
logs/deepspeed/bert/bz32/4n8g/bert_b32_fp32_3.log
{
4: 4891.77, 3: 4903.74
}
logs/deepspeed/bert/bz32/4n8g/bert_b32_fp32_5.log
{
4: 4891.77, 3: 4903.74, 5: 4900.76
}
logs/deepspeed/bert/bz32/4n8g/bert_b32_fp32_2.log
{
4: 4891.77, 3: 4903.74, 5: 4900.76, 2: 4899.37
}
logs/deepspeed/bert/bz32/4n8g/bert_b32_fp32_1.log
{
4: 4891.77, 3: 4903.74, 5: 4900.76, 2: 4899.37, 1: 4873.92
}
logs/deepspeed/bert/bz32/1n8g/bert_b32_fp32_4.log
{
4: 1150.49
}
logs/deepspeed/bert/bz32/1n8g/bert_b32_fp32_3.log
{
4: 1150.49, 3: 1186.46
}
logs/deepspeed/bert/bz32/1n8g/bert_b32_fp32_5.log
{
4: 1150.49, 3: 1186.46, 5: 1146.9
}
logs/deepspeed/bert/bz32/1n8g/bert_b32_fp32_2.log
{
4: 1150.49, 3: 1186.46, 5: 1146.9, 2: 1145.22
}
logs/deepspeed/bert/bz32/1n8g/bert_b32_fp32_1.log
{
4: 1150.49, 3: 1186.46, 5: 1146.9, 2: 1145.22, 1: 1147.9
}
logs/deepspeed/bert/bz32/1n4g/bert_b32_fp32_4.log
{
4: 587.55
}
logs/deepspeed/bert/bz32/1n4g/bert_b32_fp32_3.log
{
4: 587.55, 3: 575.07
}
logs/deepspeed/bert/bz32/1n4g/bert_b32_fp32_5.log
{
4: 587.55, 3: 575.07, 5: 572.11
}
logs/deepspeed/bert/bz32/1n4g/bert_b32_fp32_2.log
{
4: 587.55, 3: 575.07, 5: 572.11, 2: 573.84
}
logs/deepspeed/bert/bz32/1n4g/bert_b32_fp32_1.log
{
4: 587.55, 3: 575.07, 5: 572.11, 2: 573.84, 1: 577.14
}
logs/deepspeed/bert/bz32/1n1g/bert_b32_fp32_4.log
{
4: 147.81
}
logs/deepspeed/bert/bz32/1n1g/bert_b32_fp32_3.log
{
4: 147.81, 3: 147.93
}
logs/deepspeed/bert/bz32/1n1g/bert_b32_fp32_5.log
{
4: 147.81, 3: 147.93, 5: 143.8
}
logs/deepspeed/bert/bz32/1n1g/bert_b32_fp32_2.log
{
4: 147.81, 3: 147.93, 5: 143.8, 2: 143.61
}
logs/deepspeed/bert/bz32/1n1g/bert_b32_fp32_1.log
{
4: 147.81, 3: 147.93, 5: 143.8, 2: 143.61, 1: 148.42
}
logs/deepspeed/bert/bz32/2n8g/bert_b32_fp32_4.log
{
4: 2273.68
}
logs/deepspeed/bert/bz32/2n8g/bert_b32_fp32_3.log
{
4: 2273.68, 3: 2267.55
}
logs/deepspeed/bert/bz32/2n8g/bert_b32_fp32_5.log
{
4: 2273.68, 3: 2267.55, 5: 2368.65
}
logs/deepspeed/bert/bz32/2n8g/bert_b32_fp32_2.log
{
4: 2273.68, 3: 2267.55, 5: 2368.65, 2: 2264.69
}
logs/deepspeed/bert/bz32/2n8g/bert_b32_fp32_1.log
{
4: 2273.68, 3: 2267.55, 5: 2368.65, 2: 2264.69, 1: 2266.27
}
{
'bert'
:
{
'1n1g'
:
{
'average_speed'
: 146.31,
'batch_size_per_device'
: 32,
'median_speed'
: 147.81,
'speedup'
: 1.0
}
,
'1n4g'
:
{
'average_speed'
: 577.14,
'batch_size_per_device'
: 32,
'median_speed'
: 575.07,
'speedup'
: 3.89
}
,
'1n8g'
:
{
'average_speed'
: 1155.39,
'batch_size_per_device'
: 32,
'median_speed'
: 1147.9,
'speedup'
: 7.77
}
,
'2n8g'
:
{
'average_speed'
: 2288.17,
'batch_size_per_device'
: 32,
'median_speed'
: 2267.55,
'speedup'
: 15.34
}
,
'4n8g'
:
{
'average_speed'
: 4893.91,
'batch_size_per_device'
: 32,
'median_speed'
: 4899.37,
'speedup'
: 33.15
}}}
Saving result to ./result/bz32_result.json
```
## 计算规则
### 1.测速脚本
-
extract_paddle_logs.py
-
extract_paddle_logs_time.py
两个脚本略有不同,得到的结果稍有误差:
extract_paddle_logs.py根据官方在log中打印的速度,在120个iter中,排除前20iter,取后100个iter的速度做平均;
extract_paddle_logs_time.py则根据log中打印出的时间,排除前20iter取后100个iter的实际运行时间计算速度。
README展示的是extract_paddle_logs.py的计算结果。
### 2.均值速度和中值速度
-
average_speed均值速度
-
median_speed中值速度
每个batch size进行5次训练测试,记为一组,每一组取average_speed为均值速度,median_speed为中值速度
### 3.加速比以中值速度计算
脚本和表格中的
**加速比**
是以单机单卡下的中值速度为基准进行计算的。例如:
单机单卡情况下速度为200(samples/s),单机2卡速度为400,单机4卡速度为700,则加速比分别为:1.0、2.0、3.5
## BERT-Base FP32
### batch size=32 & without xla
| node_num | gpu_num | samples/s | speedup |
| -------- | ------- | --------- | ------- |
| 1 | 1 | 147.81 | 1 |
| 1 | 4 | 575.07 | 3.89 |
| 1 | 8 | 1147.9 | 7.77 |
| 2 | 16 | 2267.55 | 15.34 |
| 4 | 32 | 4899.37 | 33.15 |
### batch size=64 & without xla
| node_num | gpu_num | samples/s | speedup |
| -------- | ------- | --------- | ------- |
| 1 | 1 | 152.32 | 1 |
| 1 | 4 | 601.64 | 3.95 |
| 1 | 8 | 1197.91 | 7.86 |
| 2 | 16 | 2318.82 | 15.22 |
| 4 | 32 | 4510.15 | 29.61 |
## BERT-Base FP16
### batch size=64 & without xla
| node_num | gpu_num | samples/s | speedup |
| -------- | ------- | --------- | ------- |
| 1 | 1 | 565.3 | 1 |
| 1 | 4 | 2271.61 | 4.02 |
| 1 | 8 | 4512.68 | 7.98 |
| 2 | 16 | 8944.0 | 15.82 |
| 4 | 32 | 16401.67 | 29.01 |
### batch size=128 & without xla
| node_num | gpu_num | samples/s | speedup |
| -------- | ------- | --------- | ------- |
| 1 | 1 | 607.12 | 1 |
| 1 | 4 | 2412.1 | 3.97 |
| 1 | 8 | 4863.79 | 8.01 |
| 2 | 16 | 9892.88 | 16.29 |
| 4 | 32 | 16809.43 | 27.69 |
### batch size=160 & without xla
| node_num | gpu_num | samples/s | speedup |
| -------- | ------- | --------- | ------- |
| 1 | 1 | 619.73 | 1 |
| 1 | 4 | 2528.53 | 4.08 |
| 1 | 8 | 4953.73 | 7.99 |
| 2 | 16 | 10122.54 | 16.33 |
| 4 | 32 | 17751.63 | 28.64 |
## 完整日志
-
[
bert_fp32.zip
](
https://oneflow-public.oss-cn-beijing.aliyuncs.com/DLPerf/logs/DeepSpeed/bert/bert_fp32.zip
)
-
[
bert_fp16.zip
](
https://oneflow-public.oss-cn-beijing.aliyuncs.com/DLPerf/logs/DeepSpeed/bert/bert_fp16.zip
)
DeepSpeed/bert_base.json
0 → 100644
浏览文件 @
7cf3c693
{
"name"
:
"bing_bert_base"
,
"bert_token_file"
:
"bert-base-uncased"
,
"bert_model_file"
:
"bert-base-uncased"
,
"bert_model_config"
:
{
"vocab_size_or_config_json_file"
:
30522
,
"hidden_size"
:
768
,
"num_hidden_layers"
:
12
,
"num_attention_heads"
:
12
,
"intermediate_size"
:
3072
,
"hidden_act"
:
"gelu"
,
"hidden_dropout_prob"
:
0.1
,
"attention_probs_dropout_prob"
:
0.1
,
"max_position_embeddings"
:
512
,
"type_vocab_size"
:
2
,
"initializer_range"
:
0.02
},
"data"
:
{
"flags"
:
{
"pretrain_dataset"
:
true
,
"pretrain_type"
:
"wiki"
},
"mixed_seq_datasets"
:
{
"128"
:
{
"pretrain_dataset"
:
"hdf5/wikicorpus_en/128"
},
"512"
:
{
"pretrain_dataset"
:
"hdf5/wikicorpus_en/512"
}
}
},
"mixed_seq_training"
:
{
"128"
:
{
"num_epochs"
:
1
,
"warmup_proportion"
:
1.0
,
"learning_rate"
:
4e-4
,
"num_workers"
:
4
,
"async_worker"
:
true
,
"decay_rate"
:
0.99
,
"decay_step"
:
520
,
"total_training_steps"
:
125000
},
"512"
:
{
"num_epochs"
:
160
,
"warmup_proportion"
:
0.02
,
"learning_rate"
:
1e-5
,
"num_workers"
:
0
,
"async_worker"
:
true
,
"decay_rate"
:
0.90
,
"decay_step"
:
150
,
"total_training_steps"
:
7500
}
},
"validation"
:
{
"path"
:
"validation_set/"
}
}
\ No newline at end of file
DeepSpeed/deepspeed_bsz64k_adam_config_seq128.json
0 → 100644
浏览文件 @
7cf3c693
{
"train_batch_size"
:
32768
,
"train_micro_batch_size_per_gpu"
:
32
,
"steps_per_print"
:
1
,
"prescale_gradients"
:
false
,
"optimizer"
:
{
"type"
:
"ADAM"
,
"params"
:
{
"lr"
:
1e-4
,
"weight_decay"
:
0.01
,
"bias_correction"
:
false
}
},
"gradient_clipping"
:
1.0
,
"wall_clock_breakdown"
:
false
,
"fp16"
:
{
"enabled"
:
false
,
"loss_scale"
:
0
},
"sparse_attention"
:
{
"mode"
:
"fixed"
,
"block"
:
16
,
"different_layout_per_head"
:
true
,
"num_local_blocks"
:
4
,
"num_global_blocks"
:
1
,
"attention"
:
"bidirectional"
,
"horizontal_global_attention"
:
false
,
"num_different_global_patterns"
:
4
}
}
DeepSpeed/extract_deepspeed_logs.py
0 → 100755
浏览文件 @
7cf3c693
import
os
import
re
import
sys
import
glob
import
json
import
argparse
import
pprint
import
numpy
as
np
pp
=
pprint
.
PrettyPrinter
(
indent
=
1
)
os
.
chdir
(
sys
.
path
[
0
])
parser
=
argparse
.
ArgumentParser
(
description
=
"flags for benchmark"
)
parser
.
add_argument
(
"--log_dir"
,
type
=
str
,
default
=
"./logs/deepspeed/bert/bz32"
,
required
=
True
)
parser
.
add_argument
(
"--output_dir"
,
type
=
str
,
default
=
"./result"
,
required
=
False
)
parser
.
add_argument
(
'--warmup_batches'
,
type
=
int
,
default
=
20
)
parser
.
add_argument
(
'--train_batches'
,
type
=
int
,
default
=
120
)
parser
.
add_argument
(
'--batch_size_per_device'
,
type
=
int
,
default
=
32
)
args
=
parser
.
parse_args
()
class
AutoVivification
(
dict
):
"""Implementation of perl's autovivification feature."""
def
__getitem__
(
self
,
item
):
try
:
return
dict
.
__getitem__
(
self
,
item
)
except
KeyError
:
value
=
self
[
item
]
=
type
(
self
)()
return
value
def
extract_info_from_file
(
log_file
,
result_dict
,
speed_dict
):
# extract info from file name
fname
=
os
.
path
.
basename
(
log_file
)
run_case
=
log_file
.
split
(
"/"
)[
-
2
]
# eg: 1n1g
model
=
fname
.
split
(
"_"
)[
0
]
batch_size
=
int
(
fname
.
split
(
"_"
)[
1
].
strip
(
"b"
))
pricition
=
fname
.
split
(
"_"
)[
2
]
test_iter
=
int
(
fname
.
split
(
"_"
)[
3
].
strip
(
".log"
))
node_num
=
int
(
run_case
[
0
])
if
len
(
run_case
)
==
4
:
card_num
=
int
(
run_case
[
-
2
])
elif
len
(
run_case
)
==
5
:
card_num
=
int
(
run_case
[
-
3
:
-
1
])
total_batch_size
=
node_num
*
card_num
*
batch_size
tmp_dict
=
{
'average_speed'
:
0
,
'batch_size_per_device'
:
batch_size
,
}
avg_speed_list
=
[]
# extract info from file content
with
open
(
log_file
)
as
f
:
lines
=
f
.
readlines
()
for
line
in
lines
:
if
"SamplesPerSec"
in
line
:
p1
=
re
.
compile
(
r
"SamplesPerSec=(.*\.?.*)\n"
,
re
.
S
)
item
=
re
.
findall
(
p1
,
line
)
a
=
float
(
item
[
0
].
strip
())
avg_speed_list
.
append
(
round
(
a
,
4
))
# compute avg throughoutput
begin_index
=
args
.
warmup_batches
-
2
avg_speed
=
round
(
np
.
mean
(
avg_speed_list
[
begin_index
:
args
.
train_batches
]),
2
)
tmp_dict
[
'average_speed'
]
=
avg_speed
result_dict
[
model
][
run_case
][
'average_speed'
]
=
tmp_dict
[
'average_speed'
]
result_dict
[
model
][
run_case
][
'batch_size_per_device'
]
=
tmp_dict
[
'batch_size_per_device'
]
speed_dict
[
model
][
run_case
][
test_iter
]
=
avg_speed
print
(
log_file
,
speed_dict
[
model
][
run_case
])
def
compute_median
(
iter_dict
):
speed_list
=
[
i
for
i
in
iter_dict
.
values
()]
return
round
(
np
.
median
(
speed_list
),
2
)
def
compute_speedup
(
result_dict
,
speed_dict
):
model_list
=
[
key
for
key
in
result_dict
]
# eg.['vgg16', 'rn50']
for
m
in
model_list
:
run_case
=
[
key
for
key
in
result_dict
[
m
]]
# eg.['4n8g', '2n8g', '1n8g', '1n4g', '1n1g']
for
d
in
run_case
:
speed_up
=
1.0
if
result_dict
[
m
][
'1n1g'
][
'average_speed'
]:
result_dict
[
m
][
d
][
'average_speed'
]
=
compute_average
(
speed_dict
[
m
][
d
])
result_dict
[
m
][
d
][
'median_speed'
]
=
compute_median
(
speed_dict
[
m
][
d
])
speed_up
=
result_dict
[
m
][
d
][
'median_speed'
]
/
compute_median
(
speed_dict
[
m
][
'1n1g'
])
result_dict
[
m
][
d
][
'speedup'
]
=
round
(
speed_up
,
2
)
def
compute_average
(
iter_dict
):
i
=
0
total_speed
=
0
for
iter
in
iter_dict
:
i
+=
1
total_speed
+=
iter_dict
[
iter
]
return
round
(
total_speed
/
i
,
2
)
def
extract_result
():
result_dict
=
AutoVivification
()
speed_dict
=
AutoVivification
()
logs_list
=
glob
.
glob
(
os
.
path
.
join
(
args
.
log_dir
,
"*/*.log"
))
for
l
in
logs_list
:
extract_info_from_file
(
l
,
result_dict
,
speed_dict
)
# compute speedup
compute_speedup
(
result_dict
,
speed_dict
)
# print result
pp
.
pprint
(
result_dict
)
# write to file as JSON format
os
.
makedirs
(
args
.
output_dir
,
exist_ok
=
True
)
framwork
=
args
.
log_dir
.
split
(
'/'
)[
-
1
]
result_file_name
=
os
.
path
.
join
(
args
.
output_dir
,
framwork
+
"_result.json"
)
print
(
"Saving result to {}"
.
format
(
result_file_name
))
with
open
(
result_file_name
,
'w'
)
as
f
:
json
.
dump
(
result_dict
,
f
)
if
__name__
==
"__main__"
:
extract_result
()
DeepSpeed/extract_deepspeed_logs_time.py
0 → 100644
浏览文件 @
7cf3c693
import
os
import
re
import
sys
import
glob
import
json
import
argparse
import
pprint
import
time
import
datetime
import
numpy
as
np
pp
=
pprint
.
PrettyPrinter
(
indent
=
1
)
os
.
chdir
(
sys
.
path
[
0
])
parser
=
argparse
.
ArgumentParser
(
description
=
"flags for benchmark"
)
parser
.
add_argument
(
"--log_dir"
,
type
=
str
,
default
=
"./logs/deepspeed/bert/bz32"
,
required
=
True
)
parser
.
add_argument
(
"--output_dir"
,
type
=
str
,
default
=
"./result"
,
required
=
False
)
parser
.
add_argument
(
'--warmup_batches'
,
type
=
int
,
default
=
20
)
parser
.
add_argument
(
'--train_batches'
,
type
=
int
,
default
=
120
)
parser
.
add_argument
(
'--batch_size_per_device'
,
type
=
int
,
default
=
32
)
args
=
parser
.
parse_args
()
class
AutoVivification
(
dict
):
"""Implementation of perl's autovivification feature."""
def
__getitem__
(
self
,
item
):
try
:
return
dict
.
__getitem__
(
self
,
item
)
except
KeyError
:
value
=
self
[
item
]
=
type
(
self
)()
return
value
def
extract_info_from_file
(
log_file
,
result_dict
,
speed_dict
):
# extract info from file name
fname
=
os
.
path
.
basename
(
log_file
)
run_case
=
log_file
.
split
(
"/"
)[
-
2
]
# eg: 1n1g
model
=
fname
.
split
(
"_"
)[
0
]
batch_size
=
int
(
fname
.
split
(
"_"
)[
1
].
strip
(
"b"
))
pricition
=
fname
.
split
(
"_"
)[
2
]
test_iter
=
int
(
fname
.
split
(
"_"
)[
3
].
strip
(
".log"
))
node_num
=
int
(
run_case
[
0
])
if
len
(
run_case
)
==
4
:
card_num
=
int
(
run_case
[
-
2
])
elif
len
(
run_case
)
==
5
:
card_num
=
int
(
run_case
[
-
3
:
-
1
])
total_batch_size
=
node_num
*
card_num
*
batch_size
tmp_dict
=
{
'average_speed'
:
0
,
'batch_size_per_device'
:
batch_size
,
}
avg_speed
=
0
# extract info from file content, e.g. 2020-10-27 11:28:12,892
pt
=
re
.
compile
(
r
"(\d{4}-\d{1,2}-\d{1,2} \d{1,2}:\d{1,2}:\d{1,2},\d{1,3})"
,
re
.
S
)
s1
=
"[timer.py:157:stop] 0/"
+
str
(
args
.
warmup_batches
)
s2
=
"[timer.py:157:stop] 0/"
+
str
(
args
.
train_batches
)
start_time
=
''
end_time
=
''
with
open
(
log_file
)
as
f
:
lines
=
f
.
readlines
()
for
line
in
lines
:
if
"SamplesPerSec"
in
line
:
if
s1
in
line
:
start_time
=
re
.
findall
(
pt
,
line
)[
0
]
continue
if
s2
in
line
:
end_time
=
re
.
findall
(
pt
,
line
)[
0
]
t1
=
datetime
.
datetime
.
strptime
(
start_time
,
"%Y-%m-%d %H:%M:%S,%f"
)
t2
=
datetime
.
datetime
.
strptime
(
end_time
,
"%Y-%m-%d %H:%M:%S,%f"
)
cost_time
=
(
t2
-
t1
).
total_seconds
()
iter_num
=
args
.
train_batches
-
args
.
warmup_batches
avg_speed
=
round
(
float
(
total_batch_size
)
/
(
cost_time
/
iter_num
),
2
)
break
# compute avg throughoutput
tmp_dict
[
'average_speed'
]
=
avg_speed
result_dict
[
model
][
run_case
][
'average_speed'
]
=
avg_speed
result_dict
[
model
][
run_case
][
'batch_size_per_device'
]
=
tmp_dict
[
'batch_size_per_device'
]
speed_dict
[
model
][
run_case
][
test_iter
]
=
avg_speed
print
(
log_file
,
speed_dict
[
model
][
run_case
])
def
compute_speedup
(
result_dict
,
speed_dict
):
model_list
=
[
key
for
key
in
result_dict
]
# eg.['vgg16', 'rn50']
for
m
in
model_list
:
run_case
=
[
key
for
key
in
result_dict
[
m
]]
# eg.['4n8g', '2n8g', '1n8g', '1n4g', '1n1g']
for
d
in
run_case
:
speed_up
=
1.0
if
result_dict
[
m
][
'1n1g'
][
'average_speed'
]:
result_dict
[
m
][
d
][
'average_speed'
]
=
compute_average
(
speed_dict
[
m
][
d
])
result_dict
[
m
][
d
][
'median_speed'
]
=
compute_median
(
speed_dict
[
m
][
d
])
speed_up
=
result_dict
[
m
][
d
][
'median_speed'
]
/
compute_median
(
speed_dict
[
m
][
'1n1g'
])
result_dict
[
m
][
d
][
'speedup'
]
=
round
(
speed_up
,
2
)
def
compute_median
(
iter_dict
):
speed_list
=
[
i
for
i
in
iter_dict
.
values
()]
return
round
(
np
.
median
(
speed_list
),
2
)
def
compute_average
(
iter_dict
):
i
=
0
total_speed
=
0
for
iter
in
iter_dict
:
i
+=
1
total_speed
+=
iter_dict
[
iter
]
return
round
(
total_speed
/
i
,
4
)
def
extract_result
():
result_dict
=
AutoVivification
()
speed_dict
=
AutoVivification
()
logs_list
=
glob
.
glob
(
os
.
path
.
join
(
args
.
log_dir
,
"*/*.log"
))
for
l
in
logs_list
:
extract_info_from_file
(
l
,
result_dict
,
speed_dict
)
# compute speedup
compute_speedup
(
result_dict
,
speed_dict
)
# print result
pp
.
pprint
(
result_dict
)
# write to file as JSON format
os
.
makedirs
(
args
.
output_dir
,
exist_ok
=
True
)
framwork
=
args
.
log_dir
.
split
(
'/'
)[
-
1
]
result_file_name
=
os
.
path
.
join
(
args
.
output_dir
,
framwork
+
"_result.json"
)
print
(
"Saving result to {}"
.
format
(
result_file_name
))
with
open
(
result_file_name
,
'w'
)
as
f
:
json
.
dump
(
result_dict
,
f
)
if
__name__
==
"__main__"
:
# The iteration output in tensorflow log files is an integer multiple of 10
assert
args
.
warmup_batches
%
10
==
0
and
args
.
train_batches
%
10
==
0
extract_result
()
DeepSpeed/scripts/multi_node_train.sh
0 → 100644
浏览文件 @
7cf3c693
#!/bin/bash
OUTPUT_DIR
=
../output
# Where should we save checkpoints and tensorboard events?
rm
-rf
$OUTPUT_DIR
mkdir
-p
$OUTPUT_DIR
MODEL
=
${
1
:-
"bert_base"
}
BATCH_SIZE
=
${
2
:-
32
}
gpus
=
${
3
:-
"0"
}
nodes
=
${
4
:-
$NODE1
}
TEST_NUM
=
${
5
:-
1
}
DTYPE
=
${
6
:-
"fp32"
}
a
=
`
expr
${#
gpus
}
+ 1
`
num_gpus
=
`
expr
${
a
}
/ 2
`
num_nodes
=
$(
echo
$nodes
|
tr
','
'\n'
|
wc
-l
)
train_batch_size
=
`
expr
${
BATCH_SIZE
}
\*
1024
`
LOG_FOLDER
=
../logs-
${
DTYPE
}
/deepspeed/bert/bz
${
BATCH_SIZE
}
/
${
num_nodes
}
n
${
num_gpus
}
g
mkdir
-p
$LOG_FOLDER
LOGFILE
=
${
LOG_FOLDER
}
/bert_b
${
BATCH_SIZE
}
_
${
DTYPE
}
_
$TEST_NUM
.log
job_name
=
adam_nvidia_data_
${
MODEL
}
config
=
${
MODEL
}
.json
deepspeed_config
=
deepspeed_bsz64k_adam_config_seq128.json
# deepspeed_config=deepspeed_bsz4k_onebit_config_seq128.json
if
[
${
DTYPE
}
==
"fp16"
]
;
then
enabled
=
true
else
enabled
=
false
fi
sed
-i
"s/
\"
train_batch_size
\"
:.*
$/
\"
train_batch_size
\"
:
$train_batch_size
,/"
$deepspeed_config
sed
-i
"s/
\"
train_micro_batch_size_per_gpu
\"
:.*
$/
\"
train_micro_batch_size_per_gpu
\"
:
$BATCH_SIZE
,/"
$deepspeed_config
sed
-i
"s/
\"
enabled
\"
:.*
$/
\"
enabled
\"
:
$enabled
,/"
$deepspeed_config
DATA_PATH_PREFIX
=
/datasets/bert/deepspeed/data/test
if
[
$num_nodes
-ge
2
]
;
then
NCCL_TREE_THRESHOLD
=
0 deepspeed
--hostfile
=
deepspeed_hosts
\
--num_nodes
=
$num_nodes
\
--num_gpus
=
$num_gpus
deepspeed_train.py
\
--cf
${
config
}
\
--max_seq_length
128
\
--output_dir
$OUTPUT_DIR
\
--deepspeed
\
--print_steps
1
\
--lr_schedule
"EP"
\
--max_steps_per_epoch
120
\
--lr_offset
10e-4
\
--job_name
${
job_name
}
\
--deepspeed_config
$deepspeed_config
\
--data_path_prefix
${
DATA_PATH_PREFIX
}
\
--use_nvidia_dataset
2>&1 |
tee
$LOGFILE
else
NCCL_TREE_THRESHOLD
=
0 deepspeed
\
--num_nodes
=
$num_nodes
\
--num_gpus
=
$num_gpus
deepspeed_train.py
\
--cf
${
config
}
\
--max_seq_length
128
\
--output_dir
$OUTPUT_DIR
\
--deepspeed
\
--print_steps
1
\
--lr_schedule
"EP"
\
--max_steps_per_epoch
120
\
--lr_offset
10e-4
\
--job_name
${
job_name
}
\
--deepspeed_config
$deepspeed_config
\
--data_path_prefix
${
DATA_PATH_PREFIX
}
\
--use_nvidia_dataset
2>&1 |
tee
$LOGFILE
fi
DeepSpeed/scripts/run_multi_node.sh
0 → 100644
浏览文件 @
7cf3c693
SHELL_FOLDER
=
$(
dirname
$(
readlink
-f
"
$0
"
))
BATCH_SIZE
=
${
1
:-
32
}
DTYPE
=
${
2
:-
"fp32"
}
NODE1
=
'10.11.0.2'
NODE2
=
'10.11.0.3'
NODE3
=
'10.11.0.4'
NODE4
=
'10.11.0.5'
nodes
=
$NODE1
,
$NODE2
,
$NODE3
,
$NODE4
nodes
=
$NODE1
,
$NODE2
,
$NODE3
,
$NODE4
i
=
1
while
[
$i
-le
5
]
do
bash
$SHELL_FOLDER
/multi_node_train.sh
"bert_base"
$BATCH_SIZE
0,1,2,3,4,5,6,7
$nodes
$i
$DTYPE
echo
">>>>>>>>>>>>>>>>>>>>>>>>>>>>>>Finished Test Case
${
i
}
!<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<"
let
i++
sleep
20
done
\ No newline at end of file
DeepSpeed/scripts/run_single_node.sh
0 → 100644
浏览文件 @
7cf3c693
SHELL_FOLDER
=
$(
dirname
$(
readlink
-f
"
$0
"
))
BATCH_SIZE
=
${
1
:-
32
}
DTYPE
=
${
2
:-
"fp32"
}
NODE1
=
'10.11.0.2'
# NODE2='10.11.0.3'
# NODE3='10.11.0.4'
# NODE4='10.11.0.5'
nodes
=
$NODE1
i
=
1
while
[
$i
-le
5
]
do
bash
$SHELL_FOLDER
/multi_node_train.sh
"bert_base"
$BATCH_SIZE
0
$nodes
$i
$DTYPE
echo
">>>>>>>>>>>>>>>>>>>>>>>>>>>>>>Finished Test Case
${
i
}
!<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<"
let
i++
sleep
20
done
i
=
1
while
[
$i
-le
5
]
do
bash
$SHELL_FOLDER
/multi_node_train.sh
"bert_base"
$BATCH_SIZE
0,1,2,3
$nodes
$i
$DTYPE
echo
">>>>>>>>>>>>>>>>>>>>>>>>>>>>>>Finished Test Case
${
i
}
!<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<"
let
i++
sleep
20
done
i
=
1
while
[
$i
-le
5
]
do
bash
$SHELL_FOLDER
/multi_node_train.sh
"bert_base"
$BATCH_SIZE
0,1,2,3,4,5,6,7
$nodes
$i
$DTYPE
echo
">>>>>>>>>>>>>>>>>>>>>>>>>>>>>>Finished Test Case
${
i
}
!<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<"
let
i++
sleep
20
done
\ No newline at end of file
DeepSpeed/scripts/run_two_node.sh
0 → 100644
浏览文件 @
7cf3c693
SHELL_FOLDER
=
$(
dirname
$(
readlink
-f
"
$0
"
))
BATCH_SIZE
=
${
1
:-
32
}
DTYPE
=
${
2
:-
"fp32"
}
NODE1
=
'10.11.0.2'
NODE2
=
'10.11.0.3'
nodes
=
$NODE1
,
$NODE2
i
=
1
while
[
$i
-le
5
]
do
bash
$SHELL_FOLDER
/multi_node_train.sh
"bert_base"
$BATCH_SIZE
0,1,2,3,4,5,6,7
$nodes
$i
$DTYPE
echo
">>>>>>>>>>>>>>>>>>>>>>>>>>>>>>Finished Test Case
${
i
}
!<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<"
let
i++
sleep
20
done
\ No newline at end of file
编辑
预览
Markdown
is supported
0%
请重试
或
添加新附件
.
添加附件
取消
You are about to add
0
people
to the discussion. Proceed with caution.
先完成此消息的编辑!
取消
想要评论请
注册
或
登录