Skip to content
体验新版
项目
组织
正在加载...
登录
切换导航
打开侧边栏
Oneflow-Inc
DLPerf
提交
802e4529
D
DLPerf
项目概览
Oneflow-Inc
/
DLPerf
上一次同步 2 年多
通知
4
Star
152
Fork
0
代码
文件
提交
分支
Tags
贡献者
分支图
Diff
Issue
0
列表
看板
标记
里程碑
合并请求
0
DevOps
流水线
流水线任务
计划
Wiki
0
Wiki
分析
仓库
DevOps
项目成员
Pages
D
DLPerf
项目概览
项目概览
详情
发布
仓库
仓库
文件
提交
分支
标签
贡献者
分支图
比较
Issue
0
Issue
0
列表
看板
标记
里程碑
合并请求
0
合并请求
0
Pages
DevOps
DevOps
流水线
流水线任务
计划
分析
分析
仓库分析
DevOps
Wiki
0
Wiki
成员
成员
收起侧边栏
关闭侧边栏
动态
分支图
创建新Issue
流水线任务
提交
Issue看板
前往新版Gitcode,体验更适合开发者的 AI 搜索 >>
提交
802e4529
编写于
10月 28, 2020
作者:
F
Flowingsun007
浏览文件
操作
浏览文件
下载
电子邮件补丁
差异文件
add scripts for testing bert
上级
9a923cca
变更
4
隐藏空白更改
内联
并排
Showing
4 changed file
with
145 addition
and
0 deletion
+145
-0
DeepSpeed/bert/scripts/multi_node_train.sh
DeepSpeed/bert/scripts/multi_node_train.sh
+72
-0
DeepSpeed/bert/scripts/run_multi_node.sh
DeepSpeed/bert/scripts/run_multi_node.sh
+20
-0
DeepSpeed/bert/scripts/run_single_node.sh
DeepSpeed/bert/scripts/run_single_node.sh
+37
-0
DeepSpeed/bert/scripts/run_two_node.sh
DeepSpeed/bert/scripts/run_two_node.sh
+16
-0
未找到文件。
DeepSpeed/bert/scripts/multi_node_train.sh
0 → 100644
浏览文件 @
802e4529
#!/bin/bash
OUTPUT_DIR
=
../output
# Where should we save checkpoints and tensorboard events?
rm
-rf
$OUTPUT_DIR
mkdir
-p
$OUTPUT_DIR
MODEL
=
${
1
:-
"bert_base"
}
BATCH_SIZE
=
${
2
:-
32
}
gpus
=
${
3
:-
"0"
}
nodes
=
${
4
:-
$NODE1
}
TEST_NUM
=
${
5
:-
1
}
DTYPE
=
${
6
:-
"fp32"
}
a
=
`
expr
${#
gpus
}
+ 1
`
num_gpus
=
`
expr
${
a
}
/ 2
`
num_nodes
=
$(
echo
$nodes
|
tr
','
'\n'
|
wc
-l
)
train_batch_size
=
`
expr
${
BATCH_SIZE
}
\*
1024
`
LOG_FOLDER
=
../logs-
${
DTYPE
}
/deepspeed/bert/bz
${
BATCH_SIZE
}
/
${
num_nodes
}
n
${
num_gpus
}
g
mkdir
-p
$LOG_FOLDER
LOGFILE
=
${
LOG_FOLDER
}
/bert_b
${
BATCH_SIZE
}
_
${
DTYPE
}
_
$TEST_NUM
.log
job_name
=
adam_nvidia_data_
${
MODEL
}
config
=
${
MODEL
}
.json
deepspeed_config
=
deepspeed_bsz64k_adam_config_seq128.json
# deepspeed_config=deepspeed_bsz4k_onebit_config_seq128.json
if
[
${
DTYPE
}
==
"fp16"
]
;
then
enabled
=
true
else
enabled
=
false
fi
sed
-i
"s/
\"
train_batch_size
\"
:.*
$/
\"
train_batch_size
\"
:
$train_batch_size
,/"
$deepspeed_config
sed
-i
"s/
\"
train_micro_batch_size_per_gpu
\"
:.*
$/
\"
train_micro_batch_size_per_gpu
\"
:
$BATCH_SIZE
,/"
$deepspeed_config
sed
-i
"s/
\"
enabled
\"
:.*
$/
\"
enabled
\"
:
$enabled
,/"
$deepspeed_config
DATA_PATH_PREFIX
=
/datasets/bert/deepspeed/data/test
if
[
$num_nodes
-ge
2
]
;
then
NCCL_TREE_THRESHOLD
=
0 deepspeed
--hostfile
=
deepspeed_hosts
\
--num_nodes
=
$num_nodes
\
--num_gpus
=
$num_gpus
deepspeed_train.py
\
--cf
${
config
}
\
--max_seq_length
128
\
--output_dir
$OUTPUT_DIR
\
--deepspeed
\
--print_steps
1
\
--lr_schedule
"EP"
\
--max_steps_per_epoch
120
\
--lr_offset
10e-4
\
--job_name
${
job_name
}
\
--deepspeed_config
$deepspeed_config
\
--data_path_prefix
${
DATA_PATH_PREFIX
}
\
--use_nvidia_dataset
2>&1 |
tee
$LOGFILE
else
NCCL_TREE_THRESHOLD
=
0 deepspeed
\
--num_nodes
=
$num_nodes
\
--num_gpus
=
$num_gpus
deepspeed_train.py
\
--cf
${
config
}
\
--max_seq_length
128
\
--output_dir
$OUTPUT_DIR
\
--deepspeed
\
--print_steps
1
\
--lr_schedule
"EP"
\
--max_steps_per_epoch
120
\
--lr_offset
10e-4
\
--job_name
${
job_name
}
\
--deepspeed_config
$deepspeed_config
\
--data_path_prefix
${
DATA_PATH_PREFIX
}
\
--use_nvidia_dataset
2>&1 |
tee
$LOGFILE
fi
DeepSpeed/bert/scripts/run_multi_node.sh
0 → 100644
浏览文件 @
802e4529
SHELL_FOLDER
=
$(
dirname
$(
readlink
-f
"
$0
"
))
BATCH_SIZE
=
${
1
:-
32
}
DTYPE
=
${
2
:-
"fp32"
}
NODE1
=
'10.11.0.2'
NODE2
=
'10.11.0.3'
NODE3
=
'10.11.0.4'
NODE4
=
'10.11.0.5'
nodes
=
$NODE1
,
$NODE2
,
$NODE3
,
$NODE4
nodes
=
$NODE1
,
$NODE2
,
$NODE3
,
$NODE4
i
=
1
while
[
$i
-le
5
]
do
bash
$SHELL_FOLDER
/multi_node_train.sh
"bert_base"
$BATCH_SIZE
0,1,2,3,4,5,6,7
$nodes
$i
$DTYPE
echo
">>>>>>>>>>>>>>>>>>>>>>>>>>>>>>Finished Test Case
${
i
}
!<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<"
let
i++
sleep
20
done
\ No newline at end of file
DeepSpeed/bert/scripts/run_single_node.sh
0 → 100644
浏览文件 @
802e4529
SHELL_FOLDER
=
$(
dirname
$(
readlink
-f
"
$0
"
))
BATCH_SIZE
=
${
1
:-
32
}
DTYPE
=
${
2
:-
"fp32"
}
NODE1
=
'10.11.0.2'
# NODE2='10.11.0.3'
# NODE3='10.11.0.4'
# NODE4='10.11.0.5'
nodes
=
$NODE1
i
=
1
while
[
$i
-le
5
]
do
bash
$SHELL_FOLDER
/multi_node_train.sh
"bert_base"
$BATCH_SIZE
0
$nodes
$i
$DTYPE
echo
">>>>>>>>>>>>>>>>>>>>>>>>>>>>>>Finished Test Case
${
i
}
!<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<"
let
i++
sleep
20
done
i
=
1
while
[
$i
-le
5
]
do
bash
$SHELL_FOLDER
/multi_node_train.sh
"bert_base"
$BATCH_SIZE
0,1,2,3
$nodes
$i
$DTYPE
echo
">>>>>>>>>>>>>>>>>>>>>>>>>>>>>>Finished Test Case
${
i
}
!<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<"
let
i++
sleep
20
done
i
=
1
while
[
$i
-le
5
]
do
bash
$SHELL_FOLDER
/multi_node_train.sh
"bert_base"
$BATCH_SIZE
0,1,2,3,4,5,6,7
$nodes
$i
$DTYPE
echo
">>>>>>>>>>>>>>>>>>>>>>>>>>>>>>Finished Test Case
${
i
}
!<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<"
let
i++
sleep
20
done
\ No newline at end of file
DeepSpeed/bert/scripts/run_two_node.sh
0 → 100644
浏览文件 @
802e4529
SHELL_FOLDER
=
$(
dirname
$(
readlink
-f
"
$0
"
))
BATCH_SIZE
=
${
1
:-
32
}
DTYPE
=
${
2
:-
"fp32"
}
NODE1
=
'10.11.0.2'
NODE2
=
'10.11.0.3'
nodes
=
$NODE1
,
$NODE2
i
=
1
while
[
$i
-le
5
]
do
bash
$SHELL_FOLDER
/multi_node_train.sh
"bert_base"
$BATCH_SIZE
0,1,2,3,4,5,6,7
$nodes
$i
$DTYPE
echo
">>>>>>>>>>>>>>>>>>>>>>>>>>>>>>Finished Test Case
${
i
}
!<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<"
let
i++
sleep
20
done
\ No newline at end of file
编辑
预览
Markdown
is supported
0%
请重试
或
添加新附件
.
添加附件
取消
You are about to add
0
people
to the discussion. Proceed with caution.
先完成此消息的编辑!
取消
想要评论请
注册
或
登录