单机多卡训练时使用profiler会导致程序hang住
Created by: liu-plus-wei
- 版本、环境信息: 1)commit id : e81b2c93 develop 2)运行环境docker : hub.baidubce.com/paddlepaddle/paddle:latest-dev
- 训练信息 1)单机多卡
- 问题描述: 多卡跑image_classification模型时加profiler会导致程序hang住 (单卡也不确定是不是一定不会hang住,比如时间长了会不会hang不确定)
command:
CUDA_VISIBLE_DEVICES=1,2,3 GLOG_vmodule=allocator=3 python train.py --model=ResNet50 --pretrained_model=${path_to_pretrain_model} --batch_size=64 --total_images=1281167 --class_dim=1000 --image_shape=3,224,224 --model_save_dir=output/ --with_mem_opt=True --lr_strategy=piecewise_decay --lr=0.1 --enable_ce True
tran.py需要修改enable ResNet50在flower data上面跑:
@@ -128,9 +131,9 @@ def net_config(image, label, model, args):
model_name = args.model
if args.enable_ce
- assert model_name == "SE_ResNeXt50_32x4d"
- model.params["dropout_seed"] = 100
- class_dim = 102
+ if model_name == "SE_ResNeXt50_32x4d":
+ model.params["dropout_seed"] = 100
+ class_dim = 102
加profiler的地方:
@@ -280,6 +284,7 @@ def train(args):
test_info = [[], [], []]
train_time = []
batch_id = 0
+ #profiler.start_profiler("All")
try:
while True:
t1 = time.time()
@@ -302,6 +307,7 @@ def train(args):
batch_id += 1
except fluid.core.EOFException:
train_py_reader.reset()
+ #profiler.stop_profiler("total", "/tmp/profile_base")
不是很会用编辑器,代码有点乱,不好意思:)