单机多卡训练时使用profiler会导致程序hang住 (#15387) · Issue · PaddlePaddle / Paddle

单机多卡训练时使用profiler会导致程序hang住

Created by: liu-plus-wei

版本、环境信息： 1）commit id : e81b2c93 develop 2）运行环境docker : hub.baidubce.com/paddlepaddle/paddle:latest-dev
训练信息 1）单机多卡
问题描述：多卡跑image_classification模型时加profiler会导致程序hang住（单卡也不确定是不是一定不会hang住，比如时间长了会不会hang不确定）

command： CUDA_VISIBLE_DEVICES=1,2,3 GLOG_vmodule=allocator=3 python train.py --model=ResNet50 --pretrained_model=${path_to_pretrain_model} --batch_size=64 --total_images=1281167 --class_dim=1000 --image_shape=3,224,224 --model_save_dir=output/ --with_mem_opt=True --lr_strategy=piecewise_decay --lr=0.1 --enable_ce True

tran.py需要修改enable ResNet50在flower data上面跑： @@ -128,9 +131,9 @@ def net_config(image, label, model, args): model_name = args.model if args.enable_ce - assert model_name == "SE_ResNeXt50_32x4d" - model.params["dropout_seed"] = 100 - class_dim = 102 + if model_name == "SE_ResNeXt50_32x4d": + model.params["dropout_seed"] = 100 + class_dim = 102

加profiler的地方: @@ -280,6 +284,7 @@ def train(args): test_info = [[], [], []] train_time = [] batch_id = 0 + #profiler.start_profiler("All") try: while True: t1 = time.time() @@ -302,6 +307,7 @@ def train(args): batch_id += 1 except fluid.core.EOFException: train_py_reader.reset() + #profiler.stop_profiler("total", "/tmp/profile_base")

不是很会用编辑器，代码有点乱，不好意思:)

PaddlePaddle / Paddle 1 年多 前同步成功

单机多卡训练时使用profiler会导致程序hang住

PaddlePaddle / Paddle
1 年多前同步成功