distributed training hung with ParallelExecutor
Created by: Yancey1989
Distributed training hung with ParallleExecutor
:
2 trainers + 2 pservers on vgg16 + followers dataset
I0528 04:26:15.910966 13997 operator.cc:546] expected_kernel_key:data_type[float]:data_layout[ANY_LAYOUT]:place[CUDAPlace(0)]:library_type[PLAIN]
I0528 04:26:15.911159 14019 operator.cc:546] expected_kernel_key:data_type[float]:data_layout[ANY_LAYOUT]:place[CUDAPlace(0)]:library_type[PLAIN]
I0528 04:26:15.911401 14010 operator.cc:546] expected_kernel_key:data_type[float]:data_layout[ANY_LAYOUT]:place[CUDAPlace(0)]:library_type[PLAIN]
I0528 04:26:15.911592 14032 operator.cc:546] expected_kernel_key:data_type[float]:data_layout[ANY_LAYOUT]:place[CUDAPlace(0)]:library_type[CUDNN]
I0528 04:26:15.916076 14004 send_vars_op.cc:57] sending batch_norm_0.w_0@GRAD.trainer_1 to 10.1.14.3:18210
I0528 04:26:15.916328 14000 send_vars_op.cc:57] sending conv2d_0.w_0@GRAD.trainer_1 to 10.1.14.3:18210
I0528 04:26:15.916420 14016 send_vars_op.cc:57] sending conv2d_0.b_0@GRAD.trainer_1 to 10.1.14.3:18210
I0528 04:26:15.916592 14014 send_vars_op.cc:57] sending batch_norm_0.b_0@GRAD.trainer_1 to 10.1.14.3:18211