Need to check pserver instance in Fluid distributed training
Created by: Yancey1989
The training job can keep training without pserver instance.
startup logs:
+ export PSERVERS=127.0.0.1:7164
+ PSERVERS=127.0.0.1:7164
+ export SERVER_ENDPOINT=127.0.0.1:7164
+ SERVER_ENDPOINT=127.0.0.1:7164
+ export TRAINING_ROLE=TRAINER
+ TRAINING_ROLE=TRAINER
+ [[ ON == \O\N ]]
+ GLOG_v=3
+ GLOG_logtostderr=1
+ python test_dist_word2vec.py
debug logs:
I0115 06:16:41.042091 2938 executor.cc:118] Op(sgd), inputs:{Grad[fc_1.w_0@GRAD[256, 2073]({})], LearningRate[learning_rate_4[1]({})], Param[fc_1.w_0[256, 2073]({})]}, outputs:{ParamOut[fc_1.w_0[256, 2073]({})]}.
I0115 06:16:41.042105 2938 operator.cc:489] expected_kernel_key:data_type[5]:data_layout[ANY_LAYOUT]:place[CUDAPlace(0)]:library_type[PLAIN]
I0115 06:16:41.042307 2938 executor.cc:118] Op(send), inputs:{X[shared_w@GRAD[2073, 32]({{}}), fc_1.b_0@GRAD[2073]({}), fc_0.b_0@GRAD[256]({}), fc_0.w_0@GRAD[128, 256]({}), fc_1.w_0@GRAD[256, 2073]({})]}, outputs:{Out[shared_w[2073, 32]({}), fc_1.b_0[2073]({}), fc_0.b_0[256]({}), fc_0.w_0[128, 256]({}), fc_1.w_0[256, 2073]({})]}.
I0115 06:16:41.053037 2938 executor.cc:118] Op(fetch), inputs:{X[mean_0.tmp_0[1]({})]}, outputs:{Out[fetch[-1]({{}})]}.
I0115 06:16:41.053076 2938 tensor_util.h:36] Copy 1 from CUDAPlace(0) to CPUPlace
I0115 06:16:41.053151 2938 fetch_op.cc:62] Fetch variable mean_0.tmp_0 to fetch
I0115 06:16:41.053221 2938 feed_fetch_method.h:51] Fetch fetch with index 0 shape= 1