fluid启动pserver时卡住
Created by: flyhighzy
使用环境:自建k8s, 使用paddle镜像:1.0.1-gpu-cuda8.0-cudnn7 启动pserver相关代码 :
if training_role == "PSERVER":
place = fluid.CPUPlace()
exe = fluid.Executor(place)
if is_local:
train_loop(fluid.default_main_program())
else:
eplist = []
for ip in pserver_ips.split(","):
eplist.append(':'.join([ip, port]))
pserver_endpoints = ",".join(eplist) # ip:port,ip:port...
print(pserver_endpoints)
print(trainers)
print(trainer_id)
t = fluid.DistributeTranspiler()
t.transpile(
trainer_id,
pservers=pserver_endpoints,
trainers=trainers)
if training_role == "PSERVER":
print("pserver")
print("current_endpoint: %s" % current_endpoint)
pserver_prog = t.get_pserver_program(current_endpoint)
pserver_startup = t.get_startup_program(current_endpoint,
pserver_prog)
print("start pserver...")
exe.run(pserver_startup)
print("pserver start up ok: %s" % pserver_startup)
exe.run(pserver_prog)
print("pserver ended")
elif training_role == "TRAINER":
print("trainer")
train_loop(t.get_trainer_program())
输出日志中发现输出了start pserver...之后再无新日志输出,而trainer也一直在等待pserver启动成功, 不停的输出“pserver not ready, wait 3 sec to retry...”,一段时间后trainer也失败退出。 感觉pserver启动过程中在exe.run(pserver_startup)时卡住了,请问是什么原因可能导致这种情况?