未验证 提交 19e877ff 编写于 作者: W Wu Yi 提交者: GitHub

Merge pull request #11690 from typhoonzero/fix_trainer_nccl2_env

fix trainer nccl2 env
...@@ -315,7 +315,7 @@ class Trainer(object): ...@@ -315,7 +315,7 @@ class Trainer(object):
for ip in worker_ips.split(","): for ip in worker_ips.split(","):
worker_endpoints.append(':'.join([ip, port])) worker_endpoints.append(':'.join([ip, port]))
self.num_trainers = len(worker_endpoints) self.num_trainers = len(worker_endpoints)
current_endpoint = os.getenv("POD_IP") + ":" + port current_endpoint = os.getenv("PADDLE_CURRENT_IP") + ":" + port
worker_endpoints.remove(current_endpoint) worker_endpoints.remove(current_endpoint)
# TODO(wuyi): use self.nccl_id_var, self.num_trainers and self.trainer_id # TODO(wuyi): use self.nccl_id_var, self.num_trainers and self.trainer_id
# in ParallelExecutor to start # in ParallelExecutor to start
......
Markdown is supported
0% .
You are about to add 0 people to the discussion. Proceed with caution.
先完成此消息的编辑!
想要评论请 注册