多机分布式运行时报错,ValueError: Operator "send" has not been registered.
Created by: jerryniu520126
3)系统环境:Windows 10
-
训练信息 1)多机,单卡
-
复现信息: `#!/bin/bash
start pserver0
python train.py
--train_data_path train.txt
--is_local 0
--role pserver
--endpoints 127.0.0.1:6000,127.0.0.1:6001
--current_endpoint 127.0.0.1:6000
--trainers 2
> pserver0.log 2>&1 &
start pserver1
python train.py
--train_data_path train.txt
--is_local 0
--role pserver
--endpoints 127.0.0.1:6000,127.0.0.1:6001
--current_endpoint 127.0.0.1:6001
--trainers 2
> pserver1.log 2>&1 &
start trainer0
python train.py
--train_data_path train.txt
--is_local 0
--role trainer
--endpoints 127.0.0.1:6000,127.0.0.1:6001
--trainers 2
--trainer_id 0
> trainer0.log 2>&1 &
start trainer1
python train.py
--train_data_path train.txt
--is_local 0
--role trainer
--endpoints 127.0.0.1:6000,127.0.0.1:6001
--trainers 2
--trainer_id 1
> trainer1.log 2>&1 &
运行完这个以后出现如下报错信息:
2020-02-06 20:36:59,184-WARNING: paddle.fluid.layers.create_py_reader_by_data() may be deprecated in the near future. Please use paddle.fluid.io.DataLoader.from_generator() instead.
2020-02-06 20:36:59,418-INFO: run dist training
Traceback (most recent call last):
File "train.py", line 262, in
train()
File "train.py", line 236, in train
t.transpile(args.trainer_id, pservers=args.endpoints, trainers=args.trainers)
File "D:\Anaconda3\envs\paddle_env\lib\site-packages\paddle\fluid\transpiler\distribute_transpiler.py", line 699, in transpile
splited_grad_varname
File "D:\Anaconda3\envs\paddle_env\lib\site-packages\paddle\fluid\framework.py", line 2506, in _insert_op
op = Operator(block=self, desc=op_desc, *args, **kwargs)
File "D:\Anaconda3\envs\paddle_env\lib\site-packages\paddle\fluid\framework.py", line 1788, in init
proto = OpProtoHolder.instance().get_op_proto(type)
File "D:\Anaconda3\envs\paddle_env\lib\site-packages\paddle\fluid\framework.py", line 1670, in get_op_proto
raise ValueError("Operator "%s" has not been registered." % type)
ValueError: Operator "send" has not been registered.
`