Operator "gen_nccl_id" has not been registered
Created by: JasonChang
我測試了好幾天,一直出現這個問題沒法解,我是用CPU版本執行drcd ERINE版本是release/r2.1.0
[指令] sh script/zh_task/ernie_base/run_drcd.sh
[run_drcd.sh] 將use_cuda 改成false
python3 ./finetune_launch.py
--nproc_per_node 1
--selected_gpus 0,1,2,3,4,5,6,7
--node_ips $(hostname -i)
--node_id 0
run_mrc.py --use_cuda false
--batch_size 4
--in_tokens false
--use_fast_executor true
--checkpoints ./checkpoints
--vocab_path ${MODEL_PATH}/vocab.txt
--ernie_config_path ${MODEL_PATH}/ernie_config.json
--do_train true
--do_val true
--do_test true
--verbose true
--save_steps 1000
--validation_steps 100
--warmup_proportion 0.0
--weight_decay 0.01
--epoch 2
--max_seq_len 512
--do_lower_case true
--doc_stride 128
--train_set ${TASK_DATA_PATH}/drcd/train.json
--dev_set ${TASK_DATA_PATH}/drcd/dev.json
--test_set ${TASK_DATA_PATH}/drcd/test.json
--learning_rate 5e-5
--num_iteration_per_drop_scope 1
--init_pretraining_params ${MODEL_PATH}/params
--skip_steps 10
[LOG] 2020-01-01 13:17:42,817-INFO: worker_endpoints:['10.144.225.77:6170'] trainers_num:1 current_endpoint XXXXXXXX:6170 trainer_id:0 [INFO] 2020-01-01 13:17:42,817 [ run_mrc.py: 169]: worker_endpoints:['XXXXXXXX:6170'] trainers_num:1 current_endpoint:XXXXXXXX:6170 trainer_id:0 Traceback (most recent call last): File "run_mrc.py", line 365, in main(args) File "run_mrc.py", line 180, in main startup_program=startup_prog) File "/home/ubuntu/.local/lib/python3.6/site-packages/paddle/fluid/transpiler/distribute_transpiler.py", line 571, in transpile wait_port=self.config.wait_port) File "/home/ubuntu/.local/lib/python3.6/site-packages/paddle/fluid/transpiler/distribute_transpiler.py", line 355, in _transpile_nccl2 self.config.hierarchical_allreduce_inter_nranks File "/home/ubuntu/.local/lib/python3.6/site-packages/paddle/fluid/framework.py", line 2488, in append_op attrs=kwargs.get("attrs", None)) File "/home/ubuntu/.local/lib/python3.6/site-packages/paddle/fluid/framework.py", line 1788, in init proto = OpProtoHolder.instance().get_op_proto(type) File "/home/ubuntu/.local/lib/python3.6/site-packages/paddle/fluid/framework.py", line 1670, in get_op_proto raise ValueError("Operator "%s" has not been registered." % type) ValueError: Operator "gen_nccl_id" has not been registered.