单机多卡预训练ernie1.0出现 SendRPC name:[NCCLID], ep:[10.12.35.45:6171], status:[-1] meets grpc error, error_code:4 error_message:Deadline Exceeded error_details:
Created by: zuole
-
版本、环境信息: 1)PaddlePaddle版本:1.5.1 3)GPU: Tesla V100 4)系统环境:Linux,Python2.7版本
-
训练信息 1)单机,多卡 2)显存信息 16160MiB 3)Operator信息 Linux
-
问题描述:请详细描述您的问题,同步贴出报错信息、日志、可复现的代码片段
[INFO] 2019-10-18 19:25:08,789 [ train.py: 210]: Device count 2 [INFO] 2019-10-18 19:25:08,789 [ train.py: 211]: theoretical memory usage: [INFO] 2019-10-18 19:25:08,833 [ train.py: 213]: (9526.169296073915, 9979.796405410767, 'MB') [INFO] 2019-10-18 19:25:08,833 [ train.py: 217]: args.is_distributed: True [INFO] 2019-10-18 19:25:08,834 [ train.py: 225]: train_id == 0, sleep 60s [INFO] 2019-10-18 19:26:08,894 [ train.py: 229]: worker_endpoints:[u'10.12.35.45:6170', u'10.12.35.45:6171'] trainers_num:2 current_endpoint:10.12.35.45:6170 trainer_id:0 W1018 19:26:10.103168 119202 device_context.cc:259] Please NOTE: device: 0, CUDA Capability: 70, Driver API Version: 9.2, Runtime API Version: 9.0 W1018 19:26:10.106727 119202 device_context.cc:267] device: 0, cuDNN Version: 7.0. W1018 19:26:10.106756 119202 device_context.cc:293] WARNING: device: 0. The installed Paddle is compiled with CUDNN 7.3, but CUDNN version in your machine is 7.0, which may cause serious incompatible bug. Please recompile or reinstall Paddle with compatible CUDNN version. I1018 19:26:10.428246 119202 rpc_client.h:101] init rpc client with trainer_id 0 F1018 19:29:10.429860 119306 grpc_client.cc:414] SendRPC name:[NCCLID], ep:[10.12.35.45:6171], status:[-1] meets grpc error, error_code:4 error_message:Deadline Exceeded error_details: