Paddle fluid v1.5.0 NCCL模式多机训练失败
Created by: D0m021ng
使用Paddle fluid v1.5.0版本进行NCCL2模式多机训练,demo使用Paddlecloud的nccl2 demo,其中python2版本训练成功,python3版本训练失败,已经修改了兼容性。
-
版本、环境信息: 1)PaddlePaddle版本:Paddle fluid v1.5.0 2) Python版本: python3.6.2 3) cuda版本: cuda10.0 4) nccl2版本: nccl2.4.7 5)已经设置的环境变量:FLAGS_rpc_deadline=3000000、NCCL_IB_DISABLE=1
-
错误信息: ('commit:', '401c03fc') START_CMD is not empty: python3 trainer_py3.py start python3 trainer_py3.py at Wed Jul 17 20:35:58 CST 2019 start cmd is: python3 trainer_py3.py W0717 20:36:02.107486 3997 device_context.cc:259] Please NOTE: device: 0, CUDA Capability: 61, Driver API Version: 10.0, Runtime API Version: 10.0 W0717 20:36:02.111711 3997 device_context.cc:267] device: 0, cuDNN Version: 7.6. E0717 20:36:02.127834213 4035 server_chttp2.cc:38] {"created":"@1563366962.127800942","description":"No address added out of total 1 resolved","file":"src/core/ext/transport/chttp2/server/chttp2_server.cc","file_line":305,"referenced_errors":[{"created":"@1563366962.127797707","description":"Unable to configure socket","fd":26,"file":"src/core/lib/iomgr/tcp_server_utils_posix_common.cc","file_line":202,"referenced_errors":[{"created":"@1563366962.127794018","description":"OS Error","errno":99,"file":"src/core/lib/iomgr/tcp_server_utils_posix_common.cc","file_line":175,"os_error":"Cannot assign requested address","syscall":"bind"}]}]} I0717 20:36:02.127897 4035 grpc_server.cc:435] Server listening on 172.16.14.189:30000 selected port: 0 /root/paddlejob/run.sh: line 268: 3997 Segmentation fault (core dumped) python3 trainer_py3.py [/root/paddlejob/run.sh : 269] [start_trainer] [FATAL]: execute user cmd failed