多机多卡训练出错
Created by: gobigrassland
File "train.py", line 55, in <module>
main()
File "train.py", line 51, in main
ins.train()
File "/export/data/PLSC-master/plsc/entry.py", line 927, in train
exe.run(self.startup_program)
File "/usr/local/lib/python2.7/dist-packages/paddle/fluid/executor.py", line 783, in run
six.reraise(*sys.exc_info())
File "/usr/local/lib/python2.7/dist-packages/paddle/fluid/executor.py", line 778, in run
use_program_cache=use_program_cache)
File "/usr/local/lib/python2.7/dist-packages/paddle/fluid/executor.py", line 831, in _run_impl
use_program_cache=use_program_cache)
File "/usr/local/lib/python2.7/dist-packages/paddle/fluid/executor.py", line 905, in _run_program
fetch_var_name)
paddle.fluid.core_avx.EnforceNotMet:
--------------------------------------------
C++ Call Stacks (More useful to developers):
--------------------------------------------
0 std::string paddle::platform::GetTraceBackString<char const*>(char const*&&, char const*, int)
1 paddle::platform::EnforceNotMet::EnforceNotMet(std::__exception_ptr::exception_ptr, char const*, int)
2 paddle::platform::NCCLCommContext::CreateNCCLComm(ncclUniqueId*, int, int, int, int)
3 paddle::operators::CCommInitOp::RunImpl(paddle::framework::Scope const&, paddle::platform::Place const&) const
4 paddle::framework::OperatorBase::Run(paddle::framework::Scope const&, paddle::platform::Place const&)
5 paddle::framework::Executor::RunPreparedContext(paddle::framework::ExecutorPrepareContext*, paddle::framework::Scope*, bool, bool, bool)
6 paddle::framework::Executor::Run(paddle::framework::ProgramDesc const&, paddle::framework::Scope*, int, bool, bool, std::vector<std::string, std::allocator<std::string> > const&, bool, bool)
----------------------
Error Message Summary:
----------------------
Error: An error occurred here. There is no accurate error hint for this error yet. We are continuously in the process of increasing hint for this kind of error check. It would be helpful if you could inform us of how this conversion went by opening a github issue. And we will resolve it with high priority.
- New issue link: https://github.com/PaddlePaddle/Paddle/issues/new
- Recommended issue content: all error stack information
[unhandled system error] at (/paddle/paddle/fluid/platform/collective_helper.cc:67)
[operator < c_comm_init > error]
我使用机房机器进行多机分布式训练,出现了上述错误,不知是什么原因导致的。看提示信息可能是nccl初始化出错,也不知道具体是什么原因导致这个问题,而使用其他框架如tensorflow则多机多卡则是正常的。麻烦各位给个建议