训练报错: an illegal memory access
Created by: xiequnyi
报错信息如下: W0524 06:06:37.243711 1423 device_context.cc:263] Please NOTE: device: 0, CUDA Capability: 61, Driver API Version: 9.0, Runtime API Version: 9.0 W0524 06:06:37.243786 1423 device_context.cc:271] device: 0, cuDNN Version: 7.0. Traceback (most recent call last): File "train.py", line 231, in main() File "train.py", line 228, in main train_net(args) File "train.py", line 100, in train_net exe.run(fluid.default_startup_program()) File "/usr/local/lib/python2.7/site-packages/paddle/fluid/executor.py", line 525, in run use_program_cache=use_program_cache) File "/usr/local/lib/python2.7/site-packages/paddle/fluid/executor.py", line 591, in _run exe.run(program.desc, scope, 0, True, True) paddle.fluid.core.EnforceNotMet: an illegal memory access was encountered at [/paddle/paddle/fluid/platform/device_context.cc:328] PaddlePaddle Call Stacks: 0 0x7f0a17a558f5p void paddle::platform::EnforceNotMet::Init<char const*>(char const*, char const*, int) + 357 1 0x7f0a17a55c79p paddle::platform::EnforceNotMet::EnforceNotMet(std::__exception_ptr::exception_ptr, char const*, int) + 137 2 0x7f0a194e3d56p 3 0x7f0a19506464p paddle::platform::TemporaryAllocator::Release(std::function<void ()> const&) + 100 4 0x7f0a194e5d71p paddle::platform::CUDADeviceContext::Wait() const + 113 5 0x7f0a17b754f7p paddle::framework::Executor::RunPreparedContext(paddle::framework::ExecutorPrepareContext*, paddle::framework::Scope*, bool, bool, bool) + 663 6 0x7f0a17b77335p paddle::framework::Executor::Run(paddle::framework::ProgramDesc const&, paddle::framework::Scope*, int, bool, bool) + 261 7 0x7f0a17a3a08bp 8 0x7f0a17a811fep 9 0x7f0ad0d6ba38p PyEval_EvalFrameEx + 23816 10 0x7f0ad0d6c6acp PyEval_EvalCodeEx + 2076 11 0x7f0ad0d6a9e6p PyEval_EvalFrameEx + 19638 12 0x7f0ad0d6c6acp PyEval_EvalCodeEx + 2076 13 0x7f0ad0d6a9e6p PyEval_EvalFrameEx + 19638 14 0x7f0ad0d6c6acp PyEval_EvalCodeEx + 2076 15 0x7f0ad0d6a9e6p PyEval_EvalFrameEx + 19638 16 0x7f0ad0d6aaebp PyEval_EvalFrameEx + 19899 17 0x7f0ad0d6c6acp PyEval_EvalCodeEx + 2076 18 0x7f0ad0d6c7c9p PyEval_EvalCode + 25 19 0x7f0ad0d8fdbap PyRun_FileExFlags + 138 20 0x7f0ad0d91167p PyRun_SimpleFileExFlags + 215 21 0x7f0ad0da738ep Py_Main + 3134 22 0x7f0ad0046d20p __libc_start_main + 256 23 0x4006c9p 本地2卡(k40),fluid1.3.2,可以正常训练,cloud上4卡(p40),fluid1.3.0,训练异常;