单机多卡训练报错: Cannot find fetched variable
Created by: dubhex
程序在单机单卡下能正常训练,但是在使用
train_exe = fluid.ParallelExecutor(use_cuda = True, loss_name = avg_cost.name, main_program = fluid.default_main_program())
进行多卡训练时报下述错误
File "run_train.py", line 199, in main(args) File "run_train.py", line 140, in main feed = feeder.feed(data)) File "/opt/python/cp27-cp27mu/lib/python2.7/site-packages/paddle/fluid/executor.py", line 472, in run self.executor.run(program.desc, scope, 0, True, True) KeyboardInterrupt [root@272c5daea05d face_align_150pts]# python run_train.py Hi, Program begin W0301 02:37:43.430696 1892 device_context.cc:213] Please NOTE: device: 0, CUDA Capability: 35, Driver Version: 9.0, Runtime Version: 8.0 W0301 02:37:43.430814 1892 device_context.cc:220] device: 0, cuDNN Version: 5.1. Data reader is ready Training begin Traceback (most recent call last): File "run_train.py", line 199, in main(args) File "run_train.py", line 140, in main feed = feeder.feed(data)) File "/opt/python/cp27-cp27mu/lib/python2.7/site-packages/paddle/fluid/parallel_executor.py", line 287, in run self.executor.run(fetch_list, fetch_var_name) paddle.fluid.core.EnforceNotMet: Cannot find fetched variable.(Perhaps the main_program is not set to ParallelExecutor) at [/paddle/paddle/fluid/framework/details/threaded_ssa_graph_executor.cc:166] PaddlePaddle Call Stacks: 0 0x7f20c96d5406p paddle::platform::EnforceNotMet::EnforceNotMet(std::__exception_ptr::exception_ptr, char const*, int) + 486 1 0x7f20caf4c658p paddle::framework::details::ThreadedSSAGraphExecutor::InsertFetchOps(std::vector<std::string, std::allocatorstd::string > const&, std::vector<paddle::framework::details::FetchOpHandle*, std::allocatorpaddle::framework::details::FetchOpHandle* >, std::unordered_set<paddle::framework::details::VarHandleBase, std::hashpaddle::framework::details::VarHandleBase*, std::equal_topaddle::framework::details::VarHandleBase*, std::allocatorpaddle::framework::details::VarHandleBase* >, std::unordered_map<paddle::framework::details::OpHandleBase, unsigned long, std::hashpaddle::framework::details::OpHandleBase*, std::equal_topaddle::framework::details::OpHandleBase*, std::allocator<std::pair<paddle::framework::details::OpHandleBase* const, unsigned long> > >, std::unordered_set<paddle::framework::details::VarHandleBase, std::hashpaddle::framework::details::VarHandleBase*, std::equal_topaddle::framework::details::VarHandleBase*, std::allocatorpaddle::framework::details::VarHandleBase* >, paddle::framework::BlockingQueuepaddle::framework::details::VarHandleBase*, std::vector<paddle::framework::LoDTensor, std::allocatorpaddle::framework::LoDTensor >*) + 3208 2 0x7f20caf4cdd8p paddle::framework::details::ThreadedSSAGraphExecutor::Run(std::vector<std::string, std::allocatorstd::string > const&) + 1768 3 0x7f20caf51707p paddle::framework::details::ScopeBufferedSSAGraphExecutor::Run(std::vector<std::string, std::allocatorstd::string > const&) + 391 4 0x7f20c97cebb9p paddle::framework::ParallelExecutor::Run(std::vector<std::string, std::allocatorstd::string > const&, std::string const&) + 489 5 0x7f20c96ca2dep 6 0x7f20c96fd78ep 7 0x7f210a90cce8p PyEval_EvalFrameEx + 28264 8 0x7f210a90f37dp PyEval_EvalCodeEx + 2061 9 0x7f210a90cd70p PyEval_EvalFrameEx + 28400 10 0x7f210a90ce9ep PyEval_EvalFrameEx + 28702 11 0x7f210a90f37dp PyEval_EvalCodeEx + 2061 12 0x7f210a90f4b2p PyEval_EvalCode + 50 13 0x7f210a9391c2p PyRun_FileExFlags + 146 14 0x7f210a93a559p PyRun_SimpleFileExFlags + 217 15 0x7f210a9501ddp Py_Main + 3149 16 0x7f2109be3d1dp __libc_start_main + 253 17 0x4006b1p
请问有什么方法解决?