paddle单机多卡一定要root权限吗,我这边只有root情况可以进行训练
Created by: JiabinYang
Paddle fluid v1.3
单卡训练正常,多卡使用root权限训练正常,多卡使用user权限出现错误:
`---------- Configuration Arguments -----------
batch_size: 16
class_dim: 20
image_shape: 3,224,224
imgmodel_save_dir: output/output_img
lr_init_img: 0.01
lr_init_text: 0.01
num_epochs: 60
num_layers: 50
pretrained_model: data/resnet_50
seg_num: 9
textmodel_save_dir: output/output_text
total_videos: 32025
use_gpu: True
with_mem_opt: True
words_id:19959
words_id:19959
W0312 16:43:50.831005 12132 device_context.cc:263] Please NOTE: device: 0, CUDA Capability: 35, Driver API Version: 9.2, Runtime API Version: 9.0
W0312 16:43:50.831063 12132 device_context.cc:271] device: 0, cuDNN Version: 7.0.
W0312 16:43:50.831074 12132 device_context.cc:295] WARNING: device: 0. The installed Paddle is compiled with CUDNN 7.3, but CUDNN version in your machine is 7.0, which may cause serious incompatible bug. Please recompile or reinstall Paddle with compatible CUDNN version.
* Aborted at 1552380234 (unix time) try "date -d @1552380234" if you are using GNU date *
PC: @ 0x0 (unknown)
* SIGSEGV (@0x50) received by PID 12132 (TID 0x7fd794b66700) from PID 80; stack trace: *
@ 0x7fd79431e160 (unknown)
@ 0x7fd6da48f6b0 freeRing()
@ 0x7fd6da4885b0 commFree()
@ 0x7fd6da48c90d ncclCommInitAll
@ 0x7fd74ab66c4c paddle::platform::NCCLContextMap::NCCLContextMap()
@ 0x7fd74ab62952 paddle::framework::ParallelExecutor::ParallelExecutor()
@ 0x7fd74aa80098 ZZN8pybind1112cpp_function10initializeIZNS_6detail8initimpl11constructorIJRKSt6vectorIN5boost7variantIN6paddle8platform9CUDAPlaceENS9_8CPUPlaceENS9_15CUDAPinnedPlaceENS6_6detail7variant5void_ESF_SF_SF_SF_SF_SF_SF_SF_SF_SF_SF_SF_SF_SF_SF_SF_EESaISG_EERKSt13unordered_setISsSt4hashISsESt8equal_toISsESaISsEERKNS8_9framework11ProgramDescERKSsPNSU_5ScopeERS5_IS11_SaIS11_EERKNSU_7details17ExecutionStrategyERKNS15_13BuildStrategyEEE7executeINS_6class_INSU_16ParallelExecutorEJEEEJELi0EEEvRT_DpRKT0_EUlRNS2_16value_and_holderESK_ST_SX_SZ_S11_S14_S18_S1B_E_vJS1O_SK_ST_SX_SZ_S11_S14_S18_S1B_EJNS_4nameENS_9is_methodENS_7siblingENS2_24is_new_style_constructorEEEEvOT_PFT0_DpT1_EDpRKT2_ENUlRNS2_13function_callEE1_4_FUNES26
@ 0x7fd74aa4e1fe pybind11::cpp_function::dispatcher()
@ 0x7fd79457fdf3 PyObject_Call
@ 0x7fd79458e9cd instancemethod_call
@ 0x7fd79457fdf3 PyObject_Call
@ 0x7fd7945edbaf slot_tp_init
@ 0x7fd7945ea46f type_call
@ 0x7fd79457fdf3 PyObject_Call
@ 0x7fd7946354a6 PyEval_EvalFrameEx
@ 0x7fd79463b0bd PyEval_EvalCodeEx
@ 0x7fd7945b1f85 function_call
@ 0x7fd79457fdf3 PyObject_Call
@ 0x7fd79458e9cd instancemethod_call
@ 0x7fd79457fdf3 PyObject_Call
@ 0x7fd7945edbaf slot_tp_init
@ 0x7fd7945ea46f type_call
@ 0x7fd79457fdf3 PyObject_Call
@ 0x7fd7946354a6 PyEval_EvalFrameEx
@ 0x7fd794638460 PyEval_EvalFrameEx
@ 0x7fd794638460 PyEval_EvalFrameEx
@ 0x7fd79463b0bd PyEval_EvalCodeEx
@ 0x7fd79463b1f2 PyEval_EvalCode
@ 0x7fd794663f42 PyRun_FileExFlags
@ 0x7fd7946652d9 PyRun_SimpleFileExFlags
@ 0x7fd79467b00d Py_Main
@ 0x7fd793878bd5 __libc_start_main
run_train.sh: line 9: 12132 Segmentation fault CUDA_VISIBLE_DEVICES=0,1,2,3 $python video_train.py`