ERROR 2019-12-12 04:02:41,061 launch.py:269] ABORT!!! Out of all 4 trainers, the trainer process with rank=[0, 2] was aborted. Please check its log.
Created by: yanmeizhao
本地环境:cuda10,cudnn: 7.6.3, models的develop分支最新代码,paddle1.6.1,gcc version: 5.4.0
在用DALI训练image_classification时报错: [root@5d564e9b351a image_classification]# sh train_dali.sh ----------- Configuration Arguments ----------- cluster_node_ips: 127.0.0.1 log_dir: None node_ip: 127.0.0.1 print_config: True selected_gpus: None started_port: 6170 training_script: train.py training_script_args: ['--model=ResNet50', '--batch_size=32', '--lr_strategy=cosine_decay_warmup', '--num_epochs=240', '--lr=0.05', '--l2_decay=3e-5', '--lower_scale=0.64', '--lower_ratio=0.8', '--upper_ratio=1.2', '--use_dali=True'] use_paddlecloud: True
trainers_endpoints: 127.0.0.1:6170,127.0.0.1:6171,127.0.0.1:6172,127.0.0.1:6173 , node_id: 0 , current_node_ip: 127.0.0.1 , num_nodes: 1 , node_ips: ['127.0.0.1'] , nranks: 4 ------------- Configuration Arguments ------------- batch_size : 32 checkpoint : None class_dim : 1000 data_dir : ./data/ILSVRC2012/ decay_epochs : 2.4 decay_rate : 0.97 drop_connect_rate : 0.2 ema_decay : 0.9999 enable_ce : False image_mean : [0.485, 0.456, 0.406] image_shape : [3, 224, 224] image_std : [0.229, 0.224, 0.225] interpolation : None is_profiler : 0 l2_decay : 3e-05 label_smoothing_epsilon : 0.1 lower_ratio : 0.8 lower_scale : 0.64 lr : 0.05 lr_strategy : cosine_decay_warmup max_iter : 0 mixup_alpha : 0.2 model : ResNet50 model_save_dir : ./output momentum_rate : 0.9 num_epochs : 240 padding_type : SAME pretrained_model : None print_step : 10 profiler_path : ./ random_seed : None reader_buf_size : 2048 reader_thread : 8 resize_short_size : 256 same_feed : 0 save_step : 1 step_epochs : [30, 60, 90] test_batch_size : 16 total_images : 1281167 upper_ratio : 1.2 use_aa : False use_dali : 1 use_ema : False use_gpu : True use_label_smoothing : False use_mixup : False use_se : True validate : 1 warm_up_epochs : 5.0
W1212 04:02:31.290369 24045 device_context.cc:235] Please NOTE: device: 3, CUDA Capability: 61, Driver API Version: 10.0, Runtime API Version: 10.0 W1212 04:02:31.302067 24045 device_context.cc:243] device: 3, cuDNN Version: 7.6. W1212 04:02:31.330199 24044 device_context.cc:235] Please NOTE: device: 2, CUDA Capability: 61, Driver API Version: 10.0, Runtime API Version: 10.0 W1212 04:02:31.333243 24043 device_context.cc:235] Please NOTE: device: 1, CUDA Capability: 61, Driver API Version: 10.0, Runtime API Version: 10.0 W1212 04:02:31.337818 24044 device_context.cc:243] device: 2, cuDNN Version: 7.6. W1212 04:02:31.340380 24043 device_context.cc:243] device: 1, cuDNN Version: 7.6. W1212 04:02:31.444458 24042 device_context.cc:235] Please NOTE: device: 0, CUDA Capability: 61, Driver API Version: 10.0, Runtime API Version: 10.0 W1212 04:02:31.451699 24042 device_context.cc:243] device: 0, cuDNN Version: 7.6. W1212 04:02:33.282218 24043 init.cc:205] *** Aborted at 1576123353 (unix time) try "date -d @1576123353" if you are using GNU date *** W1212 04:02:33.285073 24043 init.cc:205] PC: @ 0x0 (unknown) W1212 04:02:33.285284 24043 init.cc:205] *** SIGSEGV (@0x0) received by PID 24043 (TID 0x7fe6e3b9a740) from PID 0; stack trace: *** W1212 04:02:33.287878 24043 init.cc:205] @ 0x7fe6e37785d0 (unknown) W1212 04:02:33.288434 24043 init.cc:205] @ 0x7fe68db87e3b (unknown) W1212 04:02:33.288965 24043 init.cc:205] @ 0x7fe68dbc21df (unknown) W1212 04:02:33.289500 24043 init.cc:205] @ 0x7fe68db91522 (unknown) W1212 04:02:33.290030 24043 init.cc:205] @ 0x7fe68dbc1f02 PyInit_backend_impl W1212 04:02:33.291326 24043 init.cc:205] @ 0x559a8cfd08e5 _PyImport_LoadDynamicModuleWithSpec W1212 04:02:33.292248 24043 init.cc:205] @ 0x559a8cfd0ae5 _imp_create_dynamic W1212 04:02:33.293475 24043 init.cc:205] @ 0x559a8cecca61 PyCFunction_Call W1212 04:02:33.294788 24043 init.cc:205] @ 0x559a8cf80fdb _PyEval_EvalFrameDefault W1212 04:02:33.295608 24043 init.cc:205] @ 0x559a8cf52a94 _PyEval_EvalCodeWithName W1212 04:02:33.296420 24043 init.cc:205] @ 0x559a8cf53941 fast_function W1212 04:02:33.297256 24043 init.cc:205] @ 0x559a8cf59755 call_function W1212 04:02:33.298564 24043 init.cc:205] @ 0x559a8cf7bcba _PyEval_EvalFrameDefault W1212 04:02:33.299376 24043 init.cc:205] @ 0x559a8cf5370b fast_function W1212 04:02:33.300211 24043 init.cc:205] @ 0x559a8cf59755 call_function W1212 04:02:33.301522 24043 init.cc:205] @ 0x559a8cf7bcba _PyEval_EvalFrameDefault W1212 04:02:33.302337 24043 init.cc:205] @ 0x559a8cf5370b fast_function W1212 04:02:33.303174 24043 init.cc:205] @ 0x559a8cf59755 call_function W1212 04:02:33.304481 24043 init.cc:205] @ 0x559a8cf7bcba _PyEval_EvalFrameDefault W1212 04:02:33.305295 24043 init.cc:205] @ 0x559a8cf5370b fast_function W1212 04:02:33.306128 24043 init.cc:205] @ 0x559a8cf59755 call_function W1212 04:02:33.307440 24043 init.cc:205] @ 0x559a8cf7bcba _PyEval_EvalFrameDefault W1212 04:02:33.308254 24043 init.cc:205] @ 0x559a8cf5370b fast_function W1212 04:02:33.309087 24043 init.cc:205] @ 0x559a8cf59755 call_function W1212 04:02:33.310396 24043 init.cc:205] @ 0x559a8cf7bcba _PyEval_EvalFrameDefault W1212 04:02:33.311619 24043 init.cc:205] @ 0x559a8cf53d7b _PyFunction_FastCallDict W1212 04:02:33.312815 24043 init.cc:205] @ 0x559a8cec9f5f _PyObject_FastCallDict W1212 04:02:33.314150 24043 init.cc:205] @ 0x559a8cf0e670 _PyObject_CallMethodIdObjArgs W1212 04:02:33.315428 24043 init.cc:205] @ 0x559a8cec0a70 PyImport_ImportModuleLevelObject W1212 04:02:33.316738 24043 init.cc:205] @ 0x559a8cf7e033 _PyEval_EvalFrameDefault W1212 04:02:33.318001 24043 init.cc:205] @ 0x559a8cf54459 PyEval_EvalCodeEx W1212 04:02:33.319205 24043 init.cc:205] @ 0x559a8cf551ec PyEval_EvalCode W1212 04:02:33.400863 24044 init.cc:205] *** Aborted at 1576123353 (unix time) try "date -d @1576123353" if you are using GNU date *** W1212 04:02:33.403725 24044 init.cc:205] PC: @ 0x0 (unknown) W1212 04:02:33.403934 24044 init.cc:205] *** SIGSEGV (@0x0) received by PID 24044 (TID 0x7f6406bd7740) from PID 0; stack trace: *** W1212 04:02:33.406607 24044 init.cc:205] @ 0x7f64067b55d0 (unknown) W1212 04:02:33.407167 24044 init.cc:205] @ 0x7f63b0bc4e3b (unknown) W1212 04:02:33.407702 24044 init.cc:205] @ 0x7f63b0bff1df (unknown) W1212 04:02:33.408239 24044 init.cc:205] @ 0x7f63b0bce522 (unknown) W1212 04:02:33.408773 24044 init.cc:205] @ 0x7f63b0bfef02 PyInit_backend_impl W1212 04:02:33.410063 24044 init.cc:205] @ 0x55615a8c98e5 _PyImport_LoadDynamicModuleWithSpec W1212 04:02:33.410984 24044 init.cc:205] @ 0x55615a8c9ae5 _imp_create_dynamic W1212 04:02:33.412220 24044 init.cc:205] @ 0x55615a7c5a61 PyCFunction_Call W1212 04:02:33.413537 24044 init.cc:205] @ 0x55615a879fdb _PyEval_EvalFrameDefault W1212 04:02:33.414355 24044 init.cc:205] @ 0x55615a84ba94 _PyEval_EvalCodeWithName W1212 04:02:33.415171 24044 init.cc:205] @ 0x55615a84c941 fast_function W1212 04:02:33.416003 24044 init.cc:205] @ 0x55615a852755 call_function W1212 04:02:33.417318 24044 init.cc:205] @ 0x55615a874cba _PyEval_EvalFrameDefault W1212 04:02:33.418129 24044 init.cc:205] @ 0x55615a84c70b fast_function W1212 04:02:33.418969 24044 init.cc:205] @ 0x55615a852755 call_function W1212 04:02:33.419068 24045 init.cc:205] *** Aborted at 1576123353 (unix time) try "date -d @1576123353" if you are using GNU date *** W1212 04:02:33.420289 24044 init.cc:205] @ 0x55615a874cba _PyEval_EvalFrameDefault W1212 04:02:33.421099 24044 init.cc:205] @ 0x55615a84c70b fast_function W1212 04:02:33.421872 24045 init.cc:205] PC: @ 0x0 (unknown) W1212 04:02:33.421942 24044 init.cc:205] @ 0x55615a852755 call_function W1212 04:02:33.422075 24045 init.cc:205] *** SIGSEGV (@0x0) received by PID 24045 (TID 0x7f8121536740) from PID 0; stack trace: *** W1212 04:02:33.423307 24044 init.cc:205] @ 0x55615a874cba _PyEval_EvalFrameDefault W1212 04:02:33.424119 24044 init.cc:205] @ 0x55615a84c70b fast_function W1212 04:02:33.424721 24045 init.cc:205] @ 0x7f81211145d0 (unknown) W1212 04:02:33.424958 24044 init.cc:205] @ 0x55615a852755 call_function W1212 04:02:33.425238 24045 init.cc:205] @ 0x7f80c0f8de3b (unknown) W1212 04:02:33.425730 24045 init.cc:205] @ 0x7f80c0fc81df (unknown) W1212 04:02:33.426224 24045 init.cc:205] @ 0x7f80c0f97522 (unknown) W1212 04:02:33.426277 24044 init.cc:205] @ 0x55615a874cba _PyEval_EvalFrameDefault W1212 04:02:33.426712 24045 init.cc:205] @ 0x7f80c0fc7f02 PyInit_backend_impl W1212 04:02:33.427100 24044 init.cc:205] @ 0x55615a84c70b fast_function W1212 04:02:33.427959 24044 init.cc:205] @ 0x55615a852755 call_function W1212 04:02:33.428014 24045 init.cc:205] @ 0x5573e9a5f8e5 _PyImport_LoadDynamicModuleWithSpec W1212 04:02:33.428985 24045 init.cc:205] @ 0x5573e9a5fae5 _imp_create_dynamic W1212 04:02:33.429309 24044 init.cc:205] @ 0x55615a874cba _PyEval_EvalFrameDefault W1212 04:02:33.430250 24045 init.cc:205] @ 0x5573e995ba61 PyCFunction_Call W1212 04:02:33.430562 24044 init.cc:205] @ 0x55615a84cd7b _PyFunction_FastCallDict W1212 04:02:33.431604 24045 init.cc:205] @ 0x5573e9a0ffdb _PyEval_EvalFrameDefault W1212 04:02:33.431788 24044 init.cc:205] @ 0x55615a7c2f5f _PyObject_FastCallDict W1212 04:02:33.432447 24045 init.cc:205] @ 0x5573e99e1a94 _PyEval_EvalCodeWithName W1212 04:02:33.433146 24044 init.cc:205] @ 0x55615a807670 _PyObject_CallMethodIdObjArgs W1212 04:02:33.433285 24045 init.cc:205] @ 0x5573e99e2941 fast_function W1212 04:02:33.434139 24045 init.cc:205] @ 0x5573e99e8755 call_function W1212 04:02:33.434453 24044 init.cc:205] @ 0x55615a7b9a70 PyImport_ImportModuleLevelObject W1212 04:02:33.435487 24045 init.cc:205] @ 0x5573e9a0acba _PyEval_EvalFrameDefault W1212 04:02:33.435796 24044 init.cc:205] @ 0x55615a877033 _PyEval_EvalFrameDefault W1212 04:02:33.436323 24045 init.cc:205] @ 0x5573e99e270b fast_function W1212 04:02:33.437090 24044 init.cc:205] @ 0x55615a84d459 PyEval_EvalCodeEx W1212 04:02:33.437180 24045 init.cc:205] @ 0x5573e99e8755 call_function W1212 04:02:33.438321 24044 init.cc:205] @ 0x55615a84e1ec PyEval_EvalCode W1212 04:02:33.438518 24045 init.cc:205] @ 0x5573e9a0acba _PyEval_EvalFrameDefault W1212 04:02:33.439347 24045 init.cc:205] @ 0x5573e99e270b fast_function W1212 04:02:33.440203 24045 init.cc:205] @ 0x5573e99e8755 call_function W1212 04:02:33.441527 24045 init.cc:205] @ 0x5573e9a0acba _PyEval_EvalFrameDefault W1212 04:02:33.442353 24045 init.cc:205] @ 0x5573e99e270b fast_function W1212 04:02:33.443202 24045 init.cc:205] @ 0x5573e99e8755 call_function W1212 04:02:33.444519 24045 init.cc:205] @ 0x5573e9a0acba _PyEval_EvalFrameDefault W1212 04:02:33.445341 24045 init.cc:205] @ 0x5573e99e270b fast_function W1212 04:02:33.446188 24045 init.cc:205] @ 0x5573e99e8755 call_function W1212 04:02:33.447504 24045 init.cc:205] @ 0x5573e9a0acba _PyEval_EvalFrameDefault W1212 04:02:33.448730 24045 init.cc:205] @ 0x5573e99e2d7b _PyFunction_FastCallDict W1212 04:02:33.449932 24045 init.cc:205] @ 0x5573e9958f5f _PyObject_FastCallDict W1212 04:02:33.451252 24045 init.cc:205] @ 0x5573e999d670 _PyObject_CallMethodIdObjArgs W1212 04:02:33.452535 24045 init.cc:205] @ 0x5573e994fa70 PyImport_ImportModuleLevelObject W1212 04:02:33.453855 24045 init.cc:205] @ 0x5573e9a0d033 _PyEval_EvalFrameDefault W1212 04:02:33.455127 24045 init.cc:205] @ 0x5573e99e3459 PyEval_EvalCodeEx W1212 04:02:33.456331 24045 init.cc:205] @ 0x5573e99e41ec PyEval_EvalCode W1212 04:02:33.783942 24042 init.cc:205] *** Aborted at 1576123353 (unix time) try "date -d @1576123353" if you are using GNU date *** W1212 04:02:33.786825 24042 init.cc:205] PC: @ 0x0 (unknown) W1212 04:02:33.787039 24042 init.cc:205] *** SIGSEGV (@0x0) received by PID 24042 (TID 0x7efd88a12740) from PID 0; stack trace: *** W1212 04:02:33.789687 24042 init.cc:205] @ 0x7efd885f05d0 (unknown) W1212 04:02:33.790251 24042 init.cc:205] @ 0x7efd329ffe3b (unknown) W1212 04:02:33.790787 24042 init.cc:205] @ 0x7efd32a3a1df (unknown) W1212 04:02:33.791326 24042 init.cc:205] @ 0x7efd32a09522 (unknown) W1212 04:02:33.791860 24042 init.cc:205] @ 0x7efd32a39f02 PyInit_backend_impl W1212 04:02:33.793159 24042 init.cc:205] @ 0x55fcbae418e5 _PyImport_LoadDynamicModuleWithSpec W1212 04:02:33.794075 24042 init.cc:205] @ 0x55fcbae41ae5 _imp_create_dynamic W1212 04:02:33.795310 24042 init.cc:205] @ 0x55fcbad3da61 PyCFunction_Call W1212 04:02:33.796627 24042 init.cc:205] @ 0x55fcbadf1fdb _PyEval_EvalFrameDefault W1212 04:02:33.797446 24042 init.cc:205] @ 0x55fcbadc3a94 _PyEval_EvalCodeWithName W1212 04:02:33.798260 24042 init.cc:205] @ 0x55fcbadc4941 fast_function W1212 04:02:33.799093 24042 init.cc:205] @ 0x55fcbadca755 call_function W1212 04:02:33.800408 24042 init.cc:205] @ 0x55fcbadeccba _PyEval_EvalFrameDefault W1212 04:02:33.801223 24042 init.cc:205] @ 0x55fcbadc470b fast_function W1212 04:02:33.802060 24042 init.cc:205] @ 0x55fcbadca755 call_function W1212 04:02:33.803372 24042 init.cc:205] @ 0x55fcbadeccba _PyEval_EvalFrameDefault W1212 04:02:33.804188 24042 init.cc:205] @ 0x55fcbadc470b fast_function W1212 04:02:33.805022 24042 init.cc:205] @ 0x55fcbadca755 call_function W1212 04:02:33.806337 24042 init.cc:205] @ 0x55fcbadeccba _PyEval_EvalFrameDefault W1212 04:02:33.807153 24042 init.cc:205] @ 0x55fcbadc470b fast_function W1212 04:02:33.807986 24042 init.cc:205] @ 0x55fcbadca755 call_function W1212 04:02:33.809303 24042 init.cc:205] @ 0x55fcbadeccba _PyEval_EvalFrameDefault W1212 04:02:33.810118 24042 init.cc:205] @ 0x55fcbadc470b fast_function W1212 04:02:33.810961 24042 init.cc:205] @ 0x55fcbadca755 call_function W1212 04:02:33.812273 24042 init.cc:205] @ 0x55fcbadeccba _PyEval_EvalFrameDefault W1212 04:02:33.813495 24042 init.cc:205] @ 0x55fcbadc4d7b _PyFunction_FastCallDict W1212 04:02:33.814693 24042 init.cc:205] @ 0x55fcbad3af5f _PyObject_FastCallDict W1212 04:02:33.816009 24042 init.cc:205] @ 0x55fcbad7f670 _PyObject_CallMethodIdObjArgs W1212 04:02:33.817297 24042 init.cc:205] @ 0x55fcbad31a70 PyImport_ImportModuleLevelObject W1212 04:02:33.818611 24042 init.cc:205] @ 0x55fcbadef033 _PyEval_EvalFrameDefault W1212 04:02:33.819876 24042 init.cc:205] @ 0x55fcbadc5459 PyEval_EvalCodeEx W1212 04:02:33.821080 24042 init.cc:205] @ 0x55fcbadc61ec PyEval_EvalCode ERROR 2019-12-12 04:02:41,061 launch.py:269] ABORT!!! Out of all 4 trainers, the trainer process with rank=[0, 2] was aborted. Please check its log.