用 paddlepaddle/paddle:latest-gpu-cuda10.0-cudnn7镜像训练vgg和resnet，有时可以正常跑有时报如下错误。 (#22243) · Issue · PaddlePaddle / Paddle

用 paddlepaddle/paddle:latest-gpu-cuda10.0-cudnn7镜像训练vgg和resnet，有时可以正常跑有时报如下错误。

Created by: gentelyang

------------- Configuration Arguments ------------- batch_size : 32 checkpoint : None class_dim : 1000 data_dir : ./data/ILSVRC2012/ enable_ce : False fp16 : False image_shape : 3,224,224 is_distill : False l2_decay : 0.0001 label_smoothing_epsilon : 0.2 lower_ratio : 0.75 lower_scale : 0.08 lr : 0.1 lr_strategy : piecewise_decay mixup_alpha : 0.2 model : ResNet50 model_save_dir : output/ momentum_rate : 0.9 num_epochs : 120 pretrained_model : None resize_short_size : 256 scale_loss : 1.0 total_images : 1281167 upper_ratio : 1.33333333333 use_gpu : True use_label_smoothing : False use_mixup : False with_inplace : True with_mem_opt : 1

2020-01-13 06:33:49,690-WARNING: paddle.fluid.layers.py_reader() may be deprecated in the near future. Please use paddle.fluid.io.DataLoader.from_generator() instead. 2020-01-13 06:33:55,166-WARNING: paddle.fluid.layers.py_reader() may be deprecated in the near future. Please use paddle.fluid.io.DataLoader.from_generator() instead. 2020-01-13 06:33:55,711-WARNING: Caution! paddle.fluid.memory_optimize() is deprecated and not maintained any more, since it is not stable! This API would not take any memory optimizations on your Program now, since we have provided default strategies for you. The newest and stable memory optimization strategies (they are all enabled by default) are as follows:

Garbage collection strategy, which is enabled by exporting environment variable FLAGS_eager_delete_tensor_gb=0 (0 is the default value).
Inplace strategy, which is enabled by setting build_strategy.enable_inplace=True (True is the default value) when using CompiledProgram or ParallelExecutor.

2020-01-13 06:33:55,711-WARNING: Caution! paddle.fluid.memory_optimize() is deprecated and not maintained any more, since it is not stable! This API would not take any memory optimizations on your Program now, since we have provided default strategies for you. The newest and stable memory optimization strategies (they are all enabled by default) are as follows:

Garbage collection strategy, which is enabled by exporting environment variable FLAGS_eager_delete_tensor_gb=0 (0 is the default value).
Inplace strategy, which is enabled by setting build_strategy.enable_inplace=True (True is the default value) when using CompiledProgram or ParallelExecutor.

W0113 06:33:57.231657 7512 device_context.cc:237] Please NOTE: device: 0, CUDA Capability: 70, Driver API Version: 10.1, Runtime API Version: 10.0 W0113 06:33:57.316989 7512 device_context.cc:245] device: 0, cuDNN Version: 7.5. W0113 06:33:57.317095 7512 device_context.cc:271] WARNING: device: 0. The installed Paddle is compiled with CUDNN 7.6, but CUDNN version in your machine is 7.5, which may cause serious incompatible bug. Please recompile or reinstall Paddle with compatible CUDNN version. W0113 06:34:00.199051 7609 init.cc:209] Warning: PaddlePaddle catches a failure signal, it may not work properly W0113 06:34:00.199091 7609 init.cc:211] You could check whether you killed PaddlePaddle thread/process accidentally or report the case to PaddlePaddle W0113 06:34:00.199101 7609 init.cc:214] The detail failure signal is:

W0113 06:34:00.199110 7609 init.cc:217] *** Aborted at 1578897240 (unix time) try "date -d @1578897240" if you are using GNU date *** W0113 06:34:00.200392 7609 init.cc:217] PC: @ 0x0 (unknown) W0113 06:34:00.201520 7609 init.cc:217] *** SIGSEGV (@0x8e75c) received by PID 7512 (TID 0x7fa6bd771700) from PID 583516; stack trace: *** W0113 06:34:00.202455 7609 init.cc:217] @ 0x7fa6eb628390 (unknown) W0113 06:34:00.203292 7609 init.cc:217] @ 0x7fa6eb2d1512 cfree W0113 06:34:00.203785 7609 init.cc:217] @ 0x7fa640684c89 (unknown) W0113 06:34:00.204277 7609 init.cc:217] @ 0x7fa6403005b5 (unknown) W0113 06:34:00.204366 7609 init.cc:217] @ 0x4bce2f PyEval_EvalFrameEx W0113 06:34:00.204437 7609 init.cc:217] @ 0x4ba506 PyEval_EvalCodeEx W0113 06:34:00.204509 7609 init.cc:217] @ 0x4c1e32 PyEval_EvalFrameEx W0113 06:34:00.204573 7609 init.cc:217] @ 0x4ba506 PyEval_EvalCodeEx W0113 06:34:00.204644 7609 init.cc:217] @ 0x4d5e43 (unknown) W0113 06:34:00.204690 7609 init.cc:217] @ 0x4a646e PyObject_Call W0113 06:34:00.204763 7609 init.cc:217] @ 0x53a1dc (unknown) W0113 06:34:00.204833 7609 init.cc:217] @ 0x4c1c83 PyEval_EvalFrameEx W0113 06:34:00.204908 7609 init.cc:217] @ 0x4ba506 PyEval_EvalCodeEx W0113 06:34:00.204983 7609 init.cc:217] @ 0x4d5e43 (unknown) W0113 06:34:00.205029 7609 init.cc:217] @ 0x4a646e PyObject_Call W0113 06:34:00.205104 7609 init.cc:217] @ 0x4c2c4a PyEval_EvalFrameEx W0113 06:34:00.205174 7609 init.cc:217] @ 0x4c1934 PyEval_EvalFrameEx W0113 06:34:00.205245 7609 init.cc:217] @ 0x4c1934 PyEval_EvalFrameEx W0113 06:34:00.205308 7609 init.cc:217] @ 0x4ba506 PyEval_EvalCodeEx W0113 06:34:00.205377 7609 init.cc:217] @ 0x4d5d09 (unknown) W0113 06:34:00.205451 7609 init.cc:217] @ 0x4ee30e (unknown) W0113 06:34:00.205497 7609 init.cc:217] @ 0x4a646e PyObject_Call W0113 06:34:00.205559 7609 init.cc:217] @ 0x4c6690 PyEval_CallObjectWithKeywords W0113 06:34:00.205633 7609 init.cc:217] @ 0x588e42 (unknown) W0113 06:34:00.206462 7609 init.cc:217] @ 0x7fa6eb61e6ba start_thread W0113 06:34:00.207336 7609 init.cc:217] @ 0x7fa6eb35441d clone W0113 06:34:00.208194 7609 init.cc:217] @ 0x0 (unknown) run_resnet.sh: line 15: 7512 Segmentation fault python train.py --model=ResNet50 --batch_size=32 --total_images=1281167 --class_dim=1000 --image_shape=3,224,224 --model_save_dir=output/ --with_mem_opt=True --lr_strategy=piecewise_decay --num_epochs=120 --lr=0.1 --l2_decay=1e-4

PaddlePaddle / Paddle 1 年多 前同步成功

用 paddlepaddle/paddle:latest-gpu-cuda10.0-cudnn7镜像训练vgg和resnet，有时可以正常跑有时报如下错误。

PaddlePaddle / Paddle
1 年多前同步成功