ubuntu训练报错
Created by: Darki-luo
-
版本、环境信息: 1)PaddlePaddle版本:1.8.4 paddlex 2.1.0 3)GPU,请提供GPU2080ti、CUDNN 6.5.7 4)系统环境:ubuntu 18/20 linux 5.4.0.48 训练报错,ubuntu18 Linux4.15.118 可以正常训练
-
训练信息 1)单机,单卡 2)显存信息 ==============================+======================+======================| | 0 GeForce RTX 208... Off | 00000000:01:00.0 Off | N/A | | 36% 52C P2 62W / 250W | 10694MiB / 11016MiB | 0% Default | 3)Operator信息
-
复现信息:如为报错,请给出复现环境、复现步骤
-
问题描述:请详细描述您的问题,同步贴出报错信息、日志、可复现的代码片段
2020-09-27 13:37:32 [INFO] [TRAIN] Epoch=6/50, Step=85/2205, loss=1.115501, acc1=0.730769, acc5=0.961538, lr=0.01, time_each_step=0.46s, eta=15:3:52 2020-09-27 13:37:32 [INFO] [TRAIN] Epoch=6/50, Step=86/2205, loss=0.654025, acc1=0.807692, acc5=1.0, lr=0.01, time_each_step=0.46s, eta=15:4:48 2020-09-27 13:37:33 [INFO] [TRAIN] Epoch=6/50, Step=87/2205, loss=0.676896, acc1=0.884615, acc5=0.961538, lr=0.01, time_each_step=0.46s, eta=15:5:27 2020-09-27 13:37:33 [INFO] [TRAIN] Epoch=6/50, Step=88/2205, loss=0.946485, acc1=0.769231, acc5=0.923077, lr=0.01, time_each_step=0.46s, eta=15:5:22 2020-09-27 13:37:34 [INFO] [TRAIN] Epoch=6/50, Step=89/2205, loss=0.93403, acc1=0.769231, acc5=0.961538, lr=0.01, time_each_step=0.46s, eta=15:5:44 2020-09-27 13:37:34 [INFO] [TRAIN] Epoch=6/50, Step=90/2205, loss=0.736229, acc1=0.807692, acc5=0.961538, lr=0.01, time_each_step=0.46s, eta=15:5:29 2020-09-27 13:37:35 [INFO] [TRAIN] Epoch=6/50, Step=91/2205, loss=0.646935, acc1=0.846154, acc5=0.961538, lr=0.01, time_each_step=0.46s, eta=15:5:38 2020-09-27 13:37:35 [INFO] [TRAIN] Epoch=6/50, Step=92/2205, loss=0.751287, acc1=0.769231, acc5=1.0, lr=0.01, time_each_step=0.46s, eta=15:6:59 2020-09-27 13:37:36 [INFO] [TRAIN] Epoch=6/50, Step=93/2205, loss=0.856711, acc1=0.730769, acc5=1.0, lr=0.01, time_each_step=0.46s, eta=15:6:48 2020-09-27 13:37:36 [INFO] [TRAIN] Epoch=6/50, Step=94/2205, loss=0.661991, acc1=0.846154, acc5=1.0, lr=0.01, time_each_step=0.46s, eta=15:7:51 2020-09-27 13:37:37 [INFO] [TRAIN] Epoch=6/50, Step=95/2205, loss=0.656951, acc1=0.769231, acc5=1.0, lr=0.01, time_each_step=0.46s, eta=15:6:48 2020-09-27 13:37:37 [INFO] [TRAIN] Epoch=6/50, Step=96/2205, loss=0.797513, acc1=0.769231, acc5=1.0, lr=0.01, time_each_step=0.46s, eta=15:6:51 2020-09-27 13:37:37 [INFO] [TRAIN] Epoch=6/50, Step=97/2205, loss=0.665872, acc1=0.807692, acc5=1.0, lr=0.01, time_each_step=0.46s, eta=15:6:46 2020-09-27 13:37:38 [INFO] [TRAIN] Epoch=6/50, Step=98/2205, loss=1.017552, acc1=0.846154, acc5=0.961538, lr=0.01, time_each_step=0.46s, eta=15:6:20 2020-09-27 13:37:38 [INFO] [TRAIN] Epoch=6/50, Step=99/2205, loss=1.241585, acc1=0.653846, acc5=0.807692, lr=0.01, time_each_step=0.46s, eta=15:5:46 2020-09-27 13:37:39 [INFO] [TRAIN] Epoch=6/50, Step=100/2205, loss=1.026779, acc1=0.730769, acc5=0.961538, lr=0.01, time_each_step=0.46s, eta=15:2:56 2020-09-27 13:37:39 [INFO] [TRAIN] Epoch=6/50, Step=101/2205, loss=1.120948, acc1=0.730769, acc5=0.884615, lr=0.01, time_each_step=0.45s, eta=15:2:19 W0927 13:37:39.821702 10029 init.cc:226] Warning: PaddlePaddle catches a failure signal, it may not work properly W0927 13:37:39.821796 10029 init.cc:228] You could check whether you killed PaddlePaddle thread/process accidentally or report the case to PaddlePaddle W0927 13:37:39.821807 10029 init.cc:231] The detail failure signal is:
W0927 13:37:39.821816 10029 init.cc:234] *** Aborted at 1601185059 (unix time) try "date -d @1601185059" if you are using GNU date *** W0927 13:37:39.823652 10029 init.cc:234] PC: @ 0x0 (unknown) Segmentation fault (core dumped)