PaddleCV/image_classification/models/GoogleNet 模型训练时报错
Created by: liyutg
因为使用了自定义数据集,所以修改了读取数据的代码
系统信息
paddle 版本 1.3.2 clone image_classification作为demo 使用自己数据集 GPU 型号 Tesla K40m 11441MiB Please NOTE: device: 1, CUDA Capability: 35, Driver API Version: 10.1, Runtime API Version: 9.0 device: 0, cuDNN Version: 7.0.
-
版本、环境信息: 1)GPU:Tesla K40m 11441MiB 、CUDA和CUDNN版本号 device: 0, CUDA Capability: 35, Driver API Version: 10.1, Runtime API Version: 9.0 2)系统环境:请您描述系统类型、版本,ubuntu 14.0,Python版本3.6
-
训练信息 1)单机 单卡 2)显存信息 Tesla K40m 11441MiB
配置信息
------------- Configuration Arguments -------------
batch_size : 128
checkpoint : None
class_dim : 61
data_dir : ./data/ILSVRC2012
enable_ce : False
fp16 : False
image_shape : 3,224,224
infer_model : ./infer_models
l2_decay : 0.0001
lr : 0.01
lr_strategy : piecewise_decay
model : GoogleNet
model_save_dir : output/
momentum_rate : 0.9
num_epochs : 100
pretrained_model : None
scale_loss : 1.0
test_images : 4540
total_images : 31718
use_gpu : True
visual_num : 1000
with_mem_opt : 1
----------------------------------------------------
W0425 00:56:32.982301 18073 device_context.cc:263] Please NOTE: device: 0, CUDA Capability: 35, Driver API Version: 10.1, Runtime API Version: 9.0
W0425 00:56:32.982357 18073 device_context.cc:271] device: 0, cuDNN Version: 7.0.
训练信息
报错信息
ass 0, trainbatch 0, loss 6.58166, acc1 0.00781, acc5 0.05469, lr 0.01000, time 1.61 sec, scalar_train_index 0
Pass 0, trainbatch 24, loss 6.21634, acc1 0.07812, acc5 0.18750, lr 0.01000, time 0.91 sec, scalar_train_index 1
Pass 0, trainbatch 48, loss 5.91886, acc1 0.11719, acc5 0.31250, lr 0.01000, time 0.93 sec, scalar_train_index 2
Pass 0, trainbatch 72, loss 5.29668, acc1 0.14062, acc5 0.45312, lr 0.01000, time 0.88 sec, scalar_train_index 3
Pass 0, trainbatch 96, loss 4.89611, acc1 0.21875, acc5 0.55469, lr 0.01000, time 0.90 sec, scalar_train_index 4
Pass 0, trainbatch 120, loss 4.85930, acc1 0.22656, acc5 0.51562, lr 0.01000, time 0.88 sec, scalar_train_index 5
Pass 0, trainbatch 144, loss 4.89562, acc1 0.24219, acc5 0.53125, lr 0.01000, time 0.88 sec, scalar_train_index 6
Pass 0, trainbatch 168, loss 4.63307, acc1 0.24219, acc5 0.53125, lr 0.01000, time 0.88 sec, scalar_train_index 7
Pass 0, trainbatch 192, loss 4.11545, acc1 0.34375, acc5 0.67188, lr 0.01000, time 0.88 sec, scalar_train_index 8
Pass 0, trainbatch 216, loss 4.23675, acc1 0.32031, acc5 0.60938, lr 0.01000, time 0.88 sec, scalar_train_index 9
Pass 0, trainbatch 240, loss 4.18755, acc1 0.28125, acc5 0.63281, lr 0.01000, time 0.88 sec, scalar_train_index 10
Pass 0,testbatch 0,loss 6.68680, acc1 0.00000,acc5 0.06250,time 0.34 sec ,scalar_test_index 0
Pass 0,testbatch 24,loss 1.63023, acc1 0.31250,acc5 1.00000,time 0.07 sec ,scalar_test_index 1
Pass 0,testbatch 48,loss 4.17449, acc1 0.43750,acc5 0.93750,time 0.08 sec ,scalar_test_index 2
Pass 0,testbatch 72,loss 2.61009, acc1 0.43750,acc5 1.00000,time 0.08 sec ,scalar_test_index 3
Pass 0,testbatch 96,loss 6.07975, acc1 0.18750,acc5 0.25000,time 0.08 sec ,scalar_test_index 4
Pass 0,testbatch 120,loss 1.64130, acc1 0.62500,acc5 0.87500,time 0.08 sec ,scalar_test_index 5
Pass 0,testbatch 144,loss 1.27195, acc1 0.75000,acc5 1.00000,time 0.08 sec ,scalar_test_index 6
Pass 0,testbatch 168,loss 6.13589, acc1 0.00000,acc5 0.25000,time 0.08 sec ,scalar_test_index 7
Pass 0,testbatch 192,loss 2.45174, acc1 0.87500,acc5 0.87500,time 0.07 sec ,scalar_test_index 8
Pass 0,testbatch 216,loss 4.23638, acc1 0.37500,acc5 0.75000,time 0.07 sec ,scalar_test_index 9
Pass 0,testbatch 240,loss 1.77212, acc1 0.68750,acc5 0.93750,time 0.07 sec ,scalar_test_index 10
Pass 0,testbatch 264,loss 3.91363, acc1 0.37500,acc5 0.75000,time 0.07 sec ,scalar_test_index 11
End pass 0, train_loss 4.95904, train_acc1 0.22267, train_acc5 0.51104, test_loss 3.59753, test_acc1 0.40449, test_acc5 0.72014
Pass 1, trainbatch 0, loss 4.20429, acc1 0.29688, acc5 0.73438, lr 0.01000, time 1.61 sec, scalar_train_index 11
Pass 1, trainbatch 24, loss 3.92477, acc1 0.36719, acc5 0.68750, lr 0.01000, time 0.90 sec, scalar_train_index 12
Pass 1, trainbatch 48, loss 4.04725, acc1 0.27344, acc5 0.66406, lr 0.01000, time 0.89 sec, scalar_train_index 13
Pass 1, trainbatch 72, loss 3.83627, acc1 0.30469, acc5 0.71094, lr 0.01000, time 0.92 sec, scalar_train_index 14
Pass 1, trainbatch 96, loss 6.53986, acc1 0.07031, acc5 0.20312, lr 0.01000, time 0.88 sec, scalar_train_index 15
Pass 1, trainbatch 120, loss 6.48425, acc1 0.03906, acc5 0.28125, lr 0.01000, time 0.88 sec, scalar_train_index 16
Pass 1, trainbatch 144, loss 6.45244, acc1 0.05469, acc5 0.28906, lr 0.01000, time 0.89 sec, scalar_train_index 17
Pass 1, trainbatch 168, loss 6.45208, acc1 0.08594, acc5 0.20312, lr 0.01000, time 0.87 sec, scalar_train_index 18
Pass 1, trainbatch 192, loss 6.40190, acc1 0.11719, acc5 0.34375, lr 0.01000, time 0.87 sec, scalar_train_index 19
Pass 1, trainbatch 216, loss 6.41929, acc1 0.07812, acc5 0.19531, lr 0.01000, time 0.87 sec, scalar_train_index 20
Pass 1, trainbatch 240, loss 6.36903, acc1 0.09375, acc5 0.28906, lr 0.01000, time 0.87 sec, scalar_train_index 21
Pass 1,testbatch 0,loss 6.78550, acc1 0.00000,acc5 0.00000,time 0.15 sec ,scalar_test_index 12
Pass 1,testbatch 24,loss 5.59329, acc1 1.00000,acc5 1.00000,time 0.06 sec ,scalar_test_index 13
Pass 1,testbatch 48,loss 6.63706, acc1 0.00000,acc5 0.00000,time 0.07 sec ,scalar_test_index 14
Pass 1,testbatch 72,loss 6.58967, acc1 0.00000,acc5 0.00000,time 0.08 sec ,scalar_test_index 15
Pass 1,testbatch 96,loss 6.71620, acc1 0.00000,acc5 0.00000,time 0.06 sec ,scalar_test_index 16
Pass 1,testbatch 120,loss 5.87767, acc1 0.00000,acc5 0.93750,time 0.07 sec ,scalar_test_index 17
Pass 1,testbatch 144,loss 5.82598, acc1 0.00000,acc5 1.00000,time 0.06 sec ,scalar_test_index 18
Pass 1,testbatch 168,loss 6.67258, acc1 0.00000,acc5 0.00000,time 0.06 sec ,scalar_test_index 19
Pass 1,testbatch 192,loss 6.20472, acc1 0.00000,acc5 0.00000,time 0.06 sec ,scalar_test_index 20
Pass 1,testbatch 216,loss 6.26085, acc1 0.00000,acc5 0.00000,time 0.06 sec ,scalar_test_index 21
Pass 1,testbatch 240,loss 6.48922, acc1 0.00000,acc5 0.00000,time 0.06 sec ,scalar_test_index 22
Pass 1,testbatch 264,loss 6.29944, acc1 0.00000,acc5 0.00000,time 0.06 sec ,scalar_test_index 23
End pass 1, train_loss 6.07456, train_acc1 0.16669, train_acc5 0.41099, test_loss 6.34295, test_acc1 0.07768, test_acc5 0.28389
Pass 2, trainbatch 0, loss 6.30323, acc1 0.11719, acc5 0.32031, lr 0.01000, time 1.54 sec, scalar_train_index 22
Pass 2, trainbatch 24, loss 6.33684, acc1 0.08594, acc5 0.25781, lr 0.01000, time 0.91 sec, scalar_train_index 23
Pass 2, trainbatch 48, loss 6.27985, acc1 0.06250, acc5 0.28125, lr 0.01000, time 0.93 sec, scalar_train_index 24
Pass 2, trainbatch 72, loss 6.29569, acc1 0.07031, acc5 0.28125, lr 0.01000, time 0.87 sec, scalar_train_index 25
Pass 2, trainbatch 96, loss 6.24136, acc1 0.09375, acc5 0.28906, lr 0.01000, time 0.87 sec, scalar_train_index 26
Pass 2, trainbatch 120, loss 6.27440, acc1 0.07812, acc5 0.30469, lr 0.01000, time 0.87 sec, scalar_train_index 27
Pass 2, trainbatch 144, loss 6.16809, acc1 0.11719, acc5 0.33594, lr 0.01000, time 0.87 sec, scalar_train_index 28
Pass 2, trainbatch 168, loss 6.29105, acc1 0.06250, acc5 0.21094, lr 0.01000, time 0.87 sec, scalar_train_index 29
Pass 2, trainbatch 192, loss 6.24925, acc1 0.03906, acc5 0.26562, lr 0.01000, time 0.87 sec, scalar_train_index 30
Pass 2, trainbatch 216, loss 6.20490, acc1 0.11719, acc5 0.25000, lr 0.01000, time 0.87 sec, scalar_train_index 31
Pass 2, trainbatch 240, loss 6.21943, acc1 0.06250, acc5 0.25000, lr 0.01000, time 0.87 sec, scalar_train_index 32
Pass 2,testbatch 0,loss 7.05680, acc1 0.00000,acc5 0.00000,time 0.43 sec ,scalar_test_index 24
Pass 2,testbatch 24,loss 4.81013, acc1 1.00000,acc5 1.00000,time 0.07 sec ,scalar_test_index 25
Pass 2,testbatch 48,loss 6.74531, acc1 0.00000,acc5 0.00000,time 0.07 sec ,scalar_test_index 26
Pass 2,testbatch 72,loss 6.61791, acc1 0.00000,acc5 0.00000,time 0.07 sec ,scalar_test_index 27
Pass 2,testbatch 96,loss 6.89023, acc1 0.00000,acc5 0.00000,time 0.07 sec ,scalar_test_index 28
Pass 2,testbatch 120,loss 5.39686, acc1 0.00000,acc5 0.93750,time 0.07 sec ,scalar_test_index 29
Pass 2,testbatch 144,loss 5.29587, acc1 0.00000,acc5 1.00000,time 0.07 sec ,scalar_test_index 30
Pass 2,testbatch 168,loss 6.78932, acc1 0.00000,acc5 0.00000,time 0.06 sec ,scalar_test_index 31
Pass 2,testbatch 192,loss 5.88945, acc1 0.00000,acc5 0.00000,time 0.07 sec ,scalar_test_index 32
Pass 2,testbatch 216,loss 6.10050, acc1 0.00000,acc5 0.00000,time 0.07 sec ,scalar_test_index 33
Pass 2,testbatch 240,loss 6.39842, acc1 0.00000,acc5 0.00000,time 0.06 sec ,scalar_test_index 34
Pass 2,testbatch 264,loss 6.07916, acc1 0.00000,acc5 0.00000,time 0.06 sec ,scalar_test_index 35
End pass 2, train_loss 6.26088, train_acc1 0.07803, train_acc5 0.28214, test_loss 6.19724, test_acc1 0.07768, test_acc5 0.28389
Pass 3, trainbatch 0, loss 6.24757, acc1 0.03125, acc5 0.25000, lr 0.01000, time 1.56 sec, scalar_train_index 33
Pass 3, trainbatch 24, loss 6.22961, acc1 0.07812, acc5 0.25781, lr 0.01000, time 0.87 sec, scalar_train_index 34
Pass 3, trainbatch 48, loss 6.18353, acc1 0.07812, acc5 0.24219, lr 0.01000, time 0.87 sec, scalar_train_index 35
Pass 3, trainbatch 72, loss 6.25078, acc1 0.05469, acc5 0.21875, lr 0.01000, time 0.87 sec, scalar_train_index 36
Pass 3, trainbatch 96, loss 6.16289, acc1 0.03906, acc5 0.29688, lr 0.01000, time 0.89 sec, scalar_train_index 37
Pass 3, trainbatch 120, loss 6.16478, acc1 0.05469, acc5 0.29688, lr 0.01000, time 0.88 sec, scalar_train_index 38
Pass 3, trainbatch 144, loss 6.20497, acc1 0.04688, acc5 0.24219, lr 0.01000, time 0.87 sec, scalar_train_index 39
Pass 3, trainbatch 168, loss 6.09273, acc1 0.07031, acc5 0.32031, lr 0.01000, time 0.87 sec, scalar_train_index 40
Pass 3, trainbatch 192, loss 6.10254, acc1 0.11719, acc5 0.28125, lr 0.01000, time 0.87 sec, scalar_train_index 41
Pass 3, trainbatch 216, loss 6.04383, acc1 0.10938, acc5 0.30469, lr 0.01000, time 0.87 sec, scalar_train_index 42
Pass 3, trainbatch 240, loss 6.13893, acc1 0.07812, acc5 0.25000, lr 0.01000, time 0.87 sec, scalar_train_index 43
Pass 3,testbatch 0,loss 7.25361, acc1 0.00000,acc5 0.00000,time 0.20 sec ,scalar_test_index 36
Pass 3,testbatch 24,loss 4.50071, acc1 1.00000,acc5 1.00000,time 0.07 sec ,scalar_test_index 37
Pass 3,testbatch 48,loss 6.80605, acc1 0.00000,acc5 0.00000,time 0.06 sec ,scalar_test_index 38
Pass 3,testbatch 72,loss 6.62261, acc1 0.00000,acc5 0.00000,time 0.07 sec ,scalar_test_index 39
Pass 3,testbatch 96,loss 7.07348, acc1 0.00000,acc5 0.00000,time 0.08 sec ,scalar_test_index 40
Pass 3,testbatch 120,loss 5.02440, acc1 0.00000,acc5 1.00000,time 0.08 sec ,scalar_test_index 41
Pass 3,testbatch 144,loss 5.04806, acc1 0.00000,acc5 1.00000,time 0.06 sec ,scalar_test_index 42
Pass 3,testbatch 168,loss 6.88153, acc1 0.00000,acc5 0.00000,time 0.06 sec ,scalar_test_index 43
Pass 3,testbatch 192,loss 5.70489, acc1 0.00000,acc5 0.00000,time 0.06 sec ,scalar_test_index 44
Pass 3,testbatch 216,loss 5.98661, acc1 0.00000,acc5 0.00000,time 0.07 sec ,scalar_test_index 45
Pass 3,testbatch 240,loss 6.33267, acc1 0.00000,acc5 0.00000,time 0.06 sec ,scalar_test_index 46
Pass 3,testbatch 264,loss 5.92419, acc1 0.00000,acc5 0.00000,time 0.07 sec ,scalar_test_index 47
End pass 3, train_loss 6.16399, train_acc1 0.07803, train_acc5 0.28201, test_loss 6.13189, test_acc1 0.07768, test_acc5 0.28389
Pass 4, trainbatch 0, loss 6.19262, acc1 0.06250, acc5 0.27344, lr 0.01000, time 1.48 sec, scalar_train_index 44
Pass 4, trainbatch 24, loss 6.09134, acc1 0.04688, acc5 0.29688, lr 0.01000, time 0.91 sec, scalar_train_index 45
Pass 4, trainbatch 48, loss 6.10587, acc1 0.11719, acc5 0.28125, lr 0.01000, time 0.91 sec, scalar_train_index 46
Pass 4, trainbatch 72, loss 6.08864, acc1 0.06250, acc5 0.32031, lr 0.01000, time 0.87 sec, scalar_train_index 47
Pass 4, trainbatch 96, loss 6.17017, acc1 0.06250, acc5 0.26562, lr 0.01000, time 0.87 sec, scalar_train_index 48
Traceback (most recent call last):
File "train.py", line 536, in <module>
main()
File "train.py", line 532, in main
train(args)
File "train.py", line 401, in train
fetch_list=train_fetch_list)
File "/home/fzuir/.local/lib/python3.6/site-packages/paddle/fluid/parallel_executor.py", line 303, in run
self.executor.run(fetch_list, fetch_var_name)
paddle.fluid.core.EnforceNotMet: an illegal memory access was encountered at [/paddle/paddle/fluid/platform/device_context.cc:328]
PaddlePaddle Call Stacks:
0 0x7f93dc29de45p void paddle::platform::EnforceNotMet::Init<char const*>(char const*, char const*, int) + 357
1 0x7f93dc29e1c9p paddle::platform::EnforceNotMet::EnforceNotMet(std::__exception_ptr::exception_ptr, char const*, int) + 137
2 0x7f93dddf8aa6p
3 0x7f93dde1b1b4p paddle::platform::TemporaryAllocator::Release(std::function<void ()> const&) + 100
4 0x7f93dddfaac1p paddle::platform::CUDADeviceContext::Wait() const + 113
5 0x7f93ddb7b285p paddle::framework::details::ScopeBufferedSSAGraphExecutor::Run(std::vector<std::string, std::allocator<std::string> > const&) + 1445
6 0x7f93dc3d91f2p paddle::framework::ParallelExecutor::Run(std::vector<std::string, std::allocator<std::string> > const&, std::string const&) + 562
7 0x7f93dc28dfcep
8 0x7f93dc2c91eep
9 0x7f942c334302p _PyCFunction_FastCallDict + 258
10 0x7f942c3b995bp
11 0x7f942c3bcd40p _PyEval_EvalFrameDefault + 11328
12 0x7f942c3b8100p
13 0x7f942c3b9b2ap
14 0x7f942c3bd2ccp _PyEval_EvalFrameDefault + 12748
15 0x7f942c3b8100p
16 0x7f942c3b9b2ap
17 0x7f942c3bcd40p _PyEval_EvalFrameDefault + 11328
18 0x7f942c3b7514p
19 0x7f942c3b9c88p
20 0x7f942c3bcd40p _PyEval_EvalFrameDefault + 11328
21 0x7f942c3b8100p
22 0x7f942c3b8583p PyEval_EvalCodeEx + 99
23 0x7f942c3b85cbp PyEval_EvalCode + 59
24 0x7f942c3eaee0p PyRun_FileExFlags + 304
25 0x7f942c3ec4a3p PyRun_SimpleFileExFlags + 371
26 0x7f942c4078d5p Py_Main + 3621
27 0x400c1dp main + 365
28 0x7f942b396f45p __libc_start_main + 245
29 0x4009e9p
terminate called after throwing an instance of 'paddle::platform::EnforceNotMet'
what(): an illegal memory access was encountered at [/paddle/paddle/fluid/platform/device_context.cc:328]
PaddlePaddle Call Stacks:
0 0x7f93dc29de45p void paddle::platform::EnforceNotMet::Init<char const*>(char const*, char const*, int) + 357
1 0x7f93dc29e1c9p paddle::platform::EnforceNotMet::EnforceNotMet(std::__exception_ptr::exception_ptr, char const*, int) + 137
2 0x7f93dddf8aa6p
3 0x7f93dde1b1b4p paddle::platform::TemporaryAllocator::Release(std::function<void ()> const&) + 100
4 0x7f93dddfaac1p paddle::platform::CUDADeviceContext::Wait() const + 113
5 0x7f93dc3d7102p paddle::framework::ParallelExecutor::~ParallelExecutor() + 98
6 0x7f93dc29933ap
7 0x7f93dc2ccbfap
8 0x7f93dc2ccd7fp
9 0x7f942c329c3dp
10 0x7f942c348bd5p
11 0x7f942c30e632p
12 0x7f942c3f65a7p
13 0x7f942c3f65b7p
14 0x7f942c3f65b7p
15 0x7f942c327e57p
16 0x7f942c328690p PyDict_SetItemString + 64
17 0x7f942c3db723p PyImport_Cleanup + 131
18 0x7f942c3e8523p Py_FinalizeEx + 115
19 0x7f942c40711ap Py_Main + 1642
20 0x400c1dp main + 365
21 0x7f942b396f45p __libc_start_main + 245
22 0x4009e9p
*** Aborted at 1556126047 (unix time) try "date -d @1556126047" if you are using GNU date ***
PC: @ 0x0 (unknown)
*** SIGABRT (@0x3e800004699) received by PID 18073 (TID 0x7f942cb6e740) from PID 18073; stack trace: ***
@ 0x7f942c063330 (unknown)
@ 0x7f942b3abc37 gsignal
@ 0x7f942b3af028 abort
@ 0x7f9429eb6415 __gnu_cxx::__verbose_terminate_handler()
@ 0x7f9429eb4206 (unknown)
@ 0x7f9429eb31c9 (unknown)
@ 0x7f9429eb3b38 __gxx_personality_v0
@ 0x7f9429c42f43 (unknown)
@ 0x7f9429c4376e _Unwind_Resume
@ 0x7f93dddfab93 paddle::platform::CUDADeviceContext::Wait()
@ 0x7f93dc3d7102 paddle::framework::ParallelExecutor::~ParallelExecutor()
@ 0x7f93dc29933a pybind11::class_<>::dealloc()
@ 0x7f93dc2ccbfa pybind11::detail::clear_instance()
@ 0x7f93dc2ccd7f pybind11_object_dealloc
@ 0x7f942c329c3d dict_dealloc
@ 0x7f942c348bd5 subtype_dealloc
@ 0x7f942c30e632 frame_dealloc
@ 0x7f942c3f65a7 tb_dealloc
@ 0x7f942c3f65b7 tb_dealloc
@ 0x7f942c3f65b7 tb_dealloc
@ 0x7f942c327e57 insertdict
@ 0x7f942c328690 PyDict_SetItemString
@ 0x7f942c3db723 PyImport_Cleanup
@ 0x7f942c3e8523 Py_FinalizeEx
@ 0x7f942c40711a Py_Main
@ 0x400c1d main
@ 0x7f942b396f45 __libc_start_main
@ 0x4009e9 (unknown)