image_classification cifar-10 train error
Created by: apollos
Follow the guide and run train.sh in Paddle/demo/image_classification, after around 1 minutes, I get a cudnn error.
My env: OS ubuntu 14.04.4 cuda: 7.5 cudnn: libcudnn.so.5.0.5
yu@baymax:~/workspace/Paddle/demo/image_classification$ ./train.sh I0919 23:32:31.849189 32690 Util.cpp:138] commandline: /opt/paddle/bin/paddle_trainer --config=vgg_16_cifar.py --dot_period=10 --log_period=100 --test_all_data_in_one_period=1 --use_gpu=1 --trainer_count=1 --num_passes=200 --save_dir=./cifar_vgg_model I0919 23:32:32.094377 32690 Util.cpp:113] Calling runInitFunctions I0919 23:32:32.094532 32690 Util.cpp:126] Call runInitFunctions done. [INFO 2016-09-19 23:32:32,122 layers.py:1499] channels=3 size=3072 [INFO 2016-09-19 23:32:32,123 layers.py:1499] output size for conv_0 is 32 [INFO 2016-09-19 23:32:32,123 layers.py:1499] channels=64 size=65536 [INFO 2016-09-19 23:32:32,124 layers.py:1499] output size for conv_1 is 32 [INFO 2016-09-19 23:32:32,125 layers.py:1560] output size for pool_0 is 16_16 [INFO 2016-09-19 23:32:32,125 layers.py:1499] channels=64 size=16384 [INFO 2016-09-19 23:32:32,125 layers.py:1499] output size for conv_2 is 16 [INFO 2016-09-19 23:32:32,126 layers.py:1499] channels=128 size=32768 [INFO 2016-09-19 23:32:32,126 layers.py:1499] output size for conv_3 is 16 [INFO 2016-09-19 23:32:32,127 layers.py:1560] output size for pool_1 is 8_8 [INFO 2016-09-19 23:32:32,127 layers.py:1499] channels=128 size=8192 [INFO 2016-09-19 23:32:32,128 layers.py:1499] output size for conv_4 is 8 [INFO 2016-09-19 23:32:32,128 layers.py:1499] channels=256 size=16384 [INFO 2016-09-19 23:32:32,129 layers.py:1499] output size for conv_5 is 8 [INFO 2016-09-19 23:32:32,129 layers.py:1499] channels=256 size=16384 [INFO 2016-09-19 23:32:32,130 layers.py:1499] output size for conv_6 is 8 [INFO 2016-09-19 23:32:32,130 layers.py:1560] output size for pool_2 is 4_4 [INFO 2016-09-19 23:32:32,131 layers.py:1499] channels=256 size=4096 [INFO 2016-09-19 23:32:32,131 layers.py:1499] output size for conv_7 is 4 [INFO 2016-09-19 23:32:32,132 layers.py:1499] channels=512 size=8192 [INFO 2016-09-19 23:32:32,132 layers.py:1499] output size for conv_8 is 4 [INFO 2016-09-19 23:32:32,133 layers.py:1499] channels=512 size=8192 [INFO 2016-09-19 23:32:32,133 layers.py:1499] output size for conv_9 is 4 [INFO 2016-09-19 23:32:32,134 layers.py:1560] output size for pool_3 is 2_2 [INFO 2016-09-19 23:32:32,134 layers.py:1560] output size for pool_4 is 1_1 [INFO 2016-09-19 23:32:32,136 networks.py:1122] The input order is [image, label] [INFO 2016-09-19 23:32:32,136 networks.py:1129] The output order is [cost_0] I0919 23:32:32.141587 32690 Trainer.cpp:169] trainer mode: Normal I0919 23:32:32.147541 32690 PyDataProvider2.cpp:219] loading dataprovider image_provider::processData [INFO 2016-09-19 23:32:32,168 image_provider.py:52] Image size: 32 [INFO 2016-09-19 23:32:32,168 image_provider.py:53] Meta path: data/cifar-out/batches/batches.meta [INFO 2016-09-19 23:32:32,169 image_provider.py:58] DataProvider Initialization finished I0919 23:32:32.169145 32690 PyDataProvider2.cpp:219] loading dataprovider image_provider::processData [INFO 2016-09-19 23:32:32,169 image_provider.py:52] Image size: 32 [INFO 2016-09-19 23:32:32,169 image_provider.py:53] Meta path: data/cifar-out/batches/batches.meta [INFO 2016-09-19 23:32:32,169 image_provider.py:58] DataProvider Initialization finished I0919 23:32:32.169431 32690 GradientMachine.cpp:134] Initing parameters.. I0919 23:32:32.531777 32690 GradientMachine.cpp:141] Init parameters done. ......... I0919 23:32:53.036734 32690 TrainerInternal.cpp:162] Batch=100 samples=12800 AvgCost=2.38877 CurrentCost=2.38877 Eval: classification_error_evaluator=0.834219 CurrentEval: classification_error_evaluator=0.834219 ......... I0919 23:33:03.698333 32690 TrainerInternal.cpp:162] Batch=200 samples=25600 AvgCost=2.17996 CurrentCost=1.97115 Eval: classification_error_evaluator=0.786719 CurrentEval: classification_error_evaluator=0.739219 ......... I0919 23:33:14.348435 32690 TrainerInternal.cpp:162] Batch=300 samples=38400 AvgCost=2.02001 CurrentCost=1.7001 Eval: classification_error_evaluator=0.7425 CurrentEval: classification_error_evaluator=0.654062 .........I0919 23:33:23.978550 32690 TrainerInternal.cpp:179] Pass=0 Batch=391 samples=50048 AvgCost=1.90913 Eval: classification_error_evaluator=0.70658 F0919 23:33:26.893599 32690 hl_cuda_cudnn.cc:779] Check failed: CUDNN_STATUS_SUCCESS == cudnnStat (0 vs. 5) Cudnn Error: CUDNN_STATUS_INVALID_VALUE *_* Check failure stack trace: *** @ 0x7efcce9e2daa (unknown) @ 0x7efcce9e2ce4 (unknown) @ 0x7efcce9e26e6 (unknown) @ 0x7efcce9e5687 (unknown) @ 0x8affc4 hl_convolution_forward() @ 0x64bb3c paddle::CudnnConvLayer::forward() @ 0x5a2130 paddle::NeuralNetwork::forward() @ 0x6bb43f paddle::Tester::testOneBatch() @ 0x6bbd52 paddle::Tester::testOnePeriod() @ 0x6a052c paddle::Trainer::trainOnePass() @ 0x6a3927 paddle::Trainer::train() @ 0x53bfd3 main @ 0x7efccdbeef45 (unknown) @ 0x547795 (unknown) @ (nil) (unknown) /usr/local/bin/paddle: line 81: 32690 Aborted (core dumped) ${DEBUGGER} /opt/paddle/bin/paddle_trainer ${@:2}