nvidia-docker中训练aishell模型出现CudnnBatchNormLayer::backward()失败问题 (#204) · Issue · PaddlePaddle / DeepSpeech

nvidia-docker中训练aishell模型出现CudnnBatchNormLayer::backward()失败问题

Created by: BulimiaDH

Thread [140347691067136] Forwarding batch_norm_0,

* Aborted at 1523453910 (unix time) try "date -d @1523453910" if you are using GNU date *

PC: @ 0x0 (unknown)

* SIGSEGV (@0x0) received by PID 19 (TID 0x7fa53e456700) from PID 0; stack trace: *

@ 0x7fa53e032390 (unknown) @ 0x7fa518ea6c82 paddle::CudnnBatchNormLayer::backward() @ 0x7fa518f2e2bd paddle::NeuralNetwork::backward() @ 0x7fa519270bb0 GradientMachine::forwardBackward() @ 0x7fa518d295f4 _wrap_GradientMachine_forwardBackward @ 0x4cb45e PyEval_EvalFrameEx @ 0x4c2765 PyEval_EvalCodeEx @ 0x4ca8d1 PyEval_EvalFrameEx @ 0x4c2765 PyEval_EvalCodeEx @ 0x4ca099 PyEval_EvalFrameEx @ 0x4c2765 PyEval_EvalCodeEx @ 0x4ca099 PyEval_EvalFrameEx @ 0x4c2765 PyEval_EvalCodeEx @ 0x4ca099 PyEval_EvalFrameEx @ 0x4c2765 PyEval_EvalCodeEx @ 0x4ca8d1 PyEval_EvalFrameEx @ 0x4c2765 PyEval_EvalCodeEx @ 0x4ca8d1 PyEval_EvalFrameEx @ 0x4c2765 PyEval_EvalCodeEx @ 0x4c2509 PyEval_EvalCode @ 0x4f1def (unknown) @ 0x4ec652 PyRun_FileExFlags @ 0x4eae31 PyRun_SimpleFileExFlags @ 0x49e14a Py_Main @ 0x7fa53dc77830 __libc_start_main @ 0x49d9d9 _start @ 0x0 (unknown) Segmentation fault (core dumped)

模型配置： I0411 13:38:14.460194 19 Util.cpp:166] commandline: --use_gpu=True --rnn_use_batch=False --log_clipping=True --trainer_count=1 [INFO 2018-04-11 13:38:17,982 layers.py:2606] output for conv_0: c = 32, h = 81, w = 54, size = 139968 [INFO 2018-04-11 13:38:17,983 layers.py:3133] output for batch_norm_0: c = 32, h = 81, w = 54, size = 139968 [INFO 2018-04-11 13:38:17,985 layers.py:7224] output for scale_sub_region_0: c = 32, h = 81, w = 54, size = 139968 [INFO 2018-04-11 13:38:17,986 layers.py:2606] output for conv_1: c = 32, h = 41, w = 54, size = 70848 [INFO 2018-04-11 13:38:17,987 layers.py:3133] output for batch_norm_1: c = 32, h = 41, w = 54, size = 70848 [INFO 2018-04-11 13:38:17,988 layers.py:7224] output for scale_sub_region_1: c = 32, h = 41, w = 54, size = 70848

add_arg('batch_size', int, 8, "Minibatch size.") add_arg('trainer_count', int, 1, "# of Trainers (CPUs or GPUs).") add_arg('num_passes', int, 200, "# of training epochs.") add_arg('num_proc_data', int, 16, "# of CPUs for data preprocessing.") add_arg('num_conv_layers', int, 2, "# of convolution layers.") add_arg('num_rnn_layers', int, 3, "# of recurrent layers.") add_arg('rnn_layer_size', int, 1024, "# of recurrent cells per layer.") add_arg('num_iter_print', int, 100, "Every # iterations for printing " "train cost.") add_arg('learning_rate', float, 5e-4, "Learning rate.") add_arg('max_duration', float, 60.0, "Longest audio duration allowed.") add_arg('min_duration', float, 0.0, "Shortest audio duration allowed.") add_arg('test_off', bool, False, "Turn off testing.") add_arg('use_sortagrad', bool, True, "Use SortaGrad or not.") add_arg('use_gpu', bool, True, "Use GPU or not.") add_arg('use_gru', bool, True, "Use GRUs instead of simple RNNs.") add_arg('is_local', bool, True, "Use pserver or not.") add_arg('share_rnn_weights',bool, False, "Share input-hidden weights across " "bi-directional RNNs. Not for GRU.")

语言模型：Mandarin LM Small.

环境： VGA compatible controller: NVIDIA Corporation GK110B [GeForce GTX TITAN Black] (rev a1) ubuntu 16.04

正常生成Manifest文件，均值方差文件，运用提供的词表。然后出现了这个问题，请问应该怎么解决？

PaddlePaddle / DeepSpeech 1 年多 前同步成功

nvidia-docker中训练aishell模型出现CudnnBatchNormLayer::backward()失败问题

* Aborted at 1523453910 (unix time) try "date -d @1523453910" if you are using GNU date *

* SIGSEGV (@0x0) received by PID 19 (TID 0x7fa53e456700) from PID 0; stack trace: *

PaddlePaddle / DeepSpeech
1 年多前同步成功