Can the released models be used as checkpoints? (#431) · Issue · PaddlePaddle / DeepSpeech

Can the released models be used as checkpoints?

Created by: DavidHuie

I was experimenting with adding new training data to the released models, specifically the BaiduEN8k model. I tried running a script that uses the BaiduEN8k as a checkpoint (borrowing from examples/librispeech/run_train.sh):

export FLAGS_sync_nccl_allreduce=0
export CUDA_VISIBLE_DEVICES=0

python -u train.py \
--batch_size=20 \
--num_epoch=50 \
--num_conv_layers=2 \
--num_rnn_layers=3 \
--rnn_layer_size=2048 \
--num_iter_print=100 \
--save_epoch=1 \
--num_samples=280000 \
--learning_rate=5e-4 \
--max_duration=27.0 \
--min_duration=0.0 \
--test_off=False \
--use_sortagrad=True \
--use_gru=False \
--use_gpu=True \
--is_local=True \
--share_rnn_weights=True \
--train_manifest='manifest.train-clean-100' \
--dev_manifest='manifest.dev-clean' \
--mean_std_path='mean_std.npz' \
--vocab_path='/code/models/deepspeech2/vocab.txt' \
--init_from_pretrained_model='/code/models/deepspeech2' \
--output_model_dir='/code/models/deepspeech2_new' \
--augment_conf_path='conf/augmentation.config' \
--specgram_type='linear' \
--shuffle_method='batch_shuffle_clipped'

The directory /code/models/deepspeech2 contains the BaiduEN8k models:

$ ls /code/models/deepspeech2
README.md  mean_std.npz  params.pdparams  vocab.txt
$ du -hs /code/models/deepspeech2/*
4.0K    /code/models/deepspeech2/README.md
4.0K    /code/models/deepspeech2/mean_std.npz
201M    /code/models/deepspeech2/params.pdparams
4.0K    /code/models/deepspeech2/vocab.txt

When I run the script, I get this segfault:

W0224 02:04:07.079113   149 device_context.cc:236] Please NOTE: device: 0, CUDA Capability: 61, Driver API Version: 10.2, Runtime API Version: 10.0
W0224 02:04:07.100364   149 device_context.cc:244] device: 0, cuDNN Version: 7.6.
W0224 02:04:08.754072   149 init.cc:206] *** Aborted at 1582509848 (unix time) try "date -d @1582509848" if you are using GNU date ***
W0224 02:04:08.756754   149 init.cc:206] PC: @                0x0 (unknown)
W0224 02:04:08.757686   149 init.cc:206] *** SIGSEGV (@0x50) received by PID 149 (TID 0x7f0571ad1700) from PID 80; stack trace: ***
W0224 02:04:08.760175   149 init.cc:206]     @     0x7f05716aa390 (unknown)
W0224 02:04:08.762334   149 init.cc:206]     @     0x7f05718c275c (unknown)
W0224 02:04:08.764492   149 init.cc:206]     @     0x7f05718cb861 (unknown)
W0224 02:04:08.766629   149 init.cc:206]     @     0x7f05718c6574 (unknown)
W0224 02:04:08.768791   149 init.cc:206]     @     0x7f05718cadb9 (unknown)
W0224 02:04:08.771391   149 init.cc:206]     @     0x7f05714125ad (unknown)
W0224 02:04:08.773540   149 init.cc:206]     @     0x7f05718c6574 (unknown)
W0224 02:04:08.776126   149 init.cc:206]     @     0x7f0571412664 __libc_dlopen_mode
W0224 02:04:08.778730   149 init.cc:206]     @     0x7f05713e4a85 (unknown)
W0224 02:04:08.780908   149 init.cc:206]     @     0x7f05716a7a99 __pthread_once_slow
W0224 02:04:08.782840   149 init.cc:206]     @     0x7f05713e4ba4 backtrace
W0224 02:04:08.790917   149 init.cc:206]     @     0x7f0506158844 paddle::platform::GetTraceBackString<>()
W0224 02:04:08.795003   149 init.cc:206]     @     0x7f0506158cfa paddle::platform::EnforceNotMet::EnforceNotMet()
W0224 02:04:08.801776   149 init.cc:206]     @     0x7f0507554b35 paddle::operators::LoadCombineOpKernel<>::LoadParamsFromBuffer()
W0224 02:04:08.807842   149 init.cc:206]     @     0x7f0507554ece paddle::operators::LoadCombineOpKernel<>::Compute()
W0224 02:04:08.811461   149 init.cc:206]     @     0x7f0507555423 _ZNSt17_Function_handlerIFvRKN6paddle9framework16ExecutionContextEEZNKS1_24OpKernelRegistrarFunctorINS0_8platform9CUDAPlaceELb0ELm0EJNS0_9operators19LoadCombineOpKernelINS7_17CUDADeviceContextEfEENSA_ISB_dEENSA_ISB_iEENSA_ISB_aEENSA_ISB_lEEEEclEPKcSJ_iEUlS4_E_E9_M_invokeERKSt9_Any_dataS4_
W0224 02:04:08.816128   149 init.cc:206]     @     0x7f0508b4dd6b paddle::framework::OperatorWithKernel::RunImpl()
W0224 02:04:08.822257   149 init.cc:206]     @     0x7f0508b4e361 paddle::framework::OperatorWithKernel::RunImpl()
W0224 02:04:08.825278   149 init.cc:206]     @     0x7f0508b47fec paddle::framework::OperatorBase::Run()
W0224 02:04:08.830551   149 init.cc:206]     @     0x7f0506308c86 paddle::framework::Executor::RunPreparedContext()
W0224 02:04:08.833338   149 init.cc:206]     @     0x7f050630c4cf paddle::framework::Executor::Run()
W0224 02:04:08.834825   149 init.cc:206]     @     0x7f0506145f1d _ZZN8pybind1112cpp_function10initializeIZN6paddle6pybindL22pybind11_init_core_avxERNS_6moduleEEUlRNS2_9framework8ExecutorERKNS6_11ProgramDescEPNS6_5ScopeEibbRKSt6vectorISsSaISsEEE103_vIS8_SB_SD_ibbSI_EINS_4nameENS_9is_methodENS_7siblingEEEEvOT_PFT0_DpT1_EDpRKT2_ENUlRNS_6detail13function_callEE1_4_FUNES10_
W0224 02:04:08.836560   149 init.cc:206]     @     0x7f050618f086 pybind11::cpp_function::dispatcher()
W0224 02:04:08.836661   149 init.cc:206]     @           0x4c5cd6 PyEval_EvalFrameEx
W0224 02:04:08.836764   149 init.cc:206]     @           0x4ba506 PyEval_EvalCodeEx
W0224 02:04:08.836939   149 init.cc:206]     @           0x4c2418 PyEval_EvalFrameEx
W0224 02:04:08.837083   149 init.cc:206]     @           0x4ba506 PyEval_EvalCodeEx
W0224 02:04:08.837195   149 init.cc:206]     @           0x4c2418 PyEval_EvalFrameEx
W0224 02:04:08.837280   149 init.cc:206]     @           0x4ba506 PyEval_EvalCodeEx
W0224 02:04:08.837378   149 init.cc:206]     @           0x4c1e32 PyEval_EvalFrameEx
W0224 02:04:08.837463   149 init.cc:206]     @           0x4ba506 PyEval_EvalCodeEx
W0224 02:04:08.837559   149 init.cc:206]     @           0x4c1e32 PyEval_EvalFrameEx
/code/code/deepspeech2/train.sh: line 35:   149 Segmentation fault      (core dumped) python -u train.py --batch_size=20 --num_epoch=50 --num_conv_layers=2 --num_rnn_layers=3 --rnn_layer_size=2048 --num_iter_print=100 --save_epoch=1 --num_samples=280000 --learning_rate=5e-4 --max_duration=27.0 --min_duration=0.0 --test_off=False --use_sortagrad=True --use_gru=False --use_gpu=True --is_local=True --share_rnn_weights=True --train_manifest='manifest.train-clean-100' --dev_manifest='manifest.dev-clean' --mean_std_path='mean_std.npz' --vocab_path='/code/models/deepspeech2/vocab.txt' --init_from_pretrained_model='/code/models/deepspeech2_copy' --output_model_dir='/code/models/deepspeech2_new' --augment_conf_path='conf/augmentation.config' --specgram_type='linear' --shuffle_method='batch_shuffle_clipped'

If using BaiduEN8k as a checkpoint is possible, am I doing it the right way? I understand some of those parameters in the train.py command may need to changed.

PaddlePaddle / DeepSpeech 1 年多 前同步成功

Can the released models be used as checkpoints?

PaddlePaddle / DeepSpeech
1 年多前同步成功