Can the released models be used as checkpoints?
Created by: DavidHuie
I was experimenting with adding new training data to the released models, specifically the BaiduEN8k model. I tried running a script that uses the BaiduEN8k as a checkpoint (borrowing from examples/librispeech/run_train.sh
):
export FLAGS_sync_nccl_allreduce=0
export CUDA_VISIBLE_DEVICES=0
python -u train.py \
--batch_size=20 \
--num_epoch=50 \
--num_conv_layers=2 \
--num_rnn_layers=3 \
--rnn_layer_size=2048 \
--num_iter_print=100 \
--save_epoch=1 \
--num_samples=280000 \
--learning_rate=5e-4 \
--max_duration=27.0 \
--min_duration=0.0 \
--test_off=False \
--use_sortagrad=True \
--use_gru=False \
--use_gpu=True \
--is_local=True \
--share_rnn_weights=True \
--train_manifest='manifest.train-clean-100' \
--dev_manifest='manifest.dev-clean' \
--mean_std_path='mean_std.npz' \
--vocab_path='/code/models/deepspeech2/vocab.txt' \
--init_from_pretrained_model='/code/models/deepspeech2' \
--output_model_dir='/code/models/deepspeech2_new' \
--augment_conf_path='conf/augmentation.config' \
--specgram_type='linear' \
--shuffle_method='batch_shuffle_clipped'
The directory /code/models/deepspeech2
contains the BaiduEN8k models:
$ ls /code/models/deepspeech2
README.md mean_std.npz params.pdparams vocab.txt
$ du -hs /code/models/deepspeech2/*
4.0K /code/models/deepspeech2/README.md
4.0K /code/models/deepspeech2/mean_std.npz
201M /code/models/deepspeech2/params.pdparams
4.0K /code/models/deepspeech2/vocab.txt
When I run the script, I get this segfault:
W0224 02:04:07.079113 149 device_context.cc:236] Please NOTE: device: 0, CUDA Capability: 61, Driver API Version: 10.2, Runtime API Version: 10.0
W0224 02:04:07.100364 149 device_context.cc:244] device: 0, cuDNN Version: 7.6.
W0224 02:04:08.754072 149 init.cc:206] *** Aborted at 1582509848 (unix time) try "date -d @1582509848" if you are using GNU date ***
W0224 02:04:08.756754 149 init.cc:206] PC: @ 0x0 (unknown)
W0224 02:04:08.757686 149 init.cc:206] *** SIGSEGV (@0x50) received by PID 149 (TID 0x7f0571ad1700) from PID 80; stack trace: ***
W0224 02:04:08.760175 149 init.cc:206] @ 0x7f05716aa390 (unknown)
W0224 02:04:08.762334 149 init.cc:206] @ 0x7f05718c275c (unknown)
W0224 02:04:08.764492 149 init.cc:206] @ 0x7f05718cb861 (unknown)
W0224 02:04:08.766629 149 init.cc:206] @ 0x7f05718c6574 (unknown)
W0224 02:04:08.768791 149 init.cc:206] @ 0x7f05718cadb9 (unknown)
W0224 02:04:08.771391 149 init.cc:206] @ 0x7f05714125ad (unknown)
W0224 02:04:08.773540 149 init.cc:206] @ 0x7f05718c6574 (unknown)
W0224 02:04:08.776126 149 init.cc:206] @ 0x7f0571412664 __libc_dlopen_mode
W0224 02:04:08.778730 149 init.cc:206] @ 0x7f05713e4a85 (unknown)
W0224 02:04:08.780908 149 init.cc:206] @ 0x7f05716a7a99 __pthread_once_slow
W0224 02:04:08.782840 149 init.cc:206] @ 0x7f05713e4ba4 backtrace
W0224 02:04:08.790917 149 init.cc:206] @ 0x7f0506158844 paddle::platform::GetTraceBackString<>()
W0224 02:04:08.795003 149 init.cc:206] @ 0x7f0506158cfa paddle::platform::EnforceNotMet::EnforceNotMet()
W0224 02:04:08.801776 149 init.cc:206] @ 0x7f0507554b35 paddle::operators::LoadCombineOpKernel<>::LoadParamsFromBuffer()
W0224 02:04:08.807842 149 init.cc:206] @ 0x7f0507554ece paddle::operators::LoadCombineOpKernel<>::Compute()
W0224 02:04:08.811461 149 init.cc:206] @ 0x7f0507555423 _ZNSt17_Function_handlerIFvRKN6paddle9framework16ExecutionContextEEZNKS1_24OpKernelRegistrarFunctorINS0_8platform9CUDAPlaceELb0ELm0EJNS0_9operators19LoadCombineOpKernelINS7_17CUDADeviceContextEfEENSA_ISB_dEENSA_ISB_iEENSA_ISB_aEENSA_ISB_lEEEEclEPKcSJ_iEUlS4_E_E9_M_invokeERKSt9_Any_dataS4_
W0224 02:04:08.816128 149 init.cc:206] @ 0x7f0508b4dd6b paddle::framework::OperatorWithKernel::RunImpl()
W0224 02:04:08.822257 149 init.cc:206] @ 0x7f0508b4e361 paddle::framework::OperatorWithKernel::RunImpl()
W0224 02:04:08.825278 149 init.cc:206] @ 0x7f0508b47fec paddle::framework::OperatorBase::Run()
W0224 02:04:08.830551 149 init.cc:206] @ 0x7f0506308c86 paddle::framework::Executor::RunPreparedContext()
W0224 02:04:08.833338 149 init.cc:206] @ 0x7f050630c4cf paddle::framework::Executor::Run()
W0224 02:04:08.834825 149 init.cc:206] @ 0x7f0506145f1d _ZZN8pybind1112cpp_function10initializeIZN6paddle6pybindL22pybind11_init_core_avxERNS_6moduleEEUlRNS2_9framework8ExecutorERKNS6_11ProgramDescEPNS6_5ScopeEibbRKSt6vectorISsSaISsEEE103_vIS8_SB_SD_ibbSI_EINS_4nameENS_9is_methodENS_7siblingEEEEvOT_PFT0_DpT1_EDpRKT2_ENUlRNS_6detail13function_callEE1_4_FUNES10_
W0224 02:04:08.836560 149 init.cc:206] @ 0x7f050618f086 pybind11::cpp_function::dispatcher()
W0224 02:04:08.836661 149 init.cc:206] @ 0x4c5cd6 PyEval_EvalFrameEx
W0224 02:04:08.836764 149 init.cc:206] @ 0x4ba506 PyEval_EvalCodeEx
W0224 02:04:08.836939 149 init.cc:206] @ 0x4c2418 PyEval_EvalFrameEx
W0224 02:04:08.837083 149 init.cc:206] @ 0x4ba506 PyEval_EvalCodeEx
W0224 02:04:08.837195 149 init.cc:206] @ 0x4c2418 PyEval_EvalFrameEx
W0224 02:04:08.837280 149 init.cc:206] @ 0x4ba506 PyEval_EvalCodeEx
W0224 02:04:08.837378 149 init.cc:206] @ 0x4c1e32 PyEval_EvalFrameEx
W0224 02:04:08.837463 149 init.cc:206] @ 0x4ba506 PyEval_EvalCodeEx
W0224 02:04:08.837559 149 init.cc:206] @ 0x4c1e32 PyEval_EvalFrameEx
/code/code/deepspeech2/train.sh: line 35: 149 Segmentation fault (core dumped) python -u train.py --batch_size=20 --num_epoch=50 --num_conv_layers=2 --num_rnn_layers=3 --rnn_layer_size=2048 --num_iter_print=100 --save_epoch=1 --num_samples=280000 --learning_rate=5e-4 --max_duration=27.0 --min_duration=0.0 --test_off=False --use_sortagrad=True --use_gru=False --use_gpu=True --is_local=True --share_rnn_weights=True --train_manifest='manifest.train-clean-100' --dev_manifest='manifest.dev-clean' --mean_std_path='mean_std.npz' --vocab_path='/code/models/deepspeech2/vocab.txt' --init_from_pretrained_model='/code/models/deepspeech2_copy' --output_model_dir='/code/models/deepspeech2_new' --augment_conf_path='conf/augmentation.config' --specgram_type='linear' --shuffle_method='batch_shuffle_clipped'
If using BaiduEN8k as a checkpoint is possible, am I doing it the right way? I understand some of those parameters in the train.py command may need to changed.