How to correctly run transformer?
Created by: sfraczek
Hi,
I have encountered a number of problems with fluid/neural_machine_translation/transformer model. Am I doing something wrong? How to correctly run it?
Steps I have taken
Following instructions in https://github.com/PaddlePaddle/models/blob/develop/fluid/neural_machine_translation/transformer/README_cn.md I have downloaded WMT'16 EN-DE from https://github.com/google/seq2seq/blob/master/docs/data.md by clicking download.
Next I extracted it to wmt16_en_de
directory.
Next I did paste -d ' \ t ' train.tok.clean.bpe.32000.en train.tok.clean.bpe.32000.de > train.tok.clean.bpe.32000.en-de
Then I did sed -i '1i\<s>\n<e>\n<unk>' vocab.bpe.32000
in config.py I changed use_gpu = True
to False
.
In train.py I added import multiprocessing
and changed dev_count = fluid.core.get_cuda_device_count()
to dev_count = fluid.core.get_cuda_device_count() if TrainTaskConfig.use_gpu else multiprocessing.cpu_count()
.
Training
I launched training by python -u train.py --src_vocab_fpath wmt16_en_de/vocab.bpe.32000 --trg_vocab_fpath wmt16_en_de/vocab.bpe.32000 --special_token '<s>' '<e>' '<unk>' --train_file_pattern wmt16_en_de/train.tok.clean.bpe.32000.en-de --use_token_batch True --batch_size 3200 --sort_type pool --pool_size 200000
but I got
E0719 14:26:29.439303 55138 graph.cc:43] softmax_with_cross_entropy_grad input var not in all_var list: softmax_with_cross_entropy_0.tmp_0@GRAD
epoch: 0, consumed 0.000161s
Traceback (most recent call last):
File "train.py", line 428, in <module>
train(args)
File "train.py", line 419, in train
"pass_" + str(pass_id) + ".checkpoint"))
File "/home/sfraczek/Paddle/build/python/paddle/fluid/io.py", line 288, in save_persistables
filename=filename)
File "/home/sfraczek/Paddle/build/python/paddle/fluid/io.py", line 166, in save_vars
filename=filename)
File "/home/sfraczek/Paddle/build/python/paddle/fluid/io.py", line 197, in save_vars
executor.run(save_program)
File "/home/sfraczek/Paddle/build/python/paddle/fluid/executor.py", line 449, in run
self.executor.run(program.desc, scope, 0, True, True)
paddle.fluid.core.EnforceNotMet: holder_ should not be null
Tensor not initialized yet when Tensor::type() is called. at [/home/sfraczek/Paddle/paddle/fluid/framework/tensor.h:139]
PaddlePaddle Call Stacks:
0 0x7f060e948f1cp paddle::platform::EnforceNotMet::EnforceNotMet(std::__exception_ptr::exception_ptr, char const*, int) + 572
1 0x7f060e94b901p paddle::framework::Tensor::type() const + 209
2 0x7f060f617bf6p paddle::operators::SaveOp::SaveLodTensor(boost::variant<paddle::platform::CUDAPlace, paddle::platform::CPUPlace, paddle::platform::CUDAPinnedPlace, boost::detail::variant::void_, boost::detail::variant::void_,
boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::va
riant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_> con
st&, paddle::framework::Variable*) const + 614
3 0x7f060f618472p paddle::operators::SaveOp::RunImpl(paddle::framework::Scope const&, boost::variant<paddle::platform::CUDAPlace, paddle::platform::CPUPlace, paddle::platform::CUDAPinnedPlace, boost::detail::variant::void_, boos
t::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::varian
t::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::
detail::variant::void_> const&) const + 210
So I have commented out
#fluid.io.save_persistables(
# exe,
# os.path.join(TrainTaskConfig.ckpt_dir,
# "pass_" + str(pass_id) + ".checkpoint"))
and it worked.
Inference
So next I have tried to run inference.
I have found that the file wmt16_en_de/newstest2013.tok.bpe.32000.en-de doesn't exist but based on the README I guessed that I should run
paste -d ' \ t ' newstest2013.tok.bpe.32000.en newstest2013.tok.bpe.32000.de > newstest2013.tok.bpe.32000.en-de
is this correct?
python -u infer.py --src_vocab_fpath wmt16_en_de/vocab.bpe.32000 --trg_vocab_fpath wmt16_en_de/vocab.bpe.32000 --special_token '<s>' '<e>' '<unk>' --test_file_pattern wmt16_en_de/newstest2013.tok.bpe.32000.en-de --batch_size 4 model_path trained_models/pass_20.infer.model beam_size 5
but there was no ouptut from the script. It ended without error too.
I tried giving other files but it doesn't output anything either.
I added profiling by adding import paddle.fluid.profiler as profiler
and
+ parser.add_argument(
+ "--profile",
+ type=bool,
+ default=False,
+ help="Enables/disables profiling.")
and
+ if args.profile:
+ with profiler.profiler("CPU", sorted_key='total') as cpuprof:
+ infer(args)
+ else:
+ infer(args)
But there is no output from the profile.
Please help.