How to correctly run transformer?
Created by: sfraczek
Hi,
I have encountered a number of problems with fluid/neural_machine_translation/transformer model. Am I doing something wrong? How to correctly run it?
Steps I have taken
Following instructions in https://github.com/PaddlePaddle/models/blob/develop/fluid/neural_machine_translation/transformer/README_cn.md I have downloaded WMT'16 EN-DE from https://github.com/google/seq2seq/blob/master/docs/data.md by clicking download.
Next I extracted it to wmt16_en_de directory.
Next I did paste -d ' \ t ' train.tok.clean.bpe.32000.en train.tok.clean.bpe.32000.de > train.tok.clean.bpe.32000.en-de
Then I did sed -i '1i\<s>\n<e>\n<unk>' vocab.bpe.32000
in config.py I changed use_gpu = True to False.
In train.py I added import multiprocessing and changed dev_count = fluid.core.get_cuda_device_count() to dev_count = fluid.core.get_cuda_device_count() if TrainTaskConfig.use_gpu else multiprocessing.cpu_count().
Training
I launched training by python -u train.py   --src_vocab_fpath wmt16_en_de/vocab.bpe.32000   --trg_vocab_fpath wmt16_en_de/vocab.bpe.32000   --special_token '<s>' '<e>' '<unk>'   --train_file_pattern wmt16_en_de/train.tok.clean.bpe.32000.en-de   --use_token_batch True   --batch_size 3200   --sort_type pool --pool_size 200000
but I got
E0719 14:26:29.439303 55138 graph.cc:43] softmax_with_cross_entropy_grad input var not in all_var list: softmax_with_cross_entropy_0.tmp_0@GRAD
epoch: 0, consumed 0.000161s
Traceback (most recent call last):
  File "train.py", line 428, in <module>
    train(args)
  File "train.py", line 419, in train
    "pass_" + str(pass_id) + ".checkpoint"))
  File "/home/sfraczek/Paddle/build/python/paddle/fluid/io.py", line 288, in save_persistables
    filename=filename)
  File "/home/sfraczek/Paddle/build/python/paddle/fluid/io.py", line 166, in save_vars
    filename=filename)
  File "/home/sfraczek/Paddle/build/python/paddle/fluid/io.py", line 197, in save_vars
    executor.run(save_program)
  File "/home/sfraczek/Paddle/build/python/paddle/fluid/executor.py", line 449, in run
    self.executor.run(program.desc, scope, 0, True, True)
paddle.fluid.core.EnforceNotMet: holder_ should not be null
Tensor not initialized yet when Tensor::type() is called. at [/home/sfraczek/Paddle/paddle/fluid/framework/tensor.h:139]
PaddlePaddle Call Stacks:
0       0x7f060e948f1cp paddle::platform::EnforceNotMet::EnforceNotMet(std::__exception_ptr::exception_ptr, char const*, int) + 572
1       0x7f060e94b901p paddle::framework::Tensor::type() const + 209
2       0x7f060f617bf6p paddle::operators::SaveOp::SaveLodTensor(boost::variant<paddle::platform::CUDAPlace, paddle::platform::CPUPlace, paddle::platform::CUDAPinnedPlace, boost::detail::variant::void_, boost::detail::variant::void_,
boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::va
riant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_> con
st&, paddle::framework::Variable*) const + 614
3       0x7f060f618472p paddle::operators::SaveOp::RunImpl(paddle::framework::Scope const&, boost::variant<paddle::platform::CUDAPlace, paddle::platform::CPUPlace, paddle::platform::CUDAPinnedPlace, boost::detail::variant::void_, boos
t::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::varian
t::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::
detail::variant::void_> const&) const + 210So I have commented out
#fluid.io.save_persistables(
#    exe,
#    os.path.join(TrainTaskConfig.ckpt_dir,
#                 "pass_" + str(pass_id) + ".checkpoint"))and it worked.
Inference
So next I have tried to run inference.
I have found  that the file wmt16_en_de/newstest2013.tok.bpe.32000.en-de doesn't exist but based on the README I guessed that I should run
paste -d ' \ t ' newstest2013.tok.bpe.32000.en newstest2013.tok.bpe.32000.de > newstest2013.tok.bpe.32000.en-de is this correct?
python -u infer.py   --src_vocab_fpath wmt16_en_de/vocab.bpe.32000   --trg_vocab_fpath wmt16_en_de/vocab.bpe.32000   --special_token '<s>' '<e>' '<unk>'   --test_file_pattern wmt16_en_de/newstest2013.tok.bpe.32000.en-de   --batch_size 4   model_path trained_models/pass_20.infer.model   beam_size 5 but there was no ouptut from the script. It ended without error too.
I tried giving other files but it doesn't output anything either.
I added profiling by adding import paddle.fluid.profiler as profiler and
+    parser.add_argument(
+        "--profile",
+        type=bool,
+        default=False,
+        help="Enables/disables profiling.")and
+    if args.profile:
+        with profiler.profiler("CPU", sorted_key='total') as cpuprof:
+            infer(args)
+    else:
+        infer(args)But there is no output from the profile.
Please help.
