Using trained model in Word2Vec does not give any meaningful inference (#534) · Issue · PaddlePaddle / book

Using trained model in Word2Vec does not give any meaningful inference

Created by: daming-lu

After training the word2vec model using PaddlePaddle for about 1 hour, I used it to try to predict the next word of a sentence:

a form of asbestos once used to make kent cigarette filters has caused a high percentage of cancer deaths among a group of workers exposed to it more than N years ago researchers reported

among a group of is part of the sentence. But the inference gave 2072, which is <unk>. Note the sentence appears exactly in the training dataset (the 5th sentence).

(Pdb) word_dict['among']
211
(Pdb) word_dict['a']
6
(Pdb) word_dict['group']
96
(Pdb) word_dict['of']
4
(Pdb) first_word = fluid.create_lod_tensor([[211]], [[1]], place)
(Pdb) second_word = fluid.create_lod_tensor([[6]], [[1]], place)
(Pdb) third_word = fluid.create_lod_tensor([[96]], [[1]], place)
(Pdb) fourth_word = fluid.create_lod_tensor([[4]], [[1]], place)
(Pdb) result = inferencer.infer({'firstw': first_word,'secondw': second_word,'thirdw': third_word,'fourthw': fourth_word},return_numpy=False)
(Pdb) result_list = numpy.array(result[0])
(Pdb) np.argmax(result_list[0])
2072
(Pdb) word_dict['<unk>']
2072

I suspect the training data imikolov is not built correctly. ptb.train.txt file has 42068 sentences but the built word_dict only has 2073 words, including a, the, <s>, <e> and <unk>.