Using trained model in Word2Vec does not give any meaningful inference
Created by: daming-lu
After training the word2vec model using PaddlePaddle for about 1 hour, I used it to try to predict the next word of a sentence:
a form of asbestos once used to make kent cigarette filters has caused a high percentage of cancer deaths among a group of workers exposed to it more than N years ago researchers reported
among a group of is part of the sentence. But the inference gave 2072
, which is <unk>
. Note the sentence appears exactly in the training dataset (the 5th sentence).
(Pdb) word_dict['among']
211
(Pdb) word_dict['a']
6
(Pdb) word_dict['group']
96
(Pdb) word_dict['of']
4
(Pdb) first_word = fluid.create_lod_tensor([[211]], [[1]], place)
(Pdb) second_word = fluid.create_lod_tensor([[6]], [[1]], place)
(Pdb) third_word = fluid.create_lod_tensor([[96]], [[1]], place)
(Pdb) fourth_word = fluid.create_lod_tensor([[4]], [[1]], place)
(Pdb) result = inferencer.infer({'firstw': first_word,'secondw': second_word,'thirdw': third_word,'fourthw': fourth_word},return_numpy=False)
(Pdb) result_list = numpy.array(result[0])
(Pdb) np.argmax(result_list[0])
2072
(Pdb) word_dict['<unk>']
2072
I suspect the training data imikolov
is not built correctly. ptb.train.txt
file has 42068
sentences but the built word_dict
only has 2073 words, including a
, the
, <s>
, <e>
and <unk>
.