我们在[“词向量:word2vec”](./word2vec.md)introduced the word2vec word embedding model. In this notebook we will show how to train a word2vec model with Gluon. We will introduce training the model with the skip-gram objective and negative sampling. Besides mxnet Gluon and numpy we will only use standard Python language features but note that specific toolkits for Natural Language Processing, such as the Gluon-NLP toolkit exist.
首先导入实验所需的包或模块,并抽取数据集。
```{.python .input n=1}
import collections
import itertools
import functools
import random
import numpy as np
import mxnet as mx
from mxnet import nd, gluon
from mxnet.gluon import nn
```
## Data
We then load a corpus for training the word embedding model. Like for training the language model in [“循环神经网络——使用Gluon”](../chapter_recurrent-neural-networks/rnn-gluon.md), we use the Penn Tree Bank(PTB)[1]。它包括训练集、验证集和测试集 。We directly split the datasets into sentences and tokens, considering newlines as paragraph delimeters and whitespace as token delimiter. We print the first five words of the first three sentences of the dataset.
```{.python .input n=2}
import zipfile
with zipfile.ZipFile('../data/ptb.zip', 'r') as zin:
zin.extractall('../data/')
with open('../data/ptb/ptb.train.txt', 'r') as f:
dataset = f.readlines()
dataset = [sentence.split() for sentence in dataset]
for sentence in dataset[:3]:
print(sentence[:5] + ['...'])
```
### 建立词语索引
下面定义了`Dictionary`类来映射词语和整数索引。We first count all tokens in the dataset and assign integer indices to all tokens that occur more than five times in the corpus. We also construct the reverse mapping token to integer index `token_to_idx` and finally replace all tokens in the dataset with their respective indices.
indices = nd.topk(dot_prod.reshape((len(idx_to_token), )), k=k+1, ret_typ='indices')
indices = [int(i.asscalar()) for i in indices]
# Remove unknown and input tokens.
result = [idx_to_token[i] for i in indices[1:]]
print(f'Closest tokens to "{example_token}": {", ".join(result)}')
return result
```
We then define the model and initialize it randomly. Here we denote the model containing the weights $\mathbf{v}$ as `embedding` and respectively the model for $\mathbf{u}$ as `embedding_out`.
[2] Mikolov, Tomas, et al. “Distributed representations of words and phrases and their compositionality.” Advances in neural information processing systems. 2013.[2] Mikolov, Tomas, et al. “Distributed representations of words and phrases and their compositionality.” Advances in neural information processing systems. 2013.