diff --git a/word2vec/README.en.md b/word2vec/README.en.md index e4017d896a3a61a457f8b02af91bdabb8f07c64b..9ed9f720809f3ad8fdd843bd49211deff2ec6b83 100644 --- a/word2vec/README.en.md +++ b/word2vec/README.en.md @@ -149,19 +149,257 @@ The advantages of CBOW is that it smooths over the word embeddings of the contex As illustrated in the figure above, skip-gram model maps the word embedding of the given word onto $2n$ word embeddings (including $n$ words before and $n$ words after the given word), and then combine the classification loss of all those $2n$ words by softmax. -## Data Preparation +## Dataset + +We will use Peen Treebank (PTB) (Tomas Mikolov's pre-processed version) dataset. PTB is a small dataset, used in Recurrent Neural Network Language Modeling Toolkit\[[2](#reference)\]. Its statistics are as follows: + +

+ + + + + + + + + + + + + + + + +
training setvalidation settest set
ptb.train.txtptb.valid.txtptb.test.txt
42068 lines3370 lines3761 lines
+

+ +### Python Dataset Module + +We encapsulated the PTB Data Set in our Python module `paddle.dataset.imikolov`. This module can + +1. download the dataset to `~/.cache/paddle/dataset/imikolov`, if not yet, and +2. [preprocesses](#preprocessing) the dataset. + +### Preprocessing + +We will be training a 5-gram model. Given five words in a window, we will predict the fifth word given the first four words. + +Beginning and end of a sentence have a special meaning, so we will add begin token `` in the front of the sentence. And end token `` in the end of the sentence. By moving the five word window in the sentence, data instances are generated. + +For example, the sentence "I have a dream that one day" generates five data instances: + +```text + I have a dream +I have a dream that +have a dream that one +a dream that one day +dream that one day +``` + +At last, each data instance will be converted into an integer sequence according it's words' index inside the dictionary. + +## Training + +The neural network that we will be using is illustrated in the graph below: -## Model Configuration


Figure 5. N-gram neural network model in model configuration

+`word2vec/train.py` demonstrates training word2vec using PaddlePaddle: + +- Import packages. + +```python +import math +import paddle.v2 as paddle +``` + +- Configure parameter. + +```python +embsize = 32 # word vector dimension +hiddensize = 256 # hidden layer dimension +N = 5 # train 5-gram +``` + +- Map the $n-1$ words $w_{t-n+1},...w_{t-1}$ before $w_t$ to a D-dimensional vector though matrix of dimention $|V|\times D$ (D=32 in this example). + +```python +def wordemb(inlayer): + wordemb = paddle.layer.table_projection( + input=inlayer, + size=embsize, + param_attr=paddle.attr.Param( + name="_proj", + initial_std=0.001, + learning_rate=1, + l2_rate=0, )) + return wordemb +``` + +- Define name and type for input to data layer. + +```python +paddle.init(use_gpu=False, trainer_count=3) +word_dict = paddle.dataset.imikolov.build_dict() +dict_size = len(word_dict) +# Every layer takes integer value of range [0, dict_size) +firstword = paddle.layer.data( + name="firstw", type=paddle.data_type.integer_value(dict_size)) +secondword = paddle.layer.data( + name="secondw", type=paddle.data_type.integer_value(dict_size)) +thirdword = paddle.layer.data( + name="thirdw", type=paddle.data_type.integer_value(dict_size)) +fourthword = paddle.layer.data( + name="fourthw", type=paddle.data_type.integer_value(dict_size)) +nextword = paddle.layer.data( + name="fifthw", type=paddle.data_type.integer_value(dict_size)) + +Efirst = wordemb(firstword) +Esecond = wordemb(secondword) +Ethird = wordemb(thirdword) +Efourth = wordemb(fourthword) +``` + +- Concatenate n-1 word embedding vectors into a single feature vector. + +```python +contextemb = paddle.layer.concat(input=[Efirst, Esecond, Ethird, Efourth]) +``` + +- Feature vector will go through a fully connected layer which outputs a hidden feature vector. + +```python +hidden1 = paddle.layer.fc(input=contextemb, + size=hiddensize, + act=paddle.activation.Sigmoid(), + layer_attr=paddle.attr.Extra(drop_rate=0.5), + bias_attr=paddle.attr.Param(learning_rate=2), + param_attr=paddle.attr.Param( + initial_std=1. / math.sqrt(embsize * 8), + learning_rate=1)) +``` + +- Hidden feature vector will go through another fully conected layer, turn into a $|V|$ dimensional vector. At the same time softmax will be applied to get the probability of each word being generated. + +```python +predictword = paddle.layer.fc(input=hidden1, + size=dict_size, + bias_attr=paddle.attr.Param(learning_rate=2), + act=paddle.activation.Softmax()) +``` + +- We will use cross-entropy cost function. + +```python +cost = paddle.layer.classification_cost(input=predictword, label=nextword) +``` + +- Create parameter, optimizer and trainer. + +```python +parameters = paddle.parameters.create(cost) +adam_optimizer = paddle.optimizer.Adam( + learning_rate=3e-3, + regularization=paddle.optimizer.L2Regularization(8e-4)) +trainer = paddle.trainer.SGD(cost, parameters, adam_optimizer) +``` + +Next, we will begin the training process. `paddle.dataset.imikolov.train()` and `paddle.dataset.imikolov.test()` is our training set and test set. Both of the function will return a **reader**: In PaddlePaddle, reader is a python function which returns a Python iterator which output a single data instance at a time. + +`paddle.batch` takes reader as input, outputs a **batched reader**: In PaddlePaddle, a reader outputs a single data instance at a time but batched reader outputs a minibatch of data instances. + +```python +import gzip + +def event_handler(event): + if isinstance(event, paddle.event.EndIteration): + if event.batch_id % 100 == 0: + print "Pass %d, Batch %d, Cost %f, %s" % ( + event.pass_id, event.batch_id, event.cost, event.metrics) + + if isinstance(event, paddle.event.EndPass): + result = trainer.test( + paddle.batch( + paddle.dataset.imikolov.test(word_dict, N), 32)) + print "Pass %d, Testing metrics %s" % (event.pass_id, result.metrics) + with gzip.open("model_%d.tar.gz"%event.pass_id, 'w') as f: + parameters.to_tar(f) + +trainer.train( + paddle.batch(paddle.dataset.imikolov.train(word_dict, N), 32), + num_passes=100, + event_handler=event_handler) +``` + +`trainer.train` will start training, the output of `event_handler` will be similar to following: +```text +Pass 0, Batch 0, Cost 7.870579, {'classification_error_evaluator': 1.0}, Testing metrics {'classification_error_evaluator': 0.999591588973999} +Pass 0, Batch 100, Cost 6.136420, {'classification_error_evaluator': 0.84375}, Testing metrics {'classification_error_evaluator': 0.8328699469566345} +Pass 0, Batch 200, Cost 5.786797, {'classification_error_evaluator': 0.8125}, Testing metrics {'classification_error_evaluator': 0.8328542709350586} +... +``` + +After 30 passes, we can get average error rate around 0.735611. -## Model Training ## Model Application +After the model is trained, we can load saved model parameters and uses it for other models. We can also use the parameters in applications. + +### Viewing Word Vector + +Parameters trained by PaddlePaddle can be viewed by `parameters.get()`. For example, we can check the word vector for word `apple`. + +```python +embeddings = parameters.get("_proj").reshape(len(word_dict), embsize) + +print embeddings[word_dict['apple']] +``` + +```text +[-0.38961065 -0.02392169 -0.00093231 0.36301503 0.13538605 0.16076435 +-0.0678709 0.1090285 0.42014077 -0.24119169 -0.31847557 0.20410083 +0.04910378 0.19021918 -0.0122014 -0.04099389 -0.16924137 0.1911236 +-0.10917275 0.13068172 -0.23079982 0.42699069 -0.27679482 -0.01472992 +0.2069038 0.09005053 -0.3282454 0.12717034 -0.24218646 0.25304323 +0.19072419 -0.24286366] +``` + +### Modifying Word Vector + +Word vectors (`embeddings`) that we get is a numpy array. We can modify this array and set it back to `parameters`. + + +```python +def modify_embedding(emb): + # Add your modification here. + pass + +modify_embedding(embeddings) +parameters.set("_proj", embeddings) +``` + +### Calculating Cosine Similarity + +Cosine similarity is one way of quantifying the similarity between two vectors. The range of result is $[-1, 1]$. The bigger the value, the similar two vectors are: + + +```python +from scipy import spatial + +emb_1 = embeddings[word_dict['world']] +emb_2 = embeddings[word_dict['would']] + +print spatial.distance.cosine(emb_1, emb_2) +``` + +```text +0.99375076448 +``` + ## Conclusion This chapter introduces word embedding, the relationship between language model and word embedding, and how to train neural networks to learn word embedding. diff --git a/word2vec/README.md b/word2vec/README.md index 414c7839d8b81834a75ae1d51db840804edaa77c..a6e5732ca5e7844cbb45c0a568a9bbac67ff05b6 100644 --- a/word2vec/README.md +++ b/word2vec/README.md @@ -144,7 +144,7 @@ CBOW的好处是对上下文词语的分布在词向量上进行了平滑,去 ### 数据介绍 -本教程使用Penn Tree Bank (PTB)数据集。PTB数据集较小,训练速度快,应用于Mikolov的公开语言模型训练工具\[[2](#参考文献)\]中。其统计情况如下: +本教程使用Penn Treebank (PTB)(经Tomas Mikolov预处理过的版本)数据集。PTB数据集较小,训练速度快,应用于Mikolov的公开语言模型训练工具\[[2](#参考文献)\]中。其统计情况如下:

@@ -183,6 +183,7 @@ a dream that one day dream that one day ``` +最后,每个输入会按其单词次在字典里的位置,转化成整数的索引序列,作为PaddlePaddle的输入。 ## 编程实现 本配置的模型结构如下图所示: @@ -245,7 +246,6 @@ Efirst = wordemb(firstword) Esecond = wordemb(secondword) Ethird = wordemb(thirdword) Efourth = wordemb(fourthword) - ``` - 将这n-1个词向量经过concat_layer连接成一个大向量作为历史文本特征。 @@ -323,11 +323,12 @@ trainer.train( event_handler=event_handler) ``` - ... - Pass 0, Batch 25000, Cost 4.251861, {'classification_error_evaluator': 0.84375} - Pass 0, Batch 25100, Cost 4.847692, {'classification_error_evaluator': 0.8125} - Pass 0, Testing metrics {'classification_error_evaluator': 0.7417652606964111} - +```text +Pass 0, Batch 0, Cost 7.870579, {'classification_error_evaluator': 1.0}, Testing metrics {'classification_error_evaluator': 0.999591588973999} +Pass 0, Batch 100, Cost 6.136420, {'classification_error_evaluator': 0.84375}, Testing metrics {'classification_error_evaluator': 0.8328699469566345} +Pass 0, Batch 200, Cost 5.786797, {'classification_error_evaluator': 0.8125}, Testing metrics {'classification_error_evaluator': 0.8328542709350586} +... +``` 训练过程是完全自动的,event_handler里打印的日志类似如上所示: @@ -340,22 +341,23 @@ trainer.train( ### 查看词向量 -PaddlePaddle训练出来的参数可以直接使用`parameters.get()`获取出来。例如查看单词的word的词向量,即为 +PaddlePaddle训练出来的参数可以直接使用`parameters.get()`获取出来。例如查看单词`apple`的词向量,即为 ```python embeddings = parameters.get("_proj").reshape(len(word_dict), embsize) -print embeddings[word_dict['word']] +print embeddings[word_dict['apple']] ``` - [-0.38961065 -0.02392169 -0.00093231 0.36301503 0.13538605 0.16076435 - -0.0678709 0.1090285 0.42014077 -0.24119169 -0.31847557 0.20410083 - 0.04910378 0.19021918 -0.0122014 -0.04099389 -0.16924137 0.1911236 - -0.10917275 0.13068172 -0.23079982 0.42699069 -0.27679482 -0.01472992 - 0.2069038 0.09005053 -0.3282454 0.12717034 -0.24218646 0.25304323 - 0.19072419 -0.24286366] - +```text +[-0.38961065 -0.02392169 -0.00093231 0.36301503 0.13538605 0.16076435 +-0.0678709 0.1090285 0.42014077 -0.24119169 -0.31847557 0.20410083 +0.04910378 0.19021918 -0.0122014 -0.04099389 -0.16924137 0.1911236 +-0.10917275 0.13068172 -0.23079982 0.42699069 -0.27679482 -0.01472992 +0.2069038 0.09005053 -0.3282454 0.12717034 -0.24218646 0.25304323 +0.19072419 -0.24286366] +``` ### 修改词向量 @@ -387,8 +389,9 @@ emb_2 = embeddings[word_dict['would']] print spatial.distance.cosine(emb_1, emb_2) ``` - 0.99375076448 - +```text +0.99375076448 +``` ## 总结 本章中,我们介绍了词向量、语言模型和词向量的关系、以及如何通过训练神经网络模型获得词向量。在信息检索中,我们可以根据向量间的余弦夹角,来判断query和文档关键词这二者间的相关性。在句法分析和语义分析中,训练好的词向量可以用来初始化模型,以得到更好的效果。在文档分类中,有了词向量之后,可以用聚类的方法将文档中同义词进行分组。希望大家在本章后能够自行运用词向量进行相关领域的研究。 diff --git a/word2vec/index.en.html b/word2vec/index.en.html index 1738b1588ad5363e3f9a83168ab35948fe2664c3..66e4bcc9a4eb1047524729a454c57e5268ceeb86 100644 --- a/word2vec/index.en.html +++ b/word2vec/index.en.html @@ -191,19 +191,257 @@ The advantages of CBOW is that it smooths over the word embeddings of the contex As illustrated in the figure above, skip-gram model maps the word embedding of the given word onto $2n$ word embeddings (including $n$ words before and $n$ words after the given word), and then combine the classification loss of all those $2n$ words by softmax. -## Data Preparation +## Dataset + +We will use Peen Treebank (PTB) (Tomas Mikolov's pre-processed version) dataset. PTB is a small dataset, used in Recurrent Neural Network Language Modeling Toolkit\[[2](#reference)\]. Its statistics are as follows: + +

+

+ + + + + + + + + + + + + + + +
training setvalidation settest set
ptb.train.txtptb.valid.txtptb.test.txt
42068 lines3370 lines3761 lines
+

+ +### Python Dataset Module + +We encapsulated the PTB Data Set in our Python module `paddle.dataset.imikolov`. This module can + +1. download the dataset to `~/.cache/paddle/dataset/imikolov`, if not yet, and +2. [preprocesses](#preprocessing) the dataset. + +### Preprocessing + +We will be training a 5-gram model. Given five words in a window, we will predict the fifth word given the first four words. + +Beginning and end of a sentence have a special meaning, so we will add begin token `` in the front of the sentence. And end token `` in the end of the sentence. By moving the five word window in the sentence, data instances are generated. + +For example, the sentence "I have a dream that one day" generates five data instances: + +```text + I have a dream +I have a dream that +have a dream that one +a dream that one day +dream that one day +``` + +At last, each data instance will be converted into an integer sequence according it's words' index inside the dictionary. + +## Training + +The neural network that we will be using is illustrated in the graph below: -## Model Configuration


Figure 5. N-gram neural network model in model configuration

+`word2vec/train.py` demonstrates training word2vec using PaddlePaddle: + +- Import packages. + +```python +import math +import paddle.v2 as paddle +``` + +- Configure parameter. + +```python +embsize = 32 # word vector dimension +hiddensize = 256 # hidden layer dimension +N = 5 # train 5-gram +``` + +- Map the $n-1$ words $w_{t-n+1},...w_{t-1}$ before $w_t$ to a D-dimensional vector though matrix of dimention $|V|\times D$ (D=32 in this example). + +```python +def wordemb(inlayer): + wordemb = paddle.layer.table_projection( + input=inlayer, + size=embsize, + param_attr=paddle.attr.Param( + name="_proj", + initial_std=0.001, + learning_rate=1, + l2_rate=0, )) + return wordemb +``` + +- Define name and type for input to data layer. + +```python +paddle.init(use_gpu=False, trainer_count=3) +word_dict = paddle.dataset.imikolov.build_dict() +dict_size = len(word_dict) +# Every layer takes integer value of range [0, dict_size) +firstword = paddle.layer.data( + name="firstw", type=paddle.data_type.integer_value(dict_size)) +secondword = paddle.layer.data( + name="secondw", type=paddle.data_type.integer_value(dict_size)) +thirdword = paddle.layer.data( + name="thirdw", type=paddle.data_type.integer_value(dict_size)) +fourthword = paddle.layer.data( + name="fourthw", type=paddle.data_type.integer_value(dict_size)) +nextword = paddle.layer.data( + name="fifthw", type=paddle.data_type.integer_value(dict_size)) + +Efirst = wordemb(firstword) +Esecond = wordemb(secondword) +Ethird = wordemb(thirdword) +Efourth = wordemb(fourthword) +``` + +- Concatenate n-1 word embedding vectors into a single feature vector. + +```python +contextemb = paddle.layer.concat(input=[Efirst, Esecond, Ethird, Efourth]) +``` + +- Feature vector will go through a fully connected layer which outputs a hidden feature vector. + +```python +hidden1 = paddle.layer.fc(input=contextemb, + size=hiddensize, + act=paddle.activation.Sigmoid(), + layer_attr=paddle.attr.Extra(drop_rate=0.5), + bias_attr=paddle.attr.Param(learning_rate=2), + param_attr=paddle.attr.Param( + initial_std=1. / math.sqrt(embsize * 8), + learning_rate=1)) +``` + +- Hidden feature vector will go through another fully conected layer, turn into a $|V|$ dimensional vector. At the same time softmax will be applied to get the probability of each word being generated. + +```python +predictword = paddle.layer.fc(input=hidden1, + size=dict_size, + bias_attr=paddle.attr.Param(learning_rate=2), + act=paddle.activation.Softmax()) +``` + +- We will use cross-entropy cost function. + +```python +cost = paddle.layer.classification_cost(input=predictword, label=nextword) +``` + +- Create parameter, optimizer and trainer. + +```python +parameters = paddle.parameters.create(cost) +adam_optimizer = paddle.optimizer.Adam( + learning_rate=3e-3, + regularization=paddle.optimizer.L2Regularization(8e-4)) +trainer = paddle.trainer.SGD(cost, parameters, adam_optimizer) +``` + +Next, we will begin the training process. `paddle.dataset.imikolov.train()` and `paddle.dataset.imikolov.test()` is our training set and test set. Both of the function will return a **reader**: In PaddlePaddle, reader is a python function which returns a Python iterator which output a single data instance at a time. + +`paddle.batch` takes reader as input, outputs a **batched reader**: In PaddlePaddle, a reader outputs a single data instance at a time but batched reader outputs a minibatch of data instances. + +```python +import gzip + +def event_handler(event): + if isinstance(event, paddle.event.EndIteration): + if event.batch_id % 100 == 0: + print "Pass %d, Batch %d, Cost %f, %s" % ( + event.pass_id, event.batch_id, event.cost, event.metrics) + + if isinstance(event, paddle.event.EndPass): + result = trainer.test( + paddle.batch( + paddle.dataset.imikolov.test(word_dict, N), 32)) + print "Pass %d, Testing metrics %s" % (event.pass_id, result.metrics) + with gzip.open("model_%d.tar.gz"%event.pass_id, 'w') as f: + parameters.to_tar(f) + +trainer.train( + paddle.batch(paddle.dataset.imikolov.train(word_dict, N), 32), + num_passes=100, + event_handler=event_handler) +``` + +`trainer.train` will start training, the output of `event_handler` will be similar to following: +```text +Pass 0, Batch 0, Cost 7.870579, {'classification_error_evaluator': 1.0}, Testing metrics {'classification_error_evaluator': 0.999591588973999} +Pass 0, Batch 100, Cost 6.136420, {'classification_error_evaluator': 0.84375}, Testing metrics {'classification_error_evaluator': 0.8328699469566345} +Pass 0, Batch 200, Cost 5.786797, {'classification_error_evaluator': 0.8125}, Testing metrics {'classification_error_evaluator': 0.8328542709350586} +... +``` + +After 30 passes, we can get average error rate around 0.735611. -## Model Training ## Model Application +After the model is trained, we can load saved model parameters and uses it for other models. We can also use the parameters in applications. + +### Viewing Word Vector + +Parameters trained by PaddlePaddle can be viewed by `parameters.get()`. For example, we can check the word vector for word `apple`. + +```python +embeddings = parameters.get("_proj").reshape(len(word_dict), embsize) + +print embeddings[word_dict['apple']] +``` + +```text +[-0.38961065 -0.02392169 -0.00093231 0.36301503 0.13538605 0.16076435 +-0.0678709 0.1090285 0.42014077 -0.24119169 -0.31847557 0.20410083 +0.04910378 0.19021918 -0.0122014 -0.04099389 -0.16924137 0.1911236 +-0.10917275 0.13068172 -0.23079982 0.42699069 -0.27679482 -0.01472992 +0.2069038 0.09005053 -0.3282454 0.12717034 -0.24218646 0.25304323 +0.19072419 -0.24286366] +``` + +### Modifying Word Vector + +Word vectors (`embeddings`) that we get is a numpy array. We can modify this array and set it back to `parameters`. + + +```python +def modify_embedding(emb): + # Add your modification here. + pass + +modify_embedding(embeddings) +parameters.set("_proj", embeddings) +``` + +### Calculating Cosine Similarity + +Cosine similarity is one way of quantifying the similarity between two vectors. The range of result is $[-1, 1]$. The bigger the value, the similar two vectors are: + + +```python +from scipy import spatial + +emb_1 = embeddings[word_dict['world']] +emb_2 = embeddings[word_dict['would']] + +print spatial.distance.cosine(emb_1, emb_2) +``` + +```text +0.99375076448 +``` + ## Conclusion This chapter introduces word embedding, the relationship between language model and word embedding, and how to train neural networks to learn word embedding. diff --git a/word2vec/index.html b/word2vec/index.html index e7b232121b0b1d2c9f47720602c05e171597497a..ad5cd1942de85a946beca2bfd7213e8924316a39 100644 --- a/word2vec/index.html +++ b/word2vec/index.html @@ -186,7 +186,7 @@ CBOW的好处是对上下文词语的分布在词向量上进行了平滑,去 ### 数据介绍 -本教程使用Penn Tree Bank (PTB)数据集。PTB数据集较小,训练速度快,应用于Mikolov的公开语言模型训练工具\[[2](#参考文献)\]中。其统计情况如下: +本教程使用Penn Treebank (PTB)(经Tomas Mikolov预处理过的版本)数据集。PTB数据集较小,训练速度快,应用于Mikolov的公开语言模型训练工具\[[2](#参考文献)\]中。其统计情况如下:

@@ -225,6 +225,7 @@ a dream that one day dream that one day ``` +最后,每个输入会按其单词次在字典里的位置,转化成整数的索引序列,作为PaddlePaddle的输入。 ## 编程实现 本配置的模型结构如下图所示: @@ -287,7 +288,6 @@ Efirst = wordemb(firstword) Esecond = wordemb(secondword) Ethird = wordemb(thirdword) Efourth = wordemb(fourthword) - ``` - 将这n-1个词向量经过concat_layer连接成一个大向量作为历史文本特征。 @@ -365,11 +365,12 @@ trainer.train( event_handler=event_handler) ``` - ... - Pass 0, Batch 25000, Cost 4.251861, {'classification_error_evaluator': 0.84375} - Pass 0, Batch 25100, Cost 4.847692, {'classification_error_evaluator': 0.8125} - Pass 0, Testing metrics {'classification_error_evaluator': 0.7417652606964111} - +```text +Pass 0, Batch 0, Cost 7.870579, {'classification_error_evaluator': 1.0}, Testing metrics {'classification_error_evaluator': 0.999591588973999} +Pass 0, Batch 100, Cost 6.136420, {'classification_error_evaluator': 0.84375}, Testing metrics {'classification_error_evaluator': 0.8328699469566345} +Pass 0, Batch 200, Cost 5.786797, {'classification_error_evaluator': 0.8125}, Testing metrics {'classification_error_evaluator': 0.8328542709350586} +... +``` 训练过程是完全自动的,event_handler里打印的日志类似如上所示: @@ -382,22 +383,23 @@ trainer.train( ### 查看词向量 -PaddlePaddle训练出来的参数可以直接使用`parameters.get()`获取出来。例如查看单词的word的词向量,即为 +PaddlePaddle训练出来的参数可以直接使用`parameters.get()`获取出来。例如查看单词`apple`的词向量,即为 ```python embeddings = parameters.get("_proj").reshape(len(word_dict), embsize) -print embeddings[word_dict['word']] +print embeddings[word_dict['apple']] ``` - [-0.38961065 -0.02392169 -0.00093231 0.36301503 0.13538605 0.16076435 - -0.0678709 0.1090285 0.42014077 -0.24119169 -0.31847557 0.20410083 - 0.04910378 0.19021918 -0.0122014 -0.04099389 -0.16924137 0.1911236 - -0.10917275 0.13068172 -0.23079982 0.42699069 -0.27679482 -0.01472992 - 0.2069038 0.09005053 -0.3282454 0.12717034 -0.24218646 0.25304323 - 0.19072419 -0.24286366] - +```text +[-0.38961065 -0.02392169 -0.00093231 0.36301503 0.13538605 0.16076435 +-0.0678709 0.1090285 0.42014077 -0.24119169 -0.31847557 0.20410083 +0.04910378 0.19021918 -0.0122014 -0.04099389 -0.16924137 0.1911236 +-0.10917275 0.13068172 -0.23079982 0.42699069 -0.27679482 -0.01472992 +0.2069038 0.09005053 -0.3282454 0.12717034 -0.24218646 0.25304323 +0.19072419 -0.24286366] +``` ### 修改词向量 @@ -429,8 +431,9 @@ emb_2 = embeddings[word_dict['would']] print spatial.distance.cosine(emb_1, emb_2) ``` - 0.99375076448 - +```text +0.99375076448 +``` ## 总结 本章中,我们介绍了词向量、语言模型和词向量的关系、以及如何通过训练神经网络模型获得词向量。在信息检索中,我们可以根据向量间的余弦夹角,来判断query和文档关键词这二者间的相关性。在句法分析和语义分析中,训练好的词向量可以用来初始化模型,以得到更好的效果。在文档分类中,有了词向量之后,可以用聚类的方法将文档中同义词进行分组。希望大家在本章后能够自行运用词向量进行相关领域的研究。