diff --git a/word2vec/README.en.md b/word2vec/README.en.md
index e4017d896a3a61a457f8b02af91bdabb8f07c64b..9ed9f720809f3ad8fdd843bd49211deff2ec6b83 100644
--- a/word2vec/README.en.md
+++ b/word2vec/README.en.md
@@ -149,19 +149,257 @@ The advantages of CBOW is that it smooths over the word embeddings of the contex
As illustrated in the figure above, skip-gram model maps the word embedding of the given word onto $2n$ word embeddings (including $n$ words before and $n$ words after the given word), and then combine the classification loss of all those $2n$ words by softmax.
-## Data Preparation
+## Dataset
+
+We will use Peen Treebank (PTB) (Tomas Mikolov's pre-processed version) dataset. PTB is a small dataset, used in Recurrent Neural Network Language Modeling Toolkit\[[2](#reference)\]. Its statistics are as follows:
+
+
+
+
+ training set |
+ validation set |
+ test set |
+
+
+ ptb.train.txt |
+ ptb.valid.txt |
+ ptb.test.txt |
+
+
+ 42068 lines |
+ 3370 lines |
+ 3761 lines |
+
+
+
+
+### Python Dataset Module
+
+We encapsulated the PTB Data Set in our Python module `paddle.dataset.imikolov`. This module can
+
+1. download the dataset to `~/.cache/paddle/dataset/imikolov`, if not yet, and
+2. [preprocesses](#preprocessing) the dataset.
+
+### Preprocessing
+
+We will be training a 5-gram model. Given five words in a window, we will predict the fifth word given the first four words.
+
+Beginning and end of a sentence have a special meaning, so we will add begin token `` in the front of the sentence. And end token `` in the end of the sentence. By moving the five word window in the sentence, data instances are generated.
+
+For example, the sentence "I have a dream that one day" generates five data instances:
+
+```text
+ I have a dream
+I have a dream that
+have a dream that one
+a dream that one day
+dream that one day
+```
+
+At last, each data instance will be converted into an integer sequence according it's words' index inside the dictionary.
+
+## Training
+
+The neural network that we will be using is illustrated in the graph below:
-## Model Configuration
Figure 5. N-gram neural network model in model configuration
+`word2vec/train.py` demonstrates training word2vec using PaddlePaddle:
+
+- Import packages.
+
+```python
+import math
+import paddle.v2 as paddle
+```
+
+- Configure parameter.
+
+```python
+embsize = 32 # word vector dimension
+hiddensize = 256 # hidden layer dimension
+N = 5 # train 5-gram
+```
+
+- Map the $n-1$ words $w_{t-n+1},...w_{t-1}$ before $w_t$ to a D-dimensional vector though matrix of dimention $|V|\times D$ (D=32 in this example).
+
+```python
+def wordemb(inlayer):
+ wordemb = paddle.layer.table_projection(
+ input=inlayer,
+ size=embsize,
+ param_attr=paddle.attr.Param(
+ name="_proj",
+ initial_std=0.001,
+ learning_rate=1,
+ l2_rate=0, ))
+ return wordemb
+```
+
+- Define name and type for input to data layer.
+
+```python
+paddle.init(use_gpu=False, trainer_count=3)
+word_dict = paddle.dataset.imikolov.build_dict()
+dict_size = len(word_dict)
+# Every layer takes integer value of range [0, dict_size)
+firstword = paddle.layer.data(
+ name="firstw", type=paddle.data_type.integer_value(dict_size))
+secondword = paddle.layer.data(
+ name="secondw", type=paddle.data_type.integer_value(dict_size))
+thirdword = paddle.layer.data(
+ name="thirdw", type=paddle.data_type.integer_value(dict_size))
+fourthword = paddle.layer.data(
+ name="fourthw", type=paddle.data_type.integer_value(dict_size))
+nextword = paddle.layer.data(
+ name="fifthw", type=paddle.data_type.integer_value(dict_size))
+
+Efirst = wordemb(firstword)
+Esecond = wordemb(secondword)
+Ethird = wordemb(thirdword)
+Efourth = wordemb(fourthword)
+```
+
+- Concatenate n-1 word embedding vectors into a single feature vector.
+
+```python
+contextemb = paddle.layer.concat(input=[Efirst, Esecond, Ethird, Efourth])
+```
+
+- Feature vector will go through a fully connected layer which outputs a hidden feature vector.
+
+```python
+hidden1 = paddle.layer.fc(input=contextemb,
+ size=hiddensize,
+ act=paddle.activation.Sigmoid(),
+ layer_attr=paddle.attr.Extra(drop_rate=0.5),
+ bias_attr=paddle.attr.Param(learning_rate=2),
+ param_attr=paddle.attr.Param(
+ initial_std=1. / math.sqrt(embsize * 8),
+ learning_rate=1))
+```
+
+- Hidden feature vector will go through another fully conected layer, turn into a $|V|$ dimensional vector. At the same time softmax will be applied to get the probability of each word being generated.
+
+```python
+predictword = paddle.layer.fc(input=hidden1,
+ size=dict_size,
+ bias_attr=paddle.attr.Param(learning_rate=2),
+ act=paddle.activation.Softmax())
+```
+
+- We will use cross-entropy cost function.
+
+```python
+cost = paddle.layer.classification_cost(input=predictword, label=nextword)
+```
+
+- Create parameter, optimizer and trainer.
+
+```python
+parameters = paddle.parameters.create(cost)
+adam_optimizer = paddle.optimizer.Adam(
+ learning_rate=3e-3,
+ regularization=paddle.optimizer.L2Regularization(8e-4))
+trainer = paddle.trainer.SGD(cost, parameters, adam_optimizer)
+```
+
+Next, we will begin the training process. `paddle.dataset.imikolov.train()` and `paddle.dataset.imikolov.test()` is our training set and test set. Both of the function will return a **reader**: In PaddlePaddle, reader is a python function which returns a Python iterator which output a single data instance at a time.
+
+`paddle.batch` takes reader as input, outputs a **batched reader**: In PaddlePaddle, a reader outputs a single data instance at a time but batched reader outputs a minibatch of data instances.
+
+```python
+import gzip
+
+def event_handler(event):
+ if isinstance(event, paddle.event.EndIteration):
+ if event.batch_id % 100 == 0:
+ print "Pass %d, Batch %d, Cost %f, %s" % (
+ event.pass_id, event.batch_id, event.cost, event.metrics)
+
+ if isinstance(event, paddle.event.EndPass):
+ result = trainer.test(
+ paddle.batch(
+ paddle.dataset.imikolov.test(word_dict, N), 32))
+ print "Pass %d, Testing metrics %s" % (event.pass_id, result.metrics)
+ with gzip.open("model_%d.tar.gz"%event.pass_id, 'w') as f:
+ parameters.to_tar(f)
+
+trainer.train(
+ paddle.batch(paddle.dataset.imikolov.train(word_dict, N), 32),
+ num_passes=100,
+ event_handler=event_handler)
+```
+
+`trainer.train` will start training, the output of `event_handler` will be similar to following:
+```text
+Pass 0, Batch 0, Cost 7.870579, {'classification_error_evaluator': 1.0}, Testing metrics {'classification_error_evaluator': 0.999591588973999}
+Pass 0, Batch 100, Cost 6.136420, {'classification_error_evaluator': 0.84375}, Testing metrics {'classification_error_evaluator': 0.8328699469566345}
+Pass 0, Batch 200, Cost 5.786797, {'classification_error_evaluator': 0.8125}, Testing metrics {'classification_error_evaluator': 0.8328542709350586}
+...
+```
+
+After 30 passes, we can get average error rate around 0.735611.
-## Model Training
## Model Application
+After the model is trained, we can load saved model parameters and uses it for other models. We can also use the parameters in applications.
+
+### Viewing Word Vector
+
+Parameters trained by PaddlePaddle can be viewed by `parameters.get()`. For example, we can check the word vector for word `apple`.
+
+```python
+embeddings = parameters.get("_proj").reshape(len(word_dict), embsize)
+
+print embeddings[word_dict['apple']]
+```
+
+```text
+[-0.38961065 -0.02392169 -0.00093231 0.36301503 0.13538605 0.16076435
+-0.0678709 0.1090285 0.42014077 -0.24119169 -0.31847557 0.20410083
+0.04910378 0.19021918 -0.0122014 -0.04099389 -0.16924137 0.1911236
+-0.10917275 0.13068172 -0.23079982 0.42699069 -0.27679482 -0.01472992
+0.2069038 0.09005053 -0.3282454 0.12717034 -0.24218646 0.25304323
+0.19072419 -0.24286366]
+```
+
+### Modifying Word Vector
+
+Word vectors (`embeddings`) that we get is a numpy array. We can modify this array and set it back to `parameters`.
+
+
+```python
+def modify_embedding(emb):
+ # Add your modification here.
+ pass
+
+modify_embedding(embeddings)
+parameters.set("_proj", embeddings)
+```
+
+### Calculating Cosine Similarity
+
+Cosine similarity is one way of quantifying the similarity between two vectors. The range of result is $[-1, 1]$. The bigger the value, the similar two vectors are:
+
+
+```python
+from scipy import spatial
+
+emb_1 = embeddings[word_dict['world']]
+emb_2 = embeddings[word_dict['would']]
+
+print spatial.distance.cosine(emb_1, emb_2)
+```
+
+```text
+0.99375076448
+```
+
## Conclusion
This chapter introduces word embedding, the relationship between language model and word embedding, and how to train neural networks to learn word embedding.
diff --git a/word2vec/README.md b/word2vec/README.md
index 414c7839d8b81834a75ae1d51db840804edaa77c..a6e5732ca5e7844cbb45c0a568a9bbac67ff05b6 100644
--- a/word2vec/README.md
+++ b/word2vec/README.md
@@ -144,7 +144,7 @@ CBOW的好处是对上下文词语的分布在词向量上进行了平滑,去
### 数据介绍
-本教程使用Penn Tree Bank (PTB)数据集。PTB数据集较小,训练速度快,应用于Mikolov的公开语言模型训练工具\[[2](#参考文献)\]中。其统计情况如下:
+本教程使用Penn Treebank (PTB)(经Tomas Mikolov预处理过的版本)数据集。PTB数据集较小,训练速度快,应用于Mikolov的公开语言模型训练工具\[[2](#参考文献)\]中。其统计情况如下:
@@ -183,6 +183,7 @@ a dream that one day
dream that one day
```
+最后,每个输入会按其单词次在字典里的位置,转化成整数的索引序列,作为PaddlePaddle的输入。
## 编程实现
本配置的模型结构如下图所示:
@@ -245,7 +246,6 @@ Efirst = wordemb(firstword)
Esecond = wordemb(secondword)
Ethird = wordemb(thirdword)
Efourth = wordemb(fourthword)
-
```
- 将这n-1个词向量经过concat_layer连接成一个大向量作为历史文本特征。
@@ -323,11 +323,12 @@ trainer.train(
event_handler=event_handler)
```
- ...
- Pass 0, Batch 25000, Cost 4.251861, {'classification_error_evaluator': 0.84375}
- Pass 0, Batch 25100, Cost 4.847692, {'classification_error_evaluator': 0.8125}
- Pass 0, Testing metrics {'classification_error_evaluator': 0.7417652606964111}
-
+```text
+Pass 0, Batch 0, Cost 7.870579, {'classification_error_evaluator': 1.0}, Testing metrics {'classification_error_evaluator': 0.999591588973999}
+Pass 0, Batch 100, Cost 6.136420, {'classification_error_evaluator': 0.84375}, Testing metrics {'classification_error_evaluator': 0.8328699469566345}
+Pass 0, Batch 200, Cost 5.786797, {'classification_error_evaluator': 0.8125}, Testing metrics {'classification_error_evaluator': 0.8328542709350586}
+...
+```
训练过程是完全自动的,event_handler里打印的日志类似如上所示:
@@ -340,22 +341,23 @@ trainer.train(
### 查看词向量
-PaddlePaddle训练出来的参数可以直接使用`parameters.get()`获取出来。例如查看单词的word的词向量,即为
+PaddlePaddle训练出来的参数可以直接使用`parameters.get()`获取出来。例如查看单词`apple`的词向量,即为
```python
embeddings = parameters.get("_proj").reshape(len(word_dict), embsize)
-print embeddings[word_dict['word']]
+print embeddings[word_dict['apple']]
```
- [-0.38961065 -0.02392169 -0.00093231 0.36301503 0.13538605 0.16076435
- -0.0678709 0.1090285 0.42014077 -0.24119169 -0.31847557 0.20410083
- 0.04910378 0.19021918 -0.0122014 -0.04099389 -0.16924137 0.1911236
- -0.10917275 0.13068172 -0.23079982 0.42699069 -0.27679482 -0.01472992
- 0.2069038 0.09005053 -0.3282454 0.12717034 -0.24218646 0.25304323
- 0.19072419 -0.24286366]
-
+```text
+[-0.38961065 -0.02392169 -0.00093231 0.36301503 0.13538605 0.16076435
+-0.0678709 0.1090285 0.42014077 -0.24119169 -0.31847557 0.20410083
+0.04910378 0.19021918 -0.0122014 -0.04099389 -0.16924137 0.1911236
+-0.10917275 0.13068172 -0.23079982 0.42699069 -0.27679482 -0.01472992
+0.2069038 0.09005053 -0.3282454 0.12717034 -0.24218646 0.25304323
+0.19072419 -0.24286366]
+```
### 修改词向量
@@ -387,8 +389,9 @@ emb_2 = embeddings[word_dict['would']]
print spatial.distance.cosine(emb_1, emb_2)
```
- 0.99375076448
-
+```text
+0.99375076448
+```
## 总结
本章中,我们介绍了词向量、语言模型和词向量的关系、以及如何通过训练神经网络模型获得词向量。在信息检索中,我们可以根据向量间的余弦夹角,来判断query和文档关键词这二者间的相关性。在句法分析和语义分析中,训练好的词向量可以用来初始化模型,以得到更好的效果。在文档分类中,有了词向量之后,可以用聚类的方法将文档中同义词进行分组。希望大家在本章后能够自行运用词向量进行相关领域的研究。
diff --git a/word2vec/index.en.html b/word2vec/index.en.html
index 1738b1588ad5363e3f9a83168ab35948fe2664c3..66e4bcc9a4eb1047524729a454c57e5268ceeb86 100644
--- a/word2vec/index.en.html
+++ b/word2vec/index.en.html
@@ -191,19 +191,257 @@ The advantages of CBOW is that it smooths over the word embeddings of the contex
As illustrated in the figure above, skip-gram model maps the word embedding of the given word onto $2n$ word embeddings (including $n$ words before and $n$ words after the given word), and then combine the classification loss of all those $2n$ words by softmax.
-## Data Preparation
+## Dataset
+
+We will use Peen Treebank (PTB) (Tomas Mikolov's pre-processed version) dataset. PTB is a small dataset, used in Recurrent Neural Network Language Modeling Toolkit\[[2](#reference)\]. Its statistics are as follows:
+
+
+
+
+ training set |
+ validation set |
+ test set |
+
+
+ ptb.train.txt |
+ ptb.valid.txt |
+ ptb.test.txt |
+
+
+ 42068 lines |
+ 3370 lines |
+ 3761 lines |
+
+
+
+
+### Python Dataset Module
+
+We encapsulated the PTB Data Set in our Python module `paddle.dataset.imikolov`. This module can
+
+1. download the dataset to `~/.cache/paddle/dataset/imikolov`, if not yet, and
+2. [preprocesses](#preprocessing) the dataset.
+
+### Preprocessing
+
+We will be training a 5-gram model. Given five words in a window, we will predict the fifth word given the first four words.
+
+Beginning and end of a sentence have a special meaning, so we will add begin token `` in the front of the sentence. And end token `` in the end of the sentence. By moving the five word window in the sentence, data instances are generated.
+
+For example, the sentence "I have a dream that one day" generates five data instances:
+
+```text
+ I have a dream
+I have a dream that
+have a dream that one
+a dream that one day
+dream that one day
+```
+
+At last, each data instance will be converted into an integer sequence according it's words' index inside the dictionary.
+
+## Training
+
+The neural network that we will be using is illustrated in the graph below:
-## Model Configuration
Figure 5. N-gram neural network model in model configuration
+`word2vec/train.py` demonstrates training word2vec using PaddlePaddle:
+
+- Import packages.
+
+```python
+import math
+import paddle.v2 as paddle
+```
+
+- Configure parameter.
+
+```python
+embsize = 32 # word vector dimension
+hiddensize = 256 # hidden layer dimension
+N = 5 # train 5-gram
+```
+
+- Map the $n-1$ words $w_{t-n+1},...w_{t-1}$ before $w_t$ to a D-dimensional vector though matrix of dimention $|V|\times D$ (D=32 in this example).
+
+```python
+def wordemb(inlayer):
+ wordemb = paddle.layer.table_projection(
+ input=inlayer,
+ size=embsize,
+ param_attr=paddle.attr.Param(
+ name="_proj",
+ initial_std=0.001,
+ learning_rate=1,
+ l2_rate=0, ))
+ return wordemb
+```
+
+- Define name and type for input to data layer.
+
+```python
+paddle.init(use_gpu=False, trainer_count=3)
+word_dict = paddle.dataset.imikolov.build_dict()
+dict_size = len(word_dict)
+# Every layer takes integer value of range [0, dict_size)
+firstword = paddle.layer.data(
+ name="firstw", type=paddle.data_type.integer_value(dict_size))
+secondword = paddle.layer.data(
+ name="secondw", type=paddle.data_type.integer_value(dict_size))
+thirdword = paddle.layer.data(
+ name="thirdw", type=paddle.data_type.integer_value(dict_size))
+fourthword = paddle.layer.data(
+ name="fourthw", type=paddle.data_type.integer_value(dict_size))
+nextword = paddle.layer.data(
+ name="fifthw", type=paddle.data_type.integer_value(dict_size))
+
+Efirst = wordemb(firstword)
+Esecond = wordemb(secondword)
+Ethird = wordemb(thirdword)
+Efourth = wordemb(fourthword)
+```
+
+- Concatenate n-1 word embedding vectors into a single feature vector.
+
+```python
+contextemb = paddle.layer.concat(input=[Efirst, Esecond, Ethird, Efourth])
+```
+
+- Feature vector will go through a fully connected layer which outputs a hidden feature vector.
+
+```python
+hidden1 = paddle.layer.fc(input=contextemb,
+ size=hiddensize,
+ act=paddle.activation.Sigmoid(),
+ layer_attr=paddle.attr.Extra(drop_rate=0.5),
+ bias_attr=paddle.attr.Param(learning_rate=2),
+ param_attr=paddle.attr.Param(
+ initial_std=1. / math.sqrt(embsize * 8),
+ learning_rate=1))
+```
+
+- Hidden feature vector will go through another fully conected layer, turn into a $|V|$ dimensional vector. At the same time softmax will be applied to get the probability of each word being generated.
+
+```python
+predictword = paddle.layer.fc(input=hidden1,
+ size=dict_size,
+ bias_attr=paddle.attr.Param(learning_rate=2),
+ act=paddle.activation.Softmax())
+```
+
+- We will use cross-entropy cost function.
+
+```python
+cost = paddle.layer.classification_cost(input=predictword, label=nextword)
+```
+
+- Create parameter, optimizer and trainer.
+
+```python
+parameters = paddle.parameters.create(cost)
+adam_optimizer = paddle.optimizer.Adam(
+ learning_rate=3e-3,
+ regularization=paddle.optimizer.L2Regularization(8e-4))
+trainer = paddle.trainer.SGD(cost, parameters, adam_optimizer)
+```
+
+Next, we will begin the training process. `paddle.dataset.imikolov.train()` and `paddle.dataset.imikolov.test()` is our training set and test set. Both of the function will return a **reader**: In PaddlePaddle, reader is a python function which returns a Python iterator which output a single data instance at a time.
+
+`paddle.batch` takes reader as input, outputs a **batched reader**: In PaddlePaddle, a reader outputs a single data instance at a time but batched reader outputs a minibatch of data instances.
+
+```python
+import gzip
+
+def event_handler(event):
+ if isinstance(event, paddle.event.EndIteration):
+ if event.batch_id % 100 == 0:
+ print "Pass %d, Batch %d, Cost %f, %s" % (
+ event.pass_id, event.batch_id, event.cost, event.metrics)
+
+ if isinstance(event, paddle.event.EndPass):
+ result = trainer.test(
+ paddle.batch(
+ paddle.dataset.imikolov.test(word_dict, N), 32))
+ print "Pass %d, Testing metrics %s" % (event.pass_id, result.metrics)
+ with gzip.open("model_%d.tar.gz"%event.pass_id, 'w') as f:
+ parameters.to_tar(f)
+
+trainer.train(
+ paddle.batch(paddle.dataset.imikolov.train(word_dict, N), 32),
+ num_passes=100,
+ event_handler=event_handler)
+```
+
+`trainer.train` will start training, the output of `event_handler` will be similar to following:
+```text
+Pass 0, Batch 0, Cost 7.870579, {'classification_error_evaluator': 1.0}, Testing metrics {'classification_error_evaluator': 0.999591588973999}
+Pass 0, Batch 100, Cost 6.136420, {'classification_error_evaluator': 0.84375}, Testing metrics {'classification_error_evaluator': 0.8328699469566345}
+Pass 0, Batch 200, Cost 5.786797, {'classification_error_evaluator': 0.8125}, Testing metrics {'classification_error_evaluator': 0.8328542709350586}
+...
+```
+
+After 30 passes, we can get average error rate around 0.735611.
-## Model Training
## Model Application
+After the model is trained, we can load saved model parameters and uses it for other models. We can also use the parameters in applications.
+
+### Viewing Word Vector
+
+Parameters trained by PaddlePaddle can be viewed by `parameters.get()`. For example, we can check the word vector for word `apple`.
+
+```python
+embeddings = parameters.get("_proj").reshape(len(word_dict), embsize)
+
+print embeddings[word_dict['apple']]
+```
+
+```text
+[-0.38961065 -0.02392169 -0.00093231 0.36301503 0.13538605 0.16076435
+-0.0678709 0.1090285 0.42014077 -0.24119169 -0.31847557 0.20410083
+0.04910378 0.19021918 -0.0122014 -0.04099389 -0.16924137 0.1911236
+-0.10917275 0.13068172 -0.23079982 0.42699069 -0.27679482 -0.01472992
+0.2069038 0.09005053 -0.3282454 0.12717034 -0.24218646 0.25304323
+0.19072419 -0.24286366]
+```
+
+### Modifying Word Vector
+
+Word vectors (`embeddings`) that we get is a numpy array. We can modify this array and set it back to `parameters`.
+
+
+```python
+def modify_embedding(emb):
+ # Add your modification here.
+ pass
+
+modify_embedding(embeddings)
+parameters.set("_proj", embeddings)
+```
+
+### Calculating Cosine Similarity
+
+Cosine similarity is one way of quantifying the similarity between two vectors. The range of result is $[-1, 1]$. The bigger the value, the similar two vectors are:
+
+
+```python
+from scipy import spatial
+
+emb_1 = embeddings[word_dict['world']]
+emb_2 = embeddings[word_dict['would']]
+
+print spatial.distance.cosine(emb_1, emb_2)
+```
+
+```text
+0.99375076448
+```
+
## Conclusion
This chapter introduces word embedding, the relationship between language model and word embedding, and how to train neural networks to learn word embedding.
diff --git a/word2vec/index.html b/word2vec/index.html
index e7b232121b0b1d2c9f47720602c05e171597497a..ad5cd1942de85a946beca2bfd7213e8924316a39 100644
--- a/word2vec/index.html
+++ b/word2vec/index.html
@@ -186,7 +186,7 @@ CBOW的好处是对上下文词语的分布在词向量上进行了平滑,去
### 数据介绍
-本教程使用Penn Tree Bank (PTB)数据集。PTB数据集较小,训练速度快,应用于Mikolov的公开语言模型训练工具\[[2](#参考文献)\]中。其统计情况如下:
+本教程使用Penn Treebank (PTB)(经Tomas Mikolov预处理过的版本)数据集。PTB数据集较小,训练速度快,应用于Mikolov的公开语言模型训练工具\[[2](#参考文献)\]中。其统计情况如下:
@@ -225,6 +225,7 @@ a dream that one day
dream that one day
```
+最后,每个输入会按其单词次在字典里的位置,转化成整数的索引序列,作为PaddlePaddle的输入。
## 编程实现
本配置的模型结构如下图所示:
@@ -287,7 +288,6 @@ Efirst = wordemb(firstword)
Esecond = wordemb(secondword)
Ethird = wordemb(thirdword)
Efourth = wordemb(fourthword)
-
```
- 将这n-1个词向量经过concat_layer连接成一个大向量作为历史文本特征。
@@ -365,11 +365,12 @@ trainer.train(
event_handler=event_handler)
```
- ...
- Pass 0, Batch 25000, Cost 4.251861, {'classification_error_evaluator': 0.84375}
- Pass 0, Batch 25100, Cost 4.847692, {'classification_error_evaluator': 0.8125}
- Pass 0, Testing metrics {'classification_error_evaluator': 0.7417652606964111}
-
+```text
+Pass 0, Batch 0, Cost 7.870579, {'classification_error_evaluator': 1.0}, Testing metrics {'classification_error_evaluator': 0.999591588973999}
+Pass 0, Batch 100, Cost 6.136420, {'classification_error_evaluator': 0.84375}, Testing metrics {'classification_error_evaluator': 0.8328699469566345}
+Pass 0, Batch 200, Cost 5.786797, {'classification_error_evaluator': 0.8125}, Testing metrics {'classification_error_evaluator': 0.8328542709350586}
+...
+```
训练过程是完全自动的,event_handler里打印的日志类似如上所示:
@@ -382,22 +383,23 @@ trainer.train(
### 查看词向量
-PaddlePaddle训练出来的参数可以直接使用`parameters.get()`获取出来。例如查看单词的word的词向量,即为
+PaddlePaddle训练出来的参数可以直接使用`parameters.get()`获取出来。例如查看单词`apple`的词向量,即为
```python
embeddings = parameters.get("_proj").reshape(len(word_dict), embsize)
-print embeddings[word_dict['word']]
+print embeddings[word_dict['apple']]
```
- [-0.38961065 -0.02392169 -0.00093231 0.36301503 0.13538605 0.16076435
- -0.0678709 0.1090285 0.42014077 -0.24119169 -0.31847557 0.20410083
- 0.04910378 0.19021918 -0.0122014 -0.04099389 -0.16924137 0.1911236
- -0.10917275 0.13068172 -0.23079982 0.42699069 -0.27679482 -0.01472992
- 0.2069038 0.09005053 -0.3282454 0.12717034 -0.24218646 0.25304323
- 0.19072419 -0.24286366]
-
+```text
+[-0.38961065 -0.02392169 -0.00093231 0.36301503 0.13538605 0.16076435
+-0.0678709 0.1090285 0.42014077 -0.24119169 -0.31847557 0.20410083
+0.04910378 0.19021918 -0.0122014 -0.04099389 -0.16924137 0.1911236
+-0.10917275 0.13068172 -0.23079982 0.42699069 -0.27679482 -0.01472992
+0.2069038 0.09005053 -0.3282454 0.12717034 -0.24218646 0.25304323
+0.19072419 -0.24286366]
+```
### 修改词向量
@@ -429,8 +431,9 @@ emb_2 = embeddings[word_dict['would']]
print spatial.distance.cosine(emb_1, emb_2)
```
- 0.99375076448
-
+```text
+0.99375076448
+```
## 总结
本章中,我们介绍了词向量、语言模型和词向量的关系、以及如何通过训练神经网络模型获得词向量。在信息检索中,我们可以根据向量间的余弦夹角,来判断query和文档关键词这二者间的相关性。在句法分析和语义分析中,训练好的词向量可以用来初始化模型,以得到更好的效果。在文档分类中,有了词向量之后,可以用聚类的方法将文档中同义词进行分组。希望大家在本章后能够自行运用词向量进行相关领域的研究。