index.html 25.8 KB
Newer Older
1

Y
Yu Yang 已提交
2 3 4 5
<html>
<head>
  <script type="text/x-mathjax-config">
  MathJax.Hub.Config({
Y
Yu Yang 已提交
6
    extensions: ["tex2jax.js", "TeX/AMSsymbols.js", "TeX/AMSmath.js"],
Y
Yu Yang 已提交
7 8
    jax: ["input/TeX", "output/HTML-CSS"],
    tex2jax: {
9 10
      inlineMath: [ ['$','$'] ],
      displayMath: [ ['$$','$$'] ],
Y
Yu Yang 已提交
11 12 13 14
      processEscapes: true
    },
    "HTML-CSS": { availableFonts: ["TeX"] }
  });
Y
Yi Wang 已提交
15 16
  </script>
  <script src="https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.0/MathJax.js" async></script>
Y
Yu Yang 已提交
17
  <script type="text/javascript" src="../.tools/theme/marked.js">
Y
Yu Yang 已提交
18 19
  </script>
  <link href="http://cdn.bootcss.com/highlight.js/9.9.0/styles/darcula.min.css" rel="stylesheet">
Y
Yi Wang 已提交
20
  <script src="http://cdn.bootcss.com/highlight.js/9.9.0/highlight.min.js"></script>
Y
Yu Yang 已提交
21
  <link href="http://cdn.bootcss.com/bootstrap/4.0.0-alpha.6/css/bootstrap.min.css" rel="stylesheet">
Y
Yu Yang 已提交
22
  <link href="https://cdn.jsdelivr.net/perfect-scrollbar/0.6.14/css/perfect-scrollbar.min.css" rel="stylesheet">
Y
Yu Yang 已提交
23
  <link href="../.tools/theme/github-markdown.css" rel='stylesheet'>
Y
Yu Yang 已提交
24 25
</head>
<style type="text/css" >
Y
Yu Yang 已提交
26 27 28 29 30 31
.markdown-body {
    box-sizing: border-box;
    min-width: 200px;
    max-width: 980px;
    margin: 0 auto;
    padding: 45px;
Y
Yu Yang 已提交
32 33 34 35
}
</style>


Y
Yu Yang 已提交
36
<body>
Y
Yu Yang 已提交
37

Y
Yu Yang 已提交
38
<div id="context" class="container-fluid markdown-body">
Y
Yu Yang 已提交
39 40 41 42
</div>

<!-- This block will be replaced by each markdown file content. Please do not change lines below.-->
<div id="markdown" style='display:none'>
C
choijulie 已提交
43
# Word2Vec
44

45
This is intended as a reference tutorial. The source code of this tutorial is located at [book/word2vec](https://github.com/PaddlePaddle/book/tree/develop/04.word2vec).
Y
Yu Yang 已提交
46

L
Luo Tao 已提交
47
For instructions on getting started with PaddlePaddle, see [PaddlePaddle installation guide](https://github.com/PaddlePaddle/book/blob/develop/README.md#running-the-book).
Y
Yu Yang 已提交
48

C
choijulie 已提交
49
## Background Introduction
Y
Yu Yang 已提交
50

51
This section introduces the concept of **word embeddings**, which are vector representations of words. Word embeddings is a popular technique used in natural language processing to support applications such as search engines, advertising systems, and recommendation systems.
Y
Yu Yang 已提交
52

C
choijulie 已提交
53
### One-Hot Vectors
Y
Yu Yang 已提交
54

55
Building these applications requires us to quantify the similarity between two words or paragraphs. This calls for a new representation of all the words to make them more suitable for computation. An obvious way to achieve this is through the vector space model, where every word is represented as an **one-hot vector**.
Y
Yu Yang 已提交
56

C
choijulie 已提交
57
For each word, its vector representation has the corresponding entry in the vector as 1, and all other entries as 0. The lengths of one-hot vectors match the size of the dictionary. Each entry of a vector corresponds to the presence (or absence) of a word in the dictionary.
Y
Yu Yang 已提交
58

59
One-hot vectors are intuitive, yet they have limited usefulness. Take the example of an Internet advertising system: Suppose a customer enters the query "Mother's Day", while an ad bids for the keyword "carnations". Because the one-hot vectors of these two words are perpendicular, the metric distance (either Euclidean or cosine similarity) between them would indicate  little relevance. However, *we* know that these two queries are connected semantically, since people often gift their mothers bundles of carnation flowers on Mother's Day. This discrepancy is due to the low information capacity in each vector. That is, comparing the vector representations of two words does not assess their relevance sufficiently. To calculate their similarity accurately, we need more information, which could be learned from large amounts of data through machine learning methods.
Y
Yu Yang 已提交
60

C
choijulie 已提交
61 62
Like many machine learning models, word embeddings can represent knowledge in various ways. Another model may project an one-hot vector to an embedding vector of lower dimension e.g. $embedding(mother's day) = [0.3, 4.2, -1.5, ...], embedding(carnations) = [0.2, 5.6, -2.3, ...]$. Mapping one-hot vectors onto an embedded vector space has the potential to bring the embedding vectors of similar words (either semantically or usage-wise) closer to each other, so that the cosine similarity between the corresponding vectors for words like "Mother's Day" and "carnations" are no longer zero.

M
Mimee 已提交
63
A word embedding model could be a probabilistic model, a co-occurrence matrix model, or a neural network. Before people started using neural networks to generate word embedding, the traditional method was to calculate a co-occurrence matrix $X$ of words. Here, $X$ is a $|V| \times |V|$ matrix, where $X_{ij}$ represents the co-occurrence times of the $i$th and $j$th words in the vocabulary `V` within all corpus, and $|V|$ is the size of the vocabulary. By performing matrix decomposition on $X$ e.g. Singular Value Decomposition \[[5](#references)\]
Y
Yu Yang 已提交
64

C
choijulie 已提交
65
$$X = USV^T$$
Y
Yu Yang 已提交
66

C
choijulie 已提交
67
the resulting $U$ can be seen as the word embedding of all the words.
Y
Yu Yang 已提交
68

C
choijulie 已提交
69 70 71 72
However, this method suffers from many drawbacks:
1) Since many pairs of words don't co-occur, the co-occurrence matrix is sparse. To achieve good performance of matrix factorization, further treatment on word frequency is needed;
2) The matrix is large, frequently on the order of $10^6*10^6$;
3) We need to manually filter out stop words (like "although", "a", ...), otherwise these frequent words will affect the performance of matrix factorization.
Y
Yu Yang 已提交
73

C
choijulie 已提交
74
The neural network based model does not require storing huge hash tables of statistics on all of the corpus. It obtains the word embedding by learning from semantic information, hence could avoid the aforementioned problems in the traditional method. In this chapter, we will introduce the details of neural network word embedding model and how to train such model in PaddlePaddle.
Y
Yu Yang 已提交
75

C
choijulie 已提交
76
## Results Demonstration
Y
Yu Yang 已提交
77

M
Mimee 已提交
78
In this section, we use the $t-$SNE\[[4](#references)\] data visualization algorithm to draw the word embedding vectors after projecting them onto a two-dimensional space (see figure below). From the figure we can see that the semantically relevant words -- *a*, *the*, and *these* or *big* and *huge* -- are close to each other in the projected space, while irrelevant words -- *say* and *business* or *decision* and *japan* -- are far from each other.
Y
Yu Yang 已提交
79 80

<p align="center">
81
    <img src = "image/2d_similarity.png" width=400><br/>
C
choijulie 已提交
82
    Figure 1. Two dimension projection of word embeddings
Y
Yu Yang 已提交
83 84
</p>

C
choijulie 已提交
85 86 87
### Cosine Similarity

On the other hand, we know that the cosine similarity between two vectors falls between $[-1,1]$. Specifically, the cosine similarity is 1 when the vectors are identical, 0 when the vectors are perpendicular, -1 when the are of opposite directions. That is, the cosine similarity between two vectors scales with their relevance. So we can calculate the cosine similarity of two word embedding vectors to represent their relevance:
Y
Yu Yang 已提交
88

Y
Yu Yang 已提交
89 90
```
please input two words: big huge
C
choijulie 已提交
91
similarity: 0.899180685161
Y
Yu Yang 已提交
92 93 94 95 96

please input two words: from company
similarity: -0.0997506977351
```

97
The above results could be obtained by running `calculate_dis.py`, which loads the words in the dictionary and their corresponding trained word embeddings. For detailed instruction, see section [Model Application](https://github.com/PaddlePaddle/book/tree/develop/04.word2vec#model-application).
C
choijulie 已提交
98

Y
Yu Yang 已提交
99

C
choijulie 已提交
100
## Model Overview
Y
Yu Yang 已提交
101

C
choijulie 已提交
102
In this section, we will introduce three word embedding models: N-gram model, CBOW, and Skip-gram, which all output the frequency of each word given its immediate context.
Y
Yu Yang 已提交
103

104
For N-gram model, we will first introduce the concept of language model, and implement it using PaddlePaddle in section [Training](https://github.com/PaddlePaddle/book/tree/develop/04.word2vec#model-application).
Y
Yu Yang 已提交
105

M
Mimee 已提交
106
The latter two models, which became popular recently, are neural word embedding model developed by Tomas Mikolov at Google \[[3](#references)\]. Despite their apparent simplicity, these models train very well.
Y
Yu Yang 已提交
107

C
choijulie 已提交
108
### Language Model
Y
Yu Yang 已提交
109

C
choijulie 已提交
110 111 112 113 114 115
Before diving into word embedding models, we will first introduce the concept of **language model**. Language models build the joint probability function $P(w_1, ..., w_T)$ of a sentence, where $w_i$ is the i-th word in the sentence. The goal is to give higher probabilities to meaningful sentences, and lower probabilities to meaningless constructions.

In general, models that generate the probability of a sequence can be applied to many fields, like machine translation, speech recognition, information retrieval, part-of-speech tagging, and handwriting recognition. Take information retrieval, for example. If you were to search for "how long is a football bame" (where bame is a medical noun), the search engine would have asked if you had meant "how long is a football game" instead. This is because the probability of "how long is a football bame" is very low according to the language model; in addition, among all of the words easily confused with "bame", "game" would build the most probable sentence.

#### Target Probability
For language model's target probability $P(w_1, ..., w_T)$, if the words in the sentence were to be independent, the joint probability of the whole sentence would be the product of each word's probability:
Y
Yu Yang 已提交
116 117 118

$$P(w_1, ..., w_T) = \prod_{t=1}^TP(w_t)$$

C
choijulie 已提交
119
However, the frequency of words in a sentence typically relates to the words before them, so canonical language models are constructed using conditional probability in its target probability:
Y
Yu Yang 已提交
120 121 122 123

$$P(w_1, ..., w_T) = \prod_{t=1}^TP(w_t | w_1, ... , w_{t-1})$$


124
### N-gram neural model
Y
Yu Yang 已提交
125

C
choijulie 已提交
126
In computational linguistics, n-gram is an important method to represent text. An n-gram represents a contiguous sequence of n consecutive items given a text. Based on the desired application scenario, each item could be a letter, a syllable or a word. The N-gram model is also an important method in statistical language modeling. When training language models with n-grams, the first (n-1) words of an n-gram are used to predict the *n*th word.
Y
Yu Yang 已提交
127

M
Mimee 已提交
128
Yoshua Bengio and other scientists describe how to train a word embedding model using neural network in the famous paper of Neural Probabilistic Language Models \[[1](#references)\] published in 2003. The Neural Network Language Model (NNLM) described in the paper learns the language model and word embedding simultaneously through a linear transformation and a non-linear hidden connection. That is, after training on large amounts of corpus, the model learns the word embedding; then, it computes the probability of the whole sentence, using the embedding. This type of language model can overcome the **curse of dimensionality** i.e. model inaccuracy caused by the difference in dimensionality between training and testing data. Note that the term *neural network language model* is ill-defined, so we will not use the name NNLM but only refer to it as *N-gram neural model* in this section.
Y
Yu Yang 已提交
129

C
choijulie 已提交
130
We have previously described language model using conditional probability, where the probability of the *t*-th word in a sentence depends on all $t-1$ words before it. Furthermore, since words further prior have less impact on a word, and every word within an n-gram is only effected by its previous n-1 words, we have:
Y
Yu Yang 已提交
131 132 133

$$P(w_1, ..., w_T) = \prod_{t=n}^TP(w_t|w_{t-1}, w_{t-2}, ..., w_{t-n+1})$$

C
choijulie 已提交
134
Given some real corpus in which all sentences are meaningful, the n-gram model should maximize the following objective function:
Y
Yu Yang 已提交
135 136 137

$$\frac{1}{T}\sum_t f(w_t, w_{t-1}, ..., w_{t-n+1};\theta) + R(\theta)$$

138
where $f(w_t, w_{t-1}, ..., w_{t-n+1})$ represents the conditional logarithmic probability of the current word $w_t$ given its previous $n-1$ words, and $R(\theta)$ represents parameter regularization term.
Y
Yu Yang 已提交
139

140
<p align="center">
C
choijulie 已提交
141 142
       <img src="image/nnlm_en.png" width=500><br/>
       Figure 2. N-gram neural network model
Y
Yu Yang 已提交
143 144
</p>

145

C
choijulie 已提交
146 147 148
Figure 2 shows the N-gram neural network model. From the bottom up, the model has the following components:

 - For each sample, the model gets input $w_{t-n+1},...w_{t-1}$, and outputs the probability that the t-th word is one of `|V|` in the dictionary.
149

C
choijulie 已提交
150 151 152
 Every input word $w_{t-n+1},...w_{t-1}$ first gets transformed into word embedding $C(w_{t-n+1}),...C(w_{t-1})$ through a transformation matrix.

 - All the word embeddings concatenate into a single vector, which is mapped (nonlinearly) into the $t$-th word hidden representation:
153

154
    $$g=Utanh(\theta^Tx + b_1) + Wx + b_2$$
155

C
choijulie 已提交
156
   where $x$ is the large vector concatenated from all the word embeddings representing the context; $\theta$, $U$, $b_1$, $b_2$ and $W$ are parameters connecting word embedding layers to the hidden layers. $g$ represents the unnormalized probability of the output word, $g_i$ represents the unnormalized probability of the output word being the i-th word in the dictionary.
Y
Yu Yang 已提交
157

C
choijulie 已提交
158
 - Based on the definition of softmax, using normalized $g_i$, the probability that the output word is $w_t$ is represented as:
159

Y
Yu Yang 已提交
160 161
  $$P(w_t | w_1, ..., w_{t-n+1}) = \frac{e^{g_{w_t}}}{\sum_i^{|V|} e^{g_i}}$$

C
choijulie 已提交
162
 - The cost of the entire network is a multi-class cross-entropy and can be described by the following loss function
Y
Yu Yang 已提交
163

164
   $$J(\theta) = -\sum_{i=1}^N\sum_{c=1}^{|V|}y_k^{i}log(softmax(g_k^i))$$
Y
Yu Yang 已提交
165

C
choijulie 已提交
166
   where $y_k^i$ represents the true label for the $k$-th class in the $i$-th sample ($0$ or $1$), $softmax(g_k^i)$ represents the softmax probability for the $k$-th class in the $i$-th sample.
167 168

### Continuous Bag-of-Words model(CBOW)
Y
Yu Yang 已提交
169

C
choijulie 已提交
170
CBOW model predicts the current word based on the N words both before and after it. When $N=2$, the model is as the figure below:
Y
Yu Yang 已提交
171

172
<p align="center">
C
choijulie 已提交
173 174
    <img src="image/cbow_en.png" width=250><br/>
    Figure 3. CBOW model
Y
Yu Yang 已提交
175 176
</p>

C
choijulie 已提交
177
Specifically, by ignoring the order of words in the sequence, CBOW uses the average value of the word embedding of the context to predict the current word:
Y
Yu Yang 已提交
178

C
choijulie 已提交
179
$$\text{context} = \frac{x_{t-1} + x_{t-2} + x_{t+1} + x_{t+2}}{4}$$
Y
Yu Yang 已提交
180

C
choijulie 已提交
181
where $x_t$ is the word embedding of the t-th word, classification score vector is $z=U*\text{context}$, the final classification $y$ uses softmax and the loss function uses multi-class cross-entropy.
Y
Yu Yang 已提交
182

183
### Skip-gram model
Y
Yu Yang 已提交
184

C
choijulie 已提交
185
The advantages of CBOW is that it smooths over the word embeddings of the context and reduces noise, so it is very effective on small dataset. Skip-gram uses a word to predict its context and get multiple context for the given word, so it can be used in larger datasets.
Y
Yu Yang 已提交
186

187
<p align="center">
C
choijulie 已提交
188 189
    <img src="image/skipgram_en.png" width=250><br/>
    Figure 4. Skip-gram model
Y
Yu Yang 已提交
190 191
</p>

C
choijulie 已提交
192
As illustrated in the figure above, skip-gram model maps the word embedding of the given word onto $2n$ word embeddings (including $n$ words before and $n$ words after the given word), and then combine the classification loss of all those $2n$ words by softmax.
Y
Yu Yang 已提交
193

C
choijulie 已提交
194
## Dataset
Y
Yu Yang 已提交
195

M
Mimee 已提交
196
We will use Penn Treebank (PTB) (Tomas Mikolov's pre-processed version) dataset. PTB is a small dataset, used in Recurrent Neural Network Language Modeling Toolkit\[[2](#references)\]. Its statistics are as follows:
Y
Yu Yang 已提交
197 198 199

<p align="center">
<table>
200
    <tr>
C
choijulie 已提交
201 202 203
        <td>training set</td>
        <td>validation set</td>
        <td>test set</td>
204 205 206 207 208 209 210
    </tr>
    <tr>
        <td>ptb.train.txt</td>
        <td>ptb.valid.txt</td>
        <td>ptb.test.txt</td>
    </tr>
    <tr>
C
choijulie 已提交
211 212 213
        <td>42068 lines</td>
        <td>3370 lines</td>
        <td>3761 lines</td>
214
    </tr>
Y
Yu Yang 已提交
215 216 217
</table>
</p>

C
choijulie 已提交
218 219 220 221 222 223
### Python Dataset Module

We encapsulated the PTB Data Set in our Python module `paddle.dataset.imikolov`. This module can

1. download the dataset to `~/.cache/paddle/dataset/imikolov`, if not yet, and
2. [preprocesses](#preprocessing) the dataset.
224

C
choijulie 已提交
225
### Preprocessing
Y
Yu Yang 已提交
226

C
choijulie 已提交
227
We will be training a 5-gram model. Given five words in a window, we will predict the fifth word given the first four words.
Y
Yu Yang 已提交
228

C
choijulie 已提交
229
Beginning and end of a sentence have a special meaning, so we will add begin token `<s>` in the front of the sentence. And end token `<e>` in the end of the sentence. By moving the five word window in the sentence, data instances are generated.
Y
Yu Yang 已提交
230

C
choijulie 已提交
231
For example, the sentence "I have a dream that one day" generates five data instances:
Y
Yu Yang 已提交
232

L
Luo Tao 已提交
233 234 235 236 237 238
```text
<s> I have a dream
I have a dream that
have a dream that one
a dream that one day
dream that one day <e>
Y
Yu Yang 已提交
239 240
```

C
choijulie 已提交
241
At last, each data instance will be converted into an integer sequence according it's words' index inside the dictionary.
Y
Yu Yang 已提交
242

C
choijulie 已提交
243 244 245
## Training

The neural network that we will be using is illustrated in the graph below:
Y
Yu Yang 已提交
246

247
<p align="center">
C
choijulie 已提交
248 249
    <img src="image/ngram.en.png" width=400><br/>
    Figure 5. N-gram neural network model in model configuration
Y
Yu Yang 已提交
250 251
</p>

C
choijulie 已提交
252 253 254
`word2vec/train.py` demonstrates training word2vec using PaddlePaddle:

- Import packages.
L
Luo Tao 已提交
255 256 257 258 259 260

```python
import math
import paddle.v2 as paddle
```

C
choijulie 已提交
261 262
- Configure parameter.

L
Luo Tao 已提交
263
```python
C
choijulie 已提交
264 265 266
embsize = 32 # word vector dimension
hiddensize = 256 # hidden layer dimension
N = 5 # train 5-gram
L
Luo Tao 已提交
267 268
```

269 270 271 272 273 274 275 276 277 278

- functions used to save and load word dict and embedding table
```python
# save and load word dict and embedding table
def save_dict_and_embedding(word_dict, embeddings):
    with open("word_dict", "w") as f:
        for key in word_dict:
            f.write(key + " " + str(word_dict[key]) + "\n")
    with open("embedding_table", "w") as f:
        numpy.savetxt(f, embeddings, delimiter=',', newline='\n')
Q
typo  
qiaolongfei 已提交
279 280 281 282 283 284 285


def load_dict_and_embedding():
    word_dict = dict()
    with open("word_dict", "r") as f:
        for line in f:
            key, value = line.strip().split(" ")
Q
qiaolongfei 已提交
286
            word_dict[key] = int(value)
Q
typo  
qiaolongfei 已提交
287 288 289

    embeddings = numpy.loadtxt("embedding_table", delimiter=",")
    return word_dict, embeddings
290 291
```

C
choijulie 已提交
292
- Map the $n-1$ words $w_{t-n+1},...w_{t-1}$ before $w_t$ to a D-dimensional vector though matrix of dimention $|V|\times D$ (D=32 in this example).
Y
Update  
Yi Wang 已提交
293

294
```python
L
Luo Tao 已提交
295 296 297 298 299 300 301 302
def wordemb(inlayer):
    wordemb = paddle.layer.table_projection(
        input=inlayer,
        size=embsize,
        param_attr=paddle.attr.Param(
            name="_proj",
            initial_std=0.001,
            learning_rate=1,
Y
Yu Yang 已提交
303 304
            l2_rate=0,
            sparse_update=True))
Y
Yu Yang 已提交
305
    return wordemb
L
Luo Tao 已提交
306
```
Y
Yu Yang 已提交
307

C
choijulie 已提交
308
- Define name and type for input to data layer.
L
Luo Tao 已提交
309 310

```python
C
choijulie 已提交
311
paddle.init(use_gpu=False, trainer_count=3)
312 313
word_dict = paddle.dataset.imikolov.build_dict()
dict_size = len(word_dict)
C
choijulie 已提交
314
# Every layer takes integer value of range [0, dict_size)
315 316 317 318 319 320 321 322 323 324 325 326 327 328 329
firstword = paddle.layer.data(
    name="firstw", type=paddle.data_type.integer_value(dict_size))
secondword = paddle.layer.data(
    name="secondw", type=paddle.data_type.integer_value(dict_size))
thirdword = paddle.layer.data(
    name="thirdw", type=paddle.data_type.integer_value(dict_size))
fourthword = paddle.layer.data(
    name="fourthw", type=paddle.data_type.integer_value(dict_size))
nextword = paddle.layer.data(
    name="fifthw", type=paddle.data_type.integer_value(dict_size))

Efirst = wordemb(firstword)
Esecond = wordemb(secondword)
Ethird = wordemb(thirdword)
Efourth = wordemb(fourthword)
L
Luo Tao 已提交
330 331
```

C
choijulie 已提交
332
- Concatenate n-1 word embedding vectors into a single feature vector.
L
Luo Tao 已提交
333 334

```python
335
contextemb = paddle.layer.concat(input=[Efirst, Esecond, Ethird, Efourth])
L
Luo Tao 已提交
336 337
```

C
choijulie 已提交
338
- Feature vector will go through a fully connected layer which outputs a hidden feature vector.
L
Luo Tao 已提交
339 340

```python
341 342 343 344 345 346 347 348
hidden1 = paddle.layer.fc(input=contextemb,
                          size=hiddensize,
                          act=paddle.activation.Sigmoid(),
                          layer_attr=paddle.attr.Extra(drop_rate=0.5),
                          bias_attr=paddle.attr.Param(learning_rate=2),
                          param_attr=paddle.attr.Param(
                                initial_std=1. / math.sqrt(embsize * 8),
                                learning_rate=1))
L
Luo Tao 已提交
349
```
350

C
choijulie 已提交
351
- Hidden feature vector will go through another fully conected layer, turn into a $|V|$ dimensional vector. At the same time softmax will be applied to get the probability of each word being generated.
L
Luo Tao 已提交
352 353

```python
354 355 356 357
predictword = paddle.layer.fc(input=hidden1,
                              size=dict_size,
                              bias_attr=paddle.attr.Param(learning_rate=2),
                              act=paddle.activation.Softmax())
L
Luo Tao 已提交
358 359
```

C
choijulie 已提交
360
- We will use cross-entropy cost function.
L
Luo Tao 已提交
361 362 363 364 365

```python
cost = paddle.layer.classification_cost(input=predictword, label=nextword)
```

C
choijulie 已提交
366
- Create parameter, optimizer and trainer.
L
Luo Tao 已提交
367 368

```python
369
parameters = paddle.parameters.create(cost)
Y
Yu Yang 已提交
370
adagrad = paddle.optimizer.AdaGrad(
371 372
    learning_rate=3e-3,
    regularization=paddle.optimizer.L2Regularization(8e-4))
Y
Yu Yang 已提交
373
trainer = paddle.trainer.SGD(cost, parameters, adagrad)
L
Luo Tao 已提交
374 375
```

C
choijulie 已提交
376
Next, we will begin the training process. `paddle.dataset.imikolov.train()` and `paddle.dataset.imikolov.test()` is our training set and test set. Both of the function will return a **reader**: In PaddlePaddle, reader is a python function which returns a Python iterator which output a single data instance at a time.
L
Luo Tao 已提交
377

C
choijulie 已提交
378
`paddle.batch` takes reader as input, outputs a **batched reader**: In PaddlePaddle, a reader outputs a single data instance at a time but batched reader outputs a minibatch of data instances.
L
Luo Tao 已提交
379 380

```python
381 382 383 384 385
def event_handler(event):
    if isinstance(event, paddle.event.EndIteration):
        if event.batch_id % 100 == 0:
            print "Pass %d, Batch %d, Cost %f, %s" % (
                event.pass_id, event.batch_id, event.cost, event.metrics)
Y
Yi Wang 已提交
386

387 388
    if isinstance(event, paddle.event.EndPass):
        result = trainer.test(
L
Luo Tao 已提交
389 390
                    paddle.batch(
                        paddle.dataset.imikolov.test(word_dict, N), 32))
391
        print "Pass %d, Testing metrics %s" % (event.pass_id, result.metrics)
392
        with open("model_%d.tar"%event.pass_id, 'w') as f:
393
            trainer.save_parameter_to_tar(f)
394 395 396 397 398

trainer.train(
    paddle.batch(paddle.dataset.imikolov.train(word_dict, N), 32),
    num_passes=100,
    event_handler=event_handler)
Y
Yu Yang 已提交
399 400
```

C
choijulie 已提交
401
`trainer.train` will start training, the output of `event_handler` will be similar to following:
402 403 404 405 406 407
```text
Pass 0, Batch 0, Cost 7.870579, {'classification_error_evaluator': 1.0}, Testing metrics {'classification_error_evaluator': 0.999591588973999}
Pass 0, Batch 100, Cost 6.136420, {'classification_error_evaluator': 0.84375}, Testing metrics {'classification_error_evaluator': 0.8328699469566345}
Pass 0, Batch 200, Cost 5.786797, {'classification_error_evaluator': 0.8125}, Testing metrics {'classification_error_evaluator': 0.8328542709350586}
...
```
408

C
choijulie 已提交
409
After 30 passes, we can get average error rate around 0.735611.
Y
Yu Yang 已提交
410

411 412 413 414 415 416 417 418 419 420
## Save word dict and embedding table

after training, we can save the word dict and embedding table for the future usage.

```python
# save word dict and embedding table
embeddings = parameters.get("_proj").reshape(len(word_dict), embsize)
save_dict_and_embedding(word_dict, embeddings)
```

Y
Yu Yang 已提交
421

C
choijulie 已提交
422
## Model Application
Y
Update  
Yi Wang 已提交
423

424
After the model is trained, we can load the  saved model parameters and use it for other models. We can also use the parameters in various applications.
Y
Yu Yang 已提交
425

C
choijulie 已提交
426
### Viewing Word Vector
Y
Yu Yang 已提交
427

C
choijulie 已提交
428
Parameters trained by PaddlePaddle can be viewed by `parameters.get()`. For example, we can check the word vector for word `apple`.
Y
Yu Yang 已提交
429

430 431
```python
embeddings = parameters.get("_proj").reshape(len(word_dict), embsize)
Y
Yu Yang 已提交
432

433
print embeddings[word_dict['apple']]
Y
Yu Yang 已提交
434 435
```

436 437 438 439 440 441 442 443
```text
[-0.38961065 -0.02392169 -0.00093231  0.36301503  0.13538605  0.16076435
-0.0678709   0.1090285   0.42014077 -0.24119169 -0.31847557  0.20410083
0.04910378  0.19021918 -0.0122014  -0.04099389 -0.16924137  0.1911236
-0.10917275  0.13068172 -0.23079982  0.42699069 -0.27679482 -0.01472992
0.2069038   0.09005053 -0.3282454   0.12717034 -0.24218646  0.25304323
0.19072419 -0.24286366]
```
444

C
choijulie 已提交
445
### Modifying Word Vector
Y
Yu Yang 已提交
446

C
choijulie 已提交
447
Word vectors (`embeddings`) that we get is a numpy array. We can modify this array and set it back to `parameters`.
Y
Yu Yang 已提交
448 449


450 451 452 453
```python
def modify_embedding(emb):
    # Add your modification here.
    pass
Y
Yu Yang 已提交
454

455 456
modify_embedding(embeddings)
parameters.set("_proj", embeddings)
Y
Yu Yang 已提交
457 458
```

C
choijulie 已提交
459
### Calculating Cosine Similarity
Y
Yu Yang 已提交
460

C
choijulie 已提交
461
Cosine similarity is one way of quantifying the similarity between two vectors. The range of result is $[-1, 1]$. The bigger the value, the similar two vectors are:
Y
Yu Yang 已提交
462 463


464 465
```python
from scipy import spatial
Y
Yu Yang 已提交
466

467 468 469 470
emb_1 = embeddings[word_dict['world']]
emb_2 = embeddings[word_dict['would']]

print spatial.distance.cosine(emb_1, emb_2)
Y
Yu Yang 已提交
471
```
472

473 474 475
```text
0.99375076448
```
Y
Update  
Yi Wang 已提交
476

C
choijulie 已提交
477 478
## Conclusion

479
This chapter introduces word embeddings, the relationship between language model and word embedding, and how to train neural networks to learn word embedding.
C
choijulie 已提交
480 481

In information retrieval, the relevance between the query and document keyword can be computed through the cosine similarity of their word embeddings. In grammar analysis and semantic analysis, a previously trained word embedding can initialize models for better performance. In document classification, clustering the word embedding can group synonyms in the documents. We hope that readers can use word embedding models in their work after reading this chapter.
Y
Yu Yang 已提交
482 483


M
Mimee 已提交
484
## References
Y
Yu Yang 已提交
485 486 487 488 489 490 491
1. Bengio Y, Ducharme R, Vincent P, et al. [A neural probabilistic language model](http://www.jmlr.org/papers/volume3/bengio03a/bengio03a.pdf)[J]. journal of machine learning research, 2003, 3(Feb): 1137-1155.
2. Mikolov T, Kombrink S, Deoras A, et al. [Rnnlm-recurrent neural network language modeling toolkit](http://www.fit.vutbr.cz/~imikolov/rnnlm/rnnlm-demo.pdf)[C]//Proc. of the 2011 ASRU Workshop. 2011: 196-201.
3. Mikolov T, Chen K, Corrado G, et al. [Efficient estimation of word representations in vector space](https://arxiv.org/pdf/1301.3781.pdf)[J]. arXiv preprint arXiv:1301.3781, 2013.
4. Maaten L, Hinton G. [Visualizing data using t-SNE](https://lvdmaaten.github.io/publications/papers/JMLR_2008.pdf)[J]. Journal of Machine Learning Research, 2008, 9(Nov): 2579-2605.
5. https://en.wikipedia.org/wiki/Singular_value_decomposition

<br/>
L
Luo Tao 已提交
492
This tutorial is contributed by <a xmlns:cc="http://creativecommons.org/ns#" href="http://book.paddlepaddle.org" property="cc:attributionName" rel="cc:attributionURL">PaddlePaddle</a>, and licensed under a <a rel="license" href="http://creativecommons.org/licenses/by-sa/4.0/">Creative Commons Attribution-ShareAlike 4.0 International License</a>.
493

Y
Yu Yang 已提交
494 495 496 497 498 499 500
</div>
<!-- You can change the lines below now. -->

<script type="text/javascript">
marked.setOptions({
  renderer: new marked.Renderer(),
  gfm: true,
Y
Yu Yang 已提交
501 502 503
  breaks: false,
  smartypants: true,
  highlight: function(code, lang) {
Y
Yu Yang 已提交
504
    code = code.replace(/&amp;/g, "&")
Y
Yu Yang 已提交
505 506
    code = code.replace(/&gt;/g, ">")
    code = code.replace(/&lt;/g, "<")
507
    code = code.replace(/&nbsp;/g, " ")
Y
Yu Yang 已提交
508
    return hljs.highlightAuto(code, [lang]).value;
Y
Yu Yang 已提交
509 510 511
  }
});
document.getElementById("context").innerHTML = marked(
512
        document.getElementById("markdown").innerHTML)
Y
Yu Yang 已提交
513 514
</script>
</body>