@@ -149,19 +149,257 @@ The advantages of CBOW is that it smooths over the word embeddings of the contex
As illustrated in the figure above, skip-gram model maps the word embedding of the given word onto $2n$ word embeddings (including $n$ words before and $n$ words after the given word), and then combine the classification loss of all those $2n$ words by softmax.
## Data Preparation
## Dataset
We will use Peen Treebank (PTB) (Tomas Mikolov's pre-processed version) dataset. PTB is a small dataset, used in Recurrent Neural Network Language Modeling Toolkit\[[2](#reference)\]. Its statistics are as follows:
<palign="center">
<table>
<tr>
<td>training set</td>
<td>validation set</td>
<td>test set</td>
</tr>
<tr>
<td>ptb.train.txt</td>
<td>ptb.valid.txt</td>
<td>ptb.test.txt</td>
</tr>
<tr>
<td>42068 lines</td>
<td>3370 lines</td>
<td>3761 lines</td>
</tr>
</table>
</p>
### Python Dataset Module
We encapsulated the PTB Data Set in our Python module `paddle.dataset.imikolov`. This module can
1. download the dataset to `~/.cache/paddle/dataset/imikolov`, if not yet, and
2.[preprocesses](#preprocessing) the dataset.
### Preprocessing
We will be training a 5-gram model. Given five words in a window, we will predict the fifth word given the first four words.
Beginning and end of a sentence have a special meaning, so we will add begin token `<s>` in the front of the sentence. And end token `<e>` in the end of the sentence. By moving the five word window in the sentence, data instances are generated.
For example, the sentence "I have a dream that one day" generates five data instances:
```text
<s> I have a dream
I have a dream that
have a dream that one
a dream that one day
dream that one day <e>
```
At last, each data instance will be converted into an integer sequence according it's words' index inside the dictionary.
## Training
The neural network that we will be using is illustrated in the graph below:
## Model Configuration
<palign="center">
<imgsrc="image/ngram.en.png"width=400><br/>
Figure 5. N-gram neural network model in model configuration
</p>
`word2vec/train.py` demonstrates training word2vec using PaddlePaddle:
- Import packages.
```python
importmath
importpaddle.v2aspaddle
```
- Configure parameter.
```python
embsize=32# word vector dimension
hiddensize=256# hidden layer dimension
N=5# train 5-gram
```
- Map the $n-1$ words $w_{t-n+1},...w_{t-1}$ before $w_t$ to a D-dimensional vector though matrix of dimention $|V|\times D$ (D=32 in this example).
```python
defwordemb(inlayer):
wordemb=paddle.layer.table_projection(
input=inlayer,
size=embsize,
param_attr=paddle.attr.Param(
name="_proj",
initial_std=0.001,
learning_rate=1,
l2_rate=0,))
returnwordemb
```
- Define name and type for input to data layer.
```python
paddle.init(use_gpu=False,trainer_count=3)
word_dict=paddle.dataset.imikolov.build_dict()
dict_size=len(word_dict)
# Every layer takes integer value of range [0, dict_size)
- Feature vector will go through a fully connected layer which outputs a hidden feature vector.
```python
hidden1=paddle.layer.fc(input=contextemb,
size=hiddensize,
act=paddle.activation.Sigmoid(),
layer_attr=paddle.attr.Extra(drop_rate=0.5),
bias_attr=paddle.attr.Param(learning_rate=2),
param_attr=paddle.attr.Param(
initial_std=1./math.sqrt(embsize*8),
learning_rate=1))
```
- Hidden feature vector will go through another fully conected layer, turn into a $|V|$ dimensional vector. At the same time softmax will be applied to get the probability of each word being generated.
Next, we will begin the training process. `paddle.dataset.imikolov.train()` and `paddle.dataset.imikolov.test()` is our training set and test set. Both of the function will return a **reader**: In PaddlePaddle, reader is a python function which returns a Python iterator which output a single data instance at a time.
`paddle.batch` takes reader as input, outputs a **batched reader**: In PaddlePaddle, a reader outputs a single data instance at a time but batched reader outputs a minibatch of data instances.
Word vectors (`embeddings`) that we get is a numpy array. We can modify this array and set it back to `parameters`.
```python
defmodify_embedding(emb):
# Add your modification here.
pass
modify_embedding(embeddings)
parameters.set("_proj",embeddings)
```
### Calculating Cosine Similarity
Cosine similarity is one way of quantifying the similarity between two vectors. The range of result is $[-1, 1]$. The bigger the value, the similar two vectors are:
```python
fromscipyimportspatial
emb_1=embeddings[word_dict['world']]
emb_2=embeddings[word_dict['would']]
printspatial.distance.cosine(emb_1,emb_2)
```
```text
0.99375076448
```
## Conclusion
This chapter introduces word embedding, the relationship between language model and word embedding, and how to train neural networks to learn word embedding.
@@ -191,19 +191,257 @@ The advantages of CBOW is that it smooths over the word embeddings of the contex
As illustrated in the figure above, skip-gram model maps the word embedding of the given word onto $2n$ word embeddings (including $n$ words before and $n$ words after the given word), and then combine the classification loss of all those $2n$ words by softmax.
## Data Preparation
## Dataset
We will use Peen Treebank (PTB) (Tomas Mikolov's pre-processed version) dataset. PTB is a small dataset, used in Recurrent Neural Network Language Modeling Toolkit\[[2](#reference)\]. Its statistics are as follows:
<palign="center">
<table>
<tr>
<td>training set</td>
<td>validation set</td>
<td>test set</td>
</tr>
<tr>
<td>ptb.train.txt</td>
<td>ptb.valid.txt</td>
<td>ptb.test.txt</td>
</tr>
<tr>
<td>42068 lines</td>
<td>3370 lines</td>
<td>3761 lines</td>
</tr>
</table>
</p>
### Python Dataset Module
We encapsulated the PTB Data Set in our Python module `paddle.dataset.imikolov`. This module can
1. download the dataset to `~/.cache/paddle/dataset/imikolov`, if not yet, and
2. [preprocesses](#preprocessing) the dataset.
### Preprocessing
We will be training a 5-gram model. Given five words in a window, we will predict the fifth word given the first four words.
Beginning and end of a sentence have a special meaning, so we will add begin token `<s>` in the front of the sentence. And end token `<e>` in the end of the sentence. By moving the five word window in the sentence, data instances are generated.
For example, the sentence "I have a dream that one day" generates five data instances:
```text
<s> I have a dream
I have a dream that
have a dream that one
a dream that one day
dream that one day <e>
```
At last, each data instance will be converted into an integer sequence according it's words' index inside the dictionary.
## Training
The neural network that we will be using is illustrated in the graph below:
## Model Configuration
<palign="center">
<imgsrc="image/ngram.en.png"width=400><br/>
Figure 5. N-gram neural network model in model configuration
</p>
`word2vec/train.py` demonstrates training word2vec using PaddlePaddle:
- Import packages.
```python
import math
import paddle.v2 as paddle
```
- Configure parameter.
```python
embsize = 32 # word vector dimension
hiddensize = 256 # hidden layer dimension
N = 5 # train 5-gram
```
- Map the $n-1$ words $w_{t-n+1},...w_{t-1}$ before $w_t$ to a D-dimensional vector though matrix of dimention $|V|\times D$ (D=32 in this example).
```python
def wordemb(inlayer):
wordemb = paddle.layer.table_projection(
input=inlayer,
size=embsize,
param_attr=paddle.attr.Param(
name="_proj",
initial_std=0.001,
learning_rate=1,
l2_rate=0, ))
return wordemb
```
- Define name and type for input to data layer.
```python
paddle.init(use_gpu=False, trainer_count=3)
word_dict = paddle.dataset.imikolov.build_dict()
dict_size = len(word_dict)
# Every layer takes integer value of range [0, dict_size)
- Feature vector will go through a fully connected layer which outputs a hidden feature vector.
```python
hidden1 = paddle.layer.fc(input=contextemb,
size=hiddensize,
act=paddle.activation.Sigmoid(),
layer_attr=paddle.attr.Extra(drop_rate=0.5),
bias_attr=paddle.attr.Param(learning_rate=2),
param_attr=paddle.attr.Param(
initial_std=1. / math.sqrt(embsize * 8),
learning_rate=1))
```
- Hidden feature vector will go through another fully conected layer, turn into a $|V|$ dimensional vector. At the same time softmax will be applied to get the probability of each word being generated.
Next, we will begin the training process. `paddle.dataset.imikolov.train()` and `paddle.dataset.imikolov.test()` is our training set and test set. Both of the function will return a **reader**: In PaddlePaddle, reader is a python function which returns a Python iterator which output a single data instance at a time.
`paddle.batch` takes reader as input, outputs a **batched reader**: In PaddlePaddle, a reader outputs a single data instance at a time but batched reader outputs a minibatch of data instances.
Word vectors (`embeddings`) that we get is a numpy array. We can modify this array and set it back to `parameters`.
```python
def modify_embedding(emb):
# Add your modification here.
pass
modify_embedding(embeddings)
parameters.set("_proj", embeddings)
```
### Calculating Cosine Similarity
Cosine similarity is one way of quantifying the similarity between two vectors. The range of result is $[-1, 1]$. The bigger the value, the similar two vectors are:
```python
from scipy import spatial
emb_1 = embeddings[word_dict['world']]
emb_2 = embeddings[word_dict['would']]
print spatial.distance.cosine(emb_1, emb_2)
```
```text
0.99375076448
```
## Conclusion
This chapter introduces word embedding, the relationship between language model and word embedding, and how to train neural networks to learn word embedding.